high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A fast GEMM implementation on the cypress GPU

A fast GEMM implementation on the cypress GPU

Naohito Nakasato

University of Aizu, Aizu-Wakamatsu, Fukushima, Japan

ACM SIGMETRICS Performance Evaluation Review – Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10), Volume 38 Issue 4, March 2011

DOI:10.1145/1964218.1964227

@article{nakasato2011fast,

title={A fast GEMM implementation on the cypress GPU},

author={Nakasato, N.},

journal={ACM SIGMETRICS Performance Evaluation Review},

volume={38},

number={4},

pages={50–55},

year={2011},

publisher={ACM}

}

Download (PDF)

View

Source

2559

views

We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ~ 2 Top/s and ~ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gop/s. This performance in DDP is more than 200 times faster than the performance results in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.

Tags: ATI, ATI CAL, ATI IL, ATI Radeon HD 5870, Computer science, Linear Algebra, Matrix multiplication, Performance

August 22, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

A fast GEMM implementation on the cypress GPU

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

A fast GEMM implementation on the cypress GPU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)