high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » ATI Stream » A Fast GEMM Implementation On a Cypress GPU

A Fast GEMM Implementation On a Cypress GPU

Naohito Nakasato

University of Aizu, Aizu-Wakamatsu, Fukushima, Japan

1st International Workshop on Performance Modeling Benchmarking and Simulation of High Performance Computing Systems PMBS 10 (2010)

@article{nakasato2010fast,

title={A Fast GEMM Implementation On a Cypress GPU},

author={Nakasato, N.},

journal={University of Aizu},

year={2010}

}

Download (PDF)

View

Source

1902

views

We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ~2 Tflop/s and ~470 Gflop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP is more than 200 times faster than the performance results in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.

Tags: ATI, ATI CAL, ATI IL, ATI Radeon HD 5870, ATI Stream, Computer science, Linear Algebra

March 18, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

A Fast GEMM Implementation On a Cypress GPU

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

A Fast GEMM Implementation On a Cypress GPU

Share this:

Recent source codes

Most viewed papers (last 30 days)