high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Xiang Cui, Yifeng Chen, Changyou Zhang, Hong Mei

Key Lab. of High Confidence Software Technol., Peking Univ., Beijing, China

IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), 2010

DOI:10.1109/ICPADS.2010.64

@inproceedings{cui2010auto,

title={Auto-tuning Dense Matrix Multiplication for GPGPU with Cache},

author={Cui, X. and Chen, Y. and Zhang, C. and Mei, H.},

booktitle={2010 IEEE 16th International Conference on Parallel and Distributed Systems},

pages={237–242},

year={2010},

organization={IEEE}

}

Source

1564

views

In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on the other hand renders the dependency of performance on algorithmic parameters less predictable. Auto tuning then becomes a useful technique to address this issue. Our auto-tuned SGEMM and DGEMM reach 563 GFlops and 253 GFlops respectively on Tesla C2050. The design and implementation entirely use CUDA and C and have not benefited from tuning at the level of binary code.

Tags: Computer science, CUDA, Linear Algebra, Matrix multiplication, nVidia, Tesla C2050

June 19, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Share this:

Recent source codes

Most viewed papers (last 30 days)