high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Benchmarking GPUs to tune dense linear algebra

Benchmarking GPUs to tune dense linear algebra

Vasily Volkov, James W. Demmel

Computer Science Division, University of California at Berkeley

In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (2008), pp. 1-11.

DOI:10.1145/1413370.1413402

@conference{volkov2008benchmarking,

title={Benchmarking GPUs to tune dense linear algebra},

author={Volkov, V. and Demmel, J.W.},

booktitle={Proceedings of the 2008 ACM/IEEE conference on Supercomputing},

pages={1–11},

year={2008},

organization={IEEE Press}

}

Download (PDF)

View

Source

2161

views

We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor’s implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80–90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.

Tags: BLAS, Computer science, CUDA, Linear Algebra, nVidia, nVidia GeForce 8600 GTS, nVidia GeForce 8800 GTX, nVidia GeForce 9800 GTX, nVidia GeForce GTX 280

November 4, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Benchmarking GPUs to tune dense linear algebra

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Benchmarking GPUs to tune dense linear algebra

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)