high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Rajib Nath, Stanimire Tomov, Tingxing "Tim" Dong, Jack Dongarra

Computer Science and Engineering, University of California, San Diego

ACM/IEEE Conference on Supercomputing (SC’11), 2011

BibTeX

Download (PDF)

View

Source

Source codes

Package:

MAGMA

2391

views

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices. In this work, we present a new algorithm for optimizing the SYMV kernel on GPUs. Our optimized SYMV in single precision brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 NVIDIA library on the GTX 280 GPU. Our SYMV kernel tuned for Fermi C2050 is 4.5x faster than CUBLAS 4.0 in single precision and 2x faster than CUBLAS 4.0 in double precision. Moreover, the techniques used and described in the paper are general enough to be of interest for developing high-performance GPU kernels beyond the particular case of SYMV.

Tags: Algorithms, Computer science, CUBLAS, CUDA, Linear Algebra, nVidia, nVidia GeForce GTX 280, Package, Performance, Tesla C2050

November 20, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Package:

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)