Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Ahmad Abdelfattah, Jack Dongarra, David Keyes, Hatem Ltaief
KAUST Division of Mathematical and Computer Sciences and Engineering, Thuwal, Saudi Arabia
10th International Meeting on High-Performance Computing for Computational Science (VECPAR 2012), 2012


   title={Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators},

   author={Abdelfattah, Ahmad and Dongarra, Jack and Keyes, David and Ltaief, Hatem},



Download Download (PDF)   View View   Source Source   Source codes Source codes




Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming languages (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: