Optimized HPL for AMD GPU and multi-core CPU usage

Matthias Bach, Matthias Kretz, Volker Lindenstruth, David Rohr
Frankfurt Institute for Advanced Studies, Ruth-Mousfang-Strasse 1, 60438 Frankfurt am Main, Germany
Computer Science – Research and Development (12 April 2011), pp. 1-12


   title={Optimized HPL for AMD GPU and multi-core CPU usage},

   author={Bach, M. and Kretz, M. and Lindenstruth, V. and Rohr, D.},

   journal={Computer Science-Research and Development},





Source Source   



The installation of the LOEWE-CSC (http://csc.uni-frankfurt.de/csc/?51) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and thus maximize the GPU load. A work stealing scheduler was implemented to add the remaining CPU resources to the DGEMM. On the GPU, the DGEMM achieves 497 GFlop/s (90.9% of the theoretical peak). Combined with the 24-core Magny-Cours CPUs, 623 GFlop/s (83.6% of the peak) are achieved. The HPL (http://www.netlib.org/benchmark/hpl/algorithm.html) benchmark was modified to perform well with one MPI-process per node. The modifications include multi-threading, vectorization, use of the GPU DGEMM, cache optimizations, and a new Lookahead algorithm. A Linpack performance of 70% theoretical peak is achieved and this performance scales linearly to hundreds of nodes.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: