high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimized HPL for AMD GPU and multi-core CPU usage

Optimized HPL for AMD GPU and multi-core CPU usage

Matthias Bach, Matthias Kretz, Volker Lindenstruth, David Rohr

Frankfurt Institute for Advanced Studies, Ruth-Mousfang-Strasse 1, 60438 Frankfurt am Main, Germany

Computer Science – Research and Development (12 April 2011), pp. 1-12

DOI:10.1007/s00450-011-0161-5

@article{bachoptimized,

title={Optimized HPL for AMD GPU and multi-core CPU usage},

author={Bach, M. and Kretz, M. and Lindenstruth, V. and Rohr, D.},

journal={Computer Science-Research and Development},

pages={1–12},

issn={1865-2034},

publisher={Springer}

}

Source

2381

views

The installation of the LOEWE-CSC (http://csc.uni-frankfurt.de/csc/?51) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and thus maximize the GPU load. A work stealing scheduler was implemented to add the remaining CPU resources to the DGEMM. On the GPU, the DGEMM achieves 497 GFlop/s (90.9% of the theoretical peak). Combined with the 24-core Magny-Cours CPUs, 623 GFlop/s (83.6% of the peak) are achieved. The HPL (http://www.netlib.org/benchmark/hpl/algorithm.html) benchmark was modified to perform well with one MPI-process per node. The modifications include multi-threading, vectorization, use of the GPU DGEMM, cache optimizations, and a new Lookahead algorithm. A Linpack performance of 70% theoretical peak is achieved and this performance scales linearly to hundreds of nodes.

Tags: ATI, ATI Radeon HD 5870, Computer science, GPU cluster, Heterogeneous systems, Linear Algebra, MPI, Performance

April 25, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Optimized HPL for AMD GPU and multi-core CPU usage

Your response

Recent source codes

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Device Virtual Machine (DVM)

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

Optimized HPL for AMD GPU and multi-core CPU usage

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)