high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimized HPL for AMD GPU and multi-core CPU usage

Optimized HPL for AMD GPU and multi-core CPU usage

Matthias Bach, Matthias Kretz, Volker Lindenstruth, David Rohr

Frankfurt Institute for Advanced Studies, Ruth-Mousfang-Strasse 1, 60438 Frankfurt am Main, Germany

Computer Science – Research and Development (12 April 2011), pp. 1-12

DOI:10.1007/s00450-011-0161-5

BibTeX

Source

2074

views

The installation of the LOEWE-CSC (http://csc.uni-frankfurt.de/csc/?51) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and thus maximize the GPU load. A work stealing scheduler was implemented to add the remaining CPU resources to the DGEMM. On the GPU, the DGEMM achieves 497 GFlop/s (90.9% of the theoretical peak). Combined with the 24-core Magny-Cours CPUs, 623 GFlop/s (83.6% of the peak) are achieved. The HPL (http://www.netlib.org/benchmark/hpl/algorithm.html) benchmark was modified to perform well with one MPI-process per node. The modifications include multi-threading, vectorization, use of the GPU DGEMM, cache optimizations, and a new Lookahead algorithm. A Linpack performance of 70% theoretical peak is achieved and this performance scales linearly to hundreds of nodes.

Tags: ATI, ATI Radeon HD 5870, Computer science, GPU cluster, Heterogeneous systems, Linear Algebra, MPI, Performance

April 25, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimized HPL for AMD GPU and multi-core CPU usage

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Optimized HPL for AMD GPU and multi-core CPU usage

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)