Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

hgpu.org » Applications » Computer science » Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Feng Wang, Can-Qun Yang, Yun-Fei Du, Juan Chen, Hui-Zhan Yi, Wei-Xia Xu

School of Computer Science, National University of Defense Technology, Changsha 410073, China

Journal of Computer Science and Technology, Volume 26, Number 5, 854-865, 2011

DOI:10.1007/s11390-011-0184-1

BibTeX

Download (PDF)

View

Source

2987

views

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196:7 GFLOPS on a single compute element of TianHe-1. This result is 70:1% of the peak compute capability, 3:3 times faster than the result by using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0:563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.

Tags: Benchmarking, Computer science, Heterogeneous systems, Linear Algebra, MPI, OpenMP, Optimization

September 28, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Share this:

Recent source codes

Most viewed papers (last 30 days)