Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

hgpu.org » Applications » Computer science » Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

S.J. Pennycook, S.D. Hammond, S.A. Jarvis, G.R. Mudalige

Performance Computing and Visualisation, Department of Computer Science, University of Warwick, UK

ACM SIGMETRICS Performance Evaluation Review, Volume 38 Issue 4, March 2011

DOI:10.1145/1964218.1964223

BibTeX

Download (PDF)

View

Source

1781

views

The emergence of Graphics Processing Units (GPUs) as a potential alternative to conventional general-purpose processors has led to significant interest in these architectures by both the academic community and the High Performance Computing (HPC) industry. While GPUs look likely to deliver unparalleled levels of performance, the publication of studies claiming performance improvements in excess of 30,000x are misleading. Significant on-node performance improvements have been demonstrated for code kernels and algorithms amenable to GPU acceleration; studies demonstrating comparable results for full scientific applications requiring multiple-GPU architectures are rare. In this paper we present an analysis of a port of the NASLU benchmark to NVIDIA’s Compute Unified Device Architecture (CUDA) – the most stable GPU programming model currently available. Our solution is also extended to multiple nodes and multiple GPU devices. Runtime performance on several GPUs is presented, ranging from low-end, consumer-grade cards such as the 8400GS to NVIDIA’s agship Fermi HPC processor found in the recently released C2050. We compare the runtimes of these devices to several processors including those from Intel, AMD and IBM. In addition to this we utilise a recently developed performance model of LU. With this we predict the runtime performance of LU on large-scale distributed GPU clusters, which are predicted to become commonplace in future high-end HPC architectural solutions.

Tags: Benchmarking, Computer science, CUDA, MPI, nVidia, Performance, Tesla C1060, Tesla C2050

May 25, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org