Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

hgpu.org » Applications » Computer science » Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

S.J. Pennycook, S.D. Hammond, S.A. Jarvis, G.R. Mudalige

Performance Computing and Visualisation, Department of Computer Science, University of Warwick, UK

ACM SIGMETRICS Performance Evaluation Review, Volume 38 Issue 4, March 2011

DOI:10.1145/1964218.1964223

@article{pennycook2011performance,

title={Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark},

author={Pennycook, SJ and Hammond, SD and Jarvis, SA and Mudalige, GR},

journal={ACM SIGMETRICS Performance Evaluation Review},

volume={38},

number={4},

pages={23–29},

year={2011},

publisher={ACM}

}

Download (PDF)

View

Source

1419

views

The emergence of Graphics Processing Units (GPUs) as a potential alternative to conventional general-purpose processors has led to significant interest in these architectures by both the academic community and the High Performance Computing (HPC) industry. While GPUs look likely to deliver unparalleled levels of performance, the publication of studies claiming performance improvements in excess of 30,000x are misleading. Significant on-node performance improvements have been demonstrated for code kernels and algorithms amenable to GPU acceleration; studies demonstrating comparable results for full scientific applications requiring multiple-GPU architectures are rare. In this paper we present an analysis of a port of the NASLU benchmark to NVIDIA’s Compute Unified Device Architecture (CUDA) – the most stable GPU programming model currently available. Our solution is also extended to multiple nodes and multiple GPU devices. Runtime performance on several GPUs is presented, ranging from low-end, consumer-grade cards such as the 8400GS to NVIDIA’s agship Fermi HPC processor found in the recently released C2050. We compare the runtimes of these devices to several processors including those from Intel, AMD and IBM. In addition to this we utilise a recently developed performance model of LU. With this we predict the runtime performance of LU on large-scale distributed GPU clusters, which are predicted to become commonplace in future high-end HPC architectural solutions.

Tags: Benchmarking, Computer science, CUDA, MPI, nVidia, Performance, Tesla C1060, Tesla C2050

May 25, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org