Evaluating the Performance of Legacy Applications on Emerging Parallel Architectures

hgpu.org » Programming » Algorithms » Evaluating the Performance of Legacy Applications on Emerging Parallel Architectures

Evaluating the Performance of Legacy Applications on Emerging Parallel Architectures

Simon John Pennycook

The University of Warwick

The University of Warwick, 2012

BibTeX

Download (PDF)

View

Source

Source codes

Package:

NAS-LU Ports (CUDA)

2251

views

The gap between a supercomputer’s theoretical maximum ("peak") floating-point performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5-20% of any given machine’s peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern "accelerator" architectures — collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes — the LU benchmark from the NAS Parallel Benchmark Suite and Sandia’s miniMD benchmark — which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures.

Tags: Algorithms, ATI, ATI FirePro V7800, Benchmarking, Computer science, CUDA, MPI, nVidia, nVidia GeForce 8400 GS, nVidia GeForce 9800 GT, nVidia GeForce GTX 680, OpenCL, Performance, Tesla C1060, Tesla C2050, Thesis

May 21, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org