NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

hgpu.org » Applications » Computer science » NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman

Department of Computer Science, University of Houston, Houston TX, 77004 USA

27th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2014), 2014

@article{xu2014parallel,

title={NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model},

author={Xu, Rengan and Tian, Xiaonan and Chandrasekaran, Sunita and Yan, Yonghong and Chapman, Barbara},

year={2014}

}

Download (PDF)

View

Source

3247

views

The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create higher-level programming approaches such as OpenACC. Directive-based programming models such as OpenMP and OpenACC offer programmers an option to rapidly create prototype applications by adding annotations to guide compiler optimizations. In this paper we study the effectiveness of a high-level directive based programming model, OpenACC, for parallelizing NAS Parallel Benchmarks (NPB) on GPGPUs. We present the application of techniques such as array privatization, memory coalescing, cache optimization and examine their impact on the performance of the benchmarks. The right choice or combination of techniques/hints are crucial for compilers to generate highly efficient codes tuned to a particular type of accelerator. Poorly selected choice or combination of techniques can lead to degraded performance. We also propose a new clause, "scan", that handles scan operations for arbitrary input array size. We hope that the practices discussed in this paper will provide useful guidance to users to effectively migrate their sequential/CPU-parallel codes to GPGPU architectures and achieve optimal performance.

Tags: Benchmarking, Computer science, CUDA, MPI, nVidia, OpenACC, Tesla K20

September 28, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org