Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

hgpu.org » Applications » Computer science » Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli

Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, 02115, USA

IEEE Transactions on Parallel and Distributed Systems, January 2011 (vol. 22 no. 1), pp. 105-118

DOI:10.1109/TPDS.2010.107

BibTeX

Download (PDF)

View

Source

2221

views

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4x and 13.5x over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

Tags: ATI, ATI Radeon HD 3870, Brook, Computer science, CUDA, Data parallelism, Memory model, nVidia, nVidia GeForce GTX 285, Review

June 17, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org