high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

I-Jui Sung, John A. Stratton, Wen-Mei W. Hwu

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign

Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) 2010, Vienna, Austria, September 11-15, 2010

DOI:10.1145/1854273.1854336

BibTeX

Download (PDF)

View

Source

1902

views

We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we have enabled automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 560% performance increases over the languagedefined layout, and a 7% performance gain in the worst case, in which the language-defined layout and access pattern is already well-vectorizable by the underlying hardware.

Tags: Compilers, Computer science, CUDA, Data Structures and Algorithms, nVidia, nVidia GeForce GTX 280

February 20, 2011 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Share this:

Recent source codes

Most viewed papers (last 30 days)