high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » ATI Stream » Data transformations enabling loop vectorization on multithreaded data parallel architectures

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Byunghyun Jang, Perhaad Mistry, Dana Schaa, Rodrigo Dominguez, David Kaeli

Department of ECE, Northeastern University, Boston, MA 02115 USA

In PPoPP ’10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel computing (2010), pp. 353-354.

DOI:10.1145/1693453.1693510

BibTeX

Download (PDF)

View

Source

2155

views

Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.

Tags: ATI, ATI Radeon HD 3870, ATI Stream, Brook, Code generation, Compilers, Computer science, Optimization

January 7, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)