high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

Alan Gray, Kevin Stratford

EPCC, The University of Edinburgh, Edinburgh EH9 3JZ, UK

16th IEEE International Conference on High Performance and Communications (HPCC), 2014

@article{gray2014targetdp,

title={targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance},

author={Gray, Alan and Stratford, Kevin},

year={2014}

}

Download (PDF)

View

Source

1797

views

To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware. From an application developer’s perspective, it is also important that code can be maintained in a portable manner across a range of hardware. Here we present targetDP, a lightweight programming layer that allows the abstraction of data parallelism for applications that employ structured grids. A single source code may be used to target both thread level parallelism (TLP) and instruction level parallelism (ILP) on either SIMD multi-core CPUs or GPU-accelerated platforms. targetDP is implemented via standard C preprocessor macros and library functions, can be added to existing applications incrementally, and can be combined with higher-level paradigms such as MPI. We present CPU and GPU performance results for a benchmark taken from the lattice Boltzmann application that motivated this work. These demonstrate not only performance portability, but also the improved optimisation resulting from the intelligent exposure of ILP.

Tags: CUDA, Data parallelism, Fluid dynamics, Intel Xeon Phi, Lattice Boltzmann model, nVidia, Tesla K40

May 20, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)