high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Mayank Daga, Thomas Scogland, Wu-chun Feng

Department of Computer Science, Virginia Tech, USA

17th IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2011

@article{daga2011architecture,

title={Architecture-Aware Mapping and Optimization on a 1600-Core GPU},

author={Daga, M. and Scogland, T. and Feng, W.},

year={2011}

}

Download (PDF)

View

Source

2262

views

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for highperformance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task; it is a multi-dimensional problem that requires deep technical knowledge of GPU architecture. Although substantial literature exists on how to map and optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on an AMD GPU, such as the 1600-core AMD Radeon HD 5870 GPU. Consequently, we present and evaluate architecture-aware mapping and optimizations for the AMD GPU. The most prominent of which include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU mapping and optimizations by applying each in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, our optimized OpenCL implementation on an AMD Radeon HD 5870 delivers more than a fourfold improvement in performance over the basic OpenCL implementation. In addition, it outperforms our optimized CUDA version on an NVIDIA GTX280 by 12%. Overall, we achieve a speedup of 371-fold over a serial but hand-tuned SSE version of our molecular modeling application, and in turn, a 46-fold speedup over an ideal scaling on an 8-core CPU.

Tags: ATI, ATI Radeon HD 5870, Computer science, CUDA, Molecular modeling, nVidia, nVidia GeForce GTX 280, OpenCL, Optimization, Performance

January 3, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)