Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems

hgpu.org » Applications » Computer science » Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems

Shuai Che, Jeremy W. Sheaffer, Kevin Skadron

Department of Computer Science, University of Virginia

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11), 2011

@article{che2011dymaxion,

title={Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems},

author={Che, S. and Sheaffer, J.W. and Skadron, K.},

year={2011}

}

Download (PDF)

View

Source

2317

views

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal performance for programs designed with a CPU memory interface-or no particular memory interface at all!-in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA’s CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data mappings with a case study of concurrent CPU-GPU execution.

Tags: Computer science, CUDA, Heterogeneous systems, Memory, nVidia, nVidia GeForce GTX 285, nVidia GeForce GTX 480, Optimization, Performance

October 15, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org