5907

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems

Shuai Che, Jeremy W. Sheaffer, Kevin Skadron
Department of Computer Science, University of Virginia
Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11), 2011

@article{che2011dymaxion,

   title={Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems},

   author={Che, S. and Sheaffer, J.W. and Skadron, K.},

   year={2011}

}

Download Download (PDF)   View View   Source Source   

830

views

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal performance for programs designed with a CPU memory interface-or no particular memory interface at all!-in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA’s CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data mappings with a case study of concurrent CPU-GPU execution.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: