CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

hgpu.org » Applications » Computer science » CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

Jakob Siegel, Juergen Ributzka, Xiaoming Li

Dept. of Electr. & Comput. Eng., Univ. of Delaware, Newark, DE, USA

International Conference on Parallel Processing Workshops, 2009. ICPPW ’09

DOI:10.1109/ICPPW.2009.78

@conference{siegel2009cuda,

title={CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator},

author={Siegel, J. and Ributzka, J. and Li, X.},

booktitle={2009 International Conference on Parallel Processing Workshops},

pages={174–181},

issn={1530-2016},

year={2009},

organization={IEEE}

}

Source

1451

views

Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPUs; and optimizing programs at the instruction-level for the GPU. This paper makes two contributions to the performance optimizations for GPUs. We analyze different approaches for optimizing the memory usage and access patterns for GPUs and propose a class of memory layout optimizations that can take full advantage of the unique memory hierarchy of NVIDIA CUDA. Furthermore, we analyze the performance increase by fully unrolling the innermost loop of the algorithm and propose guidelines on how to best unroll a program for the GPU. In particular, even that loop unrolling is a common optimization, the performance improvement on a GPU derives from a completely different aspect of this architecture. To demonstrate these optimizations, we picked an embarrassingly parallel algorithm used to calculate gravitational forces. This algorithm allows us to demonstrate and to explain the performance increase gained by the applied optimizations. Our results show that our approach is quite effective. After applying our technique to the algorithm used in the Gravit gravity simulator, we observed a 1.27x speedup compared to the baseline GPU implementation. This represents a 87x speedup to the original CPU implementation.

Tags: Computer science, CUDA, Gravitation, Memory model, N-body simulation, nVidia, Optimization

April 12, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org