Evaluation and enhancement of memory efficiency targeting general-purpose computations on scalable data-parallel GPU architectures
College of Engineering, Department of Electrical and Computer Engineering, Northeastern University
Northeastern University, 2011
@article{jang2011evaluation,
title={Evaluation and enhancement of memory efficiency targeting general-purpose computations on scalable data-parallel GPU architectures},
author={Jang, B.},
year={2011}
}
This thesis addresses the memory efficiency of general-purpose applications running on massively multi-threaded, data-parallel GPU architectures. Although scalable, data-parallel GPU architectures and their associated general-purpose programming models offer impressive computational capability and attractive power budgets, the pace of migrating general-purpose applications to this emerging class of architectures is significantly hindered by the efficiency of memory subsystem present on these platforms. Programmers are forced to optimize the memory behavior of their code if they are interested in reaping the full benefits of these high performance, data-parallel architectures. In this thesis, we present a comprehensive study of memory access behavior for data-parallel workloads targeting GPUs, and present an algorithmic methodology to address memory inefficiency issues. We establish a mathematical model to capture memory behavior that enables us optimize memory system performance. We present a comprehensive analysis of memory access patterns that fully incorporates the influence of thread mapping and explains the memory behavior of kernels running on GPU hardware—this modeling and analysis serves as a theoretical foundation throughout this thesis. We then show how this new model of memory system activity can be used to enhance the memory efficiency of kernels through a series of algorithmic memory efficiency enhancement techniques. The techniques explored in this thesis include: 1) vectorization via data transformations on vector-based GPU architectures, 2) appropriate memory space selection, and 3) search for an optimized thread mapping and work group size. To demonstrate the power of our proposed algorithmic methodology, we develop a tool that implements this proposed approach and tests it on a diverse class of general-purpose benchmark applications. The experiments are conducted using the industry standard heterogeneous programming language, OpenCL, on two mainstream GPU platforms available in the market.
January 18, 2012 by hgpu