Improving GPU Performance: Reducing Memory Conflicts and Latency
Technische Universiteit Eindhoven
Technische Universiteit Eindhoven, 2015
@article{van2015improving,
title={Improving GPU Performance},
author={van den Braak, Gerardus Johannes Wilhelmus},
year={2015}
}
Over the last decade Graphics Processing Units (GPUs) have evolved from fixed function computer graphics processors to energy efficient and programmable general purpose compute accelerators. During this period the number of cores in a GPU increased from 128 to 3072, an increase of 24x. However, the peak compute performance only increased by 12x, and memory bandwidth by a mere 3.9x. Algorithms with an abundance of parallelism, such as matrix multiplication, can be implemented relatively easily on these GPUs and scale well with an increase in core count. Other, irregular algorithms are much harder to implement efficiently and benefit less of the increased number of cores. In this work a class of irregular algorithms, the so called ‘voting algorithms’ such as histogram and Hough transform, are analyzed, implemented and optimized on GPUs. Histograms are not only used in statistics or for displaying the distribution of colors in an image, but also for contrast adjustments in images, image segmentation and feature detection, such as in the Scale Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HoG). The Hough transform can be used to detect the lines on a road, or the circles of a traffic sign, but also to track particles, e.g. in the Large Hadron Collider. In voting algorithms a set of input values is mapped to a, usually much smaller, set of output bins. The main challenge in mapping voting algorithms to GPUs is to efficiently update the output bins in a parallel manner. The first contribution of this work is a set of software techniques to improve the parallel updating of the output bins in the voting algorithms. Voting algorithms use atomic operations to update the bins. By duplicating all the bins a significant performance improvement can be achieved. Multi-core CPU implementations are made which utilize the SSE and AVX vector extensions of the processor. These optimizations improve the performance of the histogram application on a CPU by 10x over a single thread CPU implementation. The baseline GPU implementation has a similar performance as a single core CPU implementation, but by using the proposed software techniques the best GPU histogram implementation outperforms the optimized multi-core CPU implementation by 4.8x. The second contribution of this thesis is a hardware change of the scratchpad memory. The GPU’s on-chip scratchpad memory is divided in banks and contains locks to support atomic operations. The duplication of the output bins requires more scratchpad memory and causes an uneven distribution of the memory accesses over the banks and locks. Hash functions in the addressing of the banks and locks are proposed to distribute the memory accesses more equally over the memory’s banks and locks. A simple hardware hash function improves performance up to 4.9x for the aforementioned optimized GPU histogram application. Applications which use the scratchpad memory, but do not rely on atomic operations, still experience an average performance gain of 1.2x by using a more complicated configurable hash function. The final contribution is an extension to the GPU architecture, resulting in a reconfigurable GPU, called R-GPU. This extension improves not only performance but also power and energy efficiency. R-GPU is an addition to a GPU, which can still be used in its original form, but also has the ability to reorganize the cores of a GPU in a reconfigurable network. In R-GPU data movement and control is implicit in the configuration of this network. Each core executes a fixed operation, reducing instruction decode count and increasing power and energy efficiency. R-GPU improves the performance of voting algorithms, e.g. histogram is improved 2.9x over an optimized GPU implementation. Other benchmarks profit as well. On a set of benchmarks an average performance improvement of 2.1x is measured. Especially algorithms which have a limited level of parallelism due to data dependencies, such as calculating an integral image, benefit from the proposed architecture changes. Furthermore, power consumption is reduced by 6%, leading to an energy consumption reduction of 55%, while the area overhead of R-GPU is only 4% of the total GPU’s chip area. With the above software techniques and hardware modifications GPUs are now much more applicable for the class of voting algorithms.
May 11, 2016 by hgpu