Performance and Power Optimization of GPU Architectures for General-purpose Computing

Yue Wang
University of South Florida
University of South Florida, 2014


   title={Performance and Power Optimization of GPU Architectures for General-purpose Computing},

   author={Wang, Yue},



Download Download (PDF)   View View   Source Source   



Power-performance efficiency has become a central focus that is challenging in heterogeneous processing platforms as the power constraints have to be established without hindering the high performance. In this dissertation, a framework for optimizing the power and performance of GPUs in the context of general-purpose computing in GPUs (GPGPU) is proposed. To optimize the leakage power of caches in GPUs, we dynamically switch the L1 and L2 caches into low power modes during periods of inactivity to reduce leakage power. The L1 cache can be put into a low-leakage (sleep) state when a processing unit is stalled due to no ready threads to be scheduled and the L2 can be put into sleep state during its idle period when there is no memory request. The sleep mode is state-retentive, which obviates the necessity to flush the caches after they are woken up, thereby, avoiding any performance degradation. Experimental results indicate that this technique can reduce the leakage power by 52% on average. Further, to improve performance, we redistribute the GPGPU workload across the computing units of the GPU during application execution. The fundamental idea is to monitor the workload on each multi-processing unit and redistribute it by having a portion of its unfinished threads executed in a neighboring multi-processing unit. Experimental results show this technique improves the performance of the GPGPU workload by 15.7%. Finally, to improve both performance and dynamic power of GPUs, we propose two dynamic frequency scaling (DFS) techniques implemented on CPU host threads, one of which is motivated by the significance of the pipeline stalls during GPGPU execution. It applies a feedback controlling algorithm, Proportional-Integral-Derivative (PID), to regulate the frequency of parallel processors and memory channels based on the occupancy of the memory buffering queues. The other technique targets on maximizing the average throughput of all parallel processors under the dynamic power constraints. We formalize this target as a linear programming problem and solve it on the runtime. According to the simulation results, the first technique achieves more than 22% power savings with a 4% improvement in performance and the second technique saves 11% power consumption with 9% performance improvement. The contributions of this dissertation represent a significant advancement in the quest for improving performance and reducing energy consumption of GPGPU.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: