Analysis and Modeling of the Timing Behavior of GPU Architectures

Petros Voudouris
Faculty of Electrical Engineering, Eindhoven University of Technology
Eindhoven University of Technology, 2014


   title={Analysis and Modeling of the Timing Behavior of GPU Architectures},

   author={Voudouris, Petros and van den Braak, Gert-Jan},



Download Download (PDF)   View View   Source Source   



Graphics processing units (GPUs) offer massive parallelism. Since a couple of years GPUs can also be used for more general purpose applications; a wide variety of applications can be accelerated efficiently with the use of the CUDA and OpenCL programming models. Real-time systems frequently use many sensors that produce a big amount of data. GPUs can be used to process the data and remove the workload from the central process unit (CPU). To use the GPUs in real time systems it is required to have time predictable behavior. However, it is hard to give an estimation of the worst case execution time (WCET) of a GPU program, since only few timing details of the GPU architecture are given. In addition, inside the GPUs several arbiters are used, and their scheduling details are not always clearly described. The Nvidia Fermi architecture is analyzed in order to identify the sources of time variation and unpredictability. Initially, all the main components of the architecture are discussed and investigated for sources of time variation and predictability. From the analyzed components of the architecture we chose to continue with the warp scheduler, the scratchpad memory and the out-of-chip global memory. The warp scheduler determines the schedule of the warps and as a result influences significantly the total execution time. The scratchpad memory is widely used in GPU application to hide the memory latency. Finally, the global memory instructions were analyzed since they are used from all the applications. Micro-benchmarks implemented in assembly were used to quantify the sources of variation. We execute the same experiments multiple times in order to determine the variation of the execution time among the experiments. A timing model for the warp scheduler based on results from the experiments was introduced. The model was tested for Convolution separable benchmark from CUDA SDK. The model was accurate for average case (96.5% of the experiments) with 2% and 3% error for Rows and Columns kernel respectively. For the best case the model had -15% and -54% and worst case 28% and 1% error for rows and columns kernel (3.5% of the experiment). In addition, it was shown that the model is scalable up to 6 warps without any modification. Furthermore, the variable-rate warp scheduler was implemented in GPGPU-Sim that improves the time predictability of the GPU by assigning the warps in fixed slots. Initial results show that by assigning one warp with higher scheduling rate can improve the performance also by taking advantage of the inter-warp locality of the warps. Finally, based on the findings of this report, the GPU can be used in real-time systems with soft deadlines. By using the suggested changes of the architecture, the time predictability of the architecture can be improved even more and support applications with more strict timing constrains.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: