high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Analysis and Modeling of the Timing Behavior of GPU Architectures

Analysis and Modeling of the Timing Behavior of GPU Architectures

Petros Voudouris

Faculty of Electrical Engineering, Eindhoven University of Technology

Eindhoven University of Technology, 2014

BibTeX

Download (PDF)

View

Source

2098

views

Graphics processing units (GPUs) offer massive parallelism. Since a couple of years GPUs can also be used for more general purpose applications; a wide variety of applications can be accelerated efficiently with the use of the CUDA and OpenCL programming models. Real-time systems frequently use many sensors that produce a big amount of data. GPUs can be used to process the data and remove the workload from the central process unit (CPU). To use the GPUs in real time systems it is required to have time predictable behavior. However, it is hard to give an estimation of the worst case execution time (WCET) of a GPU program, since only few timing details of the GPU architecture are given. In addition, inside the GPUs several arbiters are used, and their scheduling details are not always clearly described. The Nvidia Fermi architecture is analyzed in order to identify the sources of time variation and unpredictability. Initially, all the main components of the architecture are discussed and investigated for sources of time variation and predictability. From the analyzed components of the architecture we chose to continue with the warp scheduler, the scratchpad memory and the out-of-chip global memory. The warp scheduler determines the schedule of the warps and as a result influences significantly the total execution time. The scratchpad memory is widely used in GPU application to hide the memory latency. Finally, the global memory instructions were analyzed since they are used from all the applications. Micro-benchmarks implemented in assembly were used to quantify the sources of variation. We execute the same experiments multiple times in order to determine the variation of the execution time among the experiments. A timing model for the warp scheduler based on results from the experiments was introduced. The model was tested for Convolution separable benchmark from CUDA SDK. The model was accurate for average case (96.5% of the experiments) with 2% and 3% error for Rows and Columns kernel respectively. For the best case the model had -15% and -54% and worst case 28% and 1% error for rows and columns kernel (3.5% of the experiment). In addition, it was shown that the model is scalable up to 6 warps without any modification. Furthermore, the variable-rate warp scheduler was implemented in GPGPU-Sim that improves the time predictability of the GPU by assigning the warps in fixed slots. Initial results show that by assigning one warp with higher scheduling rate can improve the performance also by taking advantage of the inter-warp locality of the warps. Finally, based on the findings of this report, the GPU can be used in real-time systems with soft deadlines. By using the suggested changes of the architecture, the time predictability of the architecture can be improved even more and support applications with more strict timing constrains.

Tags: Computer science, CUDA, GPGPU-sim, nVidia, Thesis

February 10, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Analysis and Modeling of the Timing Behavior of GPU Architectures

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Analysis and Modeling of the Timing Behavior of GPU Architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)