Understanding and Modeling the Synchronization Cost in the GPU Architecture

hgpu.org » Applications » Computer science » Understanding and Modeling the Synchronization Cost in the GPU Architecture

Understanding and Modeling the Synchronization Cost in the GPU Architecture

James T. Letendre

Rochester Institute of Technology, Kate Gleason College of Engineering

Rochester Institute of Technology, 2013

BibTeX

Download (PDF)

View

Source

2194

views

Graphic Processing Units (GPUs) have been growing more and more popular being used for general purpose computations. GPUs are massively parallel processors which make them a much more ideal fit for many algorithms than the CPU is. The drawback to using a GPU to do a computation is that they are much less efficient at running algorithms with more complex control flow. This has led to them being used as part of a heterogeneous system, usually consisting of a CPU and a GPU although other types of processors could be added. Models of GPUs are important in order to determine how well your code will perform on various different GPUs, especially those which the programmer does not have access to. GPU prices range from $100s to $2000s and more, so when designing a system with a particular performance value in mind, it is beneficial to be able to determine which GPU best meets your goal without wasting money on unneeded performance. Current GPU models were either developed for older generations of GPU architectures, they ignore certain costs that are present in the GPU, or when they account for those costs, they do so inaccurately. The big component that is ignored in most of the models investigated is the synchronization cost. This cost arises when the various threads within the GPU need to share data amongst themselves. In order to ensure that the data shared is accurate, the threads must synchronize so that they have all written to memory before any thread tries to read. It is also the cause of major inaccuracies with the most up to date GPU model found. This thesis aims to understand the factors of the synchronization cost through the use of microbenchmarks. With this understanding the accuracy of the model can be improved.

Tags: Benchmarking, Computer science, CUDA, Heterogeneous systems, nVidia, Performance, Tesla C2075, Thesis

October 17, 2013 by hgpu

Rating: 2.5/5. From 1 vote.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org