Analyzing the CUDA Applications with its Latency and Bandwidth Tolerance
Department of Computer Science, Jawaharlal Darda Institute of Engineering & Technology, Yavatmal, MS, India
BIOINFO Computer Engineering, Volume 2, Issue 1, pp.25-30, 2012
@article{pistulkar2012analyzing,
title={Analyzing the CUDA Applications with its Latency and Bandwidth Tolerance},
author={Pistulkar, V.N. and Uttarwar, C.A.},
year={2012}
}
The CUDA scalable parallel programming model provides readily-understood abstractions that free programmers to focus on efficient parallel algorithms. It uses a hierarchy of thread groups, shared memory, and barrier synchronization to express fine-grained and coarse-grained parallelism, using sequential C code for one thread. This paper explores the scalability of CUDA applications on systems with varying interconnect latencies, hiding a hardware detail from the programmer and making parallel programming more accessible to nonexperts. We use a combination of the Ocelot PTX emulator [1] and a discrete event simulator to evaluate the UIUC Parboil benchmarks [2] on three distinct GPU configurations. We find that these applications are sensitive to neither interconnect latency nor bandwidth, and that integrated GPU-CPU systems are not likely to perform any better than discrete GPUs or GPU clusters.
July 5, 2012 by hgpu