Analyzing CUDA workloads using a detailed GPU simulator

hgpu.org » Applications » Computer science » Analyzing CUDA workloads using a detailed GPU simulator

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt

University of British Columbia, Vancouver, BC, Canada

In 2009 IEEE International Symposium on Performance Analysis of Systems and Software (April 2009), pp. 163-174.

DOI:10.1109/ISPASS.2009.4919648

@conference{bakhoda2009analyzing,

title={Analyzing CUDA workloads using a detailed GPU simulator},

author={Bakhoda, A. and Yuan, G.L. and Fung, W.W.L. and Wong, H. and Aamodt, T.M.},

booktitle={Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on},

pages={163–174},

year={2009},

organization={IEEE}

}

Download (PDF)

View

Source

2835

views

Modern graphic processing units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA’s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA’s parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

Tags: Computer science, CUDA, nVidia, nVidia GeForce 8600 GTS, PTX

October 28, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org