Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

hgpu.org » Applications » Computer science » Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, Wen-mei W. Hwu

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign

PPoPP ’08 Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

DOI:10.1145/1345206.1345220

@conference{ryoo2008optimization,

title={Optimization principles and application performance evaluation of a multithreaded GPU using CUDA},

author={Ryoo, S. and Rodrigues, C.I. and Baghsorkhi, S.S. and Stone, S.S. and Kirk, D.B. and Hwu, W.W.},

booktitle={Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming},

pages={73–82},

year={2008},

organization={ACM}

}

Download (PDF)

View

Source

2639

views

GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor’s organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread’s resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.

Tags: Computer science, CUDA, High-level Languages, nVidia, nVidia GeForce 8800 GTX, Performance, Programming techniques

October 27, 2010 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org