Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

hgpu.org » Applications » Computer science » Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, Chita R. Das

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA

The Pennsylvania State University, Technical Report CSE-12-006, 2012

BibTeX

Download (PDF)

View

Source

2233

views

General-purpose Graphic processing units (GPGPUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs), consisting of a group of threads. The CTAs can be executed in any order, thereby providing ample of opportunities for TLP. The state-of-the-art GPGPU schedulers allocate maximum possible CTAs per-core (limited by available on-chip resources) to enhance performance by exploiting high TLP. However, we demonstrate in this paper that executing the maximum possible CTAs on a core is not always the optimal choice from the performance perspective due to inefficient utilization of core resources. Therefore, we propose a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the core-level TLP by allocating optimal number of CTAs, based on application characteristics. DYNCTA allocates more CTAs for compute-intensive applications compared to memory-intensive applications to minimize resource contention. Simulation results on a 30-core GPGPU platform with 31 applications demonstrate that the proposed CTA scheduling provides 28% (up to 3.6x) improvement in performance compared to the existing CTA scheduler, on average. We further enhance DYNCTA by turning off some cores during run-time to limit TLP and power consumption. This proposed scheme, DYNCORE, is shown to provide 21% speedup, while reducing power consumption by 17% and saving energy by 52%, compared to existing CTA schedulers.

Tags: Computer science, CUDA, nVidia, Performance

December 10, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org