8618

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, Chita R. Das
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA
The Pennsylvania State University, Technical Report CSE-12-006, 2012

@article{kayiran2012neither,

   title={Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs},

   author={Kayiran, O. and Jog, A. and Kandemir, M.T. and Das, C.R.},

   year={2012}

}

Download Download (PDF)   View View   Source Source   

2038

views

General-purpose Graphic processing units (GPGPUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs), consisting of a group of threads. The CTAs can be executed in any order, thereby providing ample of opportunities for TLP. The state-of-the-art GPGPU schedulers allocate maximum possible CTAs per-core (limited by available on-chip resources) to enhance performance by exploiting high TLP. However, we demonstrate in this paper that executing the maximum possible CTAs on a core is not always the optimal choice from the performance perspective due to inefficient utilization of core resources. Therefore, we propose a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the core-level TLP by allocating optimal number of CTAs, based on application characteristics. DYNCTA allocates more CTAs for compute-intensive applications compared to memory-intensive applications to minimize resource contention. Simulation results on a 30-core GPGPU platform with 31 applications demonstrate that the proposed CTA scheduling provides 28% (up to 3.6x) improvement in performance compared to the existing CTA scheduler, on average. We further enhance DYNCTA by turning off some cores during run-time to limit TLP and power consumption. This proposed scheme, DYNCORE, is shown to provide 21% speedup, while reducing power consumption by 17% and saving energy by 52%, compared to existing CTA schedulers.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: