8618

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, Chita R. Das
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA
The Pennsylvania State University, Technical Report CSE-12-006, 2012

@article{kayiran2012neither,

   title={Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs},

   author={Kayiran, O. and Jog, A. and Kandemir, M.T. and Das, C.R.},

   year={2012}

}

Download Download (PDF)   View View   Source Source   

1271

views

General-purpose Graphic processing units (GPGPUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs), consisting of a group of threads. The CTAs can be executed in any order, thereby providing ample of opportunities for TLP. The state-of-the-art GPGPU schedulers allocate maximum possible CTAs per-core (limited by available on-chip resources) to enhance performance by exploiting high TLP. However, we demonstrate in this paper that executing the maximum possible CTAs on a core is not always the optimal choice from the performance perspective due to inefficient utilization of core resources. Therefore, we propose a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the core-level TLP by allocating optimal number of CTAs, based on application characteristics. DYNCTA allocates more CTAs for compute-intensive applications compared to memory-intensive applications to minimize resource contention. Simulation results on a 30-core GPGPU platform with 31 applications demonstrate that the proposed CTA scheduling provides 28% (up to 3.6x) improvement in performance compared to the existing CTA scheduler, on average. We further enhance DYNCTA by turning off some cores during run-time to limit TLP and power consumption. This proposed scheme, DYNCORE, is shown to provide 21% speedup, while reducing power consumption by 17% and saving energy by 52%, compared to existing CTA schedulers.
No votes yet.
Please wait...

* * *

* * *

Featured events

2018
November
27-30
Hida Takayama, Japan

The Third International Workshop on GPU Computing and AI (GCA), 2018

2018
September
19-21
Nagoya University, Japan

The 5th International Conference on Power and Energy Systems Engineering (CPESE), 2018

2018
September
22-24
MediaCityUK, Salford Quays, Greater Manchester, England

The 10th International Conference on Information Management and Engineering (ICIME), 2018

2018
August
21-23
No. 1037, Luoyu Road, Hongshan District, Wuhan, China

The 4th International Conference on Control Science and Systems Engineering (ICCSSE), 2018

2018
October
29-31
Nanyang Executive Centre in Nanyang Technological University, Singapore

The 2018 International Conference on Cloud Computing and Internet of Things (CCIOT’18), 2018

HGPU group © 2010-2018 hgpu.org

All rights belong to the respective authors

Contact us: