Improving GPU Performance through Instruction Redistribution and Diversification

Xiang Gong
Northeastern University, Boston, Massachusetts
Northeastern University, 2018


   title={Improving GPU Performance through Instruction Redistribution and Diversification},

   author={Gong, Xiang},


   school={Northeastern University}


Download Download (PDF)   View View   Source Source   



As throughput-oriented accelerators, GPUs provide tremendous processing power by executing a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPUs resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of TLP due to hardware bottlenecks. Unfortunately, this tolerance is not effectively exploited by the Single Instruction Multiple Thread (SIMT) execution model employed by current GPU compute frameworks. Assuming a SIMT execution model, GPU applications frequently send bursts of memory requests that compete for GPU memory resources. Traditionally, hardware units, such as the wavefront scheduler, are used to manage such requests. Compute-bound threads can be scheduled to utilize compute resources while memory requests are serviced. However, the scheduler struggles when the number of memory operations dominates execution, unable to effectively hide the long latency of memory operations. The degree of instruction diversity present in a single application may also be insufficient to fully utilize the resources on a GPU. GPU workloads tend to stress a particular hardware resource, but can leave others under-utilized. Using coarse-grained hardware resource sharing techniques, such as concurrent kernel execution, fails to guarantee that GPU hardware resources are truly shared by different kernels. Introducing additional kernels that utilize similar resources may introduce more contention to the system, especially if kernel candidates fail to use hardware resources collaboratively. Most previous studies considered the goal of achieving GPU peak performance as a hardware issue. Extensive efforts have been made to remove hardware bottlenecks to improve efficiency. In this thesis, we argue that software plays an equal, if not more important, role. We need to acknowledge that hardware working alone is not able to achieve peak performance in a GPU system. We propose novel compiler-centric software techniques that work with hardware. Our compiler-centric solutions improve GPU performance by redistributing and diversifying instructions at compile time, which reduces memory contention and improves utilization of hardware resources at the same time. A rebalanced GPU application can enjoy a much better performance with minimal effort from the programmer, and at no cost of hardware changes. To support our study of these novel compiler-based optimizations, we need a complete simulation framework that can work seamlessly with a compiler toolchain. In this thesis, we develop a full compiler toolchain based on LLVM, that works seamlessly with the Multi2Sim CPU-GPU simulation framework. In addition to supporting our work, developing this compiler framework allows future researchers to explore cross-layer optimizations for GPU systems.
Rating: 2.0/5. From 1 vote.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: