high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improving GPU Performance through Instruction Redistribution and Diversification

Improving GPU Performance through Instruction Redistribution and Diversification

Xiang Gong

Northeastern University, Boston, Massachusetts

Northeastern University, 2018

BibTeX

Download (PDF)

View

Source

1897

views

As throughput-oriented accelerators, GPUs provide tremendous processing power by executing a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPUs resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of TLP due to hardware bottlenecks. Unfortunately, this tolerance is not effectively exploited by the Single Instruction Multiple Thread (SIMT) execution model employed by current GPU compute frameworks. Assuming a SIMT execution model, GPU applications frequently send bursts of memory requests that compete for GPU memory resources. Traditionally, hardware units, such as the wavefront scheduler, are used to manage such requests. Compute-bound threads can be scheduled to utilize compute resources while memory requests are serviced. However, the scheduler struggles when the number of memory operations dominates execution, unable to effectively hide the long latency of memory operations. The degree of instruction diversity present in a single application may also be insufficient to fully utilize the resources on a GPU. GPU workloads tend to stress a particular hardware resource, but can leave others under-utilized. Using coarse-grained hardware resource sharing techniques, such as concurrent kernel execution, fails to guarantee that GPU hardware resources are truly shared by different kernels. Introducing additional kernels that utilize similar resources may introduce more contention to the system, especially if kernel candidates fail to use hardware resources collaboratively. Most previous studies considered the goal of achieving GPU peak performance as a hardware issue. Extensive efforts have been made to remove hardware bottlenecks to improve efficiency. In this thesis, we argue that software plays an equal, if not more important, role. We need to acknowledge that hardware working alone is not able to achieve peak performance in a GPU system. We propose novel compiler-centric software techniques that work with hardware. Our compiler-centric solutions improve GPU performance by redistributing and diversifying instructions at compile time, which reduces memory contention and improves utilization of hardware resources at the same time. A rebalanced GPU application can enjoy a much better performance with minimal effort from the programmer, and at no cost of hardware changes. To support our study of these novel compiler-based optimizations, we need a complete simulation framework that can work seamlessly with a compiler toolchain. In this thesis, we develop a full compiler toolchain based on LLVM, that works seamlessly with the Multi2Sim CPU-GPU simulation framework. In addition to supporting our work, developing this compiler framework allows future researchers to explore cross-layer optimizations for GPU systems.

Tags: AMD Radeon HD 7970, ATI, Compilers, Computer science, OpenCL, Optimization, Thesis

March 10, 2019 by hgpu

Rating: 2.0/5. From 1 vote.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Improving GPU Performance through Instruction Redistribution and Diversification

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Improving GPU Performance through Instruction Redistribution and Diversification

Share this:

Recent source codes

Most viewed papers (last 30 days)