high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » The More We Share, The More We Have: Improving GPU performance through Register Sharing

The More We Share, The More We Have: Improving GPU performance through Register Sharing

Vishwesh Jatala, Jayvant Anantpur, Amey Karkare

Department of CSE, IIT Kanpur, Kanpur, India

arXiv:1503.05694 [cs.AR], (19 Mar 2015)

@article{jatala2015more,

title={The More We Share, The More We Have: Improving GPU performance through Register Sharing},

author={Jatala, Vishwesh and Anantpur, Jayvant and Karkare, Amey},

year={2015},

month={mar},

archivePrefix={"arXiv"},

primaryClass={cs.AR}

}

Download (PDF)

View

Source

1700

views

Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The amount of thread level parallelism that can be utilized depends on the number of resident threads on each of the SMs. The threads are typically structured into a grid of thread blocks with each thread block containing a large number of threads. The number of thread blocks, and hence the number of threads that can be launched on an SM, depends on the resource usage–e.g. number of registers, amount of shared memory–of the thread blocks. Since the allocation of threads to an SM is at the thread block granularity, some of the resources may not be used up completely and hence will be wasted. We propose an approach, Register Sharing, that utilizes the wasted registers in SMs to launch more thread blocks and hence increases the number of resident threads. We further propose three optimizations that make effective use of these extra thread blocks to hide long execution latencies and hence reduce the number of stall cycles. We experimentally validated our approach using GPGPU-Sim simulator on several applications from 3 different benchmark suites: GPGPU-Sim, Rodinia, and Parboil. We observed a maximum improvement of 24% and an average improvement of 11% with a very small hardware overhead.

Tags: Computer science, CUDA, GPGPU-sim, Hardware Architecture, nVidia, Performance

March 20, 2015 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

The More We Share, The More We Have: Improving GPU performance through Register Sharing

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

The More We Share, The More We Have: Improving GPU performance through Register Sharing

Share this:

Recent source codes

Most viewed papers (last 30 days)