high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improving GPGPU Concurrency with Elastic Kernels

Improving GPGPU Concurrency with Elastic Kernels

Sreepathi Pai, Matthew J. Thazhuthaveetil, R. Govindarajan

Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India

18th International Conference on Architectural Support for Programming Languages and Operating Systems, 2013

@article{pai2013improving,

title={Improving GPGPU Concurrency with Elastic Kernels},

author={Pai, S. and Thazhuthaveetil, M.J. and Govindarajan, R.},

journal={Memory},

volume={16384},

number={49152},

pages={3x},

year={2013}

}

Download (PDF)

View

Source

2041

views

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

Tags: Computer science, CUDA, nVidia, Performance, Tesla C2070

February 2, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Improving GPGPU Concurrency with Elastic Kernels

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Improving GPGPU Concurrency with Elastic Kernels

Share this:

Recent source codes

Most viewed papers (last 30 days)