high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse

Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse

Guoyang Chen, Xipeng Shen

Computer Science Department, North Carolina State University, 890 Oval Drive, Raleigh, NC, USA 27695

The 48th Annual IEEE/ACM International Symposium on Microarchitecture, 2015

BibTeX

Download (PDF)

View

Source

2830

views

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.

Tags: Compilers, Computer science, CUDA, nVidia, Tesla K20, Tesla K40

November 8, 2015 by hgpu

Rating: 4.3/5. From 5 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse

Your response

Recent source codes

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Most viewed papers (last 30 days)

Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)