high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Teng Li, Vikram K. Narayana, Tarek El-Ghazawi

Department of Electrical and Computer Engineering, The George Washington University, 801 22nd St NW, Washington, DC, 20052, United States

arXiv:1511.07983 [cs.DC], (25 Nov 2015)

BibTeX

Download (PDF)

View

Source

1914

views

Contemporary GPUs allow concurrent execution of small computational kernels in order to prevent idling of GPU resources. Despite the potential concurrency between independent kernels, the order in which kernels are issued to the GPU will significantly influence the application performance. A technique for deriving suitable kernel launch orders is therefore presented, with the aim of reducing the total execution time. Experimental results indicate that the proposed method yields solutions that are well above the 90 percentile mark in the design space of all possible permutations of the kernel launch sequences.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 580

November 29, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)