high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Jianlong Zhong, Bingsheng He

School of Computer Engineering, Nanyang Technological University, Singapore, 639798

arXiv:1303.5164 [cs.DC], (21 Mar 2013)

BibTeX

Download (PDF)

View

Source

2551

views

Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite the recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system with dynamic slicing and scheduling techniques to improve the throughput of concurrent kernel executions on the GPU. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices and to fully utilize the GPU resources. We develop a novel and effective Markov chain based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31.1% and 23.4% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 680, PTX, Task scheduling, Tesla C2050

March 23, 2013 by hgpu

Rating: 2.4/5. From 11 votes.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)