high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun

Seoul National University

arXiv:2012.02732 [cs.LG], (4 Dec 2020)

@misc{kwon2020nimble,

title={Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning},

author={Woosuk Kwon and Gyeong-In Yu and Eunji Jeong and Byung-Gon Chun},

year={2020},

eprint={2012.02732},

archivePrefix={arXiv},

primaryClass={cs.LG}

}

Download (PDF)

View

Source

Source codes

Package:

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

3201

views

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead and unnecessary serial execution. To this end, we propose Nimble, a DL execution engine that runs GPU tasks in parallel with minimal scheduling overhead. Nimble introduces a novel technique called ahead-of-time (AoT) scheduling. Here, the scheduling procedure finishes before executing the GPU kernel, thereby removing most of the scheduling overhead during run time. Furthermore, Nimble automatically parallelizes the execution of GPU tasks by exploiting multiple GPU streams in a single GPU. Evaluation on a variety of neural networks shows that compared to PyTorch, Nimble speeds up inference and training by up to 22.34x and 3.61x, respectively. Moreover, Nimble outperforms state-of-the-art inference systems, TensorRT and TVM, by up to 2.81x and 1.70x, respectively.

Tags: Computer science, CUDA, Deep learning, Neural networks, nVidia, Package, Task scheduling, Tesla V100

December 13, 2020 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Package:

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)