high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

Chao Chen, Chris Porter, Santosh Pande

Amazon Web Service, Santa Clara, CA, USA

27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2022

BibTeX

Download (PDF)

View

Source

Source codes

Package:

GPU-Sched

1544

views

Modern computing platforms tend to deploy multiple GPUs on a single node to boost performance. GPUs have large computing capacities and are an expensive resource. Increasing their utilization without causing performance degradation of individual workloads is an important and challenging problem. Although services such as NVIDIA’s MPS allow multiple cooperative kernels to simultaneously run on a single device, they do not solve the co-execution problem for uncooperative, independent kernels on such a multi-GPU system. To tackle this problem, we propose CASE — a fully automated compiler-assisted scheduling framework. During the compilation of an application, CASE constructs GPU tasks from CUDA programs and instruments the code with a probe before each one. At runtime, each probe conveys information about its task’s resource requirements such as memory and the number of streaming multiprocessor (SMs) needed to a user-level scheduler. The scheduler then places each task onto a suitable device by employing a policy appropriate to the system. In our prototype, a throughput-oriented scheduling policy is implemented to evaluate our resourceaware scheduling framework. The Rodinia benchmark suite and the Darknet neural network framework were used in our evaluation. The results show that, as compared to existing state-of-the-art methods, CASE improves throughput by up to 2.5x for Rodinia, and up to 2.7x for Darknet on modern NVIDIA GPU platforms, mainly due to the fact that it improves the average system utilization by up to 3.36x and the job turnaround time by up to 4.9x. Meanwhile, it limits individual kernel performance degradation within 2.5%. CASE achieved peak system utilization of 78% for Rodinia and 80% for Darknet on a 4xV100 system.

Tags: Compilers, Computer science, CUDA, nVidia, Package, Task scheduling, Tesla P100, Tesla V100

May 1, 2022 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)

CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)