CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems
Amazon Web Service, Santa Clara, CA, USA
27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2022
@inproceedings{chen2022case,
title={CASE: a compiler-assisted SchEduling framework for multi-GPU systems},
author={Chen, Chao and Porter, Chris and Pande, Santosh},
booktitle={Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages={17–31},
year={2022}
}
Modern computing platforms tend to deploy multiple GPUs on a single node to boost performance. GPUs have large computing capacities and are an expensive resource. Increasing their utilization without causing performance degradation of individual workloads is an important and challenging problem. Although services such as NVIDIA’s MPS allow multiple cooperative kernels to simultaneously run on a single device, they do not solve the co-execution problem for uncooperative, independent kernels on such a multi-GPU system. To tackle this problem, we propose CASE — a fully automated compiler-assisted scheduling framework. During the compilation of an application, CASE constructs GPU tasks from CUDA programs and instruments the code with a probe before each one. At runtime, each probe conveys information about its task’s resource requirements such as memory and the number of streaming multiprocessor (SMs) needed to a user-level scheduler. The scheduler then places each task onto a suitable device by employing a policy appropriate to the system. In our prototype, a throughput-oriented scheduling policy is implemented to evaluate our resourceaware scheduling framework. The Rodinia benchmark suite and the Darknet neural network framework were used in our evaluation. The results show that, as compared to existing state-of-the-art methods, CASE improves throughput by up to 2.5x for Rodinia, and up to 2.7x for Darknet on modern NVIDIA GPU platforms, mainly due to the fact that it improves the average system utilization by up to 3.36x and the job turnaround time by up to 4.9x. Meanwhile, it limits individual kernel performance degradation within 2.5%. CASE achieved peak system utilization of 78% for Rodinia and 80% for Darknet on a 4xV100 system.
May 1, 2022 by hgpu