An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads
University of Utah, Salt Lake City, UT, USA
EasyChair Preprint no. 6531, 2021
@techreport{chiu2021experimental,
title={An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads},
author={Chiu, Cheng-Hsiang and Lin, Dian-Lun and Huang, Tsung-Wei},
year={2021},
institution={EasyChair}
}
Task graph parallelism has emerged as an important tool to efficiently execute large machine learning workloads on GPUs. Users describe a GPU workload in a task dependency graph rather than aggregated GPU operations and dependencies, allowing the runtime to run whole-graph scheduling optimization to significantly improve the performance. While the new CUDA graph execution model has demonstrated significant success on this front, the counterpart for SYCL, a general-purpose heterogeneous programming model using standard C++, remains nascent. Unlike CUDA graph, SYCL runtime leverages out-of-order queues to implicitly create a task execution graph induced by data dependencies. For explicit task dependencies, users are responsible for creating SYCL events and synchronizing them at a non-negligible cost. Furthermore, there is no specialized graph execution model that allows users to offload a task graph directly onto a SYCL device in a similar way to CUDA graph. This paper conducts an experimental study of SYCL’s default task graph parallelism by comparing it with CUDA graph on large-scale machine learning workloads in the recent HPEC Graph Challenge. Our result highlights the need for a new SYCL graph execution model in the standard.
September 26, 2021 by hgpu