An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

hgpu.org » Applications » Computer science » An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

Cheng-Hsiang Chiu, Dian-Lun Lin, Tsung-Wei Huang

University of Utah, Salt Lake City, UT, USA

EasyChair Preprint no. 6531, 2021

BibTeX

Download (PDF)

View

Source

1801

views

Task graph parallelism has emerged as an important tool to efficiently execute large machine learning workloads on GPUs. Users describe a GPU workload in a task dependency graph rather than aggregated GPU operations and dependencies, allowing the runtime to run whole-graph scheduling optimization to significantly improve the performance. While the new CUDA graph execution model has demonstrated significant success on this front, the counterpart for SYCL, a general-purpose heterogeneous programming model using standard C++, remains nascent. Unlike CUDA graph, SYCL runtime leverages out-of-order queues to implicitly create a task execution graph induced by data dependencies. For explicit task dependencies, users are responsible for creating SYCL events and synchronizing them at a non-negligible cost. Furthermore, there is no specialized graph execution model that allows users to offload a task graph directly onto a SYCL device in a similar way to CUDA graph. This paper conducts an experimental study of SYCL’s default task graph parallelism by comparing it with CUDA graph on large-scale machine learning workloads in the recent HPEC Graph Challenge. Our result highlights the need for a new SYCL graph execution model in the standard.

Tags: Computer science, CUDA, Heterogeneous systems, Machine learning, nVidia, nVidia GeForce GTX 2080, SYCL

September 26, 2021 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

Share this:

Recent source codes

Most viewed papers (last 30 days)