Composing Distributed Computations Through Task and Kernel Fusion
Stanford University, USA
arXiv:2406.18109 [cs.DC], (26 Jun 2024)
@misc{yadav2024composingdistributedcomputationstask,
title={Composing Distributed Computations Through Task and Kernel Fusion},
author={Rohan Yadav and Shiv Sundram and Wonchan Lee and Michael Garland and Michael Bauer and Alex Aiken and Fredrik Kjolstad},
year={2024},
eprint={2406.18109},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2406.18109}
}
We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler to fuse together the kernels within fused tasks. We show empirically that Diffuse’s intermediate representation is general enough to be a target for two real-world, task-based libraries (cuNumeric and Legate Sparse), letting Diffuse find optimization opportunities across function and library boundaries. Diffuse accelerates unmodified applications developed by composing task-based libraries by 1.86x on average (geo-mean), and by between 0.93x–10.7x on up to 128 GPUs. Diffuse also finds optimization opportunities missed by the original application developers, enabling high-level Python programs to match or exceed the performance of an explicitly parallel MPI library.
June 30, 2024 by hgpu