User-Driven Online Kernel Fusion for SYCL

Victor Perez, Lukas Sommer, Victor Lomüller, Kumudha Narasimhan, Mehdi Goli
Codeplay Software Ltd., UK
ACM Transactions on Architecture and Code Optimization, 2022


   title={User-Driven Online Kernel Fusion for SYCL},

   author={Perez, Victor and Sommer, Lukas and Lom{"u}ller, Victor and Narasimhan, Kumudha and Goli, Mehdi},

   journal={ACM Transactions on Architecture and Code Optimization},

   publisher={ACM New York, NY},



Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing speciic tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization and kernel launch. While in applications with one or two short-running kernels the overhead can be negligible, it can be noticeable when these short-running kernels dominate the overall number of kernels in an application, as it is the case in graph-based neural network models, where there are several small memory-bound nodes alongside few large compute-bound nodes. To reduce the overhead, combining several kernels into a single, more optimized kernel is an active area of research. However, this task can be time-consuming and error-prone given the huge set of potential combinations. This can push programmers to seek a trade-of between (a) task-speciic kernels with low overhead but hard to maintain and (b) smaller modular kernels with higher overhead but easier to maintain. While there are DSL-based approaches, such as those provided for machine learning frameworks, which ofer the possibility of such a fusion, they are limited to a particular domain and exploit speciic knowledge of that domain and, as a consequence, are hard to port elsewhere. This study explores the feasibility of a user-driven kernel fusion through an extension to the SYCL API to address the automation of kernel fusion. The proposed solution requires programmers to deine the subgraph regions that are potentially suitable for fusion without any modification to the kernel code or the function signature. We evaluate the performance beneit of our approach on common neural networks and study the performance improvement in detail.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: