APPy: Annotated Parallelism for Python on GPUs
Georgia Institute of Technology, USA
Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction (CC’24), 2024
@inproceedings{zhou2024appy,
title={APPy: Annotated Parallelism for Python on GPUs},
author={Zhou, Tong and Shirako, Jun and Sarkar, Vivek},
booktitle={Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction},
pages={113–125},
year={2024}
}
GPUs are increasingly being used used to speed up Python applications in the scientific computing and machine learning domains. Currently, the two common approaches to leveraging GPU acceleration in Python are 1) create a custom native GPU kernel, and import it as a function that can be called from Python; 2) use libraries such as CuPy, which provides pre-defined GPU-implementation-backed tensor operators. The first approach is very flexible but requires tremendous manual effort to create a correct and high performance GPU kernel. While the second approach dramatically improves productivity, it is limited in its generality, as many applications cannot be expressed purely using CuPy’s pre-defined tensor operators. Additionally, redundant memory access can often occur between adjacent tensor operators due to the materialization of intermediate results. In this work, we present APPy (Annotated Parallelism for Python), which enables users to parallelize generic Python loops and tensor expressions for execution on GPUs by adding simple compiler directives (annotations) to Python code. Empirical evaluation on 20 scientific computing kernels from the literature on a server with an AMD Ryzen 7 5800X 8-Core CPU and an NVIDIA RTX 3090 GPU demonstrates that with simple pragmas APPy is able to generate more efficient GPU code and achieves significant geometric mean speedup relative to CuPy (30× on average), and to three state-of-the-art Python compilers, Numba (8.3× on average), DaCe-GPU (3.1× on average) and JAX-GPU (18.8× on average).
February 25, 2024 by hgpu