Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

hgpu.org » Applications » Computer science » Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Rendong Liang, Ting Cao, Jicheng Wen, Manni Wang, Yang Wang, Jianhua Zou, Yunxin Liu

Microsoft Research, University of California, Irvine

The 28th Annual International Conference On Mobile Computing And Networking (MobiCom 2022), 2022

BibTeX

Download (PDF)

View

Source

Source codes

Package:

ArchProbe: A profiler to disclose and quantify hardware features on GPUs

1464

views

Mobile GPU, as a ubiquitous and powerful accelerator, plays an important role in accelerating on-device DNN (Deep Neural Network) inference. The frequent-upgrade and diversity of mobile GPUs require automatic kernel generation to empower fast DNN deployment. However, current generated kernels have poor performance. The goal of this paper is to rapidly generate high-performance kernels for diverse mobile GPUs. The major challenges are (1) it is unclear about what is the optimal kernel due to the lack of hardware knowledge; (2) how to rapidly generate it from a large space of candidates. For the first challenge, we propose a cross-platform profiling tool, the first to disclose and quantify mobile GPU architecture. The result demystifies the hardware bottleneck, and also directs the solution for the second challenge by exposing the unique high-performance hardware feature, identifying inefficient kernels against hardware constraints, and specifying performance bound for kernels. Directed by that, we propose a mobile-GPU-specific kernel compiler Romou. It supports the unique hardware feature in kernel implementation, and prunes inefficient ones against hardware resources. Romou can thus rapidly generate high-performance kernels. Compared to the state-of-the-art generated kernels, it achieves up-to 14.7x speedup on average for convolution. Up-to 99% search space is pruned. The performance is even up-to 1.2x faster on average than the state-of-the-art hand-optimized implementation.

Tags: Code generation, Computer science, OpenCL, Package, Performance, Vulkan

March 6, 2022 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org