A Unified Approach to Variable Renaming for Enhanced Vectorization
Georgia Institute of Technology, Atlanta GA, USA
31st International Workshop on Languages and Compilers for Parallel Computing (LCPC’18), 2018
@article{chatarasi2018unified,
title={A Unified Approach to Variable Renaming for Enhanced Vectorization},
author={Chatarasi, Prasanth and Shirako, Jun and Cohen, Albert and Sarkar, Vivek},
year={2018}
}
Despite the fact that compiler technologies for automatic vectorization have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units. One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and output (write-after-write) dependences. Past approaches, such as variable renaming and variable expansion, break such dependence cycles by either eliminating or repositioning the problematic memory-based dependences. However, the past work suffers from three key limitations: 1) Lack of a unified framework that synergistically integrates multiple storage transformations, 2) Lack of support for bounding the additional space required to break memory-based dependences, and 3) Lack of support for integrating these storage transformations with other code transformations (e.g., statement reordering) to enable vectorization. In this paper, we address the three limitations above by integrating both Source Variable Renaming (SoVR) and Sink Variable Renaming (SiVR) transformations into a unified formulation, and by formalizing the "cycle-breaking" problem as a minimum weighted set cover optimization problem. To the best of our knowledge, our work is the first to formalize an optimal solution for cycle breaking that simultaneously considers both SoVR and SiVR transformations, thereby enhancing vectorization and reducing storage expansion relative to performing the transformations independently. We implemented our approach in PPCG, a state-of-the-art optimization framework for loop transformations, and evaluated it on eleven kernels from the TSVC benchmark suite. Our experimental results show a geometric mean performance improvement of 4.61x on an Intel Xeon Phi (KNL) machine relative to the optimized performance obtained by Intel’s ICC v17.0 product compiler. Further, our results demonstrate a geometric mean performance improvement of 1.08x and 1.14x on the Intel Xeon Phi (KNL) and Nvidia Tesla V100 (Volta) platforms relative to past work that only performs the SiVR transformation [5], and of 1.57x and 1.22x on both platforms relative to past work on using both SiVR and SoVR transformations.
May 15, 2019 by hgpu