25840

Performance Optimisations for Heterogeneous Managed Runtime Systems

Michail Papadimitriou
University of Manchester
University of Manchester, 2021

@article{papadimitriou2021performance,

   title={Performance Optimisations for Heterogeneous Managed Runtime Systems},

   author={Papadimitriou, Michail},

   year={2021}

}

Download Download (PDF)   View View   Source Source   

194

views

High demand for increased computational capabilities and power efficiency has resulted in making commodity devices integrating diverse hardware resources. Desktops, laptops, and smartphones have embraced heterogeneity through multi-core Central Processing Units (CPUs), energy-efficient integrated Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), powerful discrete GPUs, and Tensor Processing Units (TPUs). To ease the programmability of these heterogeneous hardware accelerators, several parallel programming frameworks, such as OpenCL and CUDA, have been introduced, to support the new diverse computing paradigm. In order to utilise heterogeneous hardware accelerators, software engineers shall divert from the conventional software engineering practices that until now regarded CPU-only execution. To manage this transition, a deep understanding of the underlying hardware architecture and parallel programming principles is required. To this end, several frameworks (e.g., Lift, TVM, Halide, Lime, Aparapi, Dandelion, Marawacc) made heterogeneous hardware accessible from high-level languages. Yet, these frameworks tend either to be tailored to a domain (e.g., machine learning, computer vision) or to expose hardware particularities to the developer. In addition, little work has been done on making niche aspects of heterogeneous hardware a natural extension to languages running on top of conventional managed runtimes, such as the Java Virtual Machine (JVM). This thesis presents novel performance-oriented, architecture-dependent, compiler and runtime optimisations to enable Java applications to benefit from heterogeneous execution seamlessly. As a foundation, it uses TornadoVM, an open-source generalpurpose programming framework that accommodates the execution of Java programs on heterogeneous hardware. The key objective is to bridge the performance gap between managed runtime systems and heterogeneous hardware, leading to efficient heterogeneous managed runtimes. This objective is fulfilled by the following three distinct contributions: The first contribution regards the performance improvements and portability of FPGA execution. FPGAs can provide high-performance execution along with power efficiency; however, they rely on a complex process dependent on High-Level Synthesis (HLS) software. This work describes a novel approach to integrate FPGAs into high-level managed programming languages by introducing a series of runtime code specialisation techniques for seamless and execution of Java programs on FPGAs. The experimental evaluation of the FPGA execution against sequential and multithreaded Java implementations showcases a geometric mean of 1.2x with a maximum of 224x and a geometric mean of 0.14x with a maximum of 19.8x performance speedups, respectively. Furthermore, it exhibits a geometric mean for speedups of 0.32x with a maximum of 13.82x compared to TornadoVM running on an Intel integrated GPU. The second contribution regards the automatic exploitation of the memory hierarchy of GPUs in order to increase performance. The memory hierarchy of heterogeneous hardware is a key factor for performance, yet it is complicated to exploit it, even by expert programmers. This work provides an extensible and parameterisable collection of compiler optimisations to automatically exploit data locality and the memory hierarchy for GPUs. These optimisations are implemented on top of the industrial-strength Graal compiler and enable Java programs to utilise the local memory on GPUs without explicit programming. A selection of benchmarks and GPU architectures was used to demonstrate the performance improvements. The experimental evaluation against the baseline implementations of generated parallel code, without the advantages of data locality, showcases speedups of up to 2.5x. Moreover, the new technique reached up to 97% of the performance of the native code, highlighting the efficiency of the generated code. The third and final contribution regards the concurrent exploitation of multiple heterogeneous hardware accelerators. Heterogeneous managed runtimes need to consider application and device characteristics to perform an efficient allocation. This work addresses the seamless concurrent execution of multiple tasks on multiple devices by extending the virtualization layer of TornadoVM to execute multiple bytecode interpreters in parallel. Furthermore, the concurrent execution was combined with a machine learning model, based on a multiple-classification architecture of Extra-Trees-Classifiers, to perform efficient device-task allocation. The experimental results showcase performance improvements (up to 83%) compared to all tasks running on the best single device, while attaining up to 91% of the highest achievable performance.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: