## Improving Energy Efficiency of Basic Linear Algebra Routines on Heterogeneous Systems with Multiple GPUs

University of California, Riverside

University of California, 2023

@phdthesis{sabzi2023improving,

title={Improving Energy Efficiency of Basic Linear Algebra Routines on Heterogeneous Systems With Multiple GPUs},

author={Sabzi, Hadi Zamani},

year={2023},

school={University of California, Riverside}

}

The current trend of ever-increasing performance in high performance computing (HPC) applications comes with tremendous growth in energy consumption. Because existing libraries are mainly concerned with performance, they do not make efficient use of heterogeneous computing systems, resulting in energy inefficiency. Hence, improving the energy efficiency of critical applications running on HPC systems is necessary to deliver better performance at a given power budget. The aim of this dissertation is to develop techniques and frameworks to improve the energy efficiency of the high performance applications on heterogeneous system with GPUs while maintaining the reliability and performance requirements. In our first approach, we present GreenMM framework for matrix multiplication, which reduces energy consumption in GPUs through undervolting without sacrificing the performance. The idea is to undervolt the GPU beyond the minimum operating voltage (Vmin) to save maximum energy while keeping the frequency constant. Since such undervolting may give rise to faults, we design an Algorithm Based Fault Tolerance (ABFT) algorithm to detect and correct those errors. We target Matrix Multiplication (MM), as a key kernel used in many scientific applications. Empirically, we explore different errors and derive a fault model as a function of undervolting levels and matrix sizes. Then, using the model, we configure the proposed fault tolerant MM algorithm. We show that energy consumption is reduced up to 19.8%. GreenMM also improves the GFLOPS/Watt by 9% with negligible performance overhead. In our second study, we present a framework for GPU applications, which reduces energy consumption in GPUs through Safe Overclocking and Undervolting (SAOU) without sacrificing performance. The idea is to increase the frequency beyond the safe frequency fsafeMax and undervolt below VsafeM in to get maximum energy saving. Since such overclocking and undervolting may give rise to faults, we employ an enhanced checkpoint-recovery technique to cover the possible errors. Empirically, we explore different errors and derive a fault model that is used to determine the appropriate undervolting and overclocking level for maximum energy saving. Similarly, we target MM kernel for error correction using the checkpoint and recovery (CR) technique as an example of scientific applications. In case of MM, SAOU achieves up to 22% energy reduction through undervolting and overclocking without sacrificing the performance. In our third study, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. The aim is to apply DVFS by leveraging slacks intelligently on both CPU and multiple GPUs. To predict the slack times, accurate performance models are developed separately for CPUs, GPUs, and PCIe bus based on the algorithmic knowledge and manufacturer’s specifications. We also determine the appropriate level of undervolting for both CPUs and GPUs through offline profiling. Reducing voltage below threshold values may give rise to errors; hence we extract the minimum safe voltages (VsafeM in) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPUs, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra library MAGMA. In our fourth study, we introduce a fault tolerant algorithm for LU factorization on heterogeneous systems with GPUs. We have developed a fault tolerant algorithm by constructing a local and global checksums. The local checksums are used to detect the errors and the global checksums are used to correct the errors. Using the local checksums, we can detect the errors in the middle of the computation which enables us to tolerate more number of faults during the whole execution. LU factorization has three main phases. These phases have different sensitivity to the error. For each phase, we introduce an appropriate level of fault tolerance to prevent the error propagation to other phases. Since, we check the correctness of the computation in each iteration, if there is any error in the system, only the small fraction of the computation is affected and can be covered easily in compared to the previous works such as ABFT.

May 21, 2023 by hgpu