Performance Optimization of GPU ELF-Codes

Fausto Artico
University of Padova, Department of Information Engineering
University of Padova, 2014


   title={Performance Optimization of GPU ELF-Codes},

   author={Artico, Fausto},



Source Source   



GPUs (Graphic Processing Units) are of interest for their favorable ratio GF/s/price. Compared to the beginning – early 1980’s – nowadays GPU architectures are more similar to general purpose architectures but with (much) larger numbers of cores – the GF100 architecture released by NVIDIA in 2009-2010, for example, has a true hardware cache hierarchy, a unified memory address space, double precision performance and has a maximum of 512 cores. Exploiting the computational power of GPUs for non-graphics applications – past or present – has, however, always been hard. Initially, in the early 2000’s, the way to program GPUs was by using graphic libraries APis (exclusively), which made writing non-graphics codes non-trivial and tedious at best, and virtually impossible in the worst case. In 2003, the Brook compiler and runtime system was introduced, giving users the ability to generate GPU code from a high level programming language. In 2006 NVIDIA introduced CUDA (Compute Unified Device Architecture). CUDA, a parallel computing platform and programming model specifically developed by NVIDIA for its GPUs, attempts to further facilitate general purpose programming of GPUs. Code edited using CUDA is portable between different NVIDIA GPU architectures and this is one of the reasons because NVIDIA claims that the user’s productivity is much higher than previous solutions, however optimizing GPU code for utmost performance remains very hard, especially for NVIDIA GPUs using the GF100 architecture – e.g., Fermi GPUs and some Tesla GPUs – because a) the real instruction set architecture (ISA) is not publicly available, b) the code of the NVIDIA compiler – nvcc – is not open and c) users can not edit code using the real assembly – ELF in NVIDIA parlance. Compilers, while enabling immense increases in programmer productivity, by eliminating the need to code at the (tedious) assembly level, are incapable of achieving, to date, performance similar to that of an expert assembly programmer with good knowledge of the underlying architecture. In fact, it is widely accepted that high-level language programming and compiling even with a stateof- the-art compilers loose, on average, a factor of 3 in performance – and sometimes much more – over what a good assembly programmer could achieve, and that even on a conventional, simple, single-core machine. Compilers for more complex machines, such as NVIDIA GPUs, are likely to do much worse because among other things, they face (even more) complex trade-offs between often undecidable and NP-hard problems. However, because NVIDIA a) makes it virtually impossible to gain access to the actual assembly language used by its GF100 architecture, b) does not publicly explain many of the internal mechanisms implemented in its compiler – nvcc – and c) makes it virtually impossible to learn the details of its very complex GF100 architecture in sufficient detail to be able to exploit them, obtaining an estimate of the performance difference between CUDA programming and machine-level programming for NVIDIA GPUs using the GF100 architecture – let alone achieving some a priori performance guarantees of shortest execution time – has been, prior to this current work, impossible. To optimize GPU code, users have to use CUDA or PTX (Parallel Thread Execution) – a virtual instruction set architecture. The CUDA or PTX files are given in input to nvcc that produces as output fatbin files. The fatbin files are produced considering the target GPU architecture selected by the user – this is done setting a flag used by nvcc. In a fatbin file, zero or more parts of the fatbin file will be executed by the CPU – think of these parts as the C/C++ parts – while the remaining parts of the fatbin file – think of these parts as the ELF parts – will be executed by the specific model of the GPU for which the CUDA or PTX file has been compiled. The fatbin files are usually very different from the corresponding CUDA or PTX files and this lack of control can completely ruin any effort made at CUDA or PTX level to optimize the ELF part/parts of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled. We therefore reverse engineer the real ISA used by the GF100 architecture and generate a set of editing guidelines to force nvcc to generate fatbin files with at least the minimum number of resources later necessary to modify them to get the wanted ELF algorithmic implementations – this gives control on the ELF code that is executed by any GPU using the GF100 architecture. During the process of reverse engineering we also discover all the correspondences between PTX instructions and ELF instructions – a single PTX instruction can be transformed in one or more ELF instructions – and the correspondences between PTX registers and ELF registers. Our procedure is completely repeatable for any NVIDIA Kepler GPU – we do not need to rewrite our code. Being able to get the wanted ELF algorithmic implementations is not enough to optimize the ELF code of a fatbin file, we need in fact also to discover, understand, and quantify some not disclosed GPU behaviors that could slow down the execution of ELF code. This is necessary to understand how to execute the optimization process and while we can not report here all the results we have got, we can however say that we will explain to the reader a) how to force even distributions of the GPU thread blocks to the streaming multiprocessors, b) how we have discovered and quantified several warp scheduling phenomenons, c) how to avoid phenomenons of warp scheduling load unbalancing, that it is not possible to control, in the streaming multiprocessors, d) how we have determined, for each ELF instruction, the minimum quantity of time that it is necessary to wait before a warp scheduler can schedule again a warp – yes, the quantity of time can be different for different ELF instructions – e) how we have determined the time that it is necessary to wait before to be able to read again the data in a register previously read or written – this too can be different for different ELF instructions and different whether the data has been previously read or written – and f) how we have discovered the presence of an overhead time for the management of the warps that does not grow linearly to a liner increase of the number of residents warps in a streaming multiprocessor. Next we explain a) the procedures of transformation that it is necessary to apply to the ELF code of a fatbin file to optimize the ELF code and so making its execution time as short as possible, b) why we need to classify the fatbin files generated from the original fatbin file during the process of optimization and how we do this using several criteria that as final result allow us to determine the positions, occupied by each one of the fatbin files generated, in a taxonomy that we have created, c) how using the position of a fatbin file in the taxonomy we determine whether the fatbin file is eligible for an empirical analysis – that we explain – a theoretical analysis or both, and d) how – if the fatbin file is eligible for a theoretical analysis – we execute the theoretical analysis that we have devised and give an a priori – without any previous execution of the fatbin file – shortest ELF code execution time guarantee – this if the fatbin file satisfies all the requirements of the theoretical analysis – for the ELF code of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: