14458

Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators

Mehdi Amini
CRI – Centre de Recherche en Informatique
pastel-00958033, (11 March 2014)

@phdthesis{amini:pastel-00958033,

   title={Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators},

   author={Amini, Mehdi},

   url={https://pastel.archives-ouvertes.fr/pastel-00958033},

   number={2012ENMP0105},

   school={Ecole Nationale Sup{‘e}rieure des Mines de Paris},

   year={2014},

   month={Dec},

   keywords={GPU; CUDA; OpenCL; Parall{‘e}lisation automatis{‘e}e; Compilation; Automatic Parallelization},

   type={Theses},

   pdf={https://pastel.archives-ouvertes.fr/pastel-00958033/file/2012ENMP0105.pdf},

   hal_id={pastel-00958033},

   hal_version={v1}

}

Download Download (PDF)   View View   Source Source   

1674

views

Since the beginning of the 2000s, the raw performance of processors stopped its exponential increase. The modern graphic processing units (GPUs) have been designed as array of hundreds or thousands of compute units. The GPUs’ compute capacity quickly leads them to be diverted from their original target to be used as accelerators for general purpose computation. However programming a GPU efficiently to perform other computations than 3D rendering remains challenging.The current jungle in the hardware ecosystem is mirrored by the software world, with more and more programming models, new languages, different APIs, etc. But no one-fits-all solution has emerged.This thesis proposes a compiler-based solution to partially answer the three "P" properties: Performance, Portability, and Programmability. The goal is to transform automatically a sequential program into an equivalent program accelerated with a GPU. A prototype, Par4All, is implemented and validated with numerous experiences. The programmability and portability are enforced by definition, and the performance may not be as good as what can be obtained by an expert programmer, but still has been measured excellent for a wide range of kernels and applications.A survey of the GPU architectures and the trends in the languages and framework design is presented. The data movement between the host and the accelerator is managed without involving the developer. An algorithm is proposed to optimize the communication by sending data to the GPU as early as possible and keeping them on the GPU as long as they are not required by the host. Loop transformations techniques for kernel code generation are involved, and even well-known ones have to be adapted to match specific GPU constraints. They are combined in a coherent and flexible way and dynamically scheduled within the compilation process of an interprocedural compiler. Some preliminary work is presented about the extension of the approach toward multiple GPUs.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: