Automatic Generation of FFT Libraries for GPU Platforms

Christos Angelopoulos
Carnegie Mellon University, Pittsburgh, PA
Carnegie Mellon University, 2012


   title={Automatic Generation of FFT Libraries for GPU Platforms},

   author={Angelopoulos, Christos},



Download Download (PDF)   View View   Source Source   



Compilers introduce a set of optimizations to speed-up source code. However due to the variety of computation platforms, algorithm complexity and problem sizes, general purpose compilers can fail to improve performance. The burden on library developers increases significantly to write optimized libraries since the user code relies on them for performance. This argument strengthens the case for automatic performance tuning frameworks like SPIRAL, a program generator and optimizer for linear transforms such as the Discrete Fourier Transform (DFT). In this thesis we present the challenges and difficulties of implementing libraries on GPUs and provide an FFT Compiler framework based on SPIRAL supporting GPU platforms. We model all the necessary GPU parameters and implemented a rewriting system and a backend that generates optimized code. Our results are comparable with NVIDIA’s cuFFT hand-written library. Today the majority of work on parallelizing compilers targets mainly large applications on multicore processors designed for coarse grain parallelism. The resulting compilers are quite successful and provide good performance scaling for relative simple programs running on systems with a small number of cores. However, they cannot achieve high throughput and streaming speed-up for complicated libraries targeting a large number of cores demanding fine-grained parallelism. The burden and difficulty of writing high performance software for developers, has increased dramatically. The programmer now has to use multiple threads, vector instruction sets, and tune the code to the memory hierarchy. This process is platform-specific and performance does not port easily. If the developer fails to apply these these optimizations, it may result in dramatic performance losses as shown in Figure 1. This Figure shows the performance in Gflop/s (giga floating point operations per second) for two implementations of the discrete Fourier transform (DFT) on NVIDIA’s GTX480 Fermi GPU. The bottom line shows a straightforward naive implementation based on typical compiler and loop optimizations. The fastest code has been generated by SPIRAL [3], an Automatic Program Generation System. the performance difference between them is a factor of 12. In this work we explain the main reasons for the achieved performance on GPUs. Our contributions include the development of novel algorithms that match the architectural constraints imposed by GPU platforms. We provide algorithmic optimizations for all layers of memory hierarchy on GPU platforms. Our Local Memory FFT kernel algorithms are optimized for fast local memory access without stalls or trading off space for speed. Our Main Memory FFT algorithms tackle the issue with DRAM main memory hierarchy. We provide fast algorithmic implementations that handle small and large stride permutation lengths. Our library is also optimized for maximum vectorization and parallelism across all multiprocessors.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: