Towards a Performance-Portable FFT Library for Heterogeneous Computing
Department of Electrical & Computer Engineering, NSF Center for High-Performance Reconfigurable Computing, Virginia Tech, Blacksburg, VA, USA
Virginia Tech, 2014
@article{del2014towards,
title={Towards a Performance-Portable FFT Library for Heterogeneous Computing},
author={del Mundo, Carlo and Feng, Wu-chun},
year={2014}
}
The fast Fourier transform (FFT), a spectral method that computes the discrete Fourier transform and its inverse, pervades many applications in digital signal processing, such as imaging, tomography, and software-defined radio. Its importance has caused the research community to expend significant resources to accelerate the FFT, of which FFTW is the most prominent example. With the emergence of the graphics processing unit (GPU) as a massively parallel computing device for high performance, we seek to identify architecture-aware optimizations across two different generations of high-end AMD and NVIDIA GPUs, namely the AMD Radeon HD 6970 and HD 7970 and the NVIDIA Tesla C2075 and K20c, respectively. Despite architectural differences across GPU generations and vendors, we identified a homogeneous set of optimizations as being most effective in accelerating FFT performance: (1) register preloading, (2) transposition via local memory, and (3) 8- or 16-byte vector access and scalar arithmetic. We then demonstrate the efficacy in combining certain optimizations in concert with register preloading, transposition via local memory, and use of constant memory being the most effective for all architectures. Our study suggests that performance of FFTs on graphics processors is primarily limited by global memory data transfer. Overall, our optimizations deliver speed-ups as high as 31.5, over a baseline GPU implementation and 9.1 over a multithreaded FFTW CPU implementation with AVX vector extensions.
February 17, 2014 by hgpu