high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » OpenCL » Towards a Performance-Portable FFT Library for Heterogeneous Computing

Towards a Performance-Portable FFT Library for Heterogeneous Computing

Carlo del Mundo, Wu-chun Feng

Department of Electrical & Computer Engineering, NSF Center for High-Performance Reconfigurable Computing, Virginia Tech, Blacksburg, VA, USA

Virginia Tech, 2014

@article{del2014towards,

title={Towards a Performance-Portable FFT Library for Heterogeneous Computing},

author={del Mundo, Carlo and Feng, Wu-chun},

year={2014}

}

Download (PDF)

View

Source

1875

views

The fast Fourier transform (FFT), a spectral method that computes the discrete Fourier transform and its inverse, pervades many applications in digital signal processing, such as imaging, tomography, and software-defined radio. Its importance has caused the research community to expend significant resources to accelerate the FFT, of which FFTW is the most prominent example. With the emergence of the graphics processing unit (GPU) as a massively parallel computing device for high performance, we seek to identify architecture-aware optimizations across two different generations of high-end AMD and NVIDIA GPUs, namely the AMD Radeon HD 6970 and HD 7970 and the NVIDIA Tesla C2075 and K20c, respectively. Despite architectural differences across GPU generations and vendors, we identified a homogeneous set of optimizations as being most effective in accelerating FFT performance: (1) register preloading, (2) transposition via local memory, and (3) 8- or 16-byte vector access and scalar arithmetic. We then demonstrate the efficacy in combining certain optimizations in concert with register preloading, transposition via local memory, and use of constant memory being the most effective for all architectures. Our study suggests that performance of FFTs on graphics processors is primarily limited by global memory data transfer. Overall, our optimizations deliver speed-ups as high as 31.5, over a baseline GPU implementation and 9.1 over a multithreaded FFTW CPU implementation with AVX vector extensions.

Tags: ATI, ATI Radeon HD 6970, ATI Radeon HD 7970, FFT, nVidia, OpenCL, Signal processing, Spectral methods, Tesla C2075, Tesla K20

February 17, 2014 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Towards a Performance-Portable FFT Library for Heterogeneous Computing

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Towards a Performance-Portable FFT Library for Heterogeneous Computing

Share this:

Recent source codes

Most viewed papers (last 30 days)