high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » OpenCL » Towards a Performance-Portable FFT Library for Heterogeneous Computing

Towards a Performance-Portable FFT Library for Heterogeneous Computing

Carlo del Mundo, Wu-chun Feng

Department of Electrical & Computer Engineering, NSF Center for High-Performance Reconfigurable Computing, Virginia Tech, Blacksburg, VA, USA

Virginia Tech, 2014

@article{del2014towards,

title={Towards a Performance-Portable FFT Library for Heterogeneous Computing},

author={del Mundo, Carlo and Feng, Wu-chun},

year={2014}

}

Download (PDF)

View

Source

2466

views

The fast Fourier transform (FFT), a spectral method that computes the discrete Fourier transform and its inverse, pervades many applications in digital signal processing, such as imaging, tomography, and software-defined radio. Its importance has caused the research community to expend significant resources to accelerate the FFT, of which FFTW is the most prominent example. With the emergence of the graphics processing unit (GPU) as a massively parallel computing device for high performance, we seek to identify architecture-aware optimizations across two different generations of high-end AMD and NVIDIA GPUs, namely the AMD Radeon HD 6970 and HD 7970 and the NVIDIA Tesla C2075 and K20c, respectively. Despite architectural differences across GPU generations and vendors, we identified a homogeneous set of optimizations as being most effective in accelerating FFT performance: (1) register preloading, (2) transposition via local memory, and (3) 8- or 16-byte vector access and scalar arithmetic. We then demonstrate the efficacy in combining certain optimizations in concert with register preloading, transposition via local memory, and use of constant memory being the most effective for all architectures. Our study suggests that performance of FFTs on graphics processors is primarily limited by global memory data transfer. Overall, our optimizations deliver speed-ups as high as 31.5, over a baseline GPU implementation and 9.1 over a multithreaded FFTW CPU implementation with AVX vector extensions.

Tags: ATI, ATI Radeon HD 6970, ATI Radeon HD 7970, FFT, nVidia, OpenCL, Signal processing, Spectral methods, Tesla C2075, Tesla K20

February 17, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Towards a Performance-Portable FFT Library for Heterogeneous Computing

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Towards a Performance-Portable FFT Library for Heterogeneous Computing

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)