A Hybrid GPU/CPU FFT Library for Large FFT Problems

hgpu.org » Applications » Computer science » A Hybrid GPU/CPU FFT Library for Large FFT Problems

A Hybrid GPU/CPU FFT Library for Large FFT Problems

Shuo Chen

University of Delaware

University of Delaware, 2013

BibTeX

Download (PDF)

View

Source

2216

views

Graphic Processing Units (GPU) has been proved to be a promising platform to accelerate large size Fast Fourier Transform (FFT) computation. However, current GPU-based FFT implementation only uses GPU to compute, but employs CPU as a mere memory-transfer controller. The computation power in today’s high-performance CPU is wasted. In this project, a hybrid optimization framework is proposed to use both CPU and GPU in heterogeneous CPU-GPU systems to compute large scale 2D and 3D FFTs that exceed GPU memory. This work introduces a flexible partitioning scheme that makes it possible to decompose FFT for two computing devices with hugely different performance characteristics. The partitioning scheme enables concurrent execution of FFT sub-problems on CPU and GPU. Additionally, our approach integrates several FFT decomposition paradigms to tailor the extraction of computation and communication patterns for CPU and GPU, and in the process exploits more hidden parallelism than other heterogeneous methods. In addition, our work automatically adapts to different hardware configurations by tuning for architecture features and the work distribution between GPU and CPU. Several empirical profiling techniques are proposed to characterize the communication and computation of FFT problems on GPU and CPU, and we develop effective heuristics to guide the entire empirical tuning process. Our library also overlaps data transfers to achieve higher bandwidth over PCI bus and equally importantly maintains data and layout consistency between CPU and GPU. We evaluate our hybrid FFT library from three aspects, i.e., optimal load distribution ratios, running time, and precision of result. In particular, the library is compared with CPU based libraries FFTW and Intel MKL, as well as a GPU based library on three GPUs, i.e., NVIDIA GeForce GTX480, Tesla C2070 and Tesla C2075. On average, our large FFT library is 121% and 145% faster than the 4-thread SSE-enabled FFTW and the 4-thread SSE-enabled Intel MKL, with max speedups 4.61 and 2.81, respectively.

Tags: Computer science, CUDA, FFT, Heterogeneous systems, nVidia, nVidia GeForce GTX 480, Tesla C2070, Tesla C2075, Thesis

November 12, 2013 by hgpu

No votes yet.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org