Large-scale FFT on GPU clusters

hgpu.org » Programming » Algorithms » Large-scale FFT on GPU clusters

Large-scale FFT on GPU clusters

Yifeng Chen, Xiang Cui, Hong Mei

HCST Key Lab at School of EECS, Peking University, Beijing 100871, China

ICS ’10 Proceedings of the 24th ACM International Conference on Supercomputing

DOI:10.1145/1810085.1810128

@conference{chen2010large,

title={Large-scale FFT on GPU clusters},

author={Chen, Y. and Cui, X. and Mei, H.},

booktitle={Proceedings of the 24th ACM International Conference on Supercomputing},

pages={315–324},

year={2010},

organization={ACM}

}

Download (PDF)

View

Source

1646

views

A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e. g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e. g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.

Tags: Algorithms, Computer science, CUDA, FFT, nVidia, nVidia GeForce GTX 285, Programming techniques, Tesla C1060

December 18, 2010 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org