high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs

Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs

Jing Wu, Joseph JaJa

Dept. of Electrical and Computer Engineering, and Inst. for Advanced Computer Studies, University of Maryland, College Park, MD

Innovative Parallel Computing (INPAR) Workshop, 2012

@article{wu2012optimized,

title={Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs},

author={Wu, Jing and JaJa, Joseph},

year={2012}

}

Download (PDF)

View

Source

1552

views

We address in this paper the problem of mapping three-dimensional Fast Fourier Transforms (FFTs) onto the recent, highly multithreaded CUDA Graphics Processing Units (GPUs) and present some of the fastest known algorithms for a wide range of 3-D FFTs on the NVIDIA Tesla and Fermi architectures. We exploit the high-degree of multi-threading offered by the CUDA environment while carefully managing the multiple levels of the memory hierarchy in such a way that: (i) all global memory accesses are coalesced into 128-byte device memory transactions issued in such a way as to optimize effects related to partition camping [19], locality [22], and associativity. and (ii) all computations are carried out on the registers with effective data movement involved in shared memory transposition. In particular, the number of global memory accesses to the entire 3-D dataset is minimized and the FFT computations along the X dimension are almost completely overlapped with global memory data transfers needed to compute the FFTs along the Y or Z dimensions. We were able to achieve performance between 135 GFlops and 172 GFlops on the Tesla architecture (Tesla C1060 and GTX280) and between 192 GFlops and 290 GFlops on the Fermi architecture (Tesla C2050 and GTX480). The bandwidths achieved by our algorithms reach over 90 GB/s for the GTX280 and around 140 GB/s for the GTX480.

Tags: Algorithms, Computer science, CUDA, FFT, nVidia, nVidia GeForce GTX 280, nVidia GeForce GTX 480, Performance, Tesla C1060, Tesla C2050

March 30, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)