high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Auto-tuning 3-D FFT library for CUDA GPUs

Auto-tuning 3-D FFT library for CUDA GPUs

Akira Nukada, Satoshi Matsuoka

Tokyo Institute of Technology and Japan Science and Technology Agency, CREST

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, 2009

DOI:10.1145/1654059.1654090

BibTeX

Source

2474

views

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.

Tags: Algorithms, Computer science, CUDA, FFT, Mathematical Software, nVidia, Performance

August 22, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Auto-tuning 3-D FFT library for CUDA GPUs

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Auto-tuning 3-D FFT library for CUDA GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)