high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » An Efficient Implementation of Double Precision 1-D FFT for GPUs Using CUDA

An Efficient Implementation of Double Precision 1-D FFT for GPUs Using CUDA

Yanjun Liu, Licai Guo, Bin Luo, Xingyi Zhang

School of Computer Science and Technology, Anhui University, Hefei 230039, China

Journal of Information & Computational Science 9: 2 (2012) 387-394, 2012

BibTeX

Download (PDF)

View

Source

3101

views

Fast Fourier Transform (FFT) is a well known and widely used tool in many scientific and engineering fields. CUFFT, which is the NVIDIA’s FFT library included in the CUDA toolkit, supports double precision FFTs. However, the implementation of CUFFT is not very efficient. In this paper, we implement an efficient double-precision Cooley-tukey algorithm for GPUs using CUDA. Some programming techniques are employed to exploit the hardware characteristics. These techniques include on-chip shared memory utilization, removing redundant computation, and coalescing the global memory access. Experiments show that the performance of our 1-D FFT is as fast as CUFFT. Furthermore, the performance of our FFT implementation is more than twice faster than CUFFT for small input sizes.

Tags: Algorithms, Computer science, CUDA, FFT, nVidia, nVidia GeForce GTX 260, Programming techniques

February 17, 2012 by hgpu

Rating: 1.0/5. From 1 vote.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

An Efficient Implementation of Double Precision 1-D FFT for GPUs Using CUDA

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

An Efficient Implementation of Double Precision 1-D FFT for GPUs Using CUDA

Share this:

Recent source codes

Most viewed papers (last 30 days)