high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » tcFFT: Accelerating Half-Precision FFT through Tensor Cores

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Binrui Li, Shenggan Cheng, James Lin

Shanghai Jiao Tong University, Shanghai, China

arXiv:2104.11471 [cs.DC], (23 Apr 2021)

BibTeX

Download (PDF)

View

Source

1751

views

Fast Fourier Transform (FFT) is an essential tool in scientific and engineering computation. The increasing demand for mixed-precision FFT has made it possible to utilize half-precision floating-point (FP16) arithmetic for faster speed and energy saving. Specializing in lower precision, NVIDIA Tensor Cores can deliver extremely high computation performance. However, the fixed computation pattern makes it hard to utilize the computing power of Tensor Cores in FFT. Therefore, we developed tcFFT to accelerate FFT with Tensor Cores. Our tcFFT supports batched 1D and 2D FFT of various sizes and it exploits a set of optimizations to achieve high performance: 1) single-element manipulation on Tensor Core fragments to support special operations needed by FFT; 2) fine-grained data arrangement design to coordinate with the GPU memory access pattern. We evaluated our tcFFT and the NVIDIA cuFFT in various sizes and dimensions on NVIDIA V100 and A100 GPUs. The results show that our tcFFT can outperform cuFFT 1.29x-3.24x and 1.10x-3.03x on the two GPUs, respectively. Our tcFFT has a great potential for mixed-precision scientific applications.

Tags: Computer science, FFT, Mixed precision, nVidia, nVidia DGX-2, nVidia DGX-A100, Tesla V100

May 2, 2021 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Share this:

Recent source codes

Most viewed papers (last 30 days)