high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Automatic Generation of FFT Libraries for GPU Platforms

Automatic Generation of FFT Libraries for GPU Platforms

Christos Angelopoulos

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, 2012

BibTeX

Download (PDF)

View

Source

1867

views

Compilers introduce a set of optimizations to speed-up source code. However due to the variety of computation platforms, algorithm complexity and problem sizes, general purpose compilers can fail to improve performance. The burden on library developers increases significantly to write optimized libraries since the user code relies on them for performance. This argument strengthens the case for automatic performance tuning frameworks like SPIRAL, a program generator and optimizer for linear transforms such as the Discrete Fourier Transform (DFT). In this thesis we present the challenges and difficulties of implementing libraries on GPUs and provide an FFT Compiler framework based on SPIRAL supporting GPU platforms. We model all the necessary GPU parameters and implemented a rewriting system and a backend that generates optimized code. Our results are comparable with NVIDIA’s cuFFT hand-written library. Today the majority of work on parallelizing compilers targets mainly large applications on multicore processors designed for coarse grain parallelism. The resulting compilers are quite successful and provide good performance scaling for relative simple programs running on systems with a small number of cores. However, they cannot achieve high throughput and streaming speed-up for complicated libraries targeting a large number of cores demanding fine-grained parallelism. The burden and difficulty of writing high performance software for developers, has increased dramatically. The programmer now has to use multiple threads, vector instruction sets, and tune the code to the memory hierarchy. This process is platform-specific and performance does not port easily. If the developer fails to apply these these optimizations, it may result in dramatic performance losses as shown in Figure 1. This Figure shows the performance in Gflop/s (giga floating point operations per second) for two implementations of the discrete Fourier transform (DFT) on NVIDIA’s GTX480 Fermi GPU. The bottom line shows a straightforward naive implementation based on typical compiler and loop optimizations. The fastest code has been generated by SPIRAL [3], an Automatic Program Generation System. the performance difference between them is a factor of 12. In this work we explain the main reasons for the achieved performance on GPUs. Our contributions include the development of novel algorithms that match the architectural constraints imposed by GPU platforms. We provide algorithmic optimizations for all layers of memory hierarchy on GPU platforms. Our Local Memory FFT kernel algorithms are optimized for fast local memory access without stalls or trading off space for speed. Our Main Memory FFT algorithms tackle the issue with DRAM main memory hierarchy. We provide fast algorithmic implementations that handle small and large stride permutation lengths. Our library is also optimized for maximum vectorization and parallelism across all multiprocessors.

Tags: Algorithms, Computer science, CUDA, FFT, nVidia, nVidia GeForce 8800 GTX, nVidia GeForce GTX 260, nVidia GeForce GTX 280, nVidia GeForce GTX 480, Thesis

July 22, 2013 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Automatic Generation of FFT Libraries for GPU Platforms

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Automatic Generation of FFT Libraries for GPU Platforms

Share this:

Recent source codes

Most viewed papers (last 30 days)