high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation

Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation

Mohamed Amine Bergach, Serge Tissot, Michael Syska, Robert De Simone

Inria

Electronic Chips & Systems Design Initiative (ECSI), 2014

@inproceedings{bergach_scaling_2014,

address={Dresden, Germany},

title={Scaling Performance of {FFT} Computation on an Industrial Integrated {GPU} Co-processor: Experiments with Algorithm Adaptation},

abstract={Recent Intel processors ({IvyBridge}, Haswell) contain an embedded on-chip {GPU} unit, in addition to the main {CPU} processor. In this work we consider the issue of efficiently mapping Fast Fourier Transform computation onto such coprocessor units. To achieve this we pursue three goals:

First, we want to study half-systematic ways to adjust the actual variant of the {FFT} algorithm, for a given size, to best fit the local memory capacity (the registers of a given {GPU} block) and perform computations without intermediate calls to distant memory;

Second, we want to study, by extensive experimentation, whether the remaining data transfers between memories (initial loads and final stores after each {FFT} computation) can be sustained by local interconnects at a speed matching the integrated {GPU} computations, or conversely if they have a negative impact on performance when computing {FFTs} on {GPUs} â€at full blastâ€;

Third, we want to record the energy consumption as observed in the previous experiments, and compare it to similar {FFT} implementations on the {CPU} side of the chip.

We report our work in this short paper and its companion poster, showing graphical results on a range of experiments. In broad terms, our findings are that {GPUs} can compute {FFTs} of a typical size faster than internal on-chip interconnects can provide them with data (by a factor of roughly 2), and that energy consumption is far smaller than on the {CPU} side.},

publisher={ECSI},

author={BERGACH, Mohamed Amine and Tissot, Serge and Syska, Michel and De Simone, Robert},

month={mar},

year={2014}

}

Download (PDF)

View

Source

2134

views

Recent Intel processors (IvyBridge, Haswell) contain an embedded on-chip GPU unit, in addition to the main CPU processor. In this work we consider the issue of efficiently mapping Fast Fourier Transform computation onto such coprocessor units. To achieve this we pursue three goals:

First, we want to study half-systematic ways to adjust the actual variant of the FFT algorithm, for a given size, to best fit the local memory capacity (the registers of a given GPU block) and perform computations without intermediate calls to distant memory;

Second, we want to study, by extensive experimentation, whether the remaining data transfers between memories (initial loads and final stores after each FFT computation) can be sustained by local interconnects at a speed matching the integrated GPU computations, or conversely if they have a negative impact on performance when computing FFTs on GPUs ”at full blast”;

Third, we want to record the energy consumption as observed in the previous experiments, and compare it to similar FFT implementations on the CPU side of the chip.

We report our work in this short paper and its companion poster, showing graphical results on a range of experiments. In broad terms, our findings are that GPUs can compute FFTs of a typical size faster than internal on-chip interconnects can provide them with data (by a factor of roughly 2), and that energy consumption is far smaller than on the CPU side.

Tags: FFT, Gen, GPGPU, GPU, Haswell, IvyBridge, OpenCL

May 7, 2014 by Aminems

Rating: 2.1/5. From 12 votes.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation

Share this:

Recent source codes

Most viewed papers (last 30 days)