## Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation

address={Dresden, Germany},

title={Scaling Performance of {FFT} Computation on an Industrial Integrated {GPU} Co-processor: Experiments with Algorithm Adaptation},

abstract={Recent Intel processors ({IvyBridge}, Haswell) contain an embedded on-chip {GPU} unit, in addition to the main {CPU} processor. In this work we consider the issue of efficiently mapping Fast Fourier Transform computation onto such coprocessor units. To achieve this we pursue three goals:

First, we want to study half-systematic ways to adjust the actual variant of the {FFT} algorithm, for a given size, to best fit the local memory capacity (the registers of a given {GPU} block) and perform computations without intermediate calls to distant memory;

Second, we want to study, by extensive experimentation, whether the remaining data transfers between memories (initial loads and final stores after each {FFT} computation) can be sustained by local interconnects at a speed matching the integrated {GPU} computations, or conversely if they have a negative impact on performance when computing {FFTs} on {GPUs} â€at full blastâ€;

Third, we want to record the energy consumption as observed in the previous experiments, and compare it to similar {FFT} implementations on the {CPU} side of the chip.

We report our work in this short paper and its companion poster, showing graphical results on a range of experiments. In broad terms, our findings are that {GPUs} can compute {FFTs} of a typical size faster than internal on-chip interconnects can provide them with data (by a factor of roughly 2), and that energy consumption is far smaller than on the {CPU} side.},

publisher={ECSI},

author={BERGACH, Mohamed Amine and Tissot, Serge and Syska, Michel and De Simone, Robert},

month={mar},

year={2014}

}

First, we want to study half-systematic ways to adjust the actual variant of the FFT algorithm, for a given size, to best fit the local memory capacity (the registers of a given GPU block) and perform computations without intermediate calls to distant memory;

Second, we want to study, by extensive experimentation, whether the remaining data transfers between memories (initial loads and final stores after each FFT computation) can be sustained by local interconnects at a speed matching the integrated GPU computations, or conversely if they have a negative impact on performance when computing FFTs on GPUs ”at full blast”;

Third, we want to record the energy consumption as observed in the previous experiments, and compare it to similar FFT implementations on the CPU side of the chip.

We report our work in this short paper and its companion poster, showing graphical results on a range of experiments. In broad terms, our findings are that GPUs can compute FFTs of a typical size faster than internal on-chip interconnects can provide them with data (by a factor of roughly 2), and that energy consumption is far smaller than on the CPU side.