Use of Multi-GPU Systems for Larger Than Device FFTs: With Applications in Ultrasound Simulations

Nimalan Nandapalan
Australian National University
Australian National University, 2013


   title={Use of multi-GPU systems for large FFTs: with applications in ultrasound simulations},

   author={Nandapalan, Nimalan and others},


   publisher={The Australian National University}


Download Download (PDF)   View View   Source Source   



Ultrasound simulations are a type of application that are both computationally and communicatively intensive. With better performance, implementations of these can be used in designing new ultrasound probes, developing better signal processing techniques, training new ultrasonographers, in treatment planning and many other uses [11]. The pseudo-spectral technique can be used effectively to express the wave-propagation model used in these simulations, and is characterised by its use of the Fast Fourier Transform (FFT). The FFT can account for over half of the time spent by ultrasound simulations, with the remaining consisting of embarrassingly parallel arithmetic [28]. The use of a Graphics Processing Unit (GPU) for general computations like the FFT has become ubiquitous with favourable performance. The current trend in the design of the Central Processing Unit (CPU) of most systems has seen a shift from single-core to multi-core processing with these now being assembled into multi-socket configurations. GPUs are already massively multi-core processors typically with three or four times as many cores the question remains: will GPUs follow a similar trend and incorporate multiple devices in individual sockets when implemented? The purpose of the work in this thesis is to assess the viability of multi-GPU systems for ultrasound simulations in terms of cost and performance compared to other system designs that offer similar computational resources. Current machine hardware is capable of supporting multiple GPU through peripheral devices and offers a glimpse of the potential of future machines however, relatively little work has been reported on the use of such systems for ultrasound simulations and the FFT algorithm. In this thesis, to address this issue, we benchmark and model the device-to-device communication potential of an existing multi-GPU system. Four different methods are considered, namely: via CPU, pointer swapping, hybrid-staged, and kernel. The results reveal that the pointer swapping and kernel based methods of managing communication can be up to twice as efficient as other methods. The methods for communication identified in the benchmarks are then used as the basis for a number of important generic communication functions, which are in turn used to implement a distributed 3D FFT algorithm as required by the ultrasound simulation. The multi-GPU distributed 3D FFT with four GPUs was found to be up to 18% faster than an existing FFT implementation on a six core CPU. This multi-GPU distributed 3D FFT implementation is then used in an ultra- sound simulation as a proof-of-concept case study of the thesis. By overlapping communication and computation between the CPU and GPU resources a speed up of 8% is observed.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: