6922

DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap

Sylvain Huet, Vincent Boulos, Vincent Fristot, Luc Salvo
GIPSA-lab, UMR5216 CNRS/INPG/UJF/U.Stendhal, F-38402 GRENOBLE CEDEX, France
hal-00657536, 2011

@inproceedings{HUET:2011:HAL-00657536:1,

   hal_id={hal-00657536},

   url={http://hal.archives-ouvertes.fr/hal-00657536/en/},

   title={DFG implementation on multi GPU cluster with computation-communication overlap},

   author={Huet, Sylvain and Boulos, Vincent and Fristot, Vincent and Salvo, Luc},

   language={English},

   affiliation={Grenoble Images Parole Signal Automatique – GIPSA-lab , Science et Ing{‘e}nierie des Mat{‘e}riaux et Proc{‘e}d{‘e}s – SIMAP},

   booktitle={Design and Architectures for Signal and Image Processing},

   address={Tampere, Finland},

   audience={international},

   year={2011},

   month={Nov},

   pdf={http://hal.archives-ouvertes.fr/hal-00657536/PDF/DFG_implementation_on_multi_GPU_cluster_with_computation-communication_overlap_.pdf}

}

Download Download (PDF)   View View   Source Source   

792

views

Nowadays, computers embed many CPUs and at least one GPU. Workstations can host several GPU cards, which are well suited for scientific and engineering computations. Such computers are linked through high bandwidth networks to compose clusters for HPC. These machines provide highly parallel multicore architectures while being cost-effective. Moreover, they significantly reduce dissipated power, and space needs compared to classical HPC clusters. Recently NVIDIA or ATI announced Tesla or Firestream boards, performing more than 500 gigaflops of double precision performance and dissipating less than 250 W for single GPU board. However, the real challenge is to achieve the highest performances on muti-GPU architectures. The programmer has to design architecture-specific code including GPU communications and memory management, task scheduling and synchronization. So, a high level programming abstract model is required to express all these important operations. In this paper, we propose a design flow allowing an efficient implementation of a DSP application specified as a DFG on a multi GPU computer cluster. We focus particularly on the effective implementation of communications by automating the computation-communication overlap. After presenting the related work, we show the interest of the implementation of communication-computation overlap on multi-GPU architectures. Then, we present our design flow that allows an efficient implementation of an algorithm expressed as DFG on a multi-GPU architecture. Finally, it is applied on a real world application of 3D granulometry developed for research on materials.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: