StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

hgpu.org » Applications » Computer science » StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

Cedric Augonnet, Olivier Aumage, Nathalie Furmento, Samuel Thibault, Raymond Namyst

RUNTIME (INRIA Bordeaux – Sud-Ouest), INRIA – CNRS: UMR5800 – Universite de Bordeaux

hal-00992208, (16 May 2014)

@techreport{augonnet:hal-00992208,

hal_id={hal-00992208},

url={http://hal.inria.fr/hal-00992208},

title={StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators},

author={Augonnet, C{‘e}dric and Aumage, Olivier and Furmento, Nathalie and Thibault, Samuel and Namyst, Raymond},

language={Anglais},

affiliation={RUNTIME – INRIA Bordeaux – Sud-Ouest , Laboratoire Bordelais de Recherche en Informatique – LaBRI},

type={Rapport de recherche},

institution={INRIA},

number={RR-8538},

year={2014},

month={May},

pdf={http://hal.inria.fr/hal-00992208/PDF/RR-8538.pdf}

}

Download (PDF)

View

Source

2035

views

GPUs have largely entered HPC clusters, as shown by the top entries of the latest top500 issue. Exploiting such machines is however very challenging, not only because of combining two separate paradigms, MPI and CUDA or OpenCL, but also because nodes are heterogeneous and thus require careful load balancing within nodes themselves. The current paradigms are usually limited to only offloading parts of the computation and leaving CPUs idle, or they require static work partitioning between CPUs and GPUs. To handle single-node architecture heterogeneity, we have previously proposed StarPU, a runtime system capable of dynamically scheduling tasks in an optimized way on such machines. We show here how the task paradigm of StarPU has been combined with MPI communications, and how we extended the task paradigm itself to allow mapping the task graph on MPI clusters such as to automatically achieve an optimized distributed execution. We show how a sequential-like Cholesky source code can be easily extended into a scalable distributed parallel execution, and already exhibits a speedup of 5 on 6 nodes.

Tags: Computer science, CUDA, Heterogeneous systems, MPI, nVidia, OpenCL, Tesla M2070

May 18, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org