high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap

DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap

Sylvain Huet, Vincent Boulos, Vincent Fristot, Luc Salvo

GIPSA-lab, UMR5216 CNRS/INPG/UJF/U.Stendhal, F-38402 GRENOBLE CEDEX, France

hal-00657536, 2011

BibTeX

Download (PDF)

View

Source

1914

views

Nowadays, computers embed many CPUs and at least one GPU. Workstations can host several GPU cards, which are well suited for scientific and engineering computations. Such computers are linked through high bandwidth networks to compose clusters for HPC. These machines provide highly parallel multicore architectures while being cost-effective. Moreover, they significantly reduce dissipated power, and space needs compared to classical HPC clusters. Recently NVIDIA or ATI announced Tesla or Firestream boards, performing more than 500 gigaflops of double precision performance and dissipating less than 250 W for single GPU board. However, the real challenge is to achieve the highest performances on muti-GPU architectures. The programmer has to design architecture-specific code including GPU communications and memory management, task scheduling and synchronization. So, a high level programming abstract model is required to express all these important operations. In this paper, we propose a design flow allowing an efficient implementation of a DSP application specified as a DFG on a multi GPU computer cluster. We focus particularly on the effective implementation of communications by automating the computation-communication overlap. After presenting the related work, we show the interest of the implementation of communication-computation overlap on multi-GPU architectures. Then, we present our design flow that allows an efficient implementation of an algorithm expressed as DFG on a multi-GPU architecture. Finally, it is applied on a real world application of 3D granulometry developed for research on materials.

Tags: Algorithms, CUDA, DSP, GPU cluster, nVidia, nVidia GeForce GTX 285, Signal processing

January 13, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)