high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Joao V. F. Lima, Thierry Gautier, Nicolas Maillard, Vincent Danjean

Grenoble University, France

24rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2012

@inproceedings{ferreiralima:hal-00735470,

hal_id={hal-00735470},

url={http://hal.inria.fr/hal-00735470},

title={Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs},

author={Ferreira Lima, Joao Vicente and Gautier, Thierry and Maillard, Nicolas and Danjean, Vincent},

language={Anglais},

affiliation={MOAIS – INRIA Grenoble Rh{^o}ne-Alpes / LIG Laboratoire d’Informatique de Grenoble, Instituto de Inform{‘a}tica da UFRGS – UFRGS},

booktitle={24rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)},

address={Columbia University, New York, {‘E}tats-Unis},

audience={internationale},

year={2012},

month={Oct},

pdf={http://hal.inria.fr/hal-00735470/PDF/sbac-pad2012.pdf}

}

Download (PDF)

View

Source

1958

views

The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available softwares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application’s load. We propose a solution that is orthogonal to the above mentioned: extensions of the XKaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. XKaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other softwares based on static scheduling. Moreover, we are able to sustain the peak performance (~310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.

Tags: Algorithms, Computer science, CUDA, Factorization, Linear Algebra, nVidia, Tesla C2050

September 28, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)