high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

Thierry Gautier, Joao V. F. Lima, Nicolas Maillard, Bruno Raffin

Grenoble University, France

27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013

@article{gautier2013xkaapi,

title={XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures},

author={Gautier, T. and Lima, J.V.F. and Maillard, N. and Raffin, B.},

year={2013}

}

Download (PDF)

View

Source

Source codes

Package:

XKAAPI: means Kernel for Adaptative, Asynchronous Parallel and Interactive programming

2607

views

Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a localityaware work stealing scheduler. XKaapi enables task multiimplementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.

Tags: Computer science, CUDA, Factorization, Heterogeneous systems, Linear Algebra, nVidia, Package, Tesla C2050

February 2, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

Package:

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)