high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Exploiting two-level parallelism by aggregating computing resources in task-based applications over accelerator-based machines

Exploiting two-level parallelism by aggregating computing resources in task-based applications over accelerator-based machines

Terry Cojean, Abdou Guermouche, Andra Hugo, Raymond Namyst, Pierre-Andre Wacrenier

INRIA, LaBRI, University of Bordeaux, Talence, France

hal-01181135, (31 July 2015)

@article{cojean2015exploiting,

title={Exploiting two-level parallelism by aggregating computing resources in task-based applications over accelerator-based machines},

author={Cojean, Terry and Guermouche, Abdou and Hugo, Andra and Namyst, Raymond and Wacrenier, Pierre-Andr{‘e}},

year={2015}

}

Download (PDF)

View

Source

2031

views

Computing platforms are now extremely complex providing an increasing number of CPUs and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we tackle the task granularity problem and we propose aggregating several CPUs in order to execute larger parallel tasks and thus find a better equilibrium between the workload assigned to the CPUs and the one assigned to the GPUs. To this end, we rely on the notion of scheduling contexts in order to isolate the parallel tasks and thus delegate the management of the task parallelism to the inner scheduling strategy. We demonstrate the relevance of our approach through the dense Cholesky factorization kernel implemented on top of the StarPU task-based runtime system. We allow having parallel elementary tasks and using Intel MKL parallel implementation optimized through the use of the OpenMP runtime system. We show how our approach handles the interaction between the StarPU and the OpenMP runtime systems and how it exploits the parallelism of modern accelerator-based machines. We present experimental results showing that our solution outperforms state of the art implementations to reach a peak performance of 4.5 TFlop/s on a platform equipped with 20 CPU cores and 4 GPU devices.

Tags: Computer science, CUDA, Factorization, Heterogeneous systems, Linear Algebra, nVidia, OpenMP, Task scheduling, Tesla K40

August 5, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Exploiting two-level parallelism by aggregating computing resources in task-based applications over accelerator-based machines

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Exploiting two-level parallelism by aggregating computing resources in task-based applications over accelerator-based machines

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)