Strategies for Maximizing Utilization in multi-CPU & multi-GPU Heterogeneous Architectures

hgpu.org » Programming » Algorithms » Strategies for Maximizing Utilization in multi-CPU & multi-GPU Heterogeneous Architectures

Strategies for Maximizing Utilization in multi-CPU & multi-GPU Heterogeneous Architectures

Angeles Navarro, Antonio Vilches, Francisco Corbera, Rafael Asenjo

Dept. of Computer Architecture, University of Malaga, Spain

Technical Report. Dept. Comp. Architecture. Univ. of Malaga, 2013

BibTeX

Download (PDF)

View

Source

2489

views

This paper explores the possibility of efficiently executing a single application using multicores simultaneously with multiple GPU accelerators under a parallel task programming paradigm. In particular, we address the challenge of extending a parallel for template to allow its exploitation on heterogeneous architectures. Previous task frameworks that offer support for heterogeneous systems implement a variety of static and dynamic scheduling strategies, although the size of the chunk of iterations assigned to each device is always fixed. However, due to the asymmetry of the computing resources we propose in this work a dynamic scheduling strategy coupled with an adaptive partitioning scheme that resizes chunks to prevent underutilization and load unbalance of CPUs and GPUs. In this paper we also address the problem of the underutilization of the CPU core where a host thread operates. To solve it, we propose two different approaches: i) a collaborative host thread strategy, in which the host thread, instead of busy-waiting for the GPU to complete, it carries out useful chunk processing. To implement this strategy, we modify our partitioning scheme to provide a chunk to the host thread each time that a GPU device gets new work; and ii) a host thread blocking strategy combined with oversubscription, that delegates on the OS the duty of scheduling threads to available CPU cores in order to guarantee that all cores are doing useful work. Using two benchmarks we evaluate the overhead introduced by our scheduling and partitioning algorithms, finding that it is negligible. We also evaluate the efficiency of the strategies proposed finding that allowing oversubscription controlled by the OS can be beneficial under certain scenarios.

Tags: Algorithms, Computer science, CUDA, Heterogeneous systems, nVidia, Performance, Tesla S2050

February 4, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org