Reducing overheads of dynamic scheduling on heterogeneous chips

hgpu.org » Applications » Computer science » Reducing overheads of dynamic scheduling on heterogeneous chips

Reducing overheads of dynamic scheduling on heterogeneous chips

Francisco Corbera, Andres Rodriguez, Rafael Asenjo, Angeles Navarro, Antonio Vilches, Maria J. Garzaran

Universidad de Malaga, Andalucia Tech, Dept. of Computer Architecture, Spain

arXiv:1501.03336 [cs.DC], (14 Jan 2015)

BibTeX

Download (PDF)

View

Source

1894

views

In recent processor development, we have witnessed the integration of GPU and CPUs into a single chip. The result of this integration is a reduction of the data communication overheads. This enables an efficient collaboration of both devices in the execution of parallel workloads. In this work, we focus on the problem of efficiently scheduling chunks of iterations of parallel loops among the computing devices on the chip (the GPU and the CPU cores) in the context of irregular applications. In particular, we analyze the sources of overhead that the host thread experiments when a chunk of iterations is offloaded to the GPU while other threads are executing concurrently other chunks on the CPU cores. We carefully study these overheads on different processor architectures and operating systems using Barnes Hut as a study case representative of irregular applications. We also propose a set of optimizations to mitigate the overheads that arise in presence of oversubscription and take advantage of the different features of the heterogeneous architectures. Thanks to these optimizations we reduce Energy-Delay Product (EDP) by 18% and 84% on Intel Ivy Bridge and Haswell architectures, respectively, and by 57% on the Exynos big.LITTLE.

Tags: Computer science, Heterogeneous systems, OpenCL, Operating systems

January 15, 2015 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Reducing overheads of dynamic scheduling on heterogeneous chips

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Reducing overheads of dynamic scheduling on heterogeneous chips

Share this:

Recent source codes

Most viewed papers (last 30 days)