high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Dynamic Load Balancing in GPU-Based Systems – Early Experiments

Dynamic Load Balancing in GPU-Based Systems – Early Experiments

Alvaro Luiz Fazenda, Celso L. Mendes, Laxmikant V. Kale, Jairo Panetta, Eduardo Rocha Rodrigues

Institute of Science and Technology, Federal University of Sao Paulo (UNIFESP), Sao Jose dos Campos-SP, Brazil

arXiv:1310.4218 [cs.DC], (15 Oct 2013)

@article{2013arXiv1310.4218F,

author={Fazenda}, A.~L. and {Mendes}, C.~L. and {Kale}, L.~V. and {Panetta}, J. and {Rocha Rodrigues}, E.},

title={"{Dynamic Load Balancing in GPU-Based Systems – Early Experiments}"},

journal={ArXiv e-prints},

archivePrefix={"arXiv"},

eprint={1310.4218},

primaryClass={"cs.DC"},

keywords={Computer Science – Distributed, Parallel, and Cluster Computing, D.1.3},

year={2013},

month={oct},

adsurl={http://adsabs.harvard.edu/abs/2013arXiv1310.4218F},

adsnote={Provided by the SAO/NASA Astrophysics Data System}

}

Download (PDF)

View

Source

1795

views

The dynamic load-balancing framework in Charm++/AMPI, developed at the University of Illinois, is based on using processor virtualization to allow thread migration across processors. This framework has been successfully applied to many scientific applications in the past, such as BRAMS, NAMD, ChaNGa, and others. Most of these applications use only CPUs to perform their operations. However, the use of GPUs to improve computational performance is quickly getting massively disseminated in the high-performance computing community. This paper aims to investigate how the same Charm++/AMPI framework can be extended to balance load in a synthetic application inspired by the BRAMS numerical forecast model, running mostly on GPUs rather than on CPUs. Many major questions involving the use of GPUs with AMPI where handled in this work, including: how to measure the GPU’s load, how to use and share GPUs among user-level threads, and what results are obtained when applying the mandatory over-decomposition technique to a GPU-accelerated program.

Tags: Computer science, CUDA, nVidia, OpenACC, Task scheduling, Tesla K20, Virtualization

October 18, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Dynamic Load Balancing in GPU-Based Systems – Early Experiments

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Dynamic Load Balancing in GPU-Based Systems – Early Experiments

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)