high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming

Ashwin M. Aji, Pavan Balaji, James Dinan, Wu-chun Feng, Rajeev Thakur

Dept. of Computer Science, Virginia Tech

3rd Int’l Workshop on Accelerators and Hybrid Exascale Systems (AsHES) (IPDPS), 2013

@article{aji2013synchronization,

title={Synchronization and Ordering Semantics in Hybrid MPI+ GPU Programming},

author={Aji, Ashwin M and Balaji, Pavan and Dinan, James and Feng, Wu-chun and Thakur, Rajeev},

year={2013}

}

Download (PDF)

View

Source

2358

views

Despite the vast interest in accelerator-based systems, programming large multinode GPUs is still a complex task, particularly with respect to optimal data movement across the host-GPU PCIe connection and then across the network. In order to address such issues, GPU-integrated MPI solutions have been developed that integrate GPU data movement into existing MPI implementations. Currently available GPUintegrated MPI frameworks differ in aspects related to the buffer synchronization and ordering semantics they provide to users. The noteworthy models are (1) unified virtual addressing (UVA)-based approach and (2) MPI attributes-based approach. In this paper, we compare these approaches, for both programmability and performance, and demonstrate that the UVA-based design is useful for isolated communication with no data dependencies or ordering requirements, while the attributes-based design might be more appropriate when multiple interdependent MPI and GPU operations are interleaved.

Tags: Computer science, CUDA, Hybrid computing, MPI, nVidia, OpenCL, Tesla C2050

April 3, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)