high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Scalable communication for high-order stencil computations using CUDA-aware MPI

Scalable communication for high-order stencil computations using CUDA-aware MPI

Johannes Pekkilä, Miikka S. Väisälä, Maarit J. Käpylä, Matthias Rheinhardt, Oskar Lappi

Department of Computer Science, Aalto University, Konemiehentie 2, 02150 Espoo, Finland

arXiv:2103.01597 [cs.DC], (2 Mar 2021)

@misc{pekkilä2021scalable,

title={Scalable communication for high-order stencil computations using CUDA-aware MPI},

author={Johannes Pekkilä and Miikka S. Väisälä and Maarit J. Käpylä and Matthias Rheinhardt and Oskar Lappi},

year={2021},

eprint={2103.01597},

archivePrefix={arXiv},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

2016

views

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to 64 devices at 50%–87% efficiency in sixth-order stencil computations when the problem domain consists of 256^3–1024^3 cells.

Tags: Computer science, CUDA, Finite difference, Magnetohydrodynamics, MPI, nVidia, Tesla V100

March 7, 2021 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Scalable communication for high-order stencil computations using CUDA-aware MPI

Your response

Recent source codes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Scalable communication for high-order stencil computations using CUDA-aware MPI

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)