Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

hgpu.org » Applications » Computer science » Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

John Jenkins, Pavan Balaji, James Dinan, Nagiza F. Samatova, Rajeev Thakur

Department of Computer Science, North Carolina State University

Preprint ANL/MCS-P2028-0212, 2012

BibTeX

Download (PDF)

View

Source

2591

views

Lack of efficient and transparent interaction with GPU data in hybrid MPI GPU environments challenges GPU acceleration of largescale scientific and engineering computations. A particular challenge is the efficient transfer of noncontiguous data to and from GPU memory. MPI supports such transfers through the use of datatypes, however an efficient means of utilizing datatypes for noncontiguous data in GPU memory is not currently known. To address this gap, we present the design and implementation of efficient MPI datatypes processing system, which is capable of efficiently processing arbitrary datatypes directly on the GPU. We present a means for converting conventional datatype representations into a GPU tractable format, which exposes parallelism. Fine grained, element level parallelism is then utilized by a GPU kernel to perform in device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array subvolumes over CUDA based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of datatypes that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end to end, GPU to GPU communication time.

Tags: Computer science, CUDA, MPI, nVidia, Performance, Programming Languages, Programming techniques, Tesla C2050

March 18, 2012 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

Share this:

Recent source codes

Most viewed papers (last 30 days)