MPI Derived Datatypes Processing on Noncontiguous GPU-resident Data

hgpu.org » Programming » Algorithms » MPI Derived Datatypes Processing on Noncontiguous GPU-resident Data

MPI Derived Datatypes Processing on Noncontiguous GPU-resident Data

John Jenkins, James Dinan, Pavan Balaji, Tom Peterka, Nagiza F. Samatova, Rajeev Thakur

Department of Computer Science, North Carolina State University

North Carolina State University, Preprint ANL/MCS-P4042-0313, 2013

BibTeX

Download (PDF)

View

Source

1667

views

Driven by the goals of efficient and generic communication of noncontiguous data layouts in GPU memory, for which solutions do not currently exist, we present a parallel, noncontiguous data-processing methodology through the MPI datatypes specification. Our processing algorithm utilizes a kernel on the GPU to pack arbitrary noncontiguous GPU data by enriching the datatypes encoding to expose a fine-grained, data-point level of parallelism. Additionally, the typically tree-based datatype encoding is preprocessed to enable efficient, cached access across GPU threads. Using CUDA, we show that the computational method outperforms DMA-based alternatives for several common data layouts as well as more complex data layouts for which reasonable DMA-based processing does not exist. Our method incurs low overhead for data layouts that closely match best-case DMA usage or that can be processed by layout-specific implementations. We additionally investigate usage scenarios for data packing that incur resource contention, identifying potential pitfalls for various packing strategies. We also demonstrate the efficacy of kernel-based packing in various communication scenarios, showing multifold improvement in point-topoint communication and evaluating packing within the context of the SHOC stencil benchmark and HACC mesh analysis.

Tags: Algorithms, Computer science, CUDA, MPI, nVidia, Tesla C2050

April 30, 2013 by hgpu

No votes yet.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org