high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Sreeram Potluri

The Ohio State University

The Ohio State University, 2014

@phdthesis{potluri2014enabling,

title={Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects},

author={Potluri, Sreeram},

year={2014},

school={The Ohio State University}

}

Download (PDF)

View

Source

2414

views

Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems that have high compute density and high performance per watt. However, these many-core architectures cause systems to be heterogeneous by introducing multiple levels of parallelism and varying computation/communication costs at each level. Application developers also use a hierarchy of programming models to extract maximum performance from these heterogeneous systems. Models such as CUDA, OpenCL, LEO, and others are used to express parallelism across accelerator or coprocessor cores, while higher level programming models such as MPI or OpenSHMEM are used to express parallelism across a cluster. The presence of multiple programming models, their runtimes and the varying communication performance at different levels of the system hierarchy has hindered applications from achieving peak performance on these systems. Modern interconnects such as InfiniBand, enable asynchronous communication progress through RDMA, freeing up the cores to do useful computation. MPI and PGAS models offer one-sided communication primitives that extract maximum performance, minimize process synchronization overheads and enable better computation and communication overlap using the high performance networks. However, there is limited literature available to guide scientists in taking advantage of these one-sided communication semantics on high-end applications, more so on heterogeneous clusters.

Tags: Computer science, CUDA, Heterogeneous systems, Intel Xeon Phi, MPI, nVidia, OpenCL, Tesla, Tesla C2075, Tesla K20, Tesla M2090

September 29, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Your response

Recent source codes

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Most viewed papers (last 30 days)

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)