high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Implementing the Himeno benchmark with CUDA on GPU clusters

Implementing the Himeno benchmark with CUDA on GPU clusters

Everett H. Phillips, Massimiliano Fatica

NVIDIA, US

2010 IEEE International Symposium on Parallel Distributed Processing IPDPS (2010) Publisher: IEEE, Pages: 1-10

DOI:10.1109/IPDPS.2010.5470394

@conference{phillips2010implementing,

title={Implementing the Himeno benchmark with CUDA on GPU clusters},

author={Phillips, E.H. and Fatica, M.},

booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},

pages={1–10},

issn={1530-2075},

organization={IEEE}

}

Source

1880

views

This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.

Tags: Benchmarking, Computer science, CUDA, GPU cluster, nVidia, Performance, Tesla C1060

March 16, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Implementing the Himeno benchmark with CUDA on GPU clusters

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Implementing the Himeno benchmark with CUDA on GPU clusters

Share this:

Recent source codes

Most viewed papers (last 30 days)