Implementing the Himeno benchmark with CUDA on GPU clusters

Everett H. Phillips, Massimiliano Fatica
2010 IEEE International Symposium on Parallel Distributed Processing IPDPS (2010) Publisher: IEEE, Pages: 1-10


   title={Implementing the Himeno benchmark with CUDA on GPU clusters},

   author={Phillips, E.H. and Fatica, M.},

   booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},





Source Source   



This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: