Implementing the Himeno benchmark with CUDA on GPU clusters
NVIDIA, US
2010 IEEE International Symposium on Parallel Distributed Processing IPDPS (2010) Publisher: IEEE, Pages: 1-10
@conference{phillips2010implementing,
title={Implementing the Himeno benchmark with CUDA on GPU clusters},
author={Phillips, E.H. and Fatica, M.},
booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},
pages={1–10},
issn={1530-2075},
organization={IEEE}
}
This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.
March 16, 2011 by hgpu