Implementing the Himeno benchmark with CUDA on GPU clusters
NVIDIA, US
2010 IEEE International Symposium on Parallel Distributed Processing IPDPS (2010) Publisher: IEEE, Pages: 1-10
This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.
March 16, 2011 by hgpu