Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer
Tokyo Institute of Technology
International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12), 2012
@inproceedings{nukada2012scalable,
title={Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer},
author={Nukada, A. and Sato, K. and Matsuoka, S.},
booktitle={Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis},
pages={44},
year={2012},
organization={IEEE Computer Society Press}
}
For scalable 3-D FFT computation using multiple GPUs, efficient all-to-all communication between GPUs is the most important factor in good performance. Implementations with point-to-point MPI library functions and CUDA memory copy APIs typically exhibit very large overheads especially for small message sizes in all-to-all communications between many nodes. We propose several schemes to minimize the overheads, including employment of lower-level API of InfiniBand to effectively overlap intra- and inter-node communication, as well as auto-tuning strategies to control scheduling and determine rail assignments. As a result we achieve very good strong scalability as well as good performance, up to 4.8TFLOPS using 256 nodes of TSUBAME 2.0 Supercomputer (768 GPUs) in double precision.
November 23, 2012 by hgpu