Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator
The Ohio State University
The Ohio State University, 2016
@phdthesis{shi2016implementation,
title={Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator},
author={Shi, Bobo},
year={2016},
school={The Ohio State University}
}
CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on application of the high-accuracy methods. Intel Xeon Phi Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which has powerful parallel computing ability due to its massively parallel many-core architecture. In this work, CCSD(T) code is implemented on Intel Xeon Phi Coprocessor and NVIDIA GPU. CCSD(T) method performs tensor contractions. In order to have an efficient implementation, we allocate the result tensor only on Intel Xeon Phi Coprocessor or GPU, and keep result tensor on the coprocessor/accelerator to receive a sequence of results from tensor contraction performed on the Intel Xeon Phi Coprocessor or GPU. The input tensors are offloaded from the host to the coprocessor/accelerator for each tensor contraction. After all the tensor contractions are finished, the final result is accumulated on the coprocessor/accelerator to avoid huge data transfer from coprocessor/accelerator to host. The tensor contraction are performed using BLAS DGEMM on coprocessor/accelerator. Then the result is post-processed using a 6 dimensional loop. For Intel Xeon Phi implementation, OpenMP is used to bind threads to physical processing units on Xeon Phi coprocessors. The OpenMP threads affinity are tuned for Intel Xeon Phi Coprocessor to obtain best performance. For GPU, a algorithm is designed to map the 6 dimensional loop (post-processing) to CUDA threads. gridDim and blockDim are tuned to reach best performance. 4x and 9x ~ 13x overall speedup is obtained for Intel Xeon Phi and GPU implementation, respectively.
September 5, 2016 by hgpu