Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator

hgpu.org » Programming » Algorithms » Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator

Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator

Bobo Shi

The Ohio State University

The Ohio State University, 2016

@phdthesis{shi2016implementation,

title={Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator},

author={Shi, Bobo},

year={2016},

school={The Ohio State University}

}

Download (PDF)

View

Source

1476

views

CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on application of the high-accuracy methods. Intel Xeon Phi Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which has powerful parallel computing ability due to its massively parallel many-core architecture. In this work, CCSD(T) code is implemented on Intel Xeon Phi Coprocessor and NVIDIA GPU. CCSD(T) method performs tensor contractions. In order to have an efficient implementation, we allocate the result tensor only on Intel Xeon Phi Coprocessor or GPU, and keep result tensor on the coprocessor/accelerator to receive a sequence of results from tensor contraction performed on the Intel Xeon Phi Coprocessor or GPU. The input tensors are offloaded from the host to the coprocessor/accelerator for each tensor contraction. After all the tensor contractions are finished, the final result is accumulated on the coprocessor/accelerator to avoid huge data transfer from coprocessor/accelerator to host. The tensor contraction are performed using BLAS DGEMM on coprocessor/accelerator. Then the result is post-processed using a 6 dimensional loop. For Intel Xeon Phi implementation, OpenMP is used to bind threads to physical processing units on Xeon Phi coprocessors. The OpenMP threads affinity are tuned for Intel Xeon Phi Coprocessor to obtain best performance. For GPU, a algorithm is designed to map the 6 dimensional loop (post-processing) to CUDA threads. gridDim and blockDim are tuned to reach best performance. 4x and 9x ~ 13x overall speedup is obtained for Intel Xeon Phi and GPU implementation, respectively.

Tags: Algorithms, Chemistry, Computational chemistry, CUDA, Intel Xeon Phi, nVidia, OpenMP, Tesla K40, Thesis

September 5, 2016 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org