Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator

hgpu.org » Programming » Algorithms » Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator

Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator

Bobo Shi

The Ohio State University

The Ohio State University, 2016

BibTeX

Download (PDF)

View

Source

1851

views

CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on application of the high-accuracy methods. Intel Xeon Phi Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which has powerful parallel computing ability due to its massively parallel many-core architecture. In this work, CCSD(T) code is implemented on Intel Xeon Phi Coprocessor and NVIDIA GPU. CCSD(T) method performs tensor contractions. In order to have an efficient implementation, we allocate the result tensor only on Intel Xeon Phi Coprocessor or GPU, and keep result tensor on the coprocessor/accelerator to receive a sequence of results from tensor contraction performed on the Intel Xeon Phi Coprocessor or GPU. The input tensors are offloaded from the host to the coprocessor/accelerator for each tensor contraction. After all the tensor contractions are finished, the final result is accumulated on the coprocessor/accelerator to avoid huge data transfer from coprocessor/accelerator to host. The tensor contraction are performed using BLAS DGEMM on coprocessor/accelerator. Then the result is post-processed using a 6 dimensional loop. For Intel Xeon Phi implementation, OpenMP is used to bind threads to physical processing units on Xeon Phi coprocessors. The OpenMP threads affinity are tuned for Intel Xeon Phi Coprocessor to obtain best performance. For GPU, a algorithm is designed to map the 6 dimensional loop (post-processing) to CUDA threads. gridDim and blockDim are tuned to reach best performance. 4x and 9x ~ 13x overall speedup is obtained for Intel Xeon Phi and GPU implementation, respectively.

Tags: Algorithms, Chemistry, Computational chemistry, CUDA, Intel Xeon Phi, nVidia, OpenMP, Tesla K40, Thesis

September 5, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org