high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, Jeongnim Kim

Intel Corporation

arXiv:1611.02665 [cs.DC], (8 Nov 2016)

@article{mathuriya2016optimization,

title={Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors},

author={Mathuriya, Amrita and Luo, Ye and Benali, Anouar and Shulenburger, Luke and Kim, Jeongnim},

year={2016},

month={nov},

archivePrefix={"arXiv"},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

1628

views

B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on multi/many-core shared memory processors. To increase SIMD efficiency and bandwidth utilization, we first apply data layout transformation from array-of-structures to structure-of-arrays (SoA). Then by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations. These optimizations are portable on four distinct cache-coherent architectures and result in up to 5.6x performance enhancements on Intel Xeon Phi processor 7250P (KNL), 5.7x on Intel Xeon Phi coprocessor 7120P, 10x on an Intel Xeon processor E5v4 CPU and 9.5x on BlueGene/Q processor. Our nested threading implementation shows nearly ideal parallel efficiency on KNL up to 16 threads. We employ roofline performance analysis to model the impacts of our optimizations. This work combined with our current efforts of optimizing other QMC kernels, result in greater than 4.5x speedup of miniQMC on KNL.

Tags: Computer science, Intel Xeon Phi, QMC

November 10, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

Your response

Recent source codes

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

BoltzGen:Toward Universal Binder Design

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Most viewed papers (last 30 days)

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)