high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning

Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning

Peter Boyle, Michael Chuvelev, Guido Cossu, Christopher Kelly, Christoph Lehner, Lawrence Meadows

The University of Edinburgh

arXiv:1711.04883 [cs.DC], (13 Nov 2017)

BibTeX

Download (PDF)

View

Source

Source codes

Package:

Grid: Data parallel C++ mathematical object library

6109

views

We discuss practical methods to ensure near wirespeed performance from clusters with either one or two Intel(R) Omni-Path host fabric interfaces (HFI) per node, and Intel(R) Xeon Phi(TM) 72xx (Knight’s Landing) processors, and using the Linux operating system. The study evaluates the performance improvements achievable and the required programming approaches in two distinct example problems: firstly in Cartesian communicator halo exchange problems, appropriate for structured grid PDE solvers that arise in quantum chromodynamics simulations of particle physics, and secondly in gradient reduction appropriate to synchronous stochastic gradient descent for machine learning. As an example, we accelerate a published Baidu Research reduction code and obtain a factor of ten speedup over the original code using the techniques discussed in this paper. This displays how a factor of ten speedup in strongly scaled distributed machine learning could be achieved when synchronous stochastic gradient descent is massively parallelised with a fixed mini-batch size. We find a significant improvement in performance robustness when memory is obtained using carefully allocated 2MB "huge" virtual memory pages, implying that either non-standard allocation routines should be used for communication buffers. These can be accessed via a LD_PRELOAD override in the manner suggested by libhugetlbfs. We make use of a the Intel(R) MPI 2019 library "Technology Preview" and underlying software to enable thread concurrency throughout the communication software stake via multiple PSM2 endpoints per process and use of multiple independent MPI communicators. When using a single MPI process per node, we find that this greatly accelerates delivered bandwidth in many core Intel(R) Xeon Phi processors.

Tags: Benchmarking, Computer science, Intel Xeon Phi, Machine learning, MPI, OpenMP, Package, Performance, Physics

November 16, 2017 by hgpu

Rating: 2.0/5. From 1 vote.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)