high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Mathematics » Statistics » SIMD Parallel Gibbs Sampling of Probabilistic Directed Acyclic Graphs

SIMD Parallel Gibbs Sampling of Probabilistic Directed Acyclic Graphs

Alireza S. Mahani, Mansour T.A. Sharabiani

Sentrana Inc., Washington DC, USA

arXiv:1310.1537 [stat.CO], (6 Oct 2013)

@ARTICLE{2013arXiv1310.1537M,

author={Mahani}, A.~S. and {Sharabiani}, M.~T.~A.},

title={"{SIMD Parallel Gibbs Sampling of Probabilistic Directed Acyclic Graphs}"},

journal={ArXiv e-prints},

archivePrefix={"arXiv"},

eprint={1310.1537},

primaryClass={"stat.CO"},

keywords={Statistics – Computation, Computer Science – Artificial Intelligence, Computer Science – Distributed, Parallel, and Cluster Computing},

year={2013},

month={oct},

adsurl={http://adsabs.harvard.edu/abs/2013arXiv1310.1537M},

adsnote={Provided by the SAO/NASA Astrophysics Data System}

}

Download (PDF)

View

Source

2455

views

We present a single-chain parallelization strategy for Gibbs sampling of probabilistic Directed Acyclic Graphs, where contributions from child nodes to the conditional posterior distribution of a given node are calculated concurrently. For statistical models with many independent observations, such parallelism takes a Single-Instruction-Multiple-Data form, and can be efficiently implemented using multicore parallelization and vector instructions on x86 processors. Since all tasks have near-identical durations in SIMD parallelism, multicore parallelization benefits from static scheduling to minimize thread synchronization overhead. For multi-socket servers, a compact processor affinity minimizes cross-chip communication during the reduction phase, leading to better scaling of performance with number of cores. Effective vectorization requires coherent memory access patterns, perhaps by converting an array of node structures into a structure of arrays. When calculating each child node’s contribution involves a loop, e.g. to calculate the inner product of the covariate and coefficient vectors, manual unrolling of this inner loop is necessary to facilitate vectorization of the outer loop. After these optimizations, we achieve nearly 10x speedup using only 4 cores of an Intel x86-64 processor with Advanced Vector Extensions, even for datasets of modest size. SIMD parallel Gibbs can be combined with parallel sampling of conditionally-independent nodes for nested parallel Gibbs sampling of Hierarchical Bayesian models. Our optimization techniques improve the scaling of performance with number of cores and width of vector units; thus paving the way for further speedup on highly-parallel, SIMD-oriented coprocessors such as Intel Xeon Phi and Graphic Processing Units.

Tags: Artificial intelligence, Bayesian, Intel Phi, Intel Xeon Phi, OpenMP, Statistics

October 22, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

SIMD Parallel Gibbs Sampling of Probabilistic Directed Acyclic Graphs

Your response

Recent source codes

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

BoltzGen:Toward Universal Binder Design

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

Most viewed papers (last 30 days)

SIMD Parallel Gibbs Sampling of Probabilistic Directed Acyclic Graphs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)