high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Johannes Hofmann, Jan Treibig, Georg Hager, Gerhard Wellein

Chair for Computer Architecture, University Erlangen-Nuremberg

arXiv:1401.7494 [cs.DC], (29 Jan 2014)

DOI:10.1145/2568058.2568068

@article{2014arXiv1401.7494H,

author={Hofmann, Johannes and Treibig, Jan and Hager, Georg and Wellein, Gerhard},

title={Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips},

journal={ArXiv e-prints},

archivePrefix={"arXiv"},

eprint={1401.7494},

primaryClass={"cs.DC"},

keywords={Distributed, Parallel, and Cluster Computing,Performance},

year={2014},

month={jan}

}

Download (PDF)

View

Source

2451

views

Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the efficiency of different SIMD-vectorized implementations of the RabbitCT benchmark. RabbitCT performs 3D image reconstruction by back projection, a vital operation in computed tomography applications. The underlying algorithm is a challenge for vectorization because it consists, apart from a streaming part, also of a bilinear interpolation requiring scattered access to image data. We analyze the performance of SSE (128 bit), AVX (256 bit), AVX2 (256 bit), and IMCI (512 bit) implementations on recent Intel x86 systems. A special emphasis is put on the vector gather implementation on Intel Haswell and Knights Corner microarchitectures. Finally we discuss why GPU implementations perform much better for this specific algorithm.

Tags: Algorithms, Computed tomography, Computer science, CT, CUDA, Image reconstruction, Intel Xeon Phi, Medicine, nVidia, nVidia GeForce GTX 680, Performance

January 30, 2014 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Share this:

Recent source codes

Most viewed papers (last 30 days)