high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography

Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography

Johannes Hofmann

Department of Computer Science and Erlangen Regional Computer Center, High Performance Computing Group, Friedrich-Alexander-University Erlangen-Nuremberg

Friedrich-Alexander-University Erlangen-Nuremberg, 2013

@article{hofmann2013performance,

title={Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography},

author={Hofmann, Johannes},

year={2013}

}

Download (PDF)

View

Source

3051

views

The computational effort of 3D image reconstruction in Computed Tomography (CT) has required special purpose hardware for a long time. Systems such as custom-built FPGA-systems and GPUs are still widely-used today, in particular in interventional settings, where radiologists require a hard time constraint for reconstruction. However, recently is has been shown that today even commodity CPUs are capable of performing the reconstruction within the imposed time-constraint. In this thesis, we examine the Intel Many Integrated Cores (MIC) architecture for its suitability to run the Feldkamp-Davis-Kress (FDK) algorithm-the most commonly used algorithm to perform the 3D image reconstruction in cone-beam computed tomography. In comparison to traditional CPUs the MIC accelerator card, which focuses on numerical applications, is expected to deliver higher performance using the same programming models such as C, C++, and Fortran. A thorough analysis of the MIC architecture is performed to determine potential hardware bottlenecks and to distinguish its design from a current state of the art two-socket Intel Sandy Bridge EP CPU system. We study the challenges of efficiently parallelizing the FDK kernel on the Intel MIC and find that careful OpenMP scheduling and thread placement is required due to lack of a shared last level cache. Efficient data sharing on the Intel MIC can only occur between hardware threads of a core via its local L1 and L2 cache segments. Apart from parallelization, SIMD vectorization is critical for good performance on the Intel MIC, whose vector registers are twice the size of vector registers found in contemporary CPUs. To classify the difficulty of harnessing the full potential of vectorization on the MIC platform we explore various approaches to vectorize the kernel: Auto-vectorization using the Intel C Compiler and the Intel SPMD Compiler, as well as manual vectorization using C with intrinsics and manual assembly coding. We used the fastest available CPU implementation from Treibig et al., developed for the RabbitCT benchmarking framework, as starting point for our optimizations. By making improvements to the original implementation, we speed up execution by 25% on the CPU. In line with the estimate of our performance model, measurements on the Intel MIC deliver a speedup of 1.5 in comparison to the reference CPU system. Our analysis reveals the major bottleneck for our application to be shortcomings in hardware: The majority of data required for the reconstruction is scattered in memory; gathering this data into vector registers for processing is still done sequentially on the Intel MIC. While computations in the kernel benefit from vectorization, the sequential loading limits the maximum achievable speedup in accordance with Amdahl’s law.

Tags: Algorithms, Computed tomography, CT, CUDA, Image processing, Image reconstruction, Intel Xeon Phi, Medicine, nVidia, nVidia GeForce GTX 680, Thesis, Tomography

January 25, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography

Your response

Recent source codes

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)