11284

Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography

Johannes Hofmann
Department of Computer Science and Erlangen Regional Computer Center, High Performance Computing Group, Friedrich-Alexander-University Erlangen-Nuremberg
Friedrich-Alexander-University Erlangen-Nuremberg, 2013

@article{hofmann2013performance,

   title={Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography},

   author={Hofmann, Johannes},

   year={2013}

}

Download Download (PDF)   View View   Source Source   

1102

views

The computational effort of 3D image reconstruction in Computed Tomography (CT) has required special purpose hardware for a long time. Systems such as custom-built FPGA-systems and GPUs are still widely-used today, in particular in interventional settings, where radiologists require a hard time constraint for reconstruction. However, recently is has been shown that today even commodity CPUs are capable of performing the reconstruction within the imposed time-constraint. In this thesis, we examine the Intel Many Integrated Cores (MIC) architecture for its suitability to run the Feldkamp-Davis-Kress (FDK) algorithm-the most commonly used algorithm to perform the 3D image reconstruction in cone-beam computed tomography. In comparison to traditional CPUs the MIC accelerator card, which focuses on numerical applications, is expected to deliver higher performance using the same programming models such as C, C++, and Fortran. A thorough analysis of the MIC architecture is performed to determine potential hardware bottlenecks and to distinguish its design from a current state of the art two-socket Intel Sandy Bridge EP CPU system. We study the challenges of efficiently parallelizing the FDK kernel on the Intel MIC and find that careful OpenMP scheduling and thread placement is required due to lack of a shared last level cache. Efficient data sharing on the Intel MIC can only occur between hardware threads of a core via its local L1 and L2 cache segments. Apart from parallelization, SIMD vectorization is critical for good performance on the Intel MIC, whose vector registers are twice the size of vector registers found in contemporary CPUs. To classify the difficulty of harnessing the full potential of vectorization on the MIC platform we explore various approaches to vectorize the kernel: Auto-vectorization using the Intel C Compiler and the Intel SPMD Compiler, as well as manual vectorization using C with intrinsics and manual assembly coding. We used the fastest available CPU implementation from Treibig et al., developed for the RabbitCT benchmarking framework, as starting point for our optimizations. By making improvements to the original implementation, we speed up execution by 25% on the CPU. In line with the estimate of our performance model, measurements on the Intel MIC deliver a speedup of 1.5 in comparison to the reference CPU system. Our analysis reveals the major bottleneck for our application to be shortcomings in hardware: The majority of data required for the reconstruction is scattered in memory; gathering this data into vector registers for processing is still done sequentially on the Intel MIC. While computations in the kernel benefit from vectorization, the sequential loading limits the maximum achievable speedup in accordance with Amdahl’s law.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: