CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

hgpu.org » Programming » CUDA » CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

Daren Lee, Ivo Dinov, Bin Dong, Boris Gutman, Igor Yanovsky, Arthur W. Toga

Laboratory of Neuro Imaging, David Geffen School of Medicine, UCLA, 635 Charles Young Drive South Suite 225, Los Angeles, CA 90095, USA

Computer Methods and Programs in Biomedicine (15 December 2010)

DOI:10.1016/j.cmpb.2010.10.013

@article{lee2010cuda,

title={CUDA optimization strategies for compute-and memory-bound neuroimaging algorithms},

author={Lee, D. and Dinov, I. and Dong, B. and Gutman, B. and Yanovsky, I. and Toga, A.W.},

journal={Computer Methods and Programs in Biomedicine},

issn={0169-2607},

year={2010},

publisher={Elsevier}

}

Download (PDF)

View

Source

2875

views

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm’s access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6? faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129? for the 3D unbiased nonlinear image registration technique and 93? for the non-local means surface denoising algorithm.

Tags: CUDA, Image processing, Image registration, Medicine, nVidia

December 20, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org