Non-separable 2D, 3D and 4D filtering with CUDA

hgpu.org » Programming » CUDA » Non-separable 2D, 3D and 4D filtering with CUDA

Non-separable 2D, 3D and 4D filtering with CUDA

Anders Eklund, Paul Dufort

Virginia Tech Carilion Research Institute, Virginia Tech, Roanoke, Virginia, USA

Chapter in book "GPU pro 5", A K Peters/CRC Press, pp. 465-487, 2014

@article{eklund2014non,

title={Non-separable 2D, 3D and 4D filtering with CUDA},

author={Eklund, Anders and Dufort, Paul},

journal={GPU Pro},

volume={5},

pages={465–487},

year={2014}

}

Download (PDF)

View

Source

Source codes

Package:

NonSeparableFilteringCUDA

4363

views

We have presented solutions for fast non-separable floating point convolution in 2, 3 and 4 dimensions, using the CUDA programming language. We believe that these implementations will serve as a complement to the NPP library, which currently only supports 2D filters and images stored as integers. The shared memory implementation with loop unrolling is approximately twice as fast as the simple texture memory implementation, which is similar to results obtained by Nvidia for separable 2D convolution. For 3D and 4D data it might seem strange to use convolution instead of an FFT, but the convolution approach can for example handle larger datasets. In our work on 4D image denoising, the FFT based approach was on average only three times faster (compared to about 30 times faster in the benchmarks given here). The main reason for this was the high resolution nature of the data (512 x 512 x 445 x 20 elements), making it impossible to load all the data into global memory. Due to its higher memory consumption, the FFT based approach was forced to load a smaller number of slices into global memory compared to the spatial approach. As only a subset of the slices (and time points) is valid after the filtering, the FFT based approach required a larger number of runs to process all the slices. Finally, we close by noting two additional topics that readers may wish to consider for more advanced study. First, applications in which several filters are applied simultaneously to the same data (for example six complex valued quadrature filters to estimate a local structure tensor in 3D) can lead to different conclusions regarding performance using spatial convolution versus FFT based filtering. Second, filter networks can be used to speedup spatial convolution by combining the result of many small filter kernels, resulting in a proportionally higher gain for 3D and 4D than for 2D convolution.

Tags: CUDA, FFT, Filtering, Image processing, nVidia, nVidia GeForce GTX 680, Package, Signal denoising

May 5, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org