Non-separable 2D, 3D and 4D filtering with CUDA

Anders Eklund, Paul Dufort
Virginia Tech Carilion Research Institute, Virginia Tech, Roanoke, Virginia, USA
Chapter in book "GPU pro 5", A K Peters/CRC Press, pp. 465-487, 2014


   title={Non-separable 2D, 3D and 4D filtering with CUDA},

   author={Eklund, Anders and Dufort, Paul},

   journal={GPU Pro},





Download Download (PDF)   View View   Source Source   Source codes Source codes



We have presented solutions for fast non-separable floating point convolution in 2, 3 and 4 dimensions, using the CUDA programming language. We believe that these implementations will serve as a complement to the NPP library, which currently only supports 2D filters and images stored as integers. The shared memory implementation with loop unrolling is approximately twice as fast as the simple texture memory implementation, which is similar to results obtained by Nvidia for separable 2D convolution. For 3D and 4D data it might seem strange to use convolution instead of an FFT, but the convolution approach can for example handle larger datasets. In our work on 4D image denoising, the FFT based approach was on average only three times faster (compared to about 30 times faster in the benchmarks given here). The main reason for this was the high resolution nature of the data (512 x 512 x 445 x 20 elements), making it impossible to load all the data into global memory. Due to its higher memory consumption, the FFT based approach was forced to load a smaller number of slices into global memory compared to the spatial approach. As only a subset of the slices (and time points) is valid after the filtering, the FFT based approach required a larger number of runs to process all the slices. Finally, we close by noting two additional topics that readers may wish to consider for more advanced study. First, applications in which several filters are applied simultaneously to the same data (for example six complex valued quadrature filters to estimate a local structure tensor in 3D) can lead to different conclusions regarding performance using spatial convolution versus FFT based filtering. Second, filter networks can be used to speedup spatial convolution by combining the result of many small filter kernels, resulting in a proportionally higher gain for 3D and 4D than for 2D convolution.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: