https://hgpu.org/?p=2523
Fast GPGPU Data Rearrangement Kernels using CUDA