https://hgpu.org/?p=28354
Efficient GPU implementation of a class of array permutations