https://hgpu.org/?p=10937
An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation