https://hgpu.org/?p=16688
A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs