https://hgpu.org/?p=3384
An Empirically Optimized Radix Sort for GPU