https://hgpu.org/?p=1533
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort