https://hgpu.org/?p=26072
COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs