https://hgpu.org/?p=18910
Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs