https://hgpu.org/?p=17713
A Fast and Generic GPU-Based Parallel Reduction Implementation