https://hgpu.org/?p=2416
Parallel Prefix Sum (Scan) with CUDA