https://hgpu.org/?p=9605
Parallelizing General Histogram Application for CUDA Architectures