https://hgpu.org/?p=5714
Thread Block Compaction for Efficient SIMT Control Flow