https://hgpu.org/?p=20104
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures