https://hgpu.org/?p=19212
FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads