high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

Guoping Long, Jun Yang, Wei Lin

Alibaba Group

arXiv:1911.11576 [cs.DC], (24 Nov 2019)

@misc{long2019fusionstitching,

title={FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads},

author={Guoping Long and Jun Yang and Wei Lin},

year={2019},

eprint={1911.11576},

archivePrefix={arXiv},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

1949

views

Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose FusionStitching, a optimization framework capable of fusing memory intensive elementwise, reduction and fine grained GEMM/Batched-GEMM ops, with or without data dependences, into large computation units, then mapping and transforming them into efficient GPU kernels. We formulate the fusion plan optimization as an integer linear programming (ILP) problem, and propose a set of empirical heuristics to reduce the combinatorial search space. In order to map optimized fusion plans to hardware, we propose a technique to effectively compose various groups of computations into a single GPU kernel, by fully leveraging on chip resources like scratchpads or registers. Experimental results on six benchmarks and four industry scale practical models are encouraging. Overall, FusionStitching can reach up to 5.7x speedup compared to Tensorflow baseline, and achieves 1.25x to 1.85x performance speedups compared to current state of the art, with 1.4x on average (geometric mean).

Tags: Computer science, CUDA, Deep learning, nVidia, Performance, Tesla V100

December 1, 2019 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)