high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

Guoping Long, Jun Yang, Kai Zhu, Wei Lin

Alibaba Inc.

arXiv:1811.05213 [cs.DC], (13 Nov 2018)

BibTeX

Download (PDF)

View

Source

2188

views

In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity problem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying accelerators. The XLA framework provides a solid foundation to explore this problem further. In this paper, we propose FusionStitching, a novel, comprehensive Op fusion and code generation system to stitch computations into large GPU kernels. Experimental results on four public models and two of our large inhouse applications show another 55% (geometric mean) reduction of GPU kernel launches, compared to the XLA fusion baseline. This increases the E2E performance of both of our latency critical inhouse applications up to 20%.

Tags: Code generation, Computer science, CUDA, Deep learning, Machine learning, nVidia, TensorFlow

November 18, 2018 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)