high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

Gongzheng li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao

Fuxi AI Lab, NetEase Inc., Hangzhou, China

arXiv:2104.12470 [cs.CL], (26 Apr 2021)

BibTeX

Download (PDF)

View

Source

Source codes

Package:

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

1701

views

The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine — Easy and Efficient Transformer (EET), Which has a significant performance improvement over the existing schemes. We firstly introduce a pre-padding decoding mechanism that improves token parallelism for generation tasks. Then we design high optimized kernels to remove sequence masks and achieve cost-free calculation for padding tokens, as well as support long sequence and long embedding sizes. Thirdly a user-friendly inference system with an easy service pipeline was introduced which greatly reduces the difficulty of engineering deployment with high throughput. Compared to Faster Transformer’s implementation for GPT-2 on A100, EET achieves a 1.5-15x state-of-art speedup varying with context length.EET is available.

Tags: Computer science, CUDA, NLP, nVidia, nVidia GeForce RTX 2080 Ti, Package, Tesla A100

May 2, 2021 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)