Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode
Fuxi AI Lab, NetEase Inc., Hangzhou, China
arXiv:2104.12470 [cs.CL], (26 Apr 2021)
@misc{li2021easy,
title={Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode},
author={Gongzheng li and Yadong Xi and Jingzhen Ding and Duan Wang and Bai Liu and Changjie Fan and Xiaoxi Mao and Zeng Zhao},
year={2021},
eprint={2104.12470},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine — Easy and Efficient Transformer (EET), Which has a significant performance improvement over the existing schemes. We firstly introduce a pre-padding decoding mechanism that improves token parallelism for generation tasks. Then we design high optimized kernels to remove sequence masks and achieve cost-free calculation for padding tokens, as well as support long sequence and long embedding sizes. Thirdly a user-friendly inference system with an easy service pipeline was introduced which greatly reduces the difficulty of engineering deployment with high throughput. Compared to Faster Transformer’s implementation for GPT-2 on A100, EET achieves a 1.5-15x state-of-art speedup varying with context length.EET is available.
May 2, 2021 by hgpu