24939

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

Gongzheng li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao
Fuxi AI Lab, NetEase Inc., Hangzhou, China
arXiv:2104.12470 [cs.CL], (26 Apr 2021)

@misc{li2021easy,

   title={Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode},

   author={Gongzheng li and Yadong Xi and Jingzhen Ding and Duan Wang and Bai Liu and Changjie Fan and Xiaoxi Mao and Zeng Zhao},

   year={2021},

   eprint={2104.12470},

   archivePrefix={arXiv},

   primaryClass={cs.CL}

}

The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine — Easy and Efficient Transformer (EET), Which has a significant performance improvement over the existing schemes. We firstly introduce a pre-padding decoding mechanism that improves token parallelism for generation tasks. Then we design high optimized kernels to remove sequence masks and achieve cost-free calculation for padding tokens, as well as support long sequence and long embedding sizes. Thirdly a user-friendly inference system with an easy service pipeline was introduced which greatly reduces the difficulty of engineering deployment with high throughput. Compared to Faster Transformer’s implementation for GPT-2 on A100, EET achieves a 1.5-15x state-of-art speedup varying with context length.EET is available.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: