high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Efficient Inference For Neural Machine Translation

Efficient Inference For Neural Machine Translation

Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, Ilya Chatsviorkin

Apple Inc.

arXiv:2010.02416 [cs.CL], (7 Oct 2020)

@misc{hsu2020efficient,

title={Efficient Inference For Neural Machine Translation},

author={Yi-Te Hsu and Sarthak Garg and Yi-Hsiu Liao and Ilya Chatsviorkin},

year={2020},

eprint={2010.02416},

archivePrefix={arXiv},

primaryClass={cs.CL}

}

Download (PDF)

View

Source

4522

views

Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without sacrificing translation quality. We conduct an empirical study that stacks various approaches and demonstrates that combination of replacing decoder self-attention with simplified recurrent units, adopting a deep encoder and a shallow decoder architecture and multi-head attention pruning can achieve up to 109% and 84% speedup on CPU and GPU respectively and reduce the number of parameters by 25% while maintaining the same translation quality in terms of BLEU.

Tags: Computer science, NLP, nVidia, Tesla V100

October 11, 2020 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Efficient Inference For Neural Machine Translation

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Efficient Inference For Neural Machine Translation

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)