Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation
LIUM, University of Le Mans, 72085 Le Mans cedex 9, France
NAACL workshop on the Future of Language Modeling, 2012
@article{schwenk2012large,
title={Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation},
author={Schwenk, H. and Rousseau, A. and Attik, M.},
journal={NAACL-HLT 2012},
pages={11},
year={2012}
}
Language models play an important role in large vocabulary speech recognition and statistical machine translation systems. The dominant approach since several decades are back-off language models. Some years ago, there was a clear tendency to build huge language models trained on hundreds of billions of words. Lately, this tendency has changed and recent works concentrate on data selection. Continuous space methods are a very competitive approach, but they have a high computational complexity and are not yet in widespread use. This paper presents an experimental comparison of all these approaches on a large statistical machine translation task. We also describe an open-source implementation to train and use continuous space language models (CSLM) for such large tasks. We describe an efficient implementation of the CSLM using graphical processing units from Nvidia. By these means, we are able to train an CSLM on more than 500 million words in 20 hours. This CSLM provides an improvement of up to 1.8 BLEU points with respect to the best back-off language model that we were able to build.
June 11, 2012 by hgpu