## Learning Sparse Recurrent Neural Networks in Language Modeling

The Ohio State University

The Ohio State University, 2014

@phdthesis{shao2014learning,

title={Learning Sparse Recurrent Neural Networks in Language Modeling},

author={Shao, Yuanlong},

year={2014},

school={The Ohio State University}

}

In the context of statistical language modeling, we explored the task of learning an Elman network with sparse weight matrices, as a pilot study towards learning a sparsely connected fully recurrent neural network, which would be potentially useful in many cases. We also explored how efficient and scalable it can be in practice. In particular, we explored these tasks: (1) We adapted the Iterative Hard Thresholding (IHT) algorithm into the BackPropagation Through Time (BPTT) learning. (2) To accelerate convergence of the IHT algorithm, we designed a scheme for expanding the net-work by replicating the existing hidden neurons. Thus we can start training from a small and dense network which is already learned. (3) We implemented this algorithm in GPU. Under small minibatch sizes and large network sizes (e.g., 2000 hidden neurons) it achieves 160 times speedup compared to the RNNLM toolkit in CPU. With larger mini-batch sizes there could be another 10 times speedup, though the convergence rate be-comes an issue in such cases and further effort is needed to address this problem. (4) Without theoretical convergence guarantee of the IHT algorithm in our problem setting, we did an empirical study showing that learning a sparse network does give competitive perplexity in language modeling. In particular, we showed that a sparse network learned in this way can outperform a dense network when the number of effective parameters is kept the same. (5) We gathered performance metric comparing the computational efficiency of the matrix operations of interest in both sparse and dense settings. The results suggest that for network sizes which we can train in reasonable time at this moment, sparse matrices do not computational advantage than dense matrices, unless we are al-lowed to have very sparse networks. Thus for research purposes we may want to focus on using dense matrices, while for engineering purposes a more flexible matrix design lever-aging the power of dense and sparse matrices might be necessary.

May 7, 2014 by hgpu