## Combining approximate inference methods for efficient learning on large computer clusters

Frankfurt Institute for Advanced Studies,Germany

Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (NIPS’11), 2011

@article{dai2012combining,

title={Combining approximate inference methods for efficient learning on large computer clusters},

author={Dai, Z. and Shelton, J.A. and Bornschein, J. and Sheikh, A.S. and L{"u}cke, J.},

year={2012}

}

An important challenge in machine learning is to develop learning algorithms that can handle large amounts of data at a realistically large scale. This entails not only the development of algorithms that can be efficiently trained to infer parameters of the model in a given dataset, but also demands careful thought about the tools (both software and hardware) used in their implementation. Based on the previously developed framework of parallel Expectation Maximization (EM) learning [1], we extend it to different models with corresponding parallelization techniques. To further tackle problems of computational complexity and to utilize the capability of the parallel computing hardware (CPU/GPU clusters), we developed a set of techniques which can be catered to specific large-scale learning problems. For instance, we design a dynamic data repartition technique for "Gaussian sparse coding" (Sec. 3.2), use specialized GPU kernels for translation invariant learning (Sec. 3.3), and show how sampling can be used to further scale the learning on very high dimensional data (Sec. 3.4). We propose these as examples of a parallelization toolbox which can be creatively combined and exploited in model-task driven ways. The framework is a lightweight and easy to use implementation of Python which facilitates the development of massive parallel machine learning algorithms using Message Passing Interface (MPI) for communication between the compute nodes. Once algorithms are integrated into the framework, they can be executed on large numbers of processor cores and can be applied to large sets of data. Some of the numerical experiments we performed ran on InfiniBand interconnected clusters and used up to 5000 parallel processor cores with more than 10^17 floating point operations. For reasonably balanced meta-parameters (number of data points vs. number of latent variables vs. number of model parameters to be inferred), we observe close to linear runtime scaling behavior with respect to the number of cores in use.

January 19, 2012 by hgpu