Efficient Algorithms for Learning Restricted Boltzmann Machines
Abstract
The probabilistic generative models learn useful features from unlabeled data which can be used for subsequent problem-specific tasks, such as classification, regression or information retrieval. The RBM is one such important energy based probabilistic generative model. RBMs are also the building blocks for several deep generative models. It is difficult to train and evaluate RBMs mainly because the normalizing constant (known as the partition function) for the distribution that they represent is computationally hard to evaluate.
Therefore, various approximate methods (based noisy gradient of the log likelihood estimated through sampling) are used to train RBMs. Thus, building efficient learning algorithms for the RBM is an important problem.
In this thesis, we consider the problem of maximum likelihood learning of RBMs. We consider both binary-binary RBMs as well as Gaussian-binary RBMs. We propose a new algorithm for learning binary-binary RBMs by exploiting the property that the BB-RBM log-likelihood function is a difference of convex functions w.r.t. its parameters. In the standard difference of convex functions programming (DCP), the optimization proceeds through solving a convex optimization problem at each iteration. In the case of RBM, this convex objective function contains the partition function and hence its gradient computation may be intractable. We propose a stochastic variant of the difference of convex functions optimization algorithm, termed S-DCP, where the the convex optimization problem at each iteration is approximately solved through a few iterations of stochastic gradient descent. The resulting algorithm is simple and the contrastive divergence~(CD) algorithm, the current standard algorithm for learning RBMs, can be derived as a special case of the proposed algorithm. It is seen through empirical studies that S-DCP improves the optimization dynamics of learning binary-binary RBMs.
We further modify this algorithm to accommodate centered gradients. Through extensive empirical studies on a number of benchmark datasets, we demonstrate the superior performance of the proposed algorithms.
It is well documented in the literature that learning Gaussian-binary RBMs is more difficult compared to binary-binary RBMs. We extend the S-DCP algorithm to learn Gaussian-binary RBMs by proving that the Gaussian-binary RBM log-likelihood function is also a difference of convex functions w.r.t. the weights and hidden biases under the assumption that the conditional distribution of the visible
units have a fixed variance. Through extensive empirical studies on a number of benchmark datasets, we demonstrate
that S-DCP learns good models more efficiently compared to CD and Persistent CD, the current standard algorithms for learning Gaussian-binary RBMs. We further modify the S-DCP algorithm to accommodate variance update (outside the inner loop of the convex optimization) so that we can learn the variance parameter of visible units too instead of keeping it fixed. We empirically analyse the resulting algorithm and show that it is more efficient compared to the current algorithms.
The second order learning methods provide invariance to re-parameterization of the model and also, improve the optimization dynamics of the learning algorithm by providing parameter specific (adaptive) learning rates. However, the Hessian of the log-likelihood, required for second order learning algorithm, can only be estimated through sampling and this noisy estimate makes the optimization algorithm unstable. Moreover, the computation of the Hessian inverse is expensive. We propose a second order learning algorithm on the convex S-DCP objective function using diagonal approximation of the Hessian which, we show, can be easily computed with the gradient estimates. To compensate for the noise in the Hessian estimate and to make the algorithm stable, we use an exponential averaging
over these estimates. We show empirically that the resulting algorithm, termed S-DCP-D, is computationally cheap, stable and
improves the performance of S-DCP further. Our empirical results show that the centered S-DCP as well as the diagonally scaled S-DCP are effective and efficient methods for learning RBMs.
In all methods for learning RBMs, the log-likelihood achieved on the held-out test samples are used to evaluate the quality of learnt RBMs and for fixing the hyperparameters. However, the presence of the partition function makes estimating the log-likelihood intractable for
models with large dimension. Currently one uses some sampling based methods for approximating the log likelihood. We provide an empirical analysis of these sampling based algorithms to estimate the log-likelihood and suggest some simple techniques to improve these estimates.