A fast learning algorithm for deep belief nets
∗
Geoffrey E. Hinton and Simon Osindero
Department of Computer Science University of Toronto
10 Kings College Road
Toronto, Canada M5S 3G4
{hinton, osindero}@cs.toronto.edu
Yee-Whye Teh
Department of Computer Science
National University of Singapore
3 Science Drive 3, Singapore, 117543
tehyw@comp.nus.edu.sg
Abstract
We show how to use “complementary priors” to
eliminate the explaining away effects that make
inference difficult in densely-connected belief
nets that have many hidden layers. Using com-
plementary priors, we derive a fast, greedy algo-
rithm that can learn deep, directed belief networks
one layer at a time, provided the top two lay-
ers form an undirected associative memory. The
fast, greedy algorithm is used to initialize a slower
learning procedure that fine-tunes the weights us-
ing a contrastive version of the wake-sleep algo-
rithm. After fine-tuning, a network with three
hidden layers forms a very good generative model
of the joint distribution of handwritten digit im-
ages and their labels. This generative model gives
better digit classification than the best discrimi-
native learning algorithms. The low-dimensional
manifolds on which the digits lie are modelled by
long ravines in the free-energy landscape of the
top-level associative memory and it is easy to ex-
plore these ravines by using the directed connec-
tions to display what the associative memory has
in mind.
1 Introduction
Learning is difficult in densely-connected, directed belief nets
that have many hidden layers because it is difficult to infer the
conditional distribution of the hidden activities when given a
data vector. Variational methods use simple approximations
to the true conditional distribution, but the approximations
may be poor, especially at the deepest hidden layer where
the prior assumes independence. Also, variational learning
still requires all of the parameters to be learned together and
makes the learning time scale poorly as the number of param-
eters increases.
We describe a model in which the top two hidden layers
form an undirected associative memory (see figure 1) and the
∗
To appear in Neural Computation 2006
remaining hidden layers form a directed acyclic graph that
converts the representations in the associative memory into
observable variables such as the pixels of an image. This hy-
brid model has some attractive features:
1. There is a fast, greedy learning algorithm that can find
a fairly good set of parameters quickly, even in deep
networks with millions of parameters and many hidden
layers.
2. The learning algorithm is unsupervised but can be ap-
plied to labeled data by learning a model that generates
both the label and the data.
3. There is a fine-tuning algorithm that learns an excel-
lent generative model which outperforms discrimina-
tive methods on the MNIST database of hand-written
digits.
4. The generative model makes it easy to interpret the dis-
tributed representations in the deep hidden layers.
5. The inference required for forming a percept is both fast
and accurate.
6. The learning algorithm is local: adjustments to a
synapse strength depend only on the states of the pre-
synaptic and post-synaptic neuron.
7. The communication is simple: neurons only need to
communicate their stochastic binary states.
Section 2 introduces the idea of a “complementary” prior
which exactly cancels the “explaining away” phenomenon
that makes inference difficult in directed models. An exam-
ple of a directed belief network with complementary priors
is presented. Section 3 shows the equivalence between re-
stricted Boltzmann machines and infinite directed networks
with tied weights.
Section 4 introduces a fast, greedy learning algorithm
for constructing multi-layer directed networks one layer at
a time. Using a variational bound it shows that as each new
layer is added, the overall generative model improves. The
greedy algorithm bears some resemblance to boosting in its
repeated use of the same “weak” learner, but instead of re-
weighting each data-vector to ensure that the next step learns
something new, it re-represents it. The “weak” learner that