A Fast Learning Algorithm for Deep Belief Nets 1529
r
The inference required for forming a percept is both fast and accurate.
r
The learning algorithm is local. Adjustments to a synapse strength
depend on only the states of the presynaptic and postsynaptic neuron.
r
The communication is simple. Neurons need only to communicate
their stochastic binary states.
Section 2 introduces the idea of a “complementary” prior that exactly
cancels the “explaining away” phenomenon that makes inference difficult
in directed models. An example of a directed belief network with com-
plementary priors is presented. Section 3 shows the equivalence between
restricted Boltzmann machines and infinite directed networks with tied
weights.
Section 4 introduces a fast, greedy learning algorithm for constructing
multilayer directed networks one layer at a time. Using a variational bound,
it shows that as each new layer is added, the overall generative model
improves. The greedy algorithm bears some resemblance to boosting in
its repeated use of the same “weak” learner, but instead of reweighting
each data vector to ensure that the next step learns something new, it re-
represents it. The “weak” learner that is used to construct deep directed
nets is itself an undirected graphical model.
Section 5 shows how the weights produced by the fast, greedy al-
gorithm can be fine-tuned using the “up-down” algorithm. This is a
contrastive version of the wake-sleep algorithm (Hinton, Dayan, Frey,
& Neal, 1995) that does not suffer from the “mode-averaging” prob-
lems that can cause the wake-sleep algorithm to learn poor recognition
weights.
Section 6 shows the pattern recognition performance of a network with
three hidden layers and about 1.7 million weights on the MNIST set of
handwritten digits. When no knowledge of geometry is provided and there
is no special preprocessing, the generalization performance of the network
is 1.25% errors on the 10,000-digit official test set. This beats the 1.5%
achieved by the best backpropagation nets when they are not handcrafted
for this particular application. It is also slightly better than the 1.4% errors
reported by Decoste and Schoelkopf (2002) for support vector machines on
the same task.
Finally, section 7 shows what happens in the mind of the network when
it is running without being constrained by visual input. The network has a
full generative model, so it is easy to look into its mind—we simply generate
an image from its high-level representations.
Throughout the letter, we consider nets composed of stochastic binary
variables, but the ideas can be generalized to other models in which the log
probability of a variable is an additive function of the states of its directly
connected neighbors (see appendix A for details).