【2】Afastlearningalgorithmfordeepbeliefnets.pdf资源-CSDN文库

5星 · 超过95%的资源需积分: 48 168 浏览量 2019-08-26 10:42:55 上传评论收藏 764KB PDF 举报

资源推荐

资源详情

资源评论

LETTER Communicated by Yann Le Cun

A Fast Learning Algorithm for Deep Belief Nets

Geoffrey E. Hinton

hinton@cs.toronto.edu

Simon Osindero

osindero@cs.toronto.edu

Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4

Yee-Whye Teh

tehyw@comp.nus.edu.sg

Department of Computer Science, National University of Singapore,

Singapore 117543

We show how to use “complementary priors” to eliminate the explaining-

away effects that make inference difﬁcult in densely connected belief nets

that have many hidden layers. Using complementary priors, we derive a

fast, greedy algorithm that can learn deep, directed belief networks one

layer at a time, provided the top two layers form an undirected associa-

tive memory. The fast, greedy algorithm is used to initialize a slower

learning procedure that ﬁne-tunes the weights using a contrastive ver-

sion of the wake-sleep algorithm. After ﬁne-tuning, a network with three

hidden layers forms a very good generative model of the joint distribu-

tion of handwritten digit images and their labels. This generative model

gives better digit classiﬁcation than the best discriminative learning al-

gorithms. The low-dimensional manifolds on which the digits lie are

modeled by long ravines in the free-energy landscape of the top-level

associative memory, and it is easy to explore these ravines by using the

directed connections to display what the associative memory has in mind.

1 Introduction

Learning is difﬁcult in densely connected, directed belief nets that have

many hidden layers because it is difﬁcult to infer the conditional distribu-

tion of the hidden activities when given a data vector. Variational methods

use simple approximations to the true conditional distribution, but the ap-

proximations may be poor, especially at the deepest hidden layer, where

the prior assumes independence. Also, variational learning still requires all

of the parameters to be learned together and this makes the learning time

scale poorly as the number of parameters increases.

We describe a model in which the top two hidden layers form an undi-

rected associative memory (see Figure 1) and the remaining hidden layers

Neural Computation 18, 1527–1554 (2006)

C

2006 Massachusetts Institute of Technology

A Fast Learning Algorithm for Deep Belief Nets 1529

The inference required for forming a percept is both fast and accurate.

The learning algorithm is local. Adjustments to a synapse strength

depend on only the states of the presynaptic and postsynaptic neuron.

The communication is simple. Neurons need only to communicate

their stochastic binary states.

Section 2 introduces the idea of a “complementary” prior that exactly

cancels the “explaining away” phenomenon that makes inference difﬁcult

in directed models. An example of a directed belief network with com-

plementary priors is presented. Section 3 shows the equivalence between

restricted Boltzmann machines and inﬁnite directed networks with tied

weights.

Section 4 introduces a fast, greedy learning algorithm for constructing

multilayer directed networks one layer at a time. Using a variational bound,

it shows that as each new layer is added, the overall generative model

improves. The greedy algorithm bears some resemblance to boosting in

its repeated use of the same “weak” learner, but instead of reweighting

each data vector to ensure that the next step learns something new, it re-

represents it. The “weak” learner that is used to construct deep directed

nets is itself an undirected graphical model.

Section 5 shows how the weights produced by the fast, greedy al-

gorithm can be ﬁne-tuned using the “up-down” algorithm. This is a

contrastive version of the wake-sleep algorithm (Hinton, Dayan, Frey,

& Neal, 1995) that does not suffer from the “mode-averaging” prob-

lems that can cause the wake-sleep algorithm to learn poor recognition

weights.

Section 6 shows the pattern recognition performance of a network with

three hidden layers and about 1.7 million weights on the MNIST set of

handwritten digits. When no knowledge of geometry is provided and there

is no special preprocessing, the generalization performance of the network

is 1.25% errors on the 10,000-digit ofﬁcial test set. This beats the 1.5%

achieved by the best backpropagation nets when they are not handcrafted

for this particular application. It is also slightly better than the 1.4% errors

reported by Decoste and Schoelkopf (2002) for support vector machines on

the same task.

Finally, section 7 shows what happens in the mind of the network when

it is running without being constrained by visual input. The network has a

full generative model, so it is easy to look into its mind—we simply generate

an image from its high-level representations.

Throughout the letter, we consider nets composed of stochastic binary

variables, but the ideas can be generalized to other models in which the log

probability of a variable is an additive function of the states of its directly

connected neighbors (see appendix A for details).

1530 G. Hinton, S. Osindero, and Y.-W. Teh

Figure 2: A simple logistic belief net containing two independent, rare causes

that become highly anticorrelated when we observe the house jumping. The bias

of −10 on the earthquake node means that in the absence of any observation,

this node is e

times more likely to be off than on. If the earthquake node is on

and the truck node is off, the jump node has a total input of 0, which means

that it has an even chance of being on. This is a much better explanation of

the observation that the house jumped than the odds of e

−20

, which apply if

neither of the hidden causes is active. But it is wasteful to turn on both hidden

causes to explain the observation because the probability of both happening is

−10

× e

−10

= e

−20

. When the earthquake node is turned on, it “explains away”

the evidence for the truck node.

2 Complementary Priors

The phenomenon of explaining away (illustrated in Figure 2) makes in-

ference difﬁcult in directed belief nets. In densely connected networks, the

posterior distribution over the hidden variables is intractable except in a few

special cases, such as mixture models or linear models with additive gaus-

sian noise. Markov chain Monte Carlo methods (Neal, 1992) can be used

to sample from the posterior, but they are typically very time-consuming.

Variational methods (Neal & Hinton, 1998) approximate the true posterior

with a more tractable distribution, and they can be used to improve a lower

bound on the log probability of the training data. It is comforting that learn-

ing is guaranteed to improve a variational bound even when the inference

of the hidden states is done incorrectly, but it would be much better to ﬁnd

a way of eliminating explaining away altogether, even in models whose

hidden variables have highly correlated effects on the visible variables. It is

widely assumed that this is impossible.

A logistic belief net (Neal, 1992) is composed of stochastic binary units.

When the net is used to generate data, the probability of turning on unit i

is a logistic function of the states of its immediate ancestors, j, and of the

A Fast Learning Algorithm for Deep Belief Nets 1531

weights, w

, on the directed connections from the ancestors:

p(s

= 1) =

1 + exp



−b

−





, (2.1)

where b

is the bias of unit i. If a logistic belief net has only one hidden

layer, the prior distribution over the hidden variables is factorial because

their binary states are chosen independently when the model is used to

generate data. The nonindependence in the posterior distribution is created

by the likelihood term coming from the data. Perhaps we could eliminate

explaining away in the ﬁrst hidden layer by using extra hidden layers to

create a “complementary” prior that has exactly the opposite correlations to

those in the likelihood term. Then, when the likelihood term is multiplied

by the prior, we will get a posterior that is exactly factorial. It is not at

all obvious that complementary priors exist, but Figure 3 shows a simple

example of an inﬁnite logistic belief net with tied weights in which the priors

are complementary at every hidden layer (see appendix A for a more general

treatment of the conditions under which complementary priors exist). The

use of tied weights to construct complementary priors may seem like a mere

trick for making directed models equivalent to undirected ones. As we shall

see, however, it leads to a novel and very efﬁcient learning algorithm that

works by progressively untying the weights in each layer from the weights

in higher layers.

2.1 An Inﬁnite Directed Model with Tied Weights. We can generate

data from the inﬁnite directed net in Figure 3 by starting with a random

conﬁguration at an inﬁnitely deep hidden layer

and then performing a

top-down “ancestral” pass in which the binary state of each variable in a

layer is chosen from the Bernoulli distribution determined by the top-down

input coming from its active parents in the layer above. In this respect, it

is just like any other directed acyclic belief net. Unlike other directed nets,

however, we can sample from the true posterior distribution over all of the

hidden layers by starting with a data vector on the visible units and then

using the transposed weight matrices to infer the factorial distributions

over each hidden layer in turn. At each hidden layer, we sample from the

factorial posterior before computing the factorial posterior for the layer

above.

Appendix A shows that this procedure gives unbiased samples

The generation process converges to the stationary distribution of the Markov chain,

so we need to start at a layer that is deep compared with the time it takes for the chain to

reach equilibrium.

This is exactly the same as the inference procedure used in the wake-sleep algorithm

(Hinton et al., 1995) but for the models described in this letter no variational approximation

is required because the inference procedure gives unbiased samples.

剩余27页未读，继续阅读

评论收藏

内容反馈

redlz2500

2023-03-28

可以，恰到好处地有用

xinghaoyan

粉丝: 11
资源: 79

【2】A fast learning algorithm for deep belief nets.pdf

最新资源

【2】A fast learning algorithm for deep belief nets.pdf

A fast learning algorithm for deep belief nets

A Fast Learning Algorithm for Deep Belief Nets

翻译A-fast-learning-algorithm-for-deep-belief-nets.doc编程资料

《A fast learning algorithm for deep belief nets》原文

A fast learning algorithm for deep belief nets.pdf

A fast learning algorithm for deep belief nets.rar

翻译A-fast-learning-algorithm-for-deep-belief-nets.pdf

翻译A-fast-learning-algorithm-for-deep-belief-nets.docx

A Fast Learning Algorithm for A Fast Learning Algorithm for Deep Belief Nets

深度学习的经典论文

吴恩达老师deep learning.ai课程中提到的所有论文

DBN相关深度学习.pdf

深度学习神经网络(英文版PDF教程）

深度学习的9篇标志性论文

Deep Learning 经典文章与（matlab）代码

深度学习结构和算法比较分析.pdf

基于深度学习的车辆跟踪算法综述.pdf

100篇之外深度学习.zip

人工智能相关论文综合.zip

Deep Learning 经典文章与代码（matlab）

深度学习推理加速工具——tensorrtx

1.2 Deep Belief Network(DBN)(Milestone of Deep Learning Eve).rar

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

最新资源