创建深度学习理论的3篇论文资源-CSDN文库

共3个文件

pdf：3个

深度学习

论文

需积分: 31 41 浏览量 2014-11-23 20:00:39 上传评论收藏 880KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

创建深度学习理论的3篇论文.zip （3个子文件）

A fast learning algorithm for deep belief nets.pdf 618KB

Efficient Learning of Sparse Representations with an Energy-Based Model.pdf 187KB

Greedy Layer-Wise Training of Deep Networks.PDF 208KB

A fast learning algorithm for deep belief nets

∗

Geoffrey E. Hinton and Simon Osindero

Department of Computer Science University of Toronto

10 Kings College Road

Toronto, Canada M5S 3G4

{hinton, osindero}@cs.toronto.edu

Yee-Whye Teh

Department of Computer Science

National University of Singapore

3 Science Drive 3, Singapore, 117543

tehyw@comp.nus.edu.sg

Abstract

We show how to use “complementary priors” to

eliminate the explaining away effects that make

inference difﬁcult in densely-connected belief

nets that have many hidden layers. Using com-

plementary priors, we derive a fast, greedy algo-

rithm that can learn deep, directed belief networks

one layer at a time, provided the top two lay-

ers form an undirected associative memory. The

fast, greedy algorithm is used to initialize a slower

learning procedure that ﬁne-tunes the weights us-

ing a contrastive version of the wake-sleep algo-

rithm. After ﬁne-tuning, a network with three

hidden layers forms a very good generative model

of the joint distribution of handwritten digit im-

ages and their labels. This generative model gives

better digit classiﬁcation than the best discrimi-

native learning algorithms. The low-dimensional

manifolds on which the digits lie are modelled by

long ravines in the free-energy landscape of the

top-level associative memory and it is easy to ex-

plore these ravines by using the directed connec-

tions to display what the associative memory has

in mind.

1 Introduction

Learning is difﬁcult in densely-connected, directed belief nets

that have many hidden layers because it is difﬁcult to infer the

conditional distribution of the hidden activities when given a

data vector. Variational methods use simple approximations

to the true conditional distribution, but the approximations

may be poor, especially at the deepest hidden layer where

the prior assumes independence. Also, variational learning

still requires all of the parameters to be learned together and

makes the learning time scale poorly as the number of param-

eters increases.

We describe a model in which the top two hidden layers

form an undirected associative memory (see ﬁgure 1) and the

∗

To appear in Neural Computation 2006

remaining hidden layers form a directed acyclic graph that

converts the representations in the associative memory into

observable variables such as the pixels of an image. This hy-

brid model has some attractive features:

1. There is a fast, greedy learning algorithm that can ﬁnd

a fairly good set of parameters quickly, even in deep

networks with millions of parameters and many hidden

layers.

2. The learning algorithm is unsupervised but can be ap-

plied to labeled data by learning a model that generates

both the label and the data.

3. There is a ﬁne-tuning algorithm that learns an excel-

lent generative model which outperforms discrimina-

tive methods on the MNIST database of hand-written

digits.

4. The generative model makes it easy to interpret the dis-

tributed representations in the deep hidden layers.

5. The inference required for forming a percept is both fast

and accurate.

6. The learning algorithm is local: adjustments to a

synapse strength depend only on the states of the pre-

synaptic and post-synaptic neuron.

7. The communication is simple: neurons only need to

communicate their stochastic binary states.

Section 2 introduces the idea of a “complementary” prior

which exactly cancels the “explaining away” phenomenon

that makes inference difﬁcult in directed models. An exam-

ple of a directed belief network with complementary priors

is presented. Section 3 shows the equivalence between re-

stricted Boltzmann machines and inﬁnite directed networks

with tied weights.

Section 4 introduces a fast, greedy learning algorithm

for constructing multi-layer directed networks one layer at

a time. Using a variational bound it shows that as each new

layer is added, the overall generative model improves. The

greedy algorithm bears some resemblance to boosting in its

repeated use of the same “weak” learner, but instead of re-

weighting each data-vector to ensure that the next step learns

something new, it re-represents it. The “weak” learner that

2000 top-level units

500 units

28 x 28

pixel

image

10 label units

This could be the

top level of

another sensory

pathway

Figure 1: The network used to model the joint distribution

of digit images and digit labels. In this paper, each training

case consists of an image and an explicit class label, but work

in progress has shown that the same learning algorithm can

be used if the “labels” are replaced by a multilayer pathway

whose inputs are spectrograms from multiple different speak-

ers saying isolated digits. The network then learns to generate

pairs that consist of an image and a spectrogram of the same

digit class.

is used to construct deep directed nets is itself an undirected

graphical model.

Section 5 shows how the weights produced by the fast

greedy algorithm can be ﬁne-tuned using the “up-down” al-

gorithm. This is a contrastive version of the wake-sleep al-

gorithm Hinton et al. (1995) that does not suffer from the

“mode-averaging” problems that can cause the wake-sleep al-

gorithm to learn poor recognition weights.

Section 6 shows the pattern recognition performance of

a network with three hidden layers and about 1.7 million

weights on the MNIST set of handwritten digits. When no

knowledge of geometry is provided and there is no special

preprocessing, the generalization performance of the network

is 1.25% errors on the 10, 000 digit ofﬁcial test set. This beats

the 1.5% achieved by the best back-propagation nets when

they are not hand-crafted for this particular application. It is

also slightly better than the 1.4% errors reported by Decoste

and Schoelkopf (2002) for support vector machines on the

same task.

Finally, section 7 shows what happens in the mind of the

network when it is running without being constrained by vi-

sual input. The network has a full generative model, so it is

easy to look into its mind – we simply generate an image from

its high-level representations.

Throughout the paper, we will consider nets composed of

Figure 2: A simple logistic belief net containing two inde-

pendent, rare causes that become highly anti-correlated when

we observe the house jumping. The bias of −10 on the earth-

quake node means that, in the absence of any observation, this

node is e

times more likely to be off than on. If the earth-

quake node is on and the truck node is off, the jump node has

a total input of 0 which means that it has an even chance of

being on. This is a much better explanation of the observation

that the house jumped than the odds of e

−20

which apply if

neither of the hidden causes is active. But it is wasteful to turn

on both hidden causes to explain the observation because the

probability of them both happening is e

−10

× e

−10

= e

−20

When the earthquake node is turned on it “explains away” the

evidence for the truck node.

stochastic binary variables but the ideas can be generalized to

other models in which the log probability of a variable is an

additive function of the states of its directly-connected neigh-

bours (see Appendix A for details).

2 Complementary priors

The phenomenon of explaining away (illustrated in ﬁgure 2)

makes inference difﬁcult in directed belief nets. In densely

connected networks, the posterior distribution over the hid-

den variables is intractable except in a few special cases such

as mixture models or linear models with additive Gaussian

noise. Markov Chain Monte Carlo methods (Neal, 1992) can

be used to sample from the posterior, but they are typically

very time consuming. Variational methods (Neal and Hinton,

1998) approximate the true posterior with a more tractable

distribution and they can be used to improve a lower bound on

the log probability of the training data. It is comforting that

learning is guaranteed to improve a variational bound even

when the inference of the hidden states is done incorrectly,

but it would be much better to ﬁnd a way of eliminating ex-

plaining away altogether, even in models whose hidden vari-

ables have highly correlated effects on the visible variables.

It is widely assumed that this is impossible.

A logistic belief net (Neal, 1992) is composed of stochas-

tic binary units. When the net is used to generate data, the

probability of turning on unit i is a logistic function of the

states of its immediate ancestors, j, and of the weights, w

on the directed connections from the ancestors:

p(s

= 1) =

1 + exp(−b

−

)

(1)

where b

is the bias of unit i. If a logistic belief net only

has one hidden layer, the prior distribution over the hidden

variables is factorial because their binary states are chosen

independently when the model is used to generate data. The

non-independence in the posterior distribution is created by

the likelihood term coming from the data. Perhaps we could

eliminate explaining away in the ﬁrst hidden layer by using

extra hidden layers to create a “complementary” prior that

has exactly the opposite correlations to those in the likeli-

hood term. Then, when the likelihood term is multiplied by

the prior, we will get a posterior that is exactly factorial. It is

not at all obvious that complementary priors exist, but ﬁgure 3

shows a simple example of an inﬁnite logistic belief net with

tied weights in which the priors are complementary at every

hidden layer (see Appendix A for a more general treatment of

the conditions under which complementary priors exist). The

use of tied weights to construct complementary priors may

seem like a mere trick for making directed models equiva-

lent to undirected ones. As we shall see, however, it leads

to a novel and very efﬁcient learning algorithm that works

by progressively untying the weights in each layer from the

weights in higher layers.

2.1 An inﬁnite directed model with tied weights

We can generate data from the inﬁnite directed net in ﬁg-

ure 3 by starting with a random conﬁguration at an inﬁnitely

deep hidden layer

and then performing a top-down “ances-

tral” pass in which the binary state of each variable in a layer

is chosen from the Bernoulli distribution determined by the

top-down input coming from its active parents in the layer

above. In this respect, it is just like any other directed acyclic

belief net. Unlike other directed nets, however, we can sam-

ple from the true posterior distribution over all of the hidden

layers by starting with a data vector on the visible units and

then using the transposed weight matrices to infer the fac-

torial distributions over each hidden layer in turn. At each

hidden layer we sample from the factorial posterior before

computing the factorial posterior for the layer above

. Ap-

pendix A shows that this procedure gives unbiased samples

because the complementary prior at each layer ensures that

the posterior distribution really is factorial.

Since we can sample from the true posterior, we can com-

pute the derivatives of the log probability of the data. Let

The generation process converges to the stationary distribution

of the Markov Chain, so we need to start at a layer that is deep

compared with the time it takes for the chain to reach equilibrium.

This is exactly the same as the inference procedure used in the

wake-sleep algorithm (Hinton et al., 1995) for the models described

in this paper no variational approximation is required because the

inference procedure gives unbiased samples.

us start by computing the derivative for a generative weight,

, from a unit j in layer H

to unit i in layer V

(see ﬁgure

3). In a logistic belief net, the maximum likelihood learning

rule for a single data-vector, v

, is:

∂ log p(v

)

∂w

=<h

− ˆv

)> (2)

where < ·> denotes an average over the sampled states and

ˆv

is the probability that unit i would be turned on if the visi-

ble vector was stochastically reconstructed from the sampled

hidden states. Computing the posterior distribution over the

second hidden layer, V

, from the sampled binary states in the

ﬁrst hidden layer, H

, is exactly the same process as recon-

structing the data, so v

is a sample from a Bernoulli random

variable with probability ˆv

. The learning rule can therefore

be written as:

∂ log p(v

)

∂w

=<h

− v

)> (3)

The dependence of v

on h

is unproblematic in the deriva-

tion of Eq. 3 from Eq. 2 because ˆv

is an expectation that is

conditional on h

. Since the weights are replicated, the full

derivative for a generative weight is obtained by summing the

derivatives of the generative weights between all pairs of lay-

ers:

∂ log p(v

)

∂w

= <h

− v

+ <v

− h

+ <h

− v

+... (4)

All of the vertically aligned terms cancel leaving the Boltz-

mann machine learning rule of Eq. 5.

3 Restricted Boltzmann machines and

contrastive divergence learning

It may not be immediately obvious that the inﬁnite directed

net in ﬁgure 3 is equivalent to a Restricted Boltzmann Ma-

chine (RBM). An RBM has a single layer of hidden units

which are not connected to each other and have undirected,

symmetrical connections to a layer of visible units. To gen-

erate data from an RBM, we can start with a random state

in one of the layers and then perform alternating Gibbs sam-

pling: All of the units in one layer are updated in parallel

given the current states of the units in the other layer and this

is repeated until the system is sampling from its equilibrium

distribution. Notice that this is exactly the same process as

generating data from the inﬁnite belief net with tied weights.

To perform maximum likelihood learning in an RBM, we can

use the difference between two correlations. For each weight,

, between a visible unit i and a hidden unit, j we measure

the correlation < v

> when a datavector is clamped on

etc.

Figure 3: An inﬁnite logistic belief net with tied weights. The

downward arrows represent the generative model. The up-

ward arrows are not part of the model. They represent the

parameters that are used to infer samples from the posterior

distribution at each hidden layer of the net when a datavector

is clamped on V

the visible units and the hidden states are sampled from their

conditional distribution, which is factorial. Then, using al-

ternating Gibbs sampling, we run the Markov chain shown in

ﬁgure 4 until it reaches its stationary distribution and measure

the correlation <v

∞

>. The gradient of the log probability

of the training data is then

∂ log p(v

)

∂w

=<v

> − <v

∞

> (5)

This learning rule is the same as the maximum likelihood

learning rule for the inﬁnite logistic belief net with tied

weights, and each step of Gibbs sampling corresponds to

computing the exact posterior distribution in a layer of the

inﬁnite logistic belief net.

Maximizing the log probability of the data is exactly

the same as minimizing the Kullback-Leibler divergence,

KL(P

||P

∞

), between the distribution of the data, P

, and

the equilibrium distribution deﬁned by the model, P

∞

. In

contrastive divergence learning (Hinton, 2002), we only run

the Markov chain for n full steps

before measuring the sec-

ond correlation. This is equivalent to ignoring the derivatives

Each full step consists of updating h given v then updating v

given h.

  

t = infinity

t = 0 t = 1 t = 2 t = infinity

∞∞

Figure 4: This depicts a Markov chain that uses alternating

Gibbs sampling. In one full step of Gibbs sampling, the hid-

den units in the top layer are all updated in parallel by apply-

ing Eq. 1 to the inputs received from the the current states

of the visible units in the bottom layer, then the visible units

are all updated in parallel given the current hidden states. The

chain is initialized by setting the binary states of the visible

units to be the same as a data-vector. The correlations in the

activities of a visible and a hidden unit are measured after the

ﬁrst update of the hidden units and again at the end of the

chain. The difference of these two correlations provides the

learning signal for updating the weight on the connection.

that come from the higher layers of the inﬁnite net. The sum

of all these ignored derivatives is the derivative of the log

probability of the posterior distribution in layer V

, which

is also the derivative of the Kullback-Leibler divergence be-

tween the posterior distribution in layer V

, P

, and the equi-

librium distribution deﬁned by the model. So contrastive di-

vergence learning minimizes the difference of two Kullback-

Leibler divergences:

KL(P

||P

∞

) − KL(P

||P

∞

) (6)

Ignoring sampling noise, this difference is never negative

because Gibbs sampling is used to produce P

from P

and

Gibbs sampling always reduces the Kullback-Leibler diver-

gence with the equilibrium distribution. It is important to no-

tice that P

depends on the current model parameters and

the way in which P

changes as the parameters change is

being ignored by contrastive divergence learning. This prob-

lem does not arise with P

because the training data does not

depend on the parameters. An empirical investigation of the

relationship between the maximum likelihood and the con-

trastive divergence learning rules can be found in Carreira-

Perpinan and Hinton (2005).

Contrastive divergence learning in a restricted Boltzmann

machine is efﬁcient enough to be practical (Mayraz and Hin-

ton, 2001). Variations that use real-valued units and differ-

ent sampling schemes are described in Teh et al. (2003) and

have been quite successful for modeling the formation of to-

pographic maps (Welling et al., 2003), for denoising natural

images (Roth and Black, 2005) or images of biological cells

(Ning et al., 2005). Marks and Movellan (2001) describe a

way of using contrastive divergence to perform factor analy-

sis and Welling et al. (2005) show that a network with logistic,

binary visible units and linear, Gaussian hidden units can be

used for rapid document retrieval. However, it appears that

the efﬁciency has been bought at a high price: When applied

in the obvious way, contrastive divergence learning fails for

deep, multilayer networks with different weights at each layer

because these networks take far too long even to reach condi-

tional equilibrium with a clamped data-vector. We now show

that the equivalence between RBM’s and inﬁnite directed nets

with tied weights suggests an efﬁcient learning algorithm for

multilayer networks in which the weights are not tied.

4 A greedy learning algorithm for

transforming representations

An efﬁcient way to learn a complicated model is to combine

a set of simpler models that are learned sequentially. To force

each model in the sequence to learn something different from

the previous models, the data is modiﬁed in some way after

each model has been learned. In boosting (Freund, 1995),

each model in the sequence is trained on re-weighted data that

emphasizes the cases that the preceding models got wrong. In

one version of principal components analysis, the variance in

a modeled direction is removed thus forcing the next modeled

direction to lie in the orthogonal subspace (Sanger, 1989).

In projection pursuit (Friedman and Stuetzle, 1981), the data

is transformed by nonlinearly distorting one direction in the

data-space to remove all non-Gaussianity in that direction.

The idea behind our greedy algorithm is to allow each model

in the sequence to receive a different representation of the

data. The model performs a non-linear transformation on its

input vectors and produces as output the vectors that will be

used as input for the next model in the sequence.

Figure 5 shows a multilayer generative model in which the

top two layers interact via undirected connections and all of

the other connections are directed. The undirected connec-

tions at the top are equivalent to having inﬁnitely many higher

layers with tied weights. There are no intra-layer connections

and, to simplify the analysis, all layers have the same number

of units. It is possible to learn sensible (though not optimal)

values for the parameters W

by assuming that the parame-

ters between higher layers will be used to construct a comple-

mentary prior for W

. This is equivalent to assuming that all

of the weight matrices are constrained to be equal. The task

of learning W

under this assumption reduces to the task of

learning an RBM and although this is still difﬁcult, good ap-

proximate solutions can be found rapidly by minimizing con-

trastive divergence. Once W

has been learned, the data can

be mapped through W

to create higher-level “data” at the

ﬁrst hidden layer.

If the RBM is a perfect model of the original data, the

higher-level “data” will already be modeled perfectly by the

higher-level weight matrices. Generally, however, the RBM

will not be able to model the original data perfectly and we

can make the generative model better using the following

greedy algorithm:

1. Learn W

assuming all the weight matrices are tied.

2. Freeze W

and commit ourselves to using W

to infer

Figure 5: A hybrid network. The top two layers have undi-

rected connections and form an associative memory. The lay-

ers below have directed, top-down, generative connections

that can be used to map a state of the associative memory

to an image. There are also directed, bottom-up, recognition

connections that are used to infer a factorial representation in

one layer from the binary activities in the layer below. In the

greedy initial learning the recognition connections are tied to

the generative connections.

factorial approximate posterior distributions over the

states of the variables in the ﬁrst hidden layer, even if

subsequent changes in higher level weights mean that

this inference method is no longer correct.

3. Keeping all the higher weight matrices tied to each

other, but untied from W

, learn an RBM model of the

higher-level “data” that was produced by using W

transform the original data.

If this greedy algorithm changes the higher-level weight

matrices, it is guaranteed to improve the generative model.

As shown in (Neal and Hinton, 1998), the negative log prob-

ability of a single data-vector, v

, under the multilayer gen-

erative model is bounded by a variational free energy which

is the expected energy under the approximating distribution,

Q(h

), minus the entropy of that distribution. For a di-

rected model, the “energy” of the conﬁguration v

, h

given by:

E(v

, h

) = −



log p(h

) + log p(v

)



(7)

So the bound is:

log p(v

) ≥

all h

Q(h

)



log p(h

) + log p(v

)



−

all h

Q(h

) log Q(h

) (8)

评论收藏

内容反馈

yangkequn

粉丝: 11
资源: 27

创建深度学习理论的3篇论文

深度学习论文阅读路线论文

深度学习论文集百篇

深度学习的最优化：理论和算法综述论文【包含257篇文献】.zip

深度学习理论与实践（全套课件+代码）.zip

2016年深度学习论文合集

深度学习理论与实践--资源下载

深度学习理论综述.pdf

深度学习重要论文

100篇+深度学习论文合集.zip

【完整版20章】深度学习经典论文与开源项目实战

李宏毅PPT介绍深度学习理论

深度学习理论知识.zip

深度学习论文合集

深度学习入门论文整合

深度学习博士论文

深度学习网络压缩2016年年度最佳3篇论文

深度学习的经典论文

引用度最高的20篇深度学习论文

100篇+深度学习论文合集

整理的计算机视觉/深度学习/机器学习相关方向的论文

Academic+Phrasebank+2021+Edition+_中英文对照.pdf

基于python的超市管理系统的设计与实现毕业论文+项目文档源码

1000套计算机毕业设计带源码

数模国赛word模板.zip

2023高教社数学建模C题 - 蔬菜类商品的自动定价与补货决策【数据处理详细代码】

2021年国赛A题（FAST主动反射面形状调节）论文+代码材料.zip

基于高校校园网的网络规划设计与实现-以锦城学院为例-kaic.docx

Python大作业（包含论文）——可打包的双人五子棋程序

软考 系统分析师论文 范文

YOLOv9论文，2024.02发布

最新资源

软考系统分析师论文范文