whydoesunsupervisedpre-traininghelpdeeplearning.pdf资源-CSDN文库

Unsupervised

learning

需积分: 10 163 浏览量 2015-06-16 14:33:25 上传评论收藏 1.22MB PDF 举报

资源推荐

资源详情

资源评论

Journal of Machine Learning Research 11 (2010) 625-660 Submitted 8/09; Published 2/10

Why Does Unsupervised Pre-training Help Deep Learning?

Dumitru Erhan

∗

DUMITRU.ERHAN@UMONTREAL.CA

Yoshua Bengio YOSHUA.BENGIO@UMONTREAL.CA

Aaron Courville AARON.COURVILLE@UMONTREAL.CA

Pierre-Antoine Manzagol PIERRE-ANTOINE.MANZAGOL@UMONTREAL.CA

Pascal Vincent PASCAL.VINCENT@UMONTREAL.CA

epartement d’informatique et de recherche op

erationnelle

Universit

e de Montr

eal

2920, chemin de la Tour

Montr

eal, Qu

ebec, H3T 1J8, Canada

Samy Bengio BENGIO@GOOGLE.COM

Google Research

1600 Amphitheatre Parkway

Mountain View, CA, 94043, USA

Editor: L

eon Bottou

Abstract

Much recent research has been devoted to learning algorithms for deep architectures such as Deep

Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several

areas, mostly on vision and language data sets. The best results obtained on supervised learning

tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase.

Even though these new algorithms have enabled training deep models, many questions remain as to

the nature of this difﬁcult learning problem. The main question investigated here is the following:

how does unsupervised pre-training work? Answering this questions is important if learning in

deep architectures is to be further improved. We propose several explanatory hypotheses and test

them through extensive simulations. We empirically show the inﬂuence of pre-training with respect

to architecture depth, model capacity, and number of training examples. The experiments conﬁrm

and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-

training guides the learning towards basins of attraction of minima that support better generalization

from the training data set; the evidence from these results supports a regularization explanation for

the effect of pre-training.

Keywords: deep architectures, unsupervised pre-training, deep belief networks, stacked denoising

auto-encoders, non-convex optimization

1. Introduction

Deep learning methods aim at learning feature hierarchies with features from higher levels of the

hierarchy formed by the composition of lower level features. They include learning methods for a

wide array of deep architectures (Bengio, 2009 provides a survey), including neural networks with

many hidden layers (Bengio et al., 2007; Ranzato et al., 2007; Vincent et al., 2008; Collobert and

Weston, 2008) and graphical models with many levels of hidden variables (Hinton et al., 2006),

∗. Part of this work was done while Dumitru Erhan was at Google Research.

2010 Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent and Samy Bengio.

ERHAN, BENGIO, COURVILLE, MANZAGOL, VINCENT AND BENGIO

among others (Zhu et al., 2009; Weston et al., 2008). Theoretical results (Yao, 1985; H

astad, 1986;

astad and Goldmann, 1991; Bengio et al., 2006), reviewed and discussed by Bengio and LeCun

(2007), suggest that in order to learn the kind of complicated functions that can represent high-level

abstractions (e.g., in vision, language, and other AI-level tasks), one may need deep architectures.

The recent surge in experimental work in the ﬁeld seems to support this notion, accumulating evi-

dence that in challenging AI-related tasks—such as computer vision (Bengio et al., 2007; Ranzato

et al., 2007; Larochelle et al., 2007; Ranzato et al., 2008; Lee et al., 2009; Mobahi et al., 2009; Osin-

dero and Hinton, 2008), natural language processing (NLP) (Collobert and Weston, 2008; Weston

et al., 2008), robotics (Hadsell et al., 2008), or information retrieval (Salakhutdinov and Hinton,

2007; Salakhutdinov et al., 2007)—deep learning methods signiﬁcantly out-perform comparable

but shallow competitors, and often match or beat the state-of-the-art.

These recent demonstrations of the potential of deep learning algorithms were achieved despite

the serious challenge of training models with many layers of adaptive parameters. In virtually all

instances of deep learning, the objective function is a highly non-convex function of the parameters,

with the potential for many distinct local minima in the model parameter space. The principal

difﬁculty is that not all of these minima provide equivalent generalization errors and, we suggest,

that for deep architectures, the standard training schemes (based on random initialization) tend to

place the parameters in regions of the parameters space that generalize poorly—as was frequently

observed empirically but rarely reported (Bengio and LeCun, 2007).

The breakthrough to effective training strategies for deep architectures came in 2006 with

the algorithms for training deep belief networks (DBN) (Hinton et al., 2006) and stacked auto-

encoders (Ranzato et al., 2007; Bengio et al., 2007), which are all based on a similar approach:

greedy layer-wise unsupervised pre-training followed by supervised ﬁne-tuning. Each layer is pre-

trained with an unsupervised learning algorithm, learning a nonlinear transformation of its input

(the output of the previous layer) that captures the main variations in its input. This unsupervised

pre-training sets the stage for a ﬁnal training phase where the deep architecture is ﬁne-tuned with

respect to a supervised training criterion with gradient-based optimization. While the improvement

in performance of trained deep models offered by the pre-training strategy is impressive, little is

understood about the mechanisms underlying this success.

The objective of this paper is to explore, through extensive experimentation, how unsupervised

pre-training works to render learning deep architectures more effective and why they appear to

work so much better than traditional neural network training methods. There are a few reasonable

hypotheses why unsupervised pre-training might work. One possibility is that unsupervised pre-

training acts as a kind of network pre-conditioner, putting the parameter values in the appropriate

range for further supervised training. Another possibility, suggested by Bengio et al. (2007), is that

unsupervised pre-training initializes the model to a point in parameter space that somehow renders

the optimization process more effective, in the sense of achieving a lower minimum of the empirical

cost function.

Here, we argue that our experiments support a view of unsupervised pre-training as an unusual

form of regularization: minimizing variance and introducing bias towards conﬁgurations of the pa-

rameter space that are useful for unsupervised learning. This perspective places unsupervised pre-

training well within the family of recently developed semi-supervised methods. The unsupervised

pre-training approach is, however, unique among semi-supervised training strategies in that it acts by

deﬁning a particular initialization point for standard supervised training rather than either modifying

the supervised objective function (Barron, 1991) or explicitly imposing constraints on the parame-

626

WHY DOES UNSUPERVISED PRE-TRAINING HELP DEEP LEARNING?

ters throughout training (Lasserre et al., 2006). This type of initialization-as-regularization strategy

has precedence in the neural networks literature, in the shape of the early stopping idea (Sj

oberg

and Ljung, 1995; Amari et al., 1997), and in the Hidden Markov Models (HMM) community (Bahl

et al., 1986; Povey and Woodland, 2002) where it was found that ﬁrst training an HMM as a genera-

tive model was essential (as an initialization step) before ﬁne-tuning it discriminatively. We suggest

that, in the highly non-convex situation of training a deep architecture, deﬁning a particular initial-

ization point implicitly imposes constraints on the parameters in that it speciﬁes which minima (out

of a very large number of possible minima) of the cost function are allowed. In this way, it may

be possible to think of unsupervised pre-training as being related to the approach of Lasserre et al.

(2006).

Another important and distinct property of the unsupervised pre-training strategy is that in the

standard situation of training using stochastic gradient descent, the beneﬁcial generalization effects

due to pre-training do not appear to diminish as the number of labeled examples grows very large.

We argue that this is a consequence of the combination of the non-convexity (multi-modality) of the

objective function and the dependency of the stochastic gradient descent method on example order-

ing. We ﬁnd that early changes in the parameters have a greater impact on the ﬁnal region (basin

of attraction of the descent procedure) in which the learner ends up. In particular, unsupervised

pre-training sets the parameter in a region from which better basins of attraction can be reached, in

terms of generalization. Hence, although unsupervised pre-training is a regularizer, it can have a

positive effect on the training objective when the number of training examples is large.

As previously stated, this paper is concerned with an experimental assessment of the various

competing hypotheses regarding the role of unsupervised pre-training in the recent success of deep

learning methods. To this end, we present a series of experiments design to pit these hypotheses

against one another in an attempt to resolve some of the mystery surrounding the effectiveness of

unsupervised pre-training.

In the ﬁrst set of experiments (in Section 6), we establish the effect of unsupervised pre-training

on improving the generalization error of trained deep architectures. In this section we also exploit

dimensionality reduction techniques to illustrate how unsupervised pre-training affects the location

of minima in parameter space.

In the second set of experiments (in Section 7), we directly compare the two alternative hy-

potheses (pre-training as a pre-conditioner; and pre-training as an optimization scheme) against

the hypothesis that unsupervised pre-training is a regularization strategy. In the ﬁnal set of experi-

ments, (in Section 8), we explore the role of unsupervised pre-training in the online learning setting,

where the number of available training examples grows very large. In these experiments, we test

key aspects of our hypothesis relating to the topology of the cost function and the role of unsuper-

vised pre-training in manipulating the region of parameter space from which supervised training is

initiated.

Before delving into the experiments, we begin with a more in-depth view of the challenges in

training deep architectures and how we believe unsupervised pre-training works towards overcom-

ing these challenges.

627

ERHAN, BENGIO, COURVILLE, MANZAGOL, VINCENT AND BENGIO

2. The Challenges of Deep Learning

In this section, we present a perspective on why standard training of deep models through gradient

backpropagation appears to be so difﬁcult. First, it is important to establish what we mean in stating

that training is difﬁcult.

We believe the central challenge in training deep architectures is dealing with the strong depen-

dencies that exist during training between the parameters across layers. One way to conceive the

difﬁculty of the problem is that we must simultaneously:

1. adapt the lower layers in order to provide adequate input to the ﬁnal (end of training) setting

of the upper layers

2. adapt the upper layers to make good use of the ﬁnal (end of training) setting of the lower

layers.

The second problem is easy on its own (i.e., when the ﬁnal setting of the other layers is known). It is

not clear how difﬁcult is the ﬁrst one, and we conjecture that a particular difﬁculty arises when both

sets of layers must be learned jointly, as the gradient of the objective function is limited to a local

measure given the current setting of other parameters. Furthermore, because with enough capacity

the top two layers can easily overﬁt the training set, training error does not necessarily reveal the

difﬁculty in optimizing the lower layers. As shown in our experiments here, the standard training

schemes tend to place the parameters in regions of the parameters space that generalize poorly.

A separate but related issue appears if we focus our consideration of traditional training methods

for deep architectures on stochastic gradient descent. A sequence of examples along with an online

gradient descent procedure deﬁnes a trajectory in parameter space, which converges in some sense

(the error does not improve anymore, maybe because we are near a local minimum). The hypothesis

is that small perturbations of that trajectory (either by initialization or by changes in which examples

are seen when) have more effect early on. Early in the process of following the stochastic gradient,

changes in the weights tend to increase their magnitude and, consequently, the amount of non-

linearity of the network increases. As this happens, the set of regions accessible by stochastic

gradient descent on samples of the training distribution becomes smaller. Early on in training small

perturbations allow the model parameters to switch from one basin to a nearby one, whereas later

on (typically with larger parameter values), it is unlikely to “escape” from such a basin of attraction.

Hence the early examples can have a larger inﬂuence and, in practice, trap the model parameters in

particular regions of parameter space that correspond to the speciﬁc and arbitrary ordering of the

training examples.

An important consequence of this phenomenon is that even in the presence of

a very large (effectively inﬁnite) amounts of supervised data, stochastic gradient descent is subject

to a degree of overﬁtting to the training data presented early in the training process. In that sense,

unsupervised pre-training interacts intimately with the optimization process, and when the number

of training examples becomes large, its positive effect is seen not only on generalization error but

also on training error.

1. This process seems similar to the “critical period” phenomena observed in neuroscience and psychology (Bornstein,

1987).

628

WHY DOES UNSUPERVISED PRE-TRAINING HELP DEEP LEARNING?

3. Unsupervised Pre-training Acts as a Regularizer

As stated in the introduction, we believe that greedy layer-wise unsupervised pre-training overcomes

the challenges of deep learning by introducing a useful prior to the supervised ﬁne-tuning training

procedure. We claim that the regularization effect is a consequence of the pre-training procedure

establishing an initialization point of the ﬁne-tuning procedure inside a region of parameter space

in which the parameters are henceforth restricted. The parameters are restricted to a relatively small

volume of parameter space that is delineated by the boundary of the local basin of attraction of the

supervised ﬁne-tuning cost function.

The pre-training procedure increases the magnitude of the weights and in standard deep models,

with a sigmoidal nonlinearity, this has the effect of rendering both the function more nonlinear and

the cost function locally more complicated with more topological features such as peaks, troughs

and plateaus. The existence of these topological features renders the parameter space locally more

difﬁcult to travel signiﬁcant distances via a gradient descent procedure. This is the core of the

restrictive property imposed by the pre-training procedure and hence the basis of its regularizing

properties.

But unsupervised pre-training restricts the parameters to particular regions: those that corre-

spond to capturing structure in the input distribution P(X). To simply state that unsupervised pre-

training is a regularization strategy somewhat undermines the signiﬁcance of its effectiveness. Not

all regularizers are created equal and, in comparison to standard regularization schemes such as

and L

parameter penalization, unsupervised pre-training is dramatically effective. We believe

the credit for its success can be attributed to the unsupervised training criteria optimized during

unsupervised pre-training.

During each phase of the greedy unsupervised training strategy, layers are trained to represent

the dominant factors of variation extant in the data. This has the effect of leveraging knowledge

of X to form, at each layer, a representation of X consisting of statistically reliable features of

X that can then be used to predict the output (usually a class label) Y. This perspective places

unsupervised pre-training well within the family of learning strategies collectively know as semi-

supervised methods. As with other recent work demonstrating the effectiveness of semi-supervised

methods in regularizing model parameters, we claim that the effectiveness of the unsupervised pre-

training strategy is limited to the extent that learning P(X) is helpful in learning P(Y|X). Here,

we ﬁnd transformations of X—learned features—that are predictive of the main factors of variation

in P(X), and when the pre-training strategy is effective,

some of these learned features of X are

also predictive of Y. In the context of deep learning, the greedy unsupervised strategy may also

have a special function. To some degree it resolves the problem of simultaneously learning the

parameters at all layers (mentioned in Section 2) by introducing a proxy criterion. This proxy

criterion encourages signiﬁcant factors of variation, present in the input data, to be represented in

intermediate layers.

To clarify this line of reasoning, we can formalize the effect of unsupervised pre-training in

inducing a prior distribution over the parameters. Let us assume that parameters are forced to be

chosen in a bounded region S ⊂ R

. Let S be split in regions {R

} that are the basins of attrac-

tion of descent procedures in the training error (note that {R

} depends on the training set, but the

dependency decreases as the number of examples increases). We have ∪

= S and R

∩R

for i 6= j. Let v

θ∈R

dθ be the volume associated with region R

(where θ are our model’s

2. Acting as a form of (data-dependent) “prior” on the parameters, as we are about to formalize.

629

剩余35页未读，继续阅读

评论收藏

内容反馈

lqqsjtu

粉丝: 0
资源: 1

why does unsupervised pre-training help deep learning.pdf

最新资源

why does unsupervised pre-training help deep learning.pdf

why does unsupervised pretraining help DL

cheatsheet-unsupervised-learning.pdf

Hands-On Unsupervised Learning Using Python - Ankur A. Patel.pdf

Unsupervised learning- deformable registration l chest.pdf

100篇之外深度学习.zip

Unsupervised Part-based Weighting Aggregation

几何感知的无监督域自适应_Geometry-Aware Unsupervised Domain Adaptation.pdf

UVStyle-Net Unsupervised Few-Shot Learning of 3D Style.pdf

unsupervised-learning-neural-network.rar_unsupervised

From neural PCA to deep unsupervised learning..pdf

03-unsupervised-learning.ipynb

An overview of multi-task learning.pdf

Visual Knowledge Discovery and Machine Learning-Springer(2018).pdf

cheatsheet-unsupervised-learning.zip

Deep_Learning_Architecture_for_AI.pdf

Unsupervised Learning Generation（入门推荐）.pdf

Deep learning Methods and Applications

R.Deep.Learning.Essentials.1785280589

Unsupervised Multi-Source Domain Adaptation Without Access.pdf

Vector Davinci官方帮助配置使用手册（AutoSAR）.pdf

c++入门，核心，提高讲义笔记

离散数学及其应用 第八版 奇数编号练习答案.pdf

数字图像处理 冈萨雷斯 课后习题

科研伦理与学术规范 期末考试2 （40题）.pdf

最值得收藏的 考研线性代数 全部知识点思维导图整理(张宇, 汤家凤), 附带惯用思维/做题技巧/易错点整理.emmx

软件著作权设计说明书模板（含填写说明）.docx

AUTOSAR培训教材.rar

菜菜sklearn课程讲义.rar

“互联网+”大学生创新创业大赛项目计划书

最新资源

离散数学及其应用第八版奇数编号练习答案.pdf

数字图像处理冈萨雷斯课后习题

科研伦理与学术规范期末考试2 （40题）.pdf

最值得收藏的考研线性代数全部知识点思维导图整理(张宇, 汤家凤), 附带惯用思维/做题技巧/易错点整理.emmx