没有合适的资源?快使用搜索试试~ 我知道了~
why does unsupervised pre-training help deep learning.pdf
需积分: 10 10 下载量 163 浏览量
2015-06-16
14:33:25
上传
评论
收藏 1.22MB PDF 举报
温馨提示
试读
36页
深度学习在作模式分类或数据回归时,在整体上是一个有监督学习。但不同于多层神经网络,深度学习通过每一层的无监督预训练来初始化一个深度网络,本文讲述无监督预训练如何提升深度学习性能。
资源推荐
资源详情
资源评论
Journal of Machine Learning Research 11 (2010) 625-660 Submitted 8/09; Published 2/10
Why Does Unsupervised Pre-training Help Deep Learning?
Dumitru Erhan
∗
DUMITRU.ERHAN@UMONTREAL.CA
Yoshua Bengio YOSHUA.BENGIO@UMONTREAL.CA
Aaron Courville AARON.COURVILLE@UMONTREAL.CA
Pierre-Antoine Manzagol PIERRE-ANTOINE.MANZAGOL@UMONTREAL.CA
Pascal Vincent PASCAL.VINCENT@UMONTREAL.CA
D
´
epartement d’informatique et de recherche op
´
erationnelle
Universit
´
e de Montr
´
eal
2920, chemin de la Tour
Montr
´
eal, Qu
´
ebec, H3T 1J8, Canada
Samy Bengio BENGIO@GOOGLE.COM
Google Research
1600 Amphitheatre Parkway
Mountain View, CA, 94043, USA
Editor: L
´
eon Bottou
Abstract
Much recent research has been devoted to learning algorithms for deep architectures such as Deep
Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several
areas, mostly on vision and language data sets. The best results obtained on supervised learning
tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase.
Even though these new algorithms have enabled training deep models, many questions remain as to
the nature of this difficult learning problem. The main question investigated here is the following:
how does unsupervised pre-training work? Answering this questions is important if learning in
deep architectures is to be further improved. We propose several explanatory hypotheses and test
them through extensive simulations. We empirically show the influence of pre-training with respect
to architecture depth, model capacity, and number of training examples. The experiments confirm
and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-
training guides the learning towards basins of attraction of minima that support better generalization
from the training data set; the evidence from these results supports a regularization explanation for
the effect of pre-training.
Keywords: deep architectures, unsupervised pre-training, deep belief networks, stacked denoising
auto-encoders, non-convex optimization
1. Introduction
Deep learning methods aim at learning feature hierarchies with features from higher levels of the
hierarchy formed by the composition of lower level features. They include learning methods for a
wide array of deep architectures (Bengio, 2009 provides a survey), including neural networks with
many hidden layers (Bengio et al., 2007; Ranzato et al., 2007; Vincent et al., 2008; Collobert and
Weston, 2008) and graphical models with many levels of hidden variables (Hinton et al., 2006),
∗. Part of this work was done while Dumitru Erhan was at Google Research.
c
2010 Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent and Samy Bengio.
ERHAN, BENGIO, COURVILLE, MANZAGOL, VINCENT AND BENGIO
among others (Zhu et al., 2009; Weston et al., 2008). Theoretical results (Yao, 1985; H
˚
astad, 1986;
H
˚
astad and Goldmann, 1991; Bengio et al., 2006), reviewed and discussed by Bengio and LeCun
(2007), suggest that in order to learn the kind of complicated functions that can represent high-level
abstractions (e.g., in vision, language, and other AI-level tasks), one may need deep architectures.
The recent surge in experimental work in the field seems to support this notion, accumulating evi-
dence that in challenging AI-related tasks—such as computer vision (Bengio et al., 2007; Ranzato
et al., 2007; Larochelle et al., 2007; Ranzato et al., 2008; Lee et al., 2009; Mobahi et al., 2009; Osin-
dero and Hinton, 2008), natural language processing (NLP) (Collobert and Weston, 2008; Weston
et al., 2008), robotics (Hadsell et al., 2008), or information retrieval (Salakhutdinov and Hinton,
2007; Salakhutdinov et al., 2007)—deep learning methods significantly out-perform comparable
but shallow competitors, and often match or beat the state-of-the-art.
These recent demonstrations of the potential of deep learning algorithms were achieved despite
the serious challenge of training models with many layers of adaptive parameters. In virtually all
instances of deep learning, the objective function is a highly non-convex function of the parameters,
with the potential for many distinct local minima in the model parameter space. The principal
difficulty is that not all of these minima provide equivalent generalization errors and, we suggest,
that for deep architectures, the standard training schemes (based on random initialization) tend to
place the parameters in regions of the parameters space that generalize poorly—as was frequently
observed empirically but rarely reported (Bengio and LeCun, 2007).
The breakthrough to effective training strategies for deep architectures came in 2006 with
the algorithms for training deep belief networks (DBN) (Hinton et al., 2006) and stacked auto-
encoders (Ranzato et al., 2007; Bengio et al., 2007), which are all based on a similar approach:
greedy layer-wise unsupervised pre-training followed by supervised fine-tuning. Each layer is pre-
trained with an unsupervised learning algorithm, learning a nonlinear transformation of its input
(the output of the previous layer) that captures the main variations in its input. This unsupervised
pre-training sets the stage for a final training phase where the deep architecture is fine-tuned with
respect to a supervised training criterion with gradient-based optimization. While the improvement
in performance of trained deep models offered by the pre-training strategy is impressive, little is
understood about the mechanisms underlying this success.
The objective of this paper is to explore, through extensive experimentation, how unsupervised
pre-training works to render learning deep architectures more effective and why they appear to
work so much better than traditional neural network training methods. There are a few reasonable
hypotheses why unsupervised pre-training might work. One possibility is that unsupervised pre-
training acts as a kind of network pre-conditioner, putting the parameter values in the appropriate
range for further supervised training. Another possibility, suggested by Bengio et al. (2007), is that
unsupervised pre-training initializes the model to a point in parameter space that somehow renders
the optimization process more effective, in the sense of achieving a lower minimum of the empirical
cost function.
Here, we argue that our experiments support a view of unsupervised pre-training as an unusual
form of regularization: minimizing variance and introducing bias towards configurations of the pa-
rameter space that are useful for unsupervised learning. This perspective places unsupervised pre-
training well within the family of recently developed semi-supervised methods. The unsupervised
pre-training approach is, however, unique among semi-supervised training strategies in that it acts by
defining a particular initialization point for standard supervised training rather than either modifying
the supervised objective function (Barron, 1991) or explicitly imposing constraints on the parame-
626
WHY DOES UNSUPERVISED PRE-TRAINING HELP DEEP LEARNING?
ters throughout training (Lasserre et al., 2006). This type of initialization-as-regularization strategy
has precedence in the neural networks literature, in the shape of the early stopping idea (Sj
¨
oberg
and Ljung, 1995; Amari et al., 1997), and in the Hidden Markov Models (HMM) community (Bahl
et al., 1986; Povey and Woodland, 2002) where it was found that first training an HMM as a genera-
tive model was essential (as an initialization step) before fine-tuning it discriminatively. We suggest
that, in the highly non-convex situation of training a deep architecture, defining a particular initial-
ization point implicitly imposes constraints on the parameters in that it specifies which minima (out
of a very large number of possible minima) of the cost function are allowed. In this way, it may
be possible to think of unsupervised pre-training as being related to the approach of Lasserre et al.
(2006).
Another important and distinct property of the unsupervised pre-training strategy is that in the
standard situation of training using stochastic gradient descent, the beneficial generalization effects
due to pre-training do not appear to diminish as the number of labeled examples grows very large.
We argue that this is a consequence of the combination of the non-convexity (multi-modality) of the
objective function and the dependency of the stochastic gradient descent method on example order-
ing. We find that early changes in the parameters have a greater impact on the final region (basin
of attraction of the descent procedure) in which the learner ends up. In particular, unsupervised
pre-training sets the parameter in a region from which better basins of attraction can be reached, in
terms of generalization. Hence, although unsupervised pre-training is a regularizer, it can have a
positive effect on the training objective when the number of training examples is large.
As previously stated, this paper is concerned with an experimental assessment of the various
competing hypotheses regarding the role of unsupervised pre-training in the recent success of deep
learning methods. To this end, we present a series of experiments design to pit these hypotheses
against one another in an attempt to resolve some of the mystery surrounding the effectiveness of
unsupervised pre-training.
In the first set of experiments (in Section 6), we establish the effect of unsupervised pre-training
on improving the generalization error of trained deep architectures. In this section we also exploit
dimensionality reduction techniques to illustrate how unsupervised pre-training affects the location
of minima in parameter space.
In the second set of experiments (in Section 7), we directly compare the two alternative hy-
potheses (pre-training as a pre-conditioner; and pre-training as an optimization scheme) against
the hypothesis that unsupervised pre-training is a regularization strategy. In the final set of experi-
ments, (in Section 8), we explore the role of unsupervised pre-training in the online learning setting,
where the number of available training examples grows very large. In these experiments, we test
key aspects of our hypothesis relating to the topology of the cost function and the role of unsuper-
vised pre-training in manipulating the region of parameter space from which supervised training is
initiated.
Before delving into the experiments, we begin with a more in-depth view of the challenges in
training deep architectures and how we believe unsupervised pre-training works towards overcom-
ing these challenges.
627
ERHAN, BENGIO, COURVILLE, MANZAGOL, VINCENT AND BENGIO
2. The Challenges of Deep Learning
In this section, we present a perspective on why standard training of deep models through gradient
backpropagation appears to be so difficult. First, it is important to establish what we mean in stating
that training is difficult.
We believe the central challenge in training deep architectures is dealing with the strong depen-
dencies that exist during training between the parameters across layers. One way to conceive the
difficulty of the problem is that we must simultaneously:
1. adapt the lower layers in order to provide adequate input to the final (end of training) setting
of the upper layers
2. adapt the upper layers to make good use of the final (end of training) setting of the lower
layers.
The second problem is easy on its own (i.e., when the final setting of the other layers is known). It is
not clear how difficult is the first one, and we conjecture that a particular difficulty arises when both
sets of layers must be learned jointly, as the gradient of the objective function is limited to a local
measure given the current setting of other parameters. Furthermore, because with enough capacity
the top two layers can easily overfit the training set, training error does not necessarily reveal the
difficulty in optimizing the lower layers. As shown in our experiments here, the standard training
schemes tend to place the parameters in regions of the parameters space that generalize poorly.
A separate but related issue appears if we focus our consideration of traditional training methods
for deep architectures on stochastic gradient descent. A sequence of examples along with an online
gradient descent procedure defines a trajectory in parameter space, which converges in some sense
(the error does not improve anymore, maybe because we are near a local minimum). The hypothesis
is that small perturbations of that trajectory (either by initialization or by changes in which examples
are seen when) have more effect early on. Early in the process of following the stochastic gradient,
changes in the weights tend to increase their magnitude and, consequently, the amount of non-
linearity of the network increases. As this happens, the set of regions accessible by stochastic
gradient descent on samples of the training distribution becomes smaller. Early on in training small
perturbations allow the model parameters to switch from one basin to a nearby one, whereas later
on (typically with larger parameter values), it is unlikely to “escape” from such a basin of attraction.
Hence the early examples can have a larger influence and, in practice, trap the model parameters in
particular regions of parameter space that correspond to the specific and arbitrary ordering of the
training examples.
1
An important consequence of this phenomenon is that even in the presence of
a very large (effectively infinite) amounts of supervised data, stochastic gradient descent is subject
to a degree of overfitting to the training data presented early in the training process. In that sense,
unsupervised pre-training interacts intimately with the optimization process, and when the number
of training examples becomes large, its positive effect is seen not only on generalization error but
also on training error.
1. This process seems similar to the “critical period” phenomena observed in neuroscience and psychology (Bornstein,
1987).
628
WHY DOES UNSUPERVISED PRE-TRAINING HELP DEEP LEARNING?
3. Unsupervised Pre-training Acts as a Regularizer
As stated in the introduction, we believe that greedy layer-wise unsupervised pre-training overcomes
the challenges of deep learning by introducing a useful prior to the supervised fine-tuning training
procedure. We claim that the regularization effect is a consequence of the pre-training procedure
establishing an initialization point of the fine-tuning procedure inside a region of parameter space
in which the parameters are henceforth restricted. The parameters are restricted to a relatively small
volume of parameter space that is delineated by the boundary of the local basin of attraction of the
supervised fine-tuning cost function.
The pre-training procedure increases the magnitude of the weights and in standard deep models,
with a sigmoidal nonlinearity, this has the effect of rendering both the function more nonlinear and
the cost function locally more complicated with more topological features such as peaks, troughs
and plateaus. The existence of these topological features renders the parameter space locally more
difficult to travel significant distances via a gradient descent procedure. This is the core of the
restrictive property imposed by the pre-training procedure and hence the basis of its regularizing
properties.
But unsupervised pre-training restricts the parameters to particular regions: those that corre-
spond to capturing structure in the input distribution P(X). To simply state that unsupervised pre-
training is a regularization strategy somewhat undermines the significance of its effectiveness. Not
all regularizers are created equal and, in comparison to standard regularization schemes such as
L
1
and L
2
parameter penalization, unsupervised pre-training is dramatically effective. We believe
the credit for its success can be attributed to the unsupervised training criteria optimized during
unsupervised pre-training.
During each phase of the greedy unsupervised training strategy, layers are trained to represent
the dominant factors of variation extant in the data. This has the effect of leveraging knowledge
of X to form, at each layer, a representation of X consisting of statistically reliable features of
X that can then be used to predict the output (usually a class label) Y. This perspective places
unsupervised pre-training well within the family of learning strategies collectively know as semi-
supervised methods. As with other recent work demonstrating the effectiveness of semi-supervised
methods in regularizing model parameters, we claim that the effectiveness of the unsupervised pre-
training strategy is limited to the extent that learning P(X) is helpful in learning P(Y|X). Here,
we find transformations of X—learned features—that are predictive of the main factors of variation
in P(X), and when the pre-training strategy is effective,
2
some of these learned features of X are
also predictive of Y. In the context of deep learning, the greedy unsupervised strategy may also
have a special function. To some degree it resolves the problem of simultaneously learning the
parameters at all layers (mentioned in Section 2) by introducing a proxy criterion. This proxy
criterion encourages significant factors of variation, present in the input data, to be represented in
intermediate layers.
To clarify this line of reasoning, we can formalize the effect of unsupervised pre-training in
inducing a prior distribution over the parameters. Let us assume that parameters are forced to be
chosen in a bounded region S ⊂ R
d
. Let S be split in regions {R
k
} that are the basins of attrac-
tion of descent procedures in the training error (note that {R
k
} depends on the training set, but the
dependency decreases as the number of examples increases). We have ∪
k
R
k
= S and R
i
∩R
j
=
/
0
for i 6= j. Let v
k
=
R
1
θ∈R
k
dθ be the volume associated with region R
k
(where θ are our model’s
2. Acting as a form of (data-dependent) “prior” on the parameters, as we are about to formalize.
629
剩余35页未读,继续阅读
资源评论
lqqsjtu
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功