没有合适的资源?快使用搜索试试~ 我知道了~
Practical Recommendations for Gradient-Based Training of Deep Ar...
需积分: 10 2 下载量 171 浏览量
2018-03-04
02:40:09
上传
评论
收藏 474KB PDF 举报
温馨提示
Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyperparameters.
资源推荐
资源详情
资源评论
arXiv:1206.5533v2 [cs.LG] 16 Sep 2012
Practical Recommendations for Gradient-Based Training of Deep
Architectures
Yoshua Bengio
Version 2, Sept. 16th, 2012
Abstract
Learning algorithms related to artificial neural net-
works and in particular for Deep Learning may seem
to involve many bells and whistles, called hyper-
parameters. This chapter is meant as a practical
guide with recommendations for some o f the most
commonly used hyper-parameters, in particular in
the context of learning algorithms based on back-
propagated gradient and gradient-based optimiza-
tion. It also discusses how to deal with the fact that
more interesting results can be obtained when allow-
ing one to adjust many hyper-parameters . Overall, it
describes elements of the practice used to successfully
and efficiently train and debug large -scale and often
deep multi-layer neural networks. It closes with open
questions about the training difficulties observed with
deep er ar chitectures.
1 Introduction
Following a decade of lower activity, research in arti-
ficial neural networks was revived a fter a 2006 br eak-
through (Hinton et al., 2006; Bengio et al., 2007;
Ranzato et al., 2007) in the ar e a of Deep Learning,
based on greedy layer-wise unsupervised pre-training
of each layer of features. See (Bengio, 2009) for a
review. Many of the practical r ecommendations that
justified the previous edition of this book are still
valid, and new elements were added, while some sur-
vived longer by virtue of the practical advantages
they provided. The panor ama pre sented in this chap-
ter regards so me of these surviving or novel elements
of practice, focusing on learning algorithms aiming
at training deep neural networks, but leaving most
of the material specific to the Boltzmann machine
family to ano ther chapter (Hinton, 2013).
Although s uch r e commendations come out of a liv-
ing practice that emerged from years of experimenta-
tion and to some extent mathematical justification,
they should be challenged. They constitute a good
starting point for the exper imenter and user of learn-
ing algorithms but very often have not been formally
validated, leaving open many que stions that can be
answered either by theoretical analysis or by solid
comparative experimental work (ideally by both). A
good indication of the need for such validation is that
different researchers and research groups do not al-
ways agree on the practice of tr aining neural net-
works.
Several of the recommendations prese nted here can
be found implemented in the Deep Learning Tutori-
als
1
and in the related Pylearn2 library
2
, all based on
the Theano library (discussed below) written in the
Python programming language.
The 2006 Deep Learning break-
through (Hinton et al., 2006; Bengio et al., 2007;
Ranzato et al., 2007) centered on the use of u n-
supervised represent ation learning to help learning
internal representations
3
by providing a local train-
1
http://deeplearning.net/tutorial/
2
http://deeplearning.net/software/pylearn2
3
A neural network computes a s equence of data transfor-
mations, each step encoding the raw input into an intermediate
or internal representation, in principle to make the prediction
or modeling task of interest easier.
1
ing s ignal at each level of a hierarchy of features
4
.
Unsupervised representation learning algorithms can
be applied several times to learn different layers
of a deep model. Several unsupervised represen-
tation learning algorithms have been proposed
since then. Those covered in this chapter (such as
auto-encoder variants) retain many of the properties
of artificial multi-layer neural networks, relying
on the back-propagation algorithm to estimate
stochastic gradients. Deep Learning algorithms
such as those based on the Boltzmann machine
and those based on auto-encoder or sparse coding
variants often include a sup e rvised fine-tuning stage.
This supervised fine-tuning as well as the gradient
descent performed with auto-encoder variants also
involves the back-propagation algorithm, just as
like when training deterministic feedforward or
recurrent artificial neural networks. Hence this
chapter also includes recommendations for training
ordinary supervised deterministic neural networks
or more generally, most machine learning algorithms
relying on iterative gradient-based optimization of
a parametrized learner with respect to an explicit
training criterion.
This chapter assumes tha t the reader a lready un-
derstands the standard algorithms for training su-
pervised multi-layer neural networks, with the loss
gradient computed tha nks to the back-propagation
algorithm (Rumelhart et al., 1986). It starts by
explaining basic concepts behind Deep Learning
and the greedy layer-wise pretraining strategy (Sec-
tion 1 .1), and recent unsupervised pre-training al-
gorithms (denois ing and contractive auto -encoders)
that are closely related in the way they are trained
to standard multi-layer neural networks (Section 1.2).
It then r eviews in Section 2 basic concepts in it-
erative gradient-based optimization and in particu-
lar the stochastic gradient method, gradient com-
putation with a flow graph, automatic differenta-
4
In standard multi-layer neural networks trained using
back-pr opagated gradients, the only signal that drives param-
eter updates is provided at the output of the network (and
then propagated backwards). Some unsupervised learning al-
gorithms provide a local source of guidance for the parameter
update in each layer, based only on the inputs and outputs of
that layer.
tion. The main section o f this chapter is Section 3,
which explains hyper-parameters in general, their op-
timization, and spe cifically covers the main hyper-
parameters of neural network s. Section 4 briefly de-
scribes simple idea s and methods to debug and visu-
alize neural network s, while Section 5 covers paral-
lelism, sparse high-dimensional inputs, symbolic in-
puts and embeddings, and multi-relational lea rning.
The chapter closes (Section 6) with open questions
on the difficulty of training deep architectures and
improving the optimization methods for neural net-
works.
1.1 Deep Learning and Greedy Layer-
Wise Pretraining
The notion of reuse, which explains the power of
distributed repr e sentations (Bengio, 2009), is also
at the heart of the theoretical advantages behind
Deep Learning. Complexity theory of circuits,
e.g. (H˚astad, 1986; H˚astad and Goldmann, 1991),
(which include neural networks as special cases) has
much preceded the recent re search o n deep learning.
The depth of a circuit is the length of the longest
path from an input node of the circuit to an out-
put node of the circuit. Formally, one can change
the depth of a given circuit by changing the defini-
tion of what each node can compute, but only by a
constant factor (Bengio, 2009). The typical compu-
tations we allow in each node include: weighted sum,
product, artificial neuron model (such as a mo no-
tone non-linear ity on top of an affine transfo rma-
tion), computation of a kernel, or logic gates. Theo-
retical results (H˚astad, 1986; H˚astad and Goldmann,
1991; Bengio et al., 2006b; Bengio and LeCun, 2007;
Bengio and Delalleau, 201 1) clearly identify families
of functions where a deep representation can be exp o-
nentially more efficient than one that is insufficiently
deep. If the same s et of functions can be represented
from within a family of architectures associated with
a smaller VC-dimension (e.g. less hidden units
5
),
learning theory would suggest that it can be lea rned
5
Note that in our experiments, deep ar chitectures tend to
generalize very well even when they have quite large numb ers
of parameters.
2
with fewer examples, yielding improvements in both
computational e fficie nc y and statistical efficiency.
Another important motivation for feature learning
and Deep Learning is that they can be done with un-
labeled examples, so long as the factors (unobserved
random variables explaining the data) relevant to the
questions we will ask later (e.g. classes to be pre-
dicted) are somehow salient in the input distribution
itself. This is true under the manifold hypothesis,
which states that natural classes and other high- level
concepts in which humans are interested are asso-
ciated with low-dimensional regions in input space
(manifolds) near which the distribution conc entrates,
and that different class manifolds are well-separated
by regions of very low dens ity. It means that a small
semantic change around a particular example can
be ca ptured by changing only a few numbers in a
high-level abstract represe ntation space. As a conse-
quence, feature learning and Deep Learning are in-
timately related to principles of unsupervised learn-
ing, and they can work in the semi-supervised setting
(where only a few examples are labeled), as well as in
the transfer learning and multi-task settings (where
we aim to generaliz e to new classes or tasks). The
underlying hypothesis is that many of the underlying
factors are shared across classes or tasks. Since rep-
resentation learning aims to extract and isolate these
factors, representations can be shared acro ss classes
and tasks .
One of the most commonly used approaches for
training deep neural networks is based on greedy
layer-wise pre-training (Bengio et al., 2007). The
idea, first introduced in Hinton et al. (2006), is to
train one layer of a deep architecture at a time us-
ing unsupervised representation learning. Each level
takes as input the repres entation learned at the pre-
vious level and learns a new representation. The
learned representation(s) can then be used as input
to predic t variables of interest, for example to cla s-
sify objects. After unsupervised pre-training, one can
also perform supervised fine-tuning of the whole sys-
tem
6
, i.e., optimize not just the classifier but also
the lower levels of the feature hiera rchy with respect
6
The whole s ystem composes the computation of the rep-
resentation with computation of the predictor’s output.
to some objective of interest. Combining unsuper-
vised pre-tra ining and supervised fine-tuning usu-
ally gives better generalization than pure supervised
learning from a purely random initialization. The
unsupe rvised representation learning algorithms for
pre-training proposed in 2006 were the Restricted
Boltzmann Machine or RBM (Hinto n et al., 2006),
the auto-encoder (Bengio et al., 2007) and a spar-
sifying form of auto-encoder similar to spar se cod-
ing (Ranzato et al., 2007).
1.2 Denoising and Contractive Auto-
Encoders
An auto-encoder has two parts: an encoder func-
tion f that maps the input x to a representation
h = f(x), and a decoder function g that maps h
back in the space of x in order to reconstruct x.
In the regular auto-encoder the reconstruction func-
tion r(·) = g(f(·)) is trained to minimize the average
value of a reconstruction loss on the training exam-
ples. Note that recons truction loss should be high for
most other input configurations
7
. The regularization
mechanism makes sure that reconstruction cannot be
perfect everywhere, while minimizing the reconstruc-
tion loss at training examples digs a hole in recon-
struction e rror where the density of tr aining exam-
ples is large. Ex amples o f reconstruction loss func-
tions inc lude ||x−r(x)||
2
(for real-valued inputs) and
−
P
i
x
i
log r
i
(x) + (1 − x
i
) log(1 − r
i
(x)) (when in-
terpreting x
i
as a bit or a probability of a binary
event). Auto-encoders capture the input distribu-
tion by learning to better reconstruct more likely in-
put configurations. The difference between the reco n-
struction vector and the input vector can be shown to
be re lated to the log-density gradient as estimated by
the le arner (Vincent, 2011; Bengio et al., 201 2) and
the Jacobian matrix of the reconstruction with re-
spect to the input gives information about the second
derivative of the density, i.e., in which direction the
density remains high when you are on a high-density
7
Different regularization mechanisms have been proposed
to push reconstruction error up in low density areas: denoising
criterion, contractive criterion, and code sparsity. It has been
argued that such constraints play a role similar to the partition
function for Boltzmann machines (Ranzato et al., 2008a).
3
manifold (Rifai et al., 201 1a; Bengio et al., 2012). In
the Denoising Auto-Enco de r (DAE) and the Con-
tractive Auto-Encoder (CAE), the training procedure
also introduces robustness (insensitivity to small vari-
ations), respectively in the r econstruction r(x) or in
the representation f (x). In the DAE (Vincent et al.,
2008, 2010), this is achieved by training with stochas-
tically corrupted inputs, but try ing to reconstruct the
uncorrupted inputs. In the CAE (Rifai et al., 2011a),
this is achieve d by adding a n explicit regularizing
term in the training criterion, proportional to the
norm of the Jacobian of the encoder, ||
∂f (x)
∂x
||
2
. But
the CAE and the DAE are very related (Bengio et al.,
2012): when the noise is Gaussian and small, the
denoising error minimized by the DAE is equiva-
lent to minimizing the norm of the Jacobian of the
reconstruction function r(·) = g(f(·)), whereas the
CAE minimizes the norm of the Jacobian of the en-
coder f (·). Besides Ga ussian noise, another interest-
ing form of corruption has been very successful with
DAEs: it is calle d the masking corruption and con-
sists in randomly z e roing out a large fraction (like
20% or even 50%) of the inputs, where the zeroed
out subset is randomly selec ted for each example. In
addition to the contractive effect, it forces the learned
encoder to b e a ble to rely only on an arbitrary subset
of the input features.
Another way to prevent the auto-enc oder from per-
fectly reconstructing everywhere is to introduce a
sparsity penalty on h, discussed below (Section 3.1).
1.3 Online Learning and Optimization
of Generalization Error
The objective of learning is not to minimize training
error or even the training criterion. The latter is a
surrogate for ge neralization error, i.e., performance
on new (out-of-sample) exa mples, and there are no
hard guarantees that minimizing the training crite-
rion will yield good generalization error: it depends
on the appropriateness of the parametrization and
training criterion (with the cor responding prior they
imply) for the task at hand.
Many learning tasks of interest will require huge
quantities of data (most of which will be unlabeled)
and as the number of examples inc reases, so long as
capacity is limited (the numbe r of parameters is small
compared to the number of examples), training er-
ror a nd generalization a pproach each other . In the
regime of such large datasets, we can consider that
the learner sees an unending stream of examples (e.g.,
think about a process that harvests text and images
from the web and feeds it to a machine learning algo-
rithm). In that context, it is most efficient to simply
update the parameters of the model after each exam-
ple or few examples, as they arrive. This is the ideal
online learning scenario, and in a simplified s e tting,
we can even consider each new example z as being
sampled i.i.d. from an unknown generating distribu-
tion with probability density p(z). More realistically,
examples in online learning do not arrive i.i.d. but
instead from an unknown sto chastic process which
exhibits serial correlation and other temporal de pen-
dencies. Many learning algorithms rely on gradient-
based numerical optimization of a training criterion.
Let L(z, θ ) be the loss incurred on example z when
the parameter vec tor takes value θ. The gradie nt
vector for the los s associated with a single example
is
∂L(z,θ)
∂θ
.
If we consider the simplified case of i.i.d. da ta,
there is an interesting observation to be made: the
online learner is performing stochastic gradient de-
scent on it s generalization error. Indeed, the gener-
alization erro r C of a learner with parameters θ and
loss function L is
C = E[L(z, θ)] =
Z
p(z)L(z, θ)dz
while the stochastic gradient from sample z is
ˆg =
∂L(z, θ)
∂θ
with z a random variable sampled from p. The gra-
dient of generalization error is
∂C
∂θ
=
∂
∂θ
Z
p(z)L(z, θ)dz =
Z
p(z)
∂L(z, θ)
∂θ
dz = E[ˆg]
showing that the online gradient ˆg is an unbia sed es-
timator of the generalization error gradie nt
∂C
∂θ
. It
means that online learners, when given a stream of
4
non-repetitive training data, really optimize (maybe
not in the optimal way, i.e., using a firs t-order gra-
dient technique) what we really care about: general-
ization error.
2 Gradients
2.1 Gradient Descent and Learning
Rate
The gradient or an estimator of the gradient is
used as the c ore part the computation of parame-
ter updates for gradient-based numerical optimiza-
tion algorithms. For example, simple online (or
stochastic) gradient de scent (Robbins and Monro,
1951; Bottou and LeCun, 2004) updates the param-
eters after each example is seen, according to
θ
(t)
← θ
(t−1)
− ǫ
t
∂L(z
t
, θ)
∂θ
where z
t
is an example sampled at iteration t and
where ǫ
t
is a hyper-parameter that is called the learn-
ing rate and whos e choice is crucial. If the learn-
ing rate is too large
8
, the average loss will increase.
The optimal learning rate is usually close to (by a
factor of 2) the largest learning rate that does not
cause divergence of the training criterion, an observa-
tion that can guide heuristics for setting the learning
rate (Bengio, 2011), e.g., start with a large learning
rate and if the training criterion diverges, try again
with 3 times smaller learning rate, etc., until no di-
vergence is observed.
See Bottou (2013) for a deeper treatment of
stochastic gradient descent, including suggestions to
set lea rning rate schedule and improve the asymp-
totic convergence through averaging.
In practice, we use mini-batch updates based on
an average of the gradients
9
inside each block of B
8
above a value which is approximately 2 times the largest
eigenvalue of the average loss Hessian matrix
9
Compared to a sum, an average makes a small change in
B have only a small effect on the optimal learning rate, with an
increase in B generally allow ing a small increase in the learning
rate because of the reduced variance of the gradient.
examples:
θ
(t)
← θ
(t−1)
− ǫ
t
1
B
B(t+1)
X
t
′
=Bt+1
∂L(z
t
′
, θ)
∂θ
. (1)
With B = 1 we are back to ordinary online gr adient
descent, while with B equal to the training set size,
this is standard (als o called “ batch”) gradient de-
scent. With intermediate values of B there is gener-
ally a sweet sp ot. When B increases we can get more
multiply-add op e rations per second by taking advan-
tage of parallelism or efficient matrix-matrix multipli-
cations (instead of separate matrix-vector multiplica-
tions), often gaining a factor of 2 in practice in overall
training time. On the other hand, as B increases, the
number of updates per computation done decreases,
which slows down convergence (in terms of error vs
number of multiply-add operations performed) b e-
cause less updates can be done in the same computing
time. Combining these two opposing effects yields a
typical U-curve with a sweet spot at an intermediate
value of B.
Keep in mind that even the true g radient direction
(averaging over the whole training set) is only the
steepest descent direc tion locally but may not point
in the right direction when cons idering larger steps.
In particular, because the training criterion is not
quadratic in the parameters, as one moves in param-
eter space the optimal descent direction keeps chang-
ing. Bec ause the gradient direction is not q uite the
right direction of descent, there is no point in spend-
ing a lot of computation to estimate it precisely for
gradient descent. Instead, doing mor e updates more
frequently helps to explore more and faster, especially
with large learning rates. In additio n, smaller values
of B may benefit from more exploration in parame-
ter space and a form of regularization both due to the
“noise” injected in the gradient estimator, which may
explain the better test results sometimes observed
with smaller B.
When the training set is finite, training proceeds
by sweeps through the training se t called a n epoch,
and full training usually requires many epochs (iter-
ations through the training set). Note that stochas-
tic gradient (either one example at a time or with
mini-batches) is different from ordinary gradient de-
5
剩余32页未读,继续阅读
资源评论
lisaientisite
- 粉丝: 0
- 资源: 20
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功