PracticalRecommendationsforGradient-BasedTrainingofDeepArchitectures.pdf资源-CSDN文库

Caffe

需积分: 10 142 浏览量 2018-03-04 02:40:09 上传评论收藏 474KB PDF 举报

资源推荐

资源详情

资源评论

arXiv:1206.5533v2 [cs.LG] 16 Sep 2012

Practical Recommendations for Gradient-Based Training of Deep

Architectures

Yoshua Bengio

Version 2, Sept. 16th, 2012

Abstract

Learning algorithms related to artiﬁcial neural net-

works and in particular for Deep Learning may seem

to involve many bells and whistles, called hyper-

parameters. This chapter is meant as a practical

guide with recommendations for some o f the most

commonly used hyper-parameters, in particular in

the context of learning algorithms based on back-

propagated gradient and gradient-based optimiza-

tion. It also discusses how to deal with the fact that

more interesting results can be obtained when allow-

ing one to adjust many hyper-parameters . Overall, it

describes elements of the practice used to successfully

and eﬃciently train and debug large -scale and often

deep multi-layer neural networks. It closes with open

questions about the training diﬃculties observed with

deep er ar chitectures.

1 Introduction

Following a decade of lower activity, research in arti-

ﬁcial neural networks was revived a fter a 2006 br eak-

through (Hinton et al., 2006; Bengio et al., 2007;

Ranzato et al., 2007) in the ar e a of Deep Learning,

based on greedy layer-wise unsupervised pre-training

of each layer of features. See (Bengio, 2009) for a

review. Many of the practical r ecommendations that

justiﬁed the previous edition of this book are still

valid, and new elements were added, while some sur-

vived longer by virtue of the practical advantages

they provided. The panor ama pre sented in this chap-

ter regards so me of these surviving or novel elements

of practice, focusing on learning algorithms aiming

at training deep neural networks, but leaving most

of the material speciﬁc to the Boltzmann machine

family to ano ther chapter (Hinton, 2013).

Although s uch r e commendations come out of a liv-

ing practice that emerged from years of experimenta-

tion and to some extent mathematical justiﬁcation,

they should be challenged. They constitute a good

starting point for the exper imenter and user of learn-

ing algorithms but very often have not been formally

validated, leaving open many que stions that can be

answered either by theoretical analysis or by solid

comparative experimental work (ideally by both). A

good indication of the need for such validation is that

diﬀerent researchers and research groups do not al-

ways agree on the practice of tr aining neural net-

works.

Several of the recommendations prese nted here can

be found implemented in the Deep Learning Tutori-

als

and in the related Pylearn2 library

, all based on

the Theano library (discussed below) written in the

Python programming language.

The 2006 Deep Learning break-

through (Hinton et al., 2006; Bengio et al., 2007;

Ranzato et al., 2007) centered on the use of u n-

supervised represent ation learning to help learning

internal representations

by providing a local train-

http://deeplearning.net/tutorial/

http://deeplearning.net/software/pylearn2

A neural network computes a s equence of data transfor-

mations, each step encoding the raw input into an intermediate

or internal representation, in principle to make the prediction

or modeling task of interest easier.

ing s ignal at each level of a hierarchy of features

Unsupervised representation learning algorithms can

be applied several times to learn diﬀerent layers

of a deep model. Several unsupervised represen-

tation learning algorithms have been proposed

since then. Those covered in this chapter (such as

auto-encoder variants) retain many of the properties

of artiﬁcial multi-layer neural networks, relying

on the back-propagation algorithm to estimate

stochastic gradients. Deep Learning algorithms

such as those based on the Boltzmann machine

and those based on auto-encoder or sparse coding

variants often include a sup e rvised ﬁne-tuning stage.

This supervised ﬁne-tuning as well as the gradient

descent performed with auto-encoder variants also

involves the back-propagation algorithm, just as

like when training deterministic feedforward or

recurrent artiﬁcial neural networks. Hence this

chapter also includes recommendations for training

ordinary supervised deterministic neural networks

or more generally, most machine learning algorithms

relying on iterative gradient-based optimization of

a parametrized learner with respect to an explicit

training criterion.

This chapter assumes tha t the reader a lready un-

derstands the standard algorithms for training su-

pervised multi-layer neural networks, with the loss

gradient computed tha nks to the back-propagation

algorithm (Rumelhart et al., 1986). It starts by

explaining basic concepts behind Deep Learning

and the greedy layer-wise pretraining strategy (Sec-

tion 1 .1), and recent unsupervised pre-training al-

gorithms (denois ing and contractive auto -encoders)

that are closely related in the way they are trained

to standard multi-layer neural networks (Section 1.2).

It then r eviews in Section 2 basic concepts in it-

erative gradient-based optimization and in particu-

lar the stochastic gradient method, gradient com-

putation with a ﬂow graph, automatic diﬀerenta-

In standard multi-layer neural networks trained using

back-pr opagated gradients, the only signal that drives param-

eter updates is provided at the output of the network (and

then propagated backwards). Some unsupervised learning al-

gorithms provide a local source of guidance for the parameter

update in each layer, based only on the inputs and outputs of

that layer.

tion. The main section o f this chapter is Section 3,

which explains hyper-parameters in general, their op-

timization, and spe ciﬁcally covers the main hyper-

parameters of neural network s. Section 4 brieﬂy de-

scribes simple idea s and methods to debug and visu-

alize neural network s, while Section 5 covers paral-

lelism, sparse high-dimensional inputs, symbolic in-

puts and embeddings, and multi-relational lea rning.

The chapter closes (Section 6) with open questions

on the diﬃculty of training deep architectures and

improving the optimization methods for neural net-

works.

1.1 Deep Learning and Greedy Layer-

Wise Pretraining

The notion of reuse, which explains the power of

distributed repr e sentations (Bengio, 2009), is also

at the heart of the theoretical advantages behind

Deep Learning. Complexity theory of circuits,

e.g. (H˚astad, 1986; H˚astad and Goldmann, 1991),

(which include neural networks as special cases) has

much preceded the recent re search o n deep learning.

The depth of a circuit is the length of the longest

path from an input node of the circuit to an out-

put node of the circuit. Formally, one can change

the depth of a given circuit by changing the deﬁni-

tion of what each node can compute, but only by a

constant factor (Bengio, 2009). The typical compu-

tations we allow in each node include: weighted sum,

product, artiﬁcial neuron model (such as a mo no-

tone non-linear ity on top of an aﬃne transfo rma-

tion), computation of a kernel, or logic gates. Theo-

retical results (H˚astad, 1986; H˚astad and Goldmann,

1991; Bengio et al., 2006b; Bengio and LeCun, 2007;

Bengio and Delalleau, 201 1) clearly identify families

of functions where a deep representation can be exp o-

nentially more eﬃcient than one that is insuﬃciently

deep. If the same s et of functions can be represented

from within a family of architectures associated with

a smaller VC-dimension (e.g. less hidden units

learning theory would suggest that it can be lea rned

Note that in our experiments, deep ar chitectures tend to

generalize very well even when they have quite large numb ers

of parameters.

with fewer examples, yielding improvements in both

computational e ﬃcie nc y and statistical eﬃciency.

Another important motivation for feature learning

and Deep Learning is that they can be done with un-

labeled examples, so long as the factors (unobserved

random variables explaining the data) relevant to the

questions we will ask later (e.g. classes to be pre-

dicted) are somehow salient in the input distribution

itself. This is true under the manifold hypothesis,

which states that natural classes and other high- level

concepts in which humans are interested are asso-

ciated with low-dimensional regions in input space

(manifolds) near which the distribution conc entrates,

and that diﬀerent class manifolds are well-separated

by regions of very low dens ity. It means that a small

semantic change around a particular example can

be ca ptured by changing only a few numbers in a

high-level abstract represe ntation space. As a conse-

quence, feature learning and Deep Learning are in-

timately related to principles of unsupervised learn-

ing, and they can work in the semi-supervised setting

(where only a few examples are labeled), as well as in

the transfer learning and multi-task settings (where

we aim to generaliz e to new classes or tasks). The

underlying hypothesis is that many of the underlying

factors are shared across classes or tasks. Since rep-

resentation learning aims to extract and isolate these

factors, representations can be shared acro ss classes

and tasks .

One of the most commonly used approaches for

training deep neural networks is based on greedy

layer-wise pre-training (Bengio et al., 2007). The

idea, ﬁrst introduced in Hinton et al. (2006), is to

train one layer of a deep architecture at a time us-

ing unsupervised representation learning. Each level

takes as input the repres entation learned at the pre-

vious level and learns a new representation. The

learned representation(s) can then be used as input

to predic t variables of interest, for example to cla s-

sify objects. After unsupervised pre-training, one can

also perform supervised ﬁne-tuning of the whole sys-

tem

, i.e., optimize not just the classiﬁer but also

the lower levels of the feature hiera rchy with respect

The whole s ystem composes the computation of the rep-

resentation with computation of the predictor’s output.

to some objective of interest. Combining unsuper-

vised pre-tra ining and supervised ﬁne-tuning usu-

ally gives better generalization than pure supervised

learning from a purely random initialization. The

unsupe rvised representation learning algorithms for

pre-training proposed in 2006 were the Restricted

Boltzmann Machine or RBM (Hinto n et al., 2006),

the auto-encoder (Bengio et al., 2007) and a spar-

sifying form of auto-encoder similar to spar se cod-

ing (Ranzato et al., 2007).

1.2 Denoising and Contractive Auto-

Encoders

An auto-encoder has two parts: an encoder func-

tion f that maps the input x to a representation

h = f(x), and a decoder function g that maps h

back in the space of x in order to reconstruct x.

In the regular auto-encoder the reconstruction func-

tion r(·) = g(f(·)) is trained to minimize the average

value of a reconstruction loss on the training exam-

ples. Note that recons truction loss should be high for

most other input conﬁgurations

. The regularization

mechanism makes sure that reconstruction cannot be

perfect everywhere, while minimizing the reconstruc-

tion loss at training examples digs a hole in recon-

struction e rror where the density of tr aining exam-

ples is large. Ex amples o f reconstruction loss func-

tions inc lude ||x−r(x)||

(for real-valued inputs) and

−

log r

(x) + (1 − x

) log(1 − r

(x)) (when in-

terpreting x

as a bit or a probability of a binary

event). Auto-encoders capture the input distribu-

tion by learning to better reconstruct more likely in-

put conﬁgurations. The diﬀerence between the reco n-

struction vector and the input vector can be shown to

be re lated to the log-density gradient as estimated by

the le arner (Vincent, 2011; Bengio et al., 201 2) and

the Jacobian matrix of the reconstruction with re-

spect to the input gives information about the second

derivative of the density, i.e., in which direction the

density remains high when you are on a high-density

Diﬀerent regularization mechanisms have been proposed

to push reconstruction error up in low density areas: denoising

criterion, contractive criterion, and code sparsity. It has been

argued that such constraints play a role similar to the partition

function for Boltzmann machines (Ranzato et al., 2008a).

manifold (Rifai et al., 201 1a; Bengio et al., 2012). In

the Denoising Auto-Enco de r (DAE) and the Con-

tractive Auto-Encoder (CAE), the training procedure

also introduces robustness (insensitivity to small vari-

ations), respectively in the r econstruction r(x) or in

the representation f (x). In the DAE (Vincent et al.,

2008, 2010), this is achieved by training with stochas-

tically corrupted inputs, but try ing to reconstruct the

uncorrupted inputs. In the CAE (Rifai et al., 2011a),

this is achieve d by adding a n explicit regularizing

term in the training criterion, proportional to the

norm of the Jacobian of the encoder, ||

∂f (x)

∂x

. But

the CAE and the DAE are very related (Bengio et al.,

2012): when the noise is Gaussian and small, the

denoising error minimized by the DAE is equiva-

lent to minimizing the norm of the Jacobian of the

reconstruction function r(·) = g(f(·)), whereas the

CAE minimizes the norm of the Jacobian of the en-

coder f (·). Besides Ga ussian noise, another interest-

ing form of corruption has been very successful with

DAEs: it is calle d the masking corruption and con-

sists in randomly z e roing out a large fraction (like

20% or even 50%) of the inputs, where the zeroed

out subset is randomly selec ted for each example. In

addition to the contractive eﬀect, it forces the learned

encoder to b e a ble to rely only on an arbitrary subset

of the input features.

Another way to prevent the auto-enc oder from per-

fectly reconstructing everywhere is to introduce a

sparsity penalty on h, discussed below (Section 3.1).

1.3 Online Learning and Optimization

of Generalization Error

The objective of learning is not to minimize training

error or even the training criterion. The latter is a

surrogate for ge neralization error, i.e., performance

on new (out-of-sample) exa mples, and there are no

hard guarantees that minimizing the training crite-

rion will yield good generalization error: it depends

on the appropriateness of the parametrization and

training criterion (with the cor responding prior they

imply) for the task at hand.

Many learning tasks of interest will require huge

quantities of data (most of which will be unlabeled)

and as the number of examples inc reases, so long as

capacity is limited (the numbe r of parameters is small

compared to the number of examples), training er-

ror a nd generalization a pproach each other . In the

regime of such large datasets, we can consider that

the learner sees an unending stream of examples (e.g.,

think about a process that harvests text and images

from the web and feeds it to a machine learning algo-

rithm). In that context, it is most eﬃcient to simply

update the parameters of the model after each exam-

ple or few examples, as they arrive. This is the ideal

online learning scenario, and in a simpliﬁed s e tting,

we can even consider each new example z as being

sampled i.i.d. from an unknown generating distribu-

tion with probability density p(z). More realistically,

examples in online learning do not arrive i.i.d. but

instead from an unknown sto chastic process which

exhibits serial correlation and other temporal de pen-

dencies. Many learning algorithms rely on gradient-

based numerical optimization of a training criterion.

Let L(z, θ ) be the loss incurred on example z when

the parameter vec tor takes value θ. The gradie nt

vector for the los s associated with a single example

∂L(z,θ)

∂θ

If we consider the simpliﬁed case of i.i.d. da ta,

there is an interesting observation to be made: the

online learner is performing stochastic gradient de-

scent on it s generalization error. Indeed, the gener-

alization erro r C of a learner with parameters θ and

loss function L is

C = E[L(z, θ)] =

p(z)L(z, θ)dz

while the stochastic gradient from sample z is

ˆg =

∂L(z, θ)

∂θ

with z a random variable sampled from p. The gra-

dient of generalization error is

∂C

∂θ

∂

∂θ

p(z)L(z, θ)dz =

p(z)

∂L(z, θ)

∂θ

dz = E[ˆg]

showing that the online gradient ˆg is an unbia sed es-

timator of the generalization error gradie nt

∂C

∂θ

. It

means that online learners, when given a stream of

non-repetitive training data, really optimize (maybe

not in the optimal way, i.e., using a ﬁrs t-order gra-

dient technique) what we really care about: general-

ization error.

2 Gradients

2.1 Gradient Descent and Learning

Rate

The gradient or an estimator of the gradient is

used as the c ore part the computation of parame-

ter updates for gradient-based numerical optimiza-

tion algorithms. For example, simple online (or

stochastic) gradient de scent (Robbins and Monro,

1951; Bottou and LeCun, 2004) updates the param-

eters after each example is seen, according to

(t)

← θ

(t−1)

− ǫ

∂L(z

, θ)

∂θ

where z

is an example sampled at iteration t and

where ǫ

is a hyper-parameter that is called the learn-

ing rate and whos e choice is crucial. If the learn-

ing rate is too large

, the average loss will increase.

The optimal learning rate is usually close to (by a

factor of 2) the largest learning rate that does not

cause divergence of the training criterion, an observa-

tion that can guide heuristics for setting the learning

rate (Bengio, 2011), e.g., start with a large learning

rate and if the training criterion diverges, try again

with 3 times smaller learning rate, etc., until no di-

vergence is observed.

See Bottou (2013) for a deeper treatment of

stochastic gradient descent, including suggestions to

set lea rning rate schedule and improve the asymp-

totic convergence through averaging.

In practice, we use mini-batch updates based on

an average of the gradients

inside each block of B

above a value which is approximately 2 times the largest

eigenvalue of the average loss Hessian matrix

Compared to a sum, an average makes a small change in

B have only a small eﬀect on the optimal learning rate, with an

increase in B generally allow ing a small increase in the learning

rate because of the reduced variance of the gradient.

examples:

(t)

← θ

(t−1)

− ǫ

B(t+1)

′

=Bt+1

∂L(z

′

, θ)

∂θ

. (1)

With B = 1 we are back to ordinary online gr adient

descent, while with B equal to the training set size,

this is standard (als o called “ batch”) gradient de-

scent. With intermediate values of B there is gener-

ally a sweet sp ot. When B increases we can get more

multiply-add op e rations per second by taking advan-

tage of parallelism or eﬃcient matrix-matrix multipli-

cations (instead of separate matrix-vector multiplica-

tions), often gaining a factor of 2 in practice in overall

training time. On the other hand, as B increases, the

number of updates per computation done decreases,

which slows down convergence (in terms of error vs

number of multiply-add operations performed) b e-

cause less updates can be done in the same computing

time. Combining these two opposing eﬀects yields a

typical U-curve with a sweet spot at an intermediate

value of B.

Keep in mind that even the true g radient direction

(averaging over the whole training set) is only the

steepest descent direc tion locally but may not point

in the right direction when cons idering larger steps.

In particular, because the training criterion is not

quadratic in the parameters, as one moves in param-

eter space the optimal descent direction keeps chang-

ing. Bec ause the gradient direction is not q uite the

right direction of descent, there is no point in spend-

ing a lot of computation to estimate it precisely for

gradient descent. Instead, doing mor e updates more

frequently helps to explore more and faster, especially

with large learning rates. In additio n, smaller values

of B may beneﬁt from more exploration in parame-

ter space and a form of regularization both due to the

“noise” injected in the gradient estimator, which may

explain the better test results sometimes observed

with smaller B.

When the training set is ﬁnite, training proceeds

by sweeps through the training se t called a n epoch,

and full training usually requires many epochs (iter-

ations through the training set). Note that stochas-

tic gradient (either one example at a time or with

mini-batches) is diﬀerent from ordinary gradient de-

剩余32页未读，继续阅读

评论收藏

内容反馈

lisaientisite

粉丝: 0
资源: 20

Practical Recommendations for Gradient-Based Training of Deep Ar...

最新资源

Practical Recommendations for Gradient-Based Training of Deep Ar...

Practical recommendations for gradient-based training of deep architectures

Practical Recommendations for Gradient-Based Training of Deep Architectures

PKCS #5- Password-Based Cryptography Specification Version 2.1.pdf

Session-based Recommendations with Recurrent Neural Networks.pdf

PyPI 官网下载 | mypy-boto3-compute-optimizer-1.12.14.0.tar.gz

YouTube推荐系统Paper[2016]-Deep Neural Networks for YouTube Recommendations.pdf

T-REC-G.987-201206-I!!PDF-E.pdf

802.1X-2004.pdf和802.1X-2010.pdf

Improved Recurrent Neural Networks for Session-based Recommendations.pdf

T-REC-G.709.1-202005-I!Cor1!PDF-E.pdf

推荐系统的循序进阶读物

EMC-AN103-RDS_gigabit PHY layout.pdf

The State of AI 2019 - Divergence.pdf

Bridging the Gap From Research to Practical Advice.pdf

MSDN Traning - VB.NET (VBL).pdf

DevOps-with-ASP.NET-Core-and-Azure.pdf

T-REC-H.265-201410-I!!PDF-E.pdf

SESSION-BASED RECOMMENDATIONS WITH RECURRENT NEURAL NETWORKS.pdf

ISO 13528-2015 [高清版].pdf

CEA 861.3-2015HDR Static Metadata Extensions.pdf

Energy Optimization Tools for Ultra-Low-Power Microcontrollers.pdf

SoCaST*: Personalized event recommendations for event-based social networks: A multi-criteria decision making approach

neo4j-getting-started-4.0.pdf

Python库 | pocket_recommendations-0.1.1-py3-none-any.whl

Lab 2 - Content-based recommendations.ipynb

Deep Learning with Python - François Chollet

最新资源