没有合适的资源?快使用搜索试试~ 我知道了~
1.deep+learning课本-三巨头(中)1
需积分: 0 1 下载量 119 浏览量
2022-08-03
15:48:36
上传
评论
收藏 18.84MB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/86291300/0001-2f5afde3d1a5b44bf50894076186e9f5_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
231页
Deep learning algorithms involve optimization in many contexts. For example,perf
资源详情
资源评论
资源推荐
![](https://csdnimg.cn/release/download_crawler_static/86291300/bg1.jpg)
Chapter 8
Optimization for Training Deep
Models
Deep learning algorithms involve optimization in many contexts. For example,
performing inference in models such as PCA involves solving an optimization
problem. We often use analytical optimization to write proofs or design algorithms.
Of all of the many optimization problems involved in deep learning, the most
difficult is neural network training. It is quite common to invest days to months of
time on hundreds of machines in order to solve even a single instance of the neural
network training problem. Because this problem is so important and so expensive,
a specialized set of optimization techniques have been developed for solving it.
This chapter presents these optimization techniques for neural network training.
If you are unfamiliar with the basic principles of gradient-based optimization,
we suggest reviewing Chapter . That chapter includes a brief overview of numerical4
optimization in general.
This chapter focuses on one particular case of optimization: finding the param-
eters
θ
of a neural network that significantly reduce a cost function
J
(
θ
), which
typically includes a performance measure evaluated on the entire training set as
well as additional regularization terms.
We begin with a description of how optimization used as a training algorithm
for a machine learning task differs from pure optimization. Next, we present several
of the concrete challenges that make optimization of neural networks difficult. We
then define several practical algorithms, including both optimization algorithms
themselves and strategies for initializing the parameters. More advanced algorithms
adapt their learning rates during training or leverage information contained in
274
![](https://csdnimg.cn/release/download_crawler_static/86291300/bg2.jpg)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
the second derivatives of the cost function. Finally, we conclude with a review of
several optimization strategies that are formed by combining simple optimization
algorithms into higher-level procedures.
8.1 How Learning Differs from Pure Optimization
Optimization algorithms used for training of deep models differ from traditional
optimization algorithms in several ways. Machine learning usually acts indirectly.
In most machine learning scenarios, we care about some performance measure
P
, that is defined with respect to the test set and may also be intractable.We
therefore optimize
P
only indirectly. We reduce a different cost function
J
(
θ
) in
the hope that doing so will improve
P
. This is in contrast to pure optimization,
where minimizing
J
is a goal in and of itself. Optimization algorithms for training
deep models also typically include some specialization on the specific structure of
machine learning objective functions.
Typically, the cost function can be written as an average over the training set,
such as
J( ) = θ E
( ) ˆx,y ∼p
data
L f , y ,( ( ; )x θ ) (8.1)
where
L
is the per-example loss function,
f
(
x
;
θ
) is the predicted output when
the input is
x
,
ˆp
data
is the empirical distribution. In the supervised learning case,
y
is the target output. Throughout this chapter, we develop the unregularized
supervised case, where the arguments to
L
are
f
(
x
;
θ
) and
y
. However, it is trivial
to extend this development, for example, to include
θ
or
x
as arguments, or to
exclude
y
as arguments, in order to develop various forms of regularization or
unsupervised learning.
Eq. defines an objective function with respect to the training set. We8.1
would usually prefer to minimize the corresponding objective function where the
expectation is taken across
the data generating distribution p
data
rather than
just over the finite training set:
J
∗
( ) = θ E
( )x,y ∼p
data
L f , y .( ( ; )x θ ) (8.2)
8.1.1 Empirical Risk Minimization
The goal of a machine learning algorithm is to reduce the expected generalization
error given by Eq. . This quantity is known as the . We emphasize here that8.2 risk
the expectation is taken over the true underlying distribution
p
data
. If we knew
the true distribution
p
data
(
x, y
), risk minimization would be an optimization task
275
![](https://csdnimg.cn/release/download_crawler_static/86291300/bg3.jpg)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
solvable by an optimization algorithm. However, when we do not know
p
data
(
x, y
)
but only have a training set of samples, we have a machine learning problem.
The simplest way to convert a machine learning problem back into an op-
timization problem is to minimize the expected loss on the training set. This
means replacing the true distribution
p
(
x, y
) with the empirical distribution
ˆp
(
x, y
)
defined by the training set. We now minimize the empirical risk
E
x,y∼ˆp
data
( )x,y
[ ( ( ; ) )] =L f x θ , y
1
m
m
X
i=1
L f( (x
( )i
; )θ , y
( )i
) (8.3)
where is the number of training examples.m
The training process based on minimizing this average training error is known
as empirical risk minimization. In this setting, machine learning is still very similar
to straightforward optimization. Rather than optimizing the risk directly, we
optimize the empirical risk, and hope that the risk decreases significantly as well.
A variety of theoretical results establish conditions under which the true risk can
be expected to decrease by various amounts.
However, empirical risk minimization is prone to overfitting. Models with
high capacity can simply memorize the training set. In many cases, empirical
risk minimization is not really feasible. The most effective modern optimization
algorithms are based on gradient descent, but many useful loss functions, such
as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined
everywhere). These two problems mean that, in the context of deep learning, we
rarely use empirical risk minimization. Instead, we must use a slightly different
approach, in which the quantity that we actually optimize is even more different
from the quantity that we truly want to optimize.
8.1.2 Surrogate Loss Functions and Early Stopping
Sometimes, the loss function we actually care about (say classification error) is not
one that can be optimized efficiently. For example, exactly minimizing expected 0-1
loss is typically intractable (exponential in the input dimension), even for a linear
classifier (Marcotte and Savard 1992, ). In such situations, one typically optimizes
a surrogate loss function instead, which acts as a proxy but has advantages. For
example, the negative log-likelihood of the correct class is typically used as a
surrogate for the 0-1 loss. The negative log-likelihood allows the model to estimate
the conditional probability of the classes, given the input, and if the model can
do that well, then it can pick the classes that yield the least classification error in
expectation.
276
![](https://csdnimg.cn/release/download_crawler_static/86291300/bg4.jpg)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
In some cases, a surrogate loss function actually results in being able to learn
more. For example, the test set 0-1 loss often continues to decrease for a long
time after the training set 0-1 loss has reached zero, when training using the
log-likelihood surrogate. This is because even when the expected 0-1 loss is zero,
one can improve the robustness of the classifier by further pushing the classes apart
from each other, obtaining a more confident and reliable classifier, thus extracting
more information from the training data than would have been possible by simply
minimizing the average 0-1 loss on the training set.
A very important difference between optimization in general and optimization
as we use it for training algorithms is that training algorithms do not usually halt
at a local minimum. Instead, a machine learning algorithm usually minimizes
a surrogate loss function but halts when a convergence criterion based on early
stopping (Sec. ) is satisfied. Typically the early stopping criterion is based on7.8
the true underlying loss function, such as 0-1 loss measured on a validation set,
and is designed to cause the algorithm to halt whenever overfitting begins to occur.
Training often halts while the surrogate loss function still has large derivatives,
which is very different from the pure optimization setting, where an optimization
algorithm is considered to have converged when the gradient becomes very small.
8.1.3 Batch and Minibatch Algorithms
One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a sum
over the training examples. Optimization algorithms for machine learning typically
compute each update to the parameters based on an expected value of the cost
function estimated using only a subset of the terms of the full cost function.
For example, maximum likelihood estimation problems, when viewed in log
space, decompose into a sum over each example:
θ
ML
= arg max
θ
m
X
i=1
log p
model
(x
( )i
, y
( )i
; )θ . (8.4)
Maximizing this sum is equivalent to maximizing the expectation over the
empirical distribution defined by the training set:
J( ) = θ E
x,y∼ˆp
data
log p
model
( ; )x, y θ . (8.5)
Most of the properties of the objective function
J
used by most of our opti-
mization algorithms are also expectations over the training set. For example, the
277
![](https://csdnimg.cn/release/download_crawler_static/86291300/bg5.jpg)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
most commonly used property is the gradient:
∇
θ
J( ) = θ E
x,y∼ˆp
data
∇
θ
log p
model
( ; )x, y θ . (8.6)
Computingthis expectationexactlyis veryexpensivebecauseitrequires
evaluating the model on every example in the entire dataset. In practice, we can
compute these expectations by randomly sampling a small number of examples
from the dataset, then taking the average over only those examples.
Recall that the standard error of the mean (Eq. ) estimated from5.46
n
samples
is given by
σ/
√
n,
where
σ
is the true standard deviation of the value of the samples.
The denominator of
√
n
shows that there are less than linear returns to using
more examples to estimate the gradient. Compare two hypothetical estimates of
the gradient, one based on 100 examples and another based on 10,000 examples.
The latter requires 100 times more computation than the former, but reduces the
standard error of the mean only by a factor of 10. Most optimization algorithms
converge much faster (in terms of total computation, not in terms of number of
updates) if they are allowed to rapidly compute approximate estimates of the
gradient rather than slowly computing the exact gradient.
Another consideration motivating statistical estimation of the gradient from a
small number of samples is redundancy in the training set. In the worst case, all
m
samples in the training set could be identical copies of each other. A sampling-
based estimate of the gradient could compute the correct gradient with a single
sample, using
m
times less computation than the naive approach. In practice, we
are unlikely to truly encounter this worst-case situation, but we may find large
numbers of examples that all make very similar contributions to the gradient.
Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch. This terminology can be somewhat confusing
because the word “batch” is also often used to describe the minibatch used by
minibatch stochastic gradient descent. Typically the term “batch gradient descent”
implies the use of the full training set, while the use of the term “batch” to describe
a group of examples does not.For example, it is very common to use the term
“batch size” to describe the size of a minibatch.
Optimization algorithms that use only a single example at a time are sometimes
called stochastic onlineor sometimes methods. The term online is usually reserved
for the case where the examples are drawn from a stream of continually created
examples rather than from a fixed-size training set over which several passes are
made.
Most algorithms used for deep learning fall somewhere in between, using more
278
剩余230页未读,继续阅读
![epub](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![exe](https://img-home.csdnimg.cn/images/20210720083343.png)
![gz](https://img-home.csdnimg.cn/images/20210720083447.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![docx](https://img-home.csdnimg.cn/images/20210720083331.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![avatar](https://profile-avatar.csdnimg.cn/9968d1675c3141c3a6402080bcaa9a37_weixin_35796461.jpg!1)
天使的梦魇
- 粉丝: 33
- 资源: 321
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)
评论0