【免费】1.deep+learning课本-三巨头(中)1资源-CSDN文库

需积分: 0 169 浏览量 2022-08-03 15:48:36 上传评论收藏 18.84MB PDF 举报

在深度学习领域，优化是核心任务之一，尤其是在训练神经网络时。"1.Deep Learning课本-三巨头(中)1"这一章节聚焦于深度学习模型的优化技术，这对于我们理解和改进模型性能至关重要。优化不仅用于推理过程，如主成分分析(PCA)，而且在设计算法和证明中也扮演着重要角色。在所有深度学习相关的优化问题中，神经网络的训练是最具挑战性的。为了求解一个单一的神经网络训练实例，通常需要投入数天到数月的时间和大量的计算资源。因此，专门针对这个问题发展出了一系列独特的优化技巧。第8章首先介绍了基于梯度的优化基本原理，这是理解后续内容的基础。如果对这个主题不熟悉，建议回顾前一章的内容，它会提供关于数值优化的概述。本章的核心关注点在于找到能够显著降低损失函数J(θ)的神经网络参数θ。损失函数通常包括在完整训练集上的性能指标以及额外的正则化项。优化作为机器学习任务的训练算法与纯优化的主要区别在于，学习过程中不仅要最小化损失，还要考虑到模型的泛化能力。接下来，章节列举了神经网络优化所面临的具体挑战，比如梯度消失或爆炸、局部极小值、鞍点等问题，这些都增加了优化的复杂性。随后，将介绍一些实用的优化算法，如梯度下降、动量法、RMSprop和Adam等。这些算法有的动态调整学习率，有的利用成本函数的二阶导数信息（Hessian矩阵）来改善优化过程。章节总结了一些通过组合简单优化算法形成更高级策略的优化方法。这些策略可能包括学习率调度、早停法或者集成学习等，它们都是为了在保持模型训练效率的同时，提高最终的模型性能。优化对于深度学习来说是关键的一环，理解和掌握各种优化技术对于提升模型的准确性和训练效率至关重要。在实际应用中，选择合适的优化算法和策略是每个深度学习从业者必须面对和解决的问题。

资源详情

资源评论

资源推荐

Chapter 8

Optimization for Training Deep

Models

Deep learning algorithms involve optimization in many contexts. For example,

performing inference in models such as PCA involves solving an optimization

problem. We often use analytical optimization to write proofs or design algorithms.

Of all of the many optimization problems involved in deep learning, the most

diﬃcult is neural network training. It is quite common to invest days to months of

time on hundreds of machines in order to solve even a single instance of the neural

network training problem. Because this problem is so important and so expensive,

a specialized set of optimization techniques have been developed for solving it.

This chapter presents these optimization techniques for neural network training.

If you are unfamiliar with the basic principles of gradient-based optimization,

we suggest reviewing Chapter . That chapter includes a brief overview of numerical4

optimization in general.

This chapter focuses on one particular case of optimization: ﬁnding the param-

eters

of a neural network that signiﬁcantly reduce a cost function

(

), which

typically includes a performance measure evaluated on the entire training set as

well as additional regularization terms.

We begin with a description of how optimization used as a training algorithm

for a machine learning task diﬀers from pure optimization. Next, we present several

of the concrete challenges that make optimization of neural networks diﬃcult. We

then deﬁne several practical algorithms, including both optimization algorithms

themselves and strategies for initializing the parameters. More advanced algorithms

adapt their learning rates during training or leverage information contained in

274

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

the second derivatives of the cost function. Finally, we conclude with a review of

several optimization strategies that are formed by combining simple optimization

algorithms into higher-level procedures.

8.1 How Learning Diﬀers from Pure Optimization

Optimization algorithms used for training of deep models diﬀer from traditional

optimization algorithms in several ways. Machine learning usually acts indirectly.

In most machine learning scenarios, we care about some performance measure

, that is deﬁned with respect to the test set and may also be intractable.We

therefore optimize

only indirectly. We reduce a diﬀerent cost function

(

) in

the hope that doing so will improve

. This is in contrast to pure optimization,

where minimizing

is a goal in and of itself. Optimization algorithms for training

deep models also typically include some specialization on the speciﬁc structure of

machine learning objective functions.

Typically, the cost function can be written as an average over the training set,

such as

J( ) = θ E

( ) ˆx,y ∼p

data

L f , y ,( ( ; )x θ ) (8.1)

where

is the per-example loss function,

(

;

) is the predicted output when

the input is

ˆp

data

is the empirical distribution. In the supervised learning case,

is the target output. Throughout this chapter, we develop the unregularized

supervised case, where the arguments to

are

(

;

) and

. However, it is trivial

to extend this development, for example, to include

as arguments, or to

exclude

as arguments, in order to develop various forms of regularization or

unsupervised learning.

Eq. deﬁnes an objective function with respect to the training set. We8.1

would usually prefer to minimize the corresponding objective function where the

expectation is taken across

the data generating distribution p

data

rather than

just over the ﬁnite training set:

∗

( ) = θ E

( )x,y ∼p

data

L f , y .( ( ; )x θ ) (8.2)

8.1.1 Empirical Risk Minimization

The goal of a machine learning algorithm is to reduce the expected generalization

error given by Eq. . This quantity is known as the . We emphasize here that8.2 risk

the expectation is taken over the true underlying distribution

data

. If we knew

the true distribution

data

(

x, y

), risk minimization would be an optimization task

275

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

solvable by an optimization algorithm. However, when we do not know

data

(

x, y

)

but only have a training set of samples, we have a machine learning problem.

The simplest way to convert a machine learning problem back into an op-

timization problem is to minimize the expected loss on the training set. This

means replacing the true distribution

(

x, y

) with the empirical distribution

ˆp

(

x, y

)

deﬁned by the training set. We now minimize the empirical risk

x,y∼ˆp

data

( )x,y

[ ( ( ; ) )] =L f x θ , y

i=1

L f( (x

( )i

; )θ , y

( )i

) (8.3)

where is the number of training examples.m

The training process based on minimizing this average training error is known

as empirical risk minimization. In this setting, machine learning is still very similar

to straightforward optimization. Rather than optimizing the risk directly, we

optimize the empirical risk, and hope that the risk decreases signiﬁcantly as well.

A variety of theoretical results establish conditions under which the true risk can

be expected to decrease by various amounts.

However, empirical risk minimization is prone to overﬁtting. Models with

high capacity can simply memorize the training set. In many cases, empirical

risk minimization is not really feasible. The most eﬀective modern optimization

algorithms are based on gradient descent, but many useful loss functions, such

as 0-1 loss, have no useful derivatives (the derivative is either zero or undeﬁned

everywhere). These two problems mean that, in the context of deep learning, we

rarely use empirical risk minimization. Instead, we must use a slightly diﬀerent

approach, in which the quantity that we actually optimize is even more diﬀerent

from the quantity that we truly want to optimize.

8.1.2 Surrogate Loss Functions and Early Stopping

Sometimes, the loss function we actually care about (say classiﬁcation error) is not

one that can be optimized eﬃciently. For example, exactly minimizing expected 0-1

loss is typically intractable (exponential in the input dimension), even for a linear

classiﬁer (Marcotte and Savard 1992, ). In such situations, one typically optimizes

a surrogate loss function instead, which acts as a proxy but has advantages. For

example, the negative log-likelihood of the correct class is typically used as a

surrogate for the 0-1 loss. The negative log-likelihood allows the model to estimate

the conditional probability of the classes, given the input, and if the model can

do that well, then it can pick the classes that yield the least classiﬁcation error in

expectation.

276

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

In some cases, a surrogate loss function actually results in being able to learn

more. For example, the test set 0-1 loss often continues to decrease for a long

time after the training set 0-1 loss has reached zero, when training using the

log-likelihood surrogate. This is because even when the expected 0-1 loss is zero,

one can improve the robustness of the classiﬁer by further pushing the classes apart

from each other, obtaining a more conﬁdent and reliable classiﬁer, thus extracting

more information from the training data than would have been possible by simply

minimizing the average 0-1 loss on the training set.

A very important diﬀerence between optimization in general and optimization

as we use it for training algorithms is that training algorithms do not usually halt

at a local minimum. Instead, a machine learning algorithm usually minimizes

a surrogate loss function but halts when a convergence criterion based on early

stopping (Sec. ) is satisﬁed. Typically the early stopping criterion is based on7.8

the true underlying loss function, such as 0-1 loss measured on a validation set,

and is designed to cause the algorithm to halt whenever overﬁtting begins to occur.

Training often halts while the surrogate loss function still has large derivatives,

which is very diﬀerent from the pure optimization setting, where an optimization

algorithm is considered to have converged when the gradient becomes very small.

8.1.3 Batch and Minibatch Algorithms

One aspect of machine learning algorithms that separates them from general

optimization algorithms is that the objective function usually decomposes as a sum

over the training examples. Optimization algorithms for machine learning typically

compute each update to the parameters based on an expected value of the cost

function estimated using only a subset of the terms of the full cost function.

For example, maximum likelihood estimation problems, when viewed in log

space, decompose into a sum over each example:

= arg max

i=1

log p

model

( )i

, y

( )i

; )θ . (8.4)

Maximizing this sum is equivalent to maximizing the expectation over the

empirical distribution deﬁned by the training set:

J( ) = θ E

x,y∼ˆp

data

log p

model

( ; )x, y θ . (8.5)

Most of the properties of the objective function

used by most of our opti-

mization algorithms are also expectations over the training set. For example, the

277

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

most commonly used property is the gradient:

∇

J( ) = θ E

x,y∼ˆp

data

∇

log p

model

( ; )x, y θ . (8.6)

Computingthis expectationexactlyis veryexpensivebecauseitrequires

evaluating the model on every example in the entire dataset. In practice, we can

compute these expectations by randomly sampling a small number of examples

from the dataset, then taking the average over only those examples.

Recall that the standard error of the mean (Eq. ) estimated from5.46

samples

is given by

σ/

√

where

is the true standard deviation of the value of the samples.

The denominator of

√

shows that there are less than linear returns to using

more examples to estimate the gradient. Compare two hypothetical estimates of

the gradient, one based on 100 examples and another based on 10,000 examples.

The latter requires 100 times more computation than the former, but reduces the

standard error of the mean only by a factor of 10. Most optimization algorithms

converge much faster (in terms of total computation, not in terms of number of

updates) if they are allowed to rapidly compute approximate estimates of the

gradient rather than slowly computing the exact gradient.

Another consideration motivating statistical estimation of the gradient from a

small number of samples is redundancy in the training set. In the worst case, all

samples in the training set could be identical copies of each other. A sampling-

based estimate of the gradient could compute the correct gradient with a single

sample, using

times less computation than the naive approach. In practice, we

are unlikely to truly encounter this worst-case situation, but we may ﬁnd large

numbers of examples that all make very similar contributions to the gradient.

Optimization algorithms that use the entire training set are called batch or

deterministic gradient methods, because they process all of the training examples

simultaneously in a large batch. This terminology can be somewhat confusing

because the word “batch” is also often used to describe the minibatch used by

minibatch stochastic gradient descent. Typically the term “batch gradient descent”

implies the use of the full training set, while the use of the term “batch” to describe

a group of examples does not.For example, it is very common to use the term

“batch size” to describe the size of a minibatch.

Optimization algorithms that use only a single example at a time are sometimes

called stochastic onlineor sometimes methods. The term online is usually reserved

for the case where the examples are drawn from a stream of continually created

examples rather than from a ﬁxed-size training set over which several passes are

made.

Most algorithms used for deep learning fall somewhere in between, using more

278

剩余230页未读，继续阅读

评论收藏

内容反馈

天使的梦魇

粉丝: 38
资源: 321

1.deep+learning课本-三巨头(中)1

评论0

最新资源

1.deep+learning课本-三巨头(中)1

评论0

1.deep+learning课本-三巨头(下)1

TensorFlow+1.x+Deep+Learning+Cookbook-Packt+Publishing(2017).epub

halcon-17.12.0.0-windows-images-deep-learning.exe

deeplearning.ai-andrewNG-master.zip

deeplearning4j-nn-1.0.0-M1.1-API文档-中文版.zip

Deep Reinforcement Learning Hands-On

deeplearning4j-core-1.0.0-M1.1-API文档-中文版.zip

234.Deep-Learning-with-PyTorch-Tutorials__dragen1860.tar.gz

TensorFlow 1.x Deep Learning Cookbook（PDF＋随书代码）

cn-deep-learning-vs-machine-learning-ebook(1).pdf

Deep-Reinforcement-Learning-Hands-On_deepreinforcement_强化学习_

TensorFlow 1.x Deep Learning Cookbook-PacktPublishing(2017).pdf

deep-learning-for-signal-white-paper.pdf

halcon-18.11.2.0-windows-deep-learning.part1.rar

Deep+Learning+with+PyTorch-Packt+Publishing

deeplearning4j-datavec-iterators-1.0.0-M1.1.jar中文-英文对照文档.zip

Deep-Learning-with-PyTorch.rar

Deep-Reinforcement-Learning-Hands-On-master-2.zip

deeplearning4j-ui-components-1.0.0-M1.1-API文档-中文版.zip

Deep-Learning-for-Beginners-master代码

Deep-Learning-21-Examples-master

deeplearning4j-datasets-1.0.0-M1.1-API文档-中文版.zip

自动驾驶汽车之深度学习PPT 2018 MIT 6.S094 Deep Learning for Self-Driving Cars

deeplearning4j-utility-iterators-1.0.0-M1.1-API文档-中文版.zip

deep-learning-for-image-processing-master_deeplearning_深度学习处理图片_

C++再生神经网络学习项目-cplus-deeplearning.zip

最新资源