没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Gradient descent is one of the most popular algorithms to perform optimization and by far
the most common way to optimize neural networks. At the same time, every state-of-the-
art Deep Learning library contains implementations of various algorithms to optimize
gradient descent (e.g. , , and documentation). These algorithms,
however, are often used as black-box optimizers, as practical explanations of their
strengths and weaknesses are hard to come by.
This blog post aims at providing you with intuitions towards the behaviour of different
algorithms for optimizing gradient descent that will help you put them to use. We are first
going to look at the different variants of gradient descent. We will then briefly summarize
challenges during training. Subsequently, we will introduce the most common optimization
Adadelta
RMSprop
Adam
Visualization of algorithms
Which optimizer to choose?
Parallelizing and distributing SGD
Hogwild!
Downpour SGD
Delay-tolerant Algorithms for SGD
TensorFlow
Elasic Averaging SGD
Additional strategies for optimizing SGD
Shuffling and Curriculum Learning
Batch normalization
Early Stopping
Gradient noise
Conclusion
References
lasagne's caffe's keras'
algorithms by showing their motivation to resolve these challenges and how this leads to
the derivation of their update rules. We will also take a short look at algorithms and
architectures to optimize gradient descent in a parallel and distributed setting. Finally, we
will consider additional strategies that are helpful for optimizing gradient descent.
Gradient descent is a way to minimize an objective function parameterized by a
model's parameters by updating the parameters in the opposite direction of the
gradient of the objective function w.r.t. to the parameters. The learning rate
determines the size of the steps we take to reach a (local) minimum. In other words, we
follow the direction of the slope of the surface created by the objective function downhill
until we reach a valley. If you are unfamiliar with gradient descent, you can find a good
introduction on optimizing neural networks .
Gradient descent variants
There are three variants of gradient descent, which differ in how much data we use to
compute the gradient of the objective function. Depending on the amount of data, we
make a trade-off between the accuracy of the parameter update and the time it takes to
perform an update.
Batch gradient descent
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost
function w.r.t. to the parameters for the entire training dataset:
.
As we need to calculate the gradients for the whole dataset to perform just
one
update,
batch gradient descent can be very slow and is intractable for datasets that don't fit in
memory. Batch gradient descent also doesn't allow us to update our model
online
, i.e. with
new examples on-the-fly.
In code, batch gradient descent looks something like this:
J
(
θ
)
θ
∈
ℝ
d
J
(
θ
)
∇
θ
η
here
θ
θ
=
θ
−
η
⋅
J
(
θ
)
∇
θ
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
For a pre-defined number of epochs, we first compute the gradient vector weights_grad
of the loss function for the whole dataset w.r.t. our parameter vector params. Note that
state-of-the-art deep learning libraries provide automatic differentiation that efficiently
computes the gradient w.r.t. some parameters. If you derive the gradients yourself, then
gradient checking is a good idea. (See for some great tips on how to check gradients
properly.)
We then update our parameters in the direction of the gradients with the learning rate
determining how big of an update we perform. Batch gradient descent is guaranteed to
converge to the global minimum for convex error surfaces and to a local minimum for non-
convex surfaces.
Stochastic gradient descent
Stochastic gradient descent (SGD) in contrast performs a parameter update for
each
training example and label :
.
Batch gradient descent performs redundant computations for large datasets, as it
recomputes gradients for similar examples before each parameter update. SGD does away
with this redundancy by performing one update at a time. It is therefore usually much
faster and can also be used to learn online.
SGD performs frequent updates with a high variance that cause the objective function to
fluctuate heavily as in Image 1.
here
x
(
i
)
y
(
i
)
θ
=
θ
−
η
⋅
J
(
θ
; ; )
∇
θ
x
(
i
)
y
(
i
)
剩余20页未读,继续阅读
资源评论
Leo青山
- 粉丝: 1
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- MATLAB叠加纪元分析教程 matlab代码.rar
- 抵押贷款、房价和商业周期动态:使用连续小波变换的中期探索matlab代码.rar
- Android Studio Ladybug(android-studio-2024.2.1.12-mac.zip.002)
- multisim14的DSB调制
- DBN网络实现的人脸识别MATLAB程序,里面使用LBP算法和HOG算法.程序使用的是ORL人脸数据库.rar
- 基于MATLABSimulink的卫星避碰方案.rar
- 基于Q学习的井字棋游戏matlab实现.rar
- 本实验将实现 FPGA 芯片和 PC 之间进行千兆以太网数据通信, 通信协议采用 Ethernet UDP 通信协议 FPGA 通过 RGMII 总线和开发板上的 Gigabit PHY 芯片通信
- web前端+HTML+HTML入门+新年快乐主题网页
- 基于大型卫星星座的多跳路径选择 matlab代码.rar
- 理APSO算法特定的变量和过程变量(如迭代次数和人口)来调整模拟和优化附matlab代码.rar
- 基于视觉的内陆水道斜接闸门模型更新和评估Matlab代码.rar
- 计算多条重力线站之间的重力差,并将其组合成网络平差matlab代码.rar
- 已产PIN检测总装图工程图机械结构设计图纸和其它技术资料和技术方案非常好100%好用.zip
- 利用DBN进行无监督特征提取,顶层接ELM,基于最小二乘法实现特征与标签的输出权重更新matlab代码.rar
- 利用MATLAB对阿尔及利亚的天气和森林火灾预测进行了分析。探索温度趋势、风速和火灾风险.rar
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功