没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Deep learning via Hessian-free optimization
James Martens JMARTENS@CS.TORONTO.EDU
University of Toronto, Ontario, M5S 1A1, Canada
Abstract
We develop a 2
nd
-order optimization method
based on the “Hessian-free” approach, and apply
it to training deep auto-encoders. Without using
pre-training, we obtain results superior to those
reported by Hinton & Salakhutdinov (2006) on
the same tasks they considered. Our method is
practical, easy to use, scales nicely to very large
datasets, and isn’t limited in applicability to auto-
encoders, or any specific model class. We also
discuss the issue of “pathological curvature” as
a possible explanation for the difficulty of deep-
learning and how 2
nd
-order optimization, and our
method in particular, effectively deals with it.
1. Introduction
Learning the parameters of neural networks is perhaps one
of the most well studied problems within the field of ma-
chine learning. Early work on backpropagation algorithms
showed that the gradient of the neural net learning objective
could be computed efficiently and used within a gradient-
descent scheme to learn the weights of a network with mul-
tiple layers of non-linear hidden units. Unfortunately, this
technique doesn’t seem to generalize well to networks that
have very many hidden layers (i.e. deep networks). The
common experience is that gradient-descent progresses ex-
tremely slowly on deep nets, seeming to halt altogether be-
fore making significant progress, resulting in poor perfor-
mance on the training set (under-fitting).
It is well known within the optimization community that
gradient descent is unsuitable for optimizing objectives
that exhibit pathological curvature. 2
nd
-order optimization
methods, which model the local curvature and correct for
it, have been demonstrated to be quite successful on such
objectives. There are even simple 2D examples such as the
Rosenbrock function where these methods can demonstrate
considerable advantages over gradient descent. Thus it is
reasonable to suspect that the deep learning problem could
be resolved by the application of such techniques. Unfortu-
Appearing in Proceedings of the 27
th
International Conference
on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the
author(s)/owner(s).
nately, there has yet to be a demonstration that any of these
methods are effective on deep learning problems that are
known to be difficult for gradient descent.
Much of the recent work on applying 2
nd
-order methods
to learning has focused on making them practical for large
datasets. This is usually attemptedby adoptingan “on-line”
approachakin to the one used in stochastic gradient descent
(SGD). The only demonstrated advantages of these meth-
ods over SGD is that they can sometimes converge in fewer
training epochs and that they require less tweaking of meta-
parameters, such as learning rate schedules.
The most important recent advance in learning for deep
networks has been the development of layer-wise unsu-
pervised pre-training methods (
Hinton & Salakhutdinov,
2006; Bengio et al., 2007). Applying these methods before
running SGD seems to overcome the difficulties associated
with deep learning. Indeed, there have been many suc-
cessful applications of these methods to hard deep learn-
ing problems, such as auto-encodersand classification nets.
But the question remains: why does pre-training work and
why is it necessary? Some researchers (e.g.
Erhan et al.,
2010) have investigated this question and proposed various
explanations such as a higher prevalence of bad local op-
tima in the learning objectives of deep models.
Another explanation is that these objectives exhibit patho-
logical curvature making them nearly impossible for
curvature-blind methods like gradient-descent to success-
fully navigate. In this paper we will argue in favor of this
explanation and provide a solution in the form of a pow-
erful semi-online 2
nd
-order optimization algorithm which
is practical for very large models and datasets. Using
this technique, we are able to overcome the under-fitting
problem encountered when training deep auto-encoder
neural nets far more effectively than the pre-training +
fine-tuning approach proposed by
Hinton & Salakhutdinov
(2006). Being an optimization algorithm, our approach
doesn’t deal specifically with the problem of over-fitting,
however we show that this is only a serious issue for one of
the three deep-auto encoder problems considered by Hin-
ton & Salakhutdinov, and can be handled by the usual
methods of regularization.
These results also help us address the question of why
deep-learning is hard and why pre-training sometimes
资源评论
- 浩三儿2015-03-09学习了 很实用
- buzy2016-10-18挺好的资源
黄革革
- 粉丝: 12
- 资源: 4
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功