没有合适的资源?快使用搜索试试~ 我知道了~
For me,Understanding the difficulty of training deep feedforward
需积分: 0 0 下载量 129 浏览量
2022-08-03
19:48:26
上传
评论
收藏 1.55MB PDF 举报
温馨提示
试读
8页
Understanding the difficulty of training deep feedforward neural networksDIRO, U
资源详情
资源评论
资源推荐
249
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot Yoshua Bengio
DIRO, Universit
´
e de Montr
´
eal, Montr
´
eal, Qu
´
ebec, Canada
Abstract
Whereas before 2006 it appears that deep multi-
layer neural networks were not successfully
trained, since then several algorithms have been
shown to successfully train them, with experi-
mental results showing the superiority of deeper
vs less deep architectures. All these experimen-
tal results were obtained with new initialization
or training mechanisms. Our objective here is to
understand better why standard gradient descent
from random initialization is doing so poorly
with deep neural networks, to better understand
these recent relative successes and help design
better algorithms in the future. We first observe
the influence of the non-linear activations func-
tions. We find that the logistic sigmoid activation
is unsuited for deep networks with random ini-
tialization because of its mean value, which can
drive especially the top hidden layer into satu-
ration. Surprisingly, we find that saturated units
can move out of saturation by themselves, albeit
slowly, and explaining the plateaus sometimes
seen when training neural networks. We find that
a new non-linearity that saturates less can often
be beneficial. Finally, we study how activations
and gradients vary across layers and during train-
ing, with the idea that training may be more dif-
ficult when the singular values of the Jacobian
associated with each layer are far from 1. Based
on these considerations, we propose a new ini-
tialization scheme that brings substantially faster
convergence.
1 Deep Neural Networks
Deep learning methods aim at learning feature hierarchies
with features from higher levels of the hierarchy formed
by the composition of lower level features. They include
Appearing in Proceedings of the 13
th
International Conference
on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La-
guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-
right 2010 by the authors.
learning methods for a wide array of deep architectures,
including neural networks with many hidden layers (Vin-
cent et al., 2008) and graphical models with many levels of
hidden variables (Hinton et al., 2006), among others (Zhu
et al., 2009; Weston et al., 2008). Much attention has re-
cently been devoted to them (see (Bengio, 2009) for a re-
view), because of their theoretical appeal, inspiration from
biology and human cognition, and because of empirical
success in vision (Ranzato et al., 2007; Larochelle et al.,
2007; Vincent et al., 2008) and natural language process-
ing (NLP) (Collobert & Weston, 2008; Mnih & Hinton,
2009). Theoretical results reviewed and discussed by Ben-
gio (2009), suggest that in order to learn the kind of com-
plicated functions that can represent high-level abstractions
(e.g. in vision, language, and other AI-level tasks), one
may need deep architectures.
Most of the recent experimental results with deep archi-
tecture are obtained with models that can be turned into
deep supervised neural networks, but with initialization or
training schemes different from the classical feedforward
neural networks (Rumelhart et al., 1986). Why are these
new algorithms working so much better than the standard
random initialization and gradient-based optimization of a
supervised training criterion? Part of the answer may be
found in recent analyses of the effect of unsupervised pre-
training (Erhan et al., 2009), showing that it acts as a regu-
larizer that initializes the parameters in a “better” basin of
attraction of the optimization procedure, corresponding to
an apparent local minimum associated with better general-
ization. But earlier work (Bengio et al., 2007) had shown
that even a purely supervised but greedy layer-wise proce-
dure would give better results. So here instead of focus-
ing on what unsupervised pre-training or semi-supervised
criteria bring to deep architectures, we focus on analyzing
what may be going wrong with good old (but deep) multi-
layer neural networks.
Our analysis is driven by investigative experiments to mon-
itor activations (watching for saturation of hidden units)
and gradients, across layers and across training iterations.
We also evaluate the effects on these of choices of acti-
vation function (with the idea that it might affect satura-
tion) and initialization procedure (since unsupervised pre-
training is a particular form of initialization and it has a
drastic impact).
华亿
- 粉丝: 40
- 资源: 308
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于SpringBoot+Vue3快速开发平台、自研工作流引擎源码设计.zip
- docker安装部署全流程
- 基于树莓派的人脸识别系统python源码+项目部署说明+超详细代码注释.zip
- Python和R爬取分析赶集网北京二手房数据.zip
- Python和R爬取分析赶集网北京二手房数据.zip
- Java知识体系最强总结(2021版).txt
- Python知识点Python知识点Python知识点Python知识点Python知识点PythonPython知识点.txt
- Java开发基于seetaface6的人脸识别(活体检测)的封装源码.zip
- JSP在线失物招领管理平台源码.zip
- JSP在线旅游美食展现管理系统源码.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0