Understanding the difficulty of training deep feedforward neural networks

Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot. Yoshua bengio training pairs(a, y) and used to update parameters 0 in that direction, with 6 <0Eg. The learning rate E is a hyper parameter that is optimized based on validation set error Layer after a large number of updates (5 million) Layer 4 We varied the type of non linear activation function in the 40 hidden layers: the sigmoid 1/(1+e), the hyperbolic Epochs of 20k minibatch updates tangent tanh( c), and a newly proposed activation func tion(Bergstra et al., 2009)called the softsign, /(1+cD) Figure 2: Mean and standard deviation(vertical bars)of the The softsign is similar to the hyperbolic tangent (its range activation values(output of the sigmoid) during supervised learning, for the different hidden layers of a deep archi is1 to 1) but its tails are quadratic polynomials rather than exponentials, i.e., it approaches its asymptotes much ch tecture The top hidden layer quickly saturates at 0(slow slower ing down all learning), but then slowly desaturates around epoch 100 In the comparisons, we search for the best hyper parameters (learning rate and depth) separately for each model. Note that the best depth was always five for We see that very quickly at the beginning, all the sigmoid activation values of the last hidden layer are pushed to their Shapeset3 x 2, except for the sigmoid, for which it was lower saturation value of 0. Inversely, the others layers have a mean activation value that is above 0.5. and decreas We initialized the biases to be 0 and the weights Wii at ing as we go from the output layer to the input layer. We each layer with the following commonly used heuristic have found that this kind of saturation can last very long in 11 deeper networks with sigmoid activations, e.g the depth W five model never escaped this regime during training. The big surprise is that for intermediate number of hidden lay where a is the uniform distribution in the interval ers(here four), the saturation regime may be escaped. At a,a) and n is the size of the previous layer(the number the same time that the top hidden layer moves out of satura of columns of w) tion, the first hidden layer begins to saturate and therefore to stabilize 3 Effect of activation Functions and Saturation During Training We hypothesize that this behavior is due to the combina tion of random initialization and the fact that an hidden unit Two things we want to avoid and that can be revealed from output of 0 corresponds to a saturated sigmoid. Note that the evolution of activations is excessive saturation of acti deep networks with sigmoids but initialized from unsuper vation functions on one hand(then gradients will not prop vised pretraining (e.g. from RBMs)do not suffer from agate well), and overly linear units(they will not compute this saturation behavior. Our proposed explanation rests on something interesting) the hypothesis that the transformation that the lower layers of the randomly initialized network computes initially is 3.1 Experiments with the Sigmoid not useful to the classification task unlike the transforma The sigmoid nonlinearity has been already shown to slow tion obtained from unsupervised pretraining. The logistic down learning because of its nonezero mean that induces layer output softmax(b+Wh)might initially rely more on important singular values in the Hessian (LeCun et al its biases b (which are learned very quickly)than on the top 1998b). In this section we will see another symptomatic hidden activations h derived from the input image(because behavior due to this activation function in deep feedforward h would vary in ways that are not predictive of y, maybe networks correlated mostly with other and possibly more dominant variations of x). Thus the error gradient would tend to We want to study possible saturation, by looking at the evo push Wh towards 0, which can be achieved by pushing lution of activations during training, and the figures in this h towards o. In the case of symmetric activation functions section show results on the Shapeset3x 2 data, but sim like the hyperbolic tangent and the softsign, sitting around ilar behavior is observed with the other datasets. Figure 2 0 is good because it allows gradients to flow backwards shows the evolution of the activation values(after the non However, pushing the sigmoid outputs to 0 would bring linearity) at each hidden layer during training of a deep ar them into a saturation regime which would prevent gradi chitecture with sigmoid activation functions. Layer 1 refers ents to flow backward and prevent the lower layers from to the output of first hidden layer, and there are four hidden learning useful features. Eventually but slowly, the lower layers. The graph shows the means and standard deviations layers move toward more useful features and the top hidden of these activations. These statistics along with histograms laver then moves out of the saturation regime. Note how are computed at different times during learning, by looking ever that, even after this, the network moves into a solution at activation values for a fixed set of 300 test examples that is of poorer quality (also in terms of generalization) 251 Understanding the difficulty of training deep feedforward neural networks hen those found with symmetric activation functions, as where the gradients would flow well can be seen in figure 11 3.2 Experiments with the Hyperbolic tangent As discussed above, the hyperbolic tangent networks do not 2345 suffer from the kind of saturation behavior of the top hid den layer observed with sigmoid networks, because of its 0全斜珍少惨物 symmetry around O. However, with our standard weight 0.6 060.8 Activation value initialization u V’w分/, we observe a sequentially oc Laver l curring saturation phenomenon starting with layer 1 and OLayer propagating up in the network, as illustrated in Figure 3 X Layer 3 Why this is happening remains to be understood 5 本详嘶 Player 080.60.40.2 020.40.60.8 Activation value o Laver 2 等转样丰 兴 Layer Figure 4: Activation values normalized histogram at the end of learning, averaged across units of the same layer and across 300 test examples. Top: activation function is hyper 90 Epochs of 20k minibatch updates bolic tangent, we see important saturation of the lower lay ers. Bottom: activation function is softsign, we see many activation values around (0.6, 0.8)and(0.6,0.8)where the units do not saturate but are nonlinear fLayer 3 要杏春S+144 Studying Gradients and their Propagation 4.1 Effect of the cost function 010 5060708090 Epochs of 20k minibatch updates We have found that the logistic regression or conditional Figure 3: Top: 98 percentiles(markers alone) and standard loglikelihood cost function (log P(gla)coupled with deviation(solid lines with markers)of the distribution of softmax outputs) worked much better(for classification the activation values for the hyperbolic tangent networks in problems) than the quadratic cost which was tradition the course of learning. We see the first hidden layer satu ally used to train feedforward neural networks(Rumelhart rating first, then the second, etc. Bottom: 98 percentiles et al 1986). This is not a new observation(Solla et al (markers alone) and standard deviation(solid lines with 1988 but we find it important to stress here. We found that markers)of the distribution of activation values for the soft the plateaus in the training criterion(as a function of the pa sign during learning. Here the different layers saturate less rameters )are less present with the loglikelihood cost fund and do so together. tion. We can see this on Figure 5, which plots the training criterion as a function of two weights for a twolayer net 3.3 Experiments with the Softsign work(one hidden laver) with hyperbolic tangent units, and The softsign /(1+aD) is similar to the hyperbolic tangent a random input and target signal. There are clearly more but might behave differently in terms of saturation because severe plateaus with the quadratic cost of its smoother asymptotes (polynomial instead of expo 4.2 Gradients at initialization nential). We see on Figure 3 that the saturation does not occur one layer after the other like for the hyperbolic tan 4.2.1 Theoretical Considerations and a New gent. It is faster at the beginning and then slow, and all Normalized initialization ayers move together towards larger weights e study the backpropagated gradients, or equivalently We can also see at the end of training that the histogram the gradient of the cost function on the inputs biases at each of activation values is very different from that seen with layer. Bradley(2009)found that backpropagated gradients the hyperbolic tangent(Figure 4). Whereas the latter yields were smaller as one moves from the output layer towards modes of the activations distribution mostly at the extremes the input layer, just after initialization. He studied networks (asymptotesl and 1)or around O, the softsign network has with linear activation at each layer, finding that the variance modes of activations around its knees(between the linear of the backpropagated gradients decreases as we go back regime around O and the flat regime aroundI and 1). These wards in the network. We will also start by studying the are the areas where there is substantial nonlinearity but linear regime 252 Xavier Glorot. Yoshua bengio From a forwardpropagation point of view, to keep infor mation flowing we would like that Va (8) From a backpropagation point of view we would similarly like to have acost V(ii),Var =Var s These two conditions transform to Vi, niVarW]=1 (10) 乙,几+1Var Figure 5: Cross entropy(black, surface on top) and As a compromise between these two constraints, we might quadratic (red, bottom surface) cost as a function of two want to have weights(one at each layer)of a network with two layers, WI respectively on the first layer and W2 on the second, Vi, Varlet (12 7z+7;+1 output layer ote how both constraints are satisfied when all layers have For a dense artificial neural network using symmetr the same width If we also have the same initialization for vation function f with unit derivative at 0(i.e. f(0)=1 the weights we could get the following interesting proper if we write z for the activation vector of layer i, and s ties the argument vector of the activation function at layer i we have s=zw+ b and z ∫(s2). From these aCost Vi ve definitions we obtain the following O =nvar W Var[] (13) aCo.st aCost aCost Os:=f(si)Wi+10Cost Vi var W Var[evo as' O (14) aCost: aCost (3) We can see that the variance of the gradient on the weights k is the same for all layers, but the variance of the back The variances will be expressed with respect to the input, propagated gradient might still vanish or explode as we outpout and weight initialization randomness. Consider consider deeper networks. Note how this is reminis cent of issues raised when studying recurrent neural net the hypothesis that we are in a linear regime at the initial ization, that the weights are initialized independently and works(Bengio et al, 1994), which can be seen as very deep networks when unfolded through time that the inputs features variances are the same(Varla) Then we can say that, with ni the size of layer i and x the The standard initialization that we have used(eq 1)gives network inpu rise to variance with the following property Vary (15 T ni v arW (5) where n is the layer size(assuming all layers of the same =0 size). This will the y of the back ted gradient to be dependent on the layer(and decreasing) We write VarlW for the shared scalar variance of all weights at layer i. Then for a network with d layers The normalization factor may therefore be important when d initializing deep networks because of the multiplicative ef Var/acost fect through layers, and we suggest the following initializa tion procedure to approximately satisfy our objectives of maintaining activation variances and backpropagated gra Var acost du ni'1Var[wi dients variance as one moves up or down the network. W call it the normalized initialization aCost X V aT aVar W≈U (16) √m+m+1√m;+m;+1 253 Understanding the difficulty of training deep feedforward neural networks 4.2.2 Gradient Propagation Study Laycr 1 To empirically validate the above theoretical ideas we h plotted some normalized histograms of activation values Layer 4 Layer 5 weight gradients and of the backpropagated gradients at initialization with the two different initialization methods The results displayed(Figures 6, 7 and 8)are from exper 0.15 0.05 005 0.15 0.2 Backpropagated gradient iments on Shapeset3x 2, but qualitatively similar results were obtained with the other datasets Layer 2 We monitor the singular values of the jacobian matrix as Laver 3 sociated with layer i Layer 5 az2+1 (17) z 0.25C20.150.10.0500.050.10.150.20.2 When consecutive lavers have the same dimension the av Backpropagated gradi erage singular value corresponds to the average ratio of in Figure 7: Backpropagated gradients normalized his finitesimal volumes mapped from z to zt, as well as tograms with hyperbolic tangent activation, with standard to the ratio of average activation variance going from z(top) vs normalized (bottom) initialization. Top Opeak to zi+1. With our normalized initialization this ratio is decreases for higher lavers around 0.8 whereas with the standard initialization, it drops down to 0.5 What was initially really surprising is that even when the backpropagated gradients become smaller(standard ini tilization), the variance of the weights gradients is roughl constant across layers, as shown on Figure 8. However, this Layer 4 iS explained by our theoretical analysis above(eg. 14). In Layer terestingly, as shown in Figure 9, these observations on the weight gradient of standard and normalized initialization hange during training(here for a tanh network). Indeed 80.60.40.200.20.4 whereas the gradients have initially roughly the same mag nitude, they diverge from each other(with larger gradients Layer 2 in the lower layers)as training progresses, especially with the standard initialization Note that this might be one of the advantages of the normalized initialization since hav Layer 5 g gradient ry different magnitudes at different lay ers may yield to illconditioning and slower training 0.6 Activation value Finally, we observe that the softsign networks share simi Figure 6: Activation values normalized histograms with rarities with the tanh networks with normalized initializa hyperbolic tangent activation, with standard(top) vs nor tion, as can be seen by comparing the evolution of activa malized initialization(bottom). Top: 0peak increases for tions in both cases(resp. Figure 3bottom and Figure 10) higher layers 5 Error Curves and conclusions 4.3 Backpropagated Gradients During Learning The final consideration that we care for is the success The dynamic of learning in such networks is complex and of training with different strategies, and this is best il we would like to develop better tools to analyze and track lustrated with error curves showing the evolution of test it. In particular, we cannot use simple variance calculations error as training progresses and asymptotes. Figure 11 in our theoretical analysis because the weights values are shows such curves with online training on shapeset3x 2 not anymore independent of the activation values and the while Table 1 gives final test error for all the datasets linearity hypothesis is also violated studied(Shapeset3x 2, MNIST, CIFAR10, and Small As first noted by bradley (2009), we observe(Figure 7)that ImageNet). As a baseline, we optimized rBF SVM mod at the beginning of training, after the standard initializa els on one hundred thousand Shapeset examples and ob tained 59. 47% test error. while on the same set we obtained tion(eg. 1), the variance of the backpropagated gradients gets smaller as it is propagated downwards. However we 50.47% with a depth five hyperbolic tangent network with normalized initialization find that this trend is reversed very quickly during learning Using our normalized initialization we do not see such de These results illustrate the effect of the choice of activa creasing backpropagated gradients(bottom of Figure 7). tion and initialization. As a reference we include in Fig 254 Xavier Glorot. Yoshua bengio 100 0.05 √ 0.0250.020.015001000500.0050.010.0150.020.025 Weights gradi Epochs of 20k minibatch updates layer I ayer 4 005 0.05 708090 0.1 Weights gradients Epochs of 20k minibatch updates Figure 8: Weight gradient normalized histograms with hy Figure 9: Standard deviation intervals of the weights gradi perbolic tangent activation just after initialization, with ents with hyperbolic tangents with standard initialization standard initialization(top) and normalized initialization ( top) and normalized(bottom) during training. We see (bottom), for different layers. Even though with standard that the normalization allows to keep the same variance initialization the backpropagated gradients get smaller, the of the weights gradient across layers, during training(top weight gradients do not! smaller variance for higher layers) Table 1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden lay ers. N after the activation function name indicates the use 样解 母日母日 Layer 1 yer 关Lyer3 of normalized initialization results in bold are statisticall different from nonbold ones under the null hypothesis test 白日日母 Layer5 with p=0.005 字字 TYPE Shapeset MNIST CIFAR10 ImageNet Epochs ofr 20k minibaich updates Figure 10: 98 percentile(markers alone) and standard de Softsign l6.27 l.64 5.78 6914 viation(solid lines with markers )of the distribution of ac Softsignn 16.06 1.72 53.8 6813 Tanh 27.15 176 55.9 tivation values for hyperbolic tangent with normalized ini 15.60 tilization during learning Sigmoid 57.28 70.66 activations(flowing upward) and gradients(flowin ure ll the error curve for the supervised finetuning from backward ). the initialization obtained after unsupervised pretrainin with denoising autoencoders( Vincent et al., 2008). For Others methods can alleviate discrepancies between lay each network the learning rate is separately chosen to mir rs during learning, e. g, exploiting second order informa imize error on the validation set. we can remark that on tion to set the learning rate separately for each parame Shapeset3x 2, because of the task difficulty, we observe er. For example, we can exploit the diagonal of the Hes important saturations during learning, this might explain sian (LeCun et al., 1998b)or a gradient variance estimate that the normalized initialization or the softsign effects are Both those methods have been applied for Shapeset3x 2 more visible with hyperbolic tangent and standard initialization. We ob served a gain in performance but not reaching the result ob Several conclusions can be drawn from these error curves: tained from normalized initialization. In addition we ob The more classical neural networks with sigmoid or served further gains by combining normalized initialization hyperbolic tangent units and standard initialization with second order methods: the estimated Hessian might fare rather poorly, converging more slowly and appar then focus on discrepancies between units, not having to ently towards ultimately poorer local minima correct important initial discrepancies between layers The softsign networks seem to be more robust to the In all reported experiments we have used the same num initialization procedure than the tanh networks, pre ber of units per layer. However, we verified that we obtain sumably because of their gentler non linearity the same gains when the layer size increases(or decreases with layer number For tanh networks, the proposed normalized initial ization can be quite helpful, presumably because the The other conclusions from this study are the following layertolayer transformations maintain magnitudes of Monitoring activations and gradients across layers and 255 Understanding the difficulty of training deep feedforward neural networks Sigmoid depth 5 Bengio, Y, Lamblin, P, Popovici, D,& larochelle, H.(2007) Sigmoid depth 4 Greedy layerwise training of deep networks. NIPS 19(pp Tanh 153160). MIT Press Softsign … Softsign N Bengio, Y, Simard, P, Frasconi, P(1994). Learning longterm  Tanh N dependencies with gradient descent is difficult. IEEE Transac Pretraining tions on neural networks. 5.157166 Bergstra, J, Desjardins, G, Lamblin, P,& Bengio, Y(2009) Quadratic polynomials learn better image features (technical 40 Report 1337). departement d Informatique ct dc recherche Universite de montreal Bradlcy, D(2009). Learning in modular systems. Doctoral dis sertation, The robotics Institute, Carnegie mellon University 心M人从时八一A一M Collobert, R, Weston, J(2008). A unified architecture for nat ural language processing: Deep neural networks with multitask 0.5 2.0 exemples seen learning. ICML 2008 Figure ll: Test error during online training on the Erhan, D, Manzagol, PA, Bengio, Y, Bengio, S, vincent Shapeset3 x2 dataset, for various activation functions and P (2009). The difficulty of training deep architectures and the initialization schemes (ordered from top to bottom in de effect of unsupervised pretraining. AlSTATS 2009(pp. 153 160) creasing final error N after the activation function name indicates the use of normalized initialization Flinton, G.E., Osindero, S,& Teh, Y.(2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527 MNIST CIFARlO Sigmoid Sigmoid Softsign N Krizhevsky, A,& hinton, G.(2009). Learning multiple layers of features from tiny images(Technical Report). University of Tanh N Softsign N Toronto Tanh N Larochelle, H, Bengio, Y, Louradour, J.,& Lamblin, P(2009) Exploring strategies for training deep neural networks. The Journal of Machine learning research, 10, 140 Me解 Larochelle, Il, Erhan, D, Courville, A,, Bergstra, J.,& Bengio, 0.00.5101.5 T exemp es seen exemples seen Y.(2007). An empirical evaluation of deep architectures on prohlems with many factors of variation. /CMI. 2007 Figure 12: Test error curves during training on mnist and CIfARiO. for various activation functions and initialization LeCun, Y, Bottou, L,, Bengio, Y,& Haffner, P.(1998a) schemes(ordered from top to bottom in decreasing final Gradientbased learning applied to document recognition Pro ceedings ofthe IEEE, 80, 22782324 error). N after the activation function name indicates the use of normalized initialization LeCun, Y, Bottou, L, Orr G. B. muller, K.R. (1998b). Effi Notes in Computer Science LNCS 1524. Springer Verleecture t backprop. Ir training iterations is a powerful investigative tool for understanding training difficulties in deep nets Mnih, A, hinton, G. E(2009). A scalable hierarchical dis tributed language model. NIPS 21(pp. 10811088 Sigmoid activations(not symmetric around o) should be avoided when initializing from small random Ranzato, M, Poultney, C, Chopra, S,& Le Cun, Y(2007). Ef ficient learning of sparse representations with an energybased weights, because they yield poor learning dynamics model wips 9 with initial saturation of the top hidden layer Rumelhart, D E, Hinton, G. E, Williams, R.J. (1986).Learn e Keeping the layertolayer transformations such that ing representations by backpropagating errors. Nature, 323, both activations and gradients flow well (i.e. with a Ja 533536 cobian around 1)appears helpful, and allows to elim Solla, S.A., Levin, E,& Fleisher, M(1988). Accelerated learn inate a good part of the discrepancy between purely ing in layered neural networks. Complex systems, 2, 625639 supervised deep networks and ones pretrained with Vincent, P, Larochelle, H, Bengio, Y, Manzagol, P.A. (2008) unsupervised learning Extracting and composing robust features with denoising au toencoders. ICML 2008 Many of our observations remain unexplained, sug gesting further investigations to better understand gra Weston,J, Ratle, F,& Collobert, R.(2008). Deep learning dients and training dynamics in deep architectures via semisupervised embedding ICML 2008(pp. 11681175) New York. NY. USA: ACM References Zhu, L, Chen,Y,& Yuille, A(2009). Unsupervised learning of probabilistic grammarmarkov models for object categories Bengio, Y(2009). Learning deep architectures for Al. Founda IEEE Transactions on Pattern Analysis and Machine intelli tions and Trends in Machine Learning, 2, 1127. Also pub gence,3l,114128. lished as a book Now Publishers 2009 256
 1.54MB
Understanding the difficulty of training deep feedforward neural networks.pdf
20190721有关BP算法深度学习巨头的论文原文，包含全部的论文内容。
 【Deep Learning】笔记：Understanding the difficulty of training deep feedforward neural networks 436820171129Understanding the difficulty of training deep feedforward neural networks这几天读了这篇论文，在这里将大致内容写在这里。Abstract介绍这篇论文的主要内容就是尝试更好的理解为什么使用“标准随机初始化”来计算使用标准梯度下降的网络效果通常来讲都不是很好。首先研究了不同的非线性激活函数的影响，发现 sigmoid 函数它的均值会

a_1317975257
等级：
关注 私信 TA的资源

下载
紫金煤业井筒揭煤防治煤与瓦斯突出技术研究
紫金煤业井筒揭煤防治煤与瓦斯突出技术研究

下载
安卓应用包名集合.xlsx
安卓应用包名集合.xlsx

下载
eclipse的activiti流程图插件.zip
eclipse的activiti流程图插件.zip

下载
H310阵列卡驱动x64位_win2003 64位raid驱动
H310阵列卡驱动x64位_win2003 64位raid驱动

下载
嵌入式系统中boot的理解
嵌入式系统中boot的理解

下载
易语言access数据库中添加与读取图片源码
易语言access数据库中添加与读取图片源码

下载
Git安装包windows64位2.28.0rc064bit.exe
Git安装包windows64位2.28.0rc064bit.exe

下载
王家塔煤矿半连续排矸运输系统设计改造
王家塔煤矿半连续排矸运输系统设计改造

下载
北通斯巴达手柄驱动(支持北通2163X/2170S/2170U型号手柄)
北通斯巴达手柄驱动(支持北通2163X/2170S/2170U型号手柄)

下载
易语言已用IE作主控端的远程控制软件源码
易语言已用IE作主控端的远程控制软件源码

下载
嵌入式LCD的接口类型详解
嵌入式LCD的接口类型详解

下载
郭家沟矿瓦斯涌出规律分析及瓦斯抽采方案设计
郭家沟矿瓦斯涌出规律分析及瓦斯抽采方案设计

下载
QuantumLeapsCApplication Notes
QuantumLeapsCApplication Notes

下载
平衡小车卡尔曼滤波算法
平衡小车卡尔曼滤波算法

下载
北通蝙蝠DF手柄驱动程序(北通BTP2126F手柄驱动)
北通蝙蝠DF手柄驱动程序(北通BTP2126F手柄驱动)

下载
浅埋式地下井口房设计中几个问题的探讨
浅埋式地下井口房设计中几个问题的探讨

下载
DX100IO说明书.txt
DX100IO说明书.txt

下载
Java面试指南.pdf
Java面试指南.pdf

下载
单片机串口通信UART与USART的区别
单片机串口通信UART与USART的区别

下载
7.10(plsql编程_序列_索引_游标_存储过程).sql
7.10(plsql编程_序列_索引_游标_存储过程).sql

下载
天猫魔盘随身wifi驱动 v1.0.2.151 官方版
天猫魔盘随身wifi驱动 v1.0.2.151 官方版

下载
php7.2.32.tar.gz
php7.2.32.tar.gz

下载
“渐进式”揭煤法在立井近水平突出煤层的实践
“渐进式”揭煤法在立井近水平突出煤层的实践

下载
易语言网吧呼叫转移源码
易语言网吧呼叫转移源码

下载
php7.3.20.tar.gz
php7.3.20.tar.gz

下载
单片机独立按键和矩阵键盘概念及原理
单片机独立按键和矩阵键盘概念及原理

下载
城郊选煤厂煤泥水处理系统改造设计
城郊选煤厂煤泥水处理系统改造设计

下载
易语言歌词同步mp3播放器源码
易语言歌词同步mp3播放器源码

下载
KEIL C51之绝对地址定位详解
KEIL C51之绝对地址定位详解

下载
Broadcom Bluetooth(博通蓝牙驱动) v7.0 官方通用版
Broadcom Bluetooth(博通蓝牙驱动) v7.0 官方通用版