Understanding the difficulty of training deep feedforward neural networks

所需积分/C币:10 2018-03-17 15:24:13 1.57MB PDF

Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot. Yoshua bengio training pairs(a, y) and used to update parameters 0 in that direction, with 6 <-0-Eg. The learning rate E is a hyper parameter that is optimized based on validation set error Layer after a large number of updates (5 million) Layer 4 We varied the type of non -linear activation function in the 40 hidden layers: the sigmoid 1/(1+e), the hyperbolic Epochs of 20k mini-batch updates tangent tanh( c), and a newly proposed activation func tion(Bergstra et al., 2009)called the softsign, /(1+cD) Figure 2: Mean and standard deviation(vertical bars)of the The softsign is similar to the hyperbolic tangent (its range activation values(output of the sigmoid) during supervised learning, for the different hidden layers of a deep archi is-1 to 1) but its tails are quadratic polynomials rather than exponentials, i.e., it approaches its asymptotes much ch tecture The top hidden layer quickly saturates at 0(slow slower ing down all learning), but then slowly desaturates around epoch 100 In the comparisons, we search for the best hyper- parameters (learning rate and depth) separately for each model. Note that the best depth was always five for We see that very quickly at the beginning, all the sigmoid activation values of the last hidden layer are pushed to their Shapeset-3 x 2, except for the sigmoid, for which it was lower saturation value of 0. Inversely, the others layers have a mean activation value that is above 0.5. and decreas We initialized the biases to be 0 and the weights Wii at ing as we go from the output layer to the input layer. We each layer with the following commonly used heuristic have found that this kind of saturation can last very long in 11 deeper networks with sigmoid activations, e.g the depth W five model never escaped this regime during training. The big surprise is that for intermediate number of hidden lay- where a is the uniform distribution in the interval ers(here four), the saturation regime may be escaped. At a,a) and n is the size of the previous layer(the number the same time that the top hidden layer moves out of satura of columns of w) tion, the first hidden layer begins to saturate and therefore to stabilize 3 Effect of activation Functions and Saturation During Training We hypothesize that this behavior is due to the combina- tion of random initialization and the fact that an hidden unit Two things we want to avoid and that can be revealed from output of 0 corresponds to a saturated sigmoid. Note that the evolution of activations is excessive saturation of acti- deep networks with sigmoids but initialized from unsuper vation functions on one hand(then gradients will not prop- vised pre-training (e.g. from RBMs)do not suffer from agate well), and overly linear units(they will not compute this saturation behavior. Our proposed explanation rests on something interesting) the hypothesis that the transformation that the lower layers of the randomly initialized network computes initially is 3.1 Experiments with the Sigmoid not useful to the classification task unlike the transforma- The sigmoid non-linearity has been already shown to slow tion obtained from unsupervised pre-training. The logistic down learning because of its none-zero mean that induces layer output softmax(b+Wh)might initially rely more on important singular values in the Hessian (LeCun et al its biases b (which are learned very quickly)than on the top 1998b). In this section we will see another symptomatic hidden activations h derived from the input image(because behavior due to this activation function in deep feedforward h would vary in ways that are not predictive of y, maybe networks correlated mostly with other and possibly more dominant variations of x). Thus the error gradient would tend to We want to study possible saturation, by looking at the evo- push Wh towards 0, which can be achieved by pushing lution of activations during training, and the figures in this h towards o. In the case of symmetric activation functions section show results on the Shapeset-3x 2 data, but sim- like the hyperbolic tangent and the softsign, sitting around ilar behavior is observed with the other datasets. Figure 2 0 is good because it allows gradients to flow backwards shows the evolution of the activation values(after the non- However, pushing the sigmoid outputs to 0 would bring linearity) at each hidden layer during training of a deep ar- them into a saturation regime which would prevent gradi chitecture with sigmoid activation functions. Layer 1 refers ents to flow backward and prevent the lower layers from to the output of first hidden layer, and there are four hidden learning useful features. Eventually but slowly, the lower layers. The graph shows the means and standard deviations layers move toward more useful features and the top hidden of these activations. These statistics along with histograms laver then moves out of the saturation regime. Note how are computed at different times during learning, by looking ever that, even after this, the network moves into a solution at activation values for a fixed set of 300 test examples that is of poorer quality (also in terms of generalization) 251 Understanding the difficulty of training deep feedforward neural networks hen those found with symmetric activation functions, as where the gradients would flow well can be seen in figure 11 3.2 Experiments with the Hyperbolic tangent As discussed above, the hyperbolic tangent networks do not 2345 suffer from the kind of saturation behavior of the top hid den layer observed with sigmoid networks, because of its 0全斜珍少惨物 symmetry around O. However, with our standard weight 0.6 060.8 Activation value initialization u V’w分/, we observe a sequentially oc Laver l curring saturation phenomenon starting with layer 1 and OLayer propagating up in the network, as illustrated in Figure 3 X Layer 3 Why this is happening remains to be understood 5 本详嘶 Player 080.6-0.40.2 Activation value o Laver 2 等转样丰 兴 Layer Figure 4: Activation values normalized histogram at the end of learning, averaged across units of the same layer and across 300 test examples. Top: activation function is hyper- 90 Epochs of 20k mini-batch updates bolic tangent, we see important saturation of the lower lay ers. Bottom: activation function is softsign, we see many activation values around (-0.6, 0.8)and(0.6,0.8)where the units do not saturate but are non-linear fLayer 3 要杏春S+144 Studying Gradients and their Propagation 4.1 Effect of the cost function 010 5060708090 Epochs of 20k mini-batch updates We have found that the logistic regression or conditional Figure 3: Top: 98 percentiles(markers alone) and standard log-likelihood cost function (-log P(gla)coupled with deviation(solid lines with markers)of the distribution of softmax outputs) worked much better(for classification the activation values for the hyperbolic tangent networks in problems) than the quadratic cost which was tradition the course of learning. We see the first hidden layer satu- ally used to train feedforward neural networks(Rumelhart rating first, then the second, etc. Bottom: 98 percentiles et al 1986). This is not a new observation(Solla et al (markers alone) and standard deviation(solid lines with 1988 but we find it important to stress here. We found that markers)of the distribution of activation values for the soft- the plateaus in the training criterion(as a function of the pa- sign during learning. Here the different layers saturate less rameters )are less present with the log-likelihood cost fund and do so together. tion. We can see this on Figure 5, which plots the training criterion as a function of two weights for a two-layer net 3.3 Experiments with the Softsign work(one hidden laver) with hyperbolic tangent units, and The softsign /(1+aD) is similar to the hyperbolic tangent a random input and target signal. There are clearly more but might behave differently in terms of saturation because severe plateaus with the quadratic cost of its smoother asymptotes (polynomial instead of expo- 4.2 Gradients at initialization nential). We see on Figure 3 that the saturation does not occur one layer after the other like for the hyperbolic tan- 4.2.1 Theoretical Considerations and a New gent. It is faster at the beginning and then slow, and all Normalized initialization ayers move together towards larger weights e study the back-propagated gradients, or equivalently We can also see at the end of training that the histogram the gradient of the cost function on the inputs biases at each of activation values is very different from that seen with layer. Bradley(2009)found that back-propagated gradients the hyperbolic tangent(Figure 4). Whereas the latter yields were smaller as one moves from the output layer towards modes of the activations distribution mostly at the extremes the input layer, just after initialization. He studied networks (asymptotes-l and 1)or around O, the softsign network has with linear activation at each layer, finding that the variance modes of activations around its knees(between the linear of the back-propagated gradients decreases as we go back regime around O and the flat regime around-I and 1). These wards in the network. We will also start by studying the are the areas where there is substantial non-linearity but linear regime 252 Xavier Glorot. Yoshua bengio From a forward-propagation point of view, to keep infor mation flowing we would like that Va (8) From a back-propagation point of view we would similarly like to have acost V(ii),Var =Var s These two conditions transform to Vi, niVarW]=1 (10) 乙,几+1Var Figure 5: Cross entropy(black, surface on top) and As a compromise between these two constraints, we might quadratic (red, bottom surface) cost as a function of two want to have weights(one at each layer)of a network with two layers, WI respectively on the first layer and W2 on the second, Vi, Varlet (12 7z+7;+1 output layer ote how both constraints are satisfied when all layers have For a dense artificial neural network using symmetr the same width If we also have the same initialization for vation function f with unit derivative at 0(i.e. f(0)=1 the weights we could get the following interesting proper- if we write z for the activation vector of layer i, and s ties the argument vector of the activation function at layer i we have s=zw+ b and z ∫(s2). From these aCost Vi ve definitions we obtain the following O =nvar W Var[] (13) aCo.st aCost aCost Os:=f(si)Wi+10Cost Vi var W Var[evo as' O (14) aCost: aCost (3) We can see that the variance of the gradient on the weights k is the same for all layers, but the variance of the back- The variances will be expressed with respect to the input, propagated gradient might still vanish or explode as we outpout and weight initialization randomness. Consider consider deeper networks. Note how this is reminis cent of issues raised when studying recurrent neural net- the hypothesis that we are in a linear regime at the initial ization, that the weights are initialized independently and works(Bengio et al, 1994), which can be seen as very deep networks when unfolded through time that the inputs features variances are the same(Varla) Then we can say that, with ni the size of layer i and x the The standard initialization that we have used(eq 1)gives network inpu rise to variance with the following property Vary (15 T ni v arW (5) where n is the layer size(assuming all layers of the same =0 size). This will the y of the back ted gradient to be dependent on the layer(and decreasing) We write VarlW for the shared scalar variance of all weights at layer i. Then for a network with d layers The normalization factor may therefore be important when d initializing deep networks because of the multiplicative ef Var/acost fect through layers, and we suggest the following initializa tion procedure to approximately satisfy our objectives of maintaining activation variances and back-propagated gra- Var acost du ni'-1Var[wi dients variance as one moves up or down the network. W call it the normalized initialization aCost X V aT aVar W≈U (16) √m+m+1√m;+m;+1 253 Understanding the difficulty of training deep feedforward neural networks 4.2.2 Gradient Propagation Study Laycr 1 To empirically validate the above theoretical ideas we h plotted some normalized histograms of activation values Layer 4 Layer 5 weight gradients and of the back-propagated gradients at initialization with the two different initialization methods The results displayed(Figures 6, 7 and 8)are from exper -0.15 -0.05 005 0.15 0.2 Backpropagated gradient iments on Shapeset-3x 2, but qualitatively similar results were obtained with the other datasets Layer 2 We monitor the singular values of the jacobian matrix as- Laver 3 sociated with layer i Layer 5 az2+1 (17) z -0.25-C2-0.15-0.1-0.0500. When consecutive lavers have the same dimension the av- Backpropagated gradi erage singular value corresponds to the average ratio of in- Figure 7: Back-propagated gradients normalized his finitesimal volumes mapped from z to zt, as well as tograms with hyperbolic tangent activation, with standard to the ratio of average activation variance going from z(top) vs normalized (bottom) initialization. Top O-peak to zi+1. With our normalized initialization this ratio is decreases for higher lavers around 0.8 whereas with the standard initialization, it drops down to 0.5 What was initially really surprising is that even when the back-propagated gradients become smaller(standard ini tilization), the variance of the weights gradients is roughl constant across layers, as shown on Figure 8. However, this Layer 4 iS explained by our theoretical analysis above(eg. 14). In- Layer terestingly, as shown in Figure 9, these observations on the weight gradient of standard and normalized initialization hange during training(here for a tanh network). Indeed 80.6- whereas the gradients have initially roughly the same mag nitude, they diverge from each other(with larger gradients Layer 2 in the lower layers)as training progresses, especially with the standard initialization Note that this might be one of the advantages of the normalized initialization since hav- Layer 5 g gradient ry different magnitudes at different lay ers may yield to ill-conditioning and slower training 0.6 Activation value Finally, we observe that the softsign networks share simi Figure 6: Activation values normalized histograms with rarities with the tanh networks with normalized initializa hyperbolic tangent activation, with standard(top) vs nor- tion, as can be seen by comparing the evolution of activa malized initialization(bottom). Top: 0-peak increases for tions in both cases(resp. Figure 3-bottom and Figure 10) higher layers 5 Error Curves and conclusions 4.3 Back-propagated Gradients During Learning The final consideration that we care for is the success The dynamic of learning in such networks is complex and of training with different strategies, and this is best il- we would like to develop better tools to analyze and track lustrated with error curves showing the evolution of test it. In particular, we cannot use simple variance calculations error as training progresses and asymptotes. Figure 11 in our theoretical analysis because the weights values are shows such curves with online training on shapeset-3x 2 not anymore independent of the activation values and the while Table 1 gives final test error for all the datasets linearity hypothesis is also violated studied(Shapeset-3x 2, MNIST, CIFAR-10, and Small As first noted by bradley (2009), we observe(Figure 7)that ImageNet). As a baseline, we optimized rBF SVM mod- at the beginning of training, after the standard initializa- els on one hundred thousand Shapeset examples and ob tained 59. 47% test error. while on the same set we obtained tion(eg. 1), the variance of the back-propagated gradients gets smaller as it is propagated downwards. However we 50.47% with a depth five hyperbolic tangent network with normalized initialization find that this trend is reversed very quickly during learning Using our normalized initialization we do not see such de- These results illustrate the effect of the choice of activa creasing back-propagated gradients(bottom of Figure 7). tion and initialization. As a reference we include in Fig- 254 Xavier Glorot. Yoshua bengio 100 -0.05 √ -0.025-0.02-0.015-001-000500.0050.010.0150.020.025 Weights gradi Epochs of 20k mini-batch updates layer I ayer 4 005 0.05 708090 0.1 Weights gradients Epochs of 20k mini-batch updates Figure 8: Weight gradient normalized histograms with hy- Figure 9: Standard deviation intervals of the weights gradi perbolic tangent activation just after initialization, with ents with hyperbolic tangents with standard initialization standard initialization(top) and normalized initialization ( top) and normalized(bottom) during training. We see (bottom), for different layers. Even though with standard that the normalization allows to keep the same variance initialization the back-propagated gradients get smaller, the of the weights gradient across layers, during training(top weight gradients do not! smaller variance for higher layers) Table 1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden lay ers. N after the activation function name indicates the use 样解 母日母日-| Layer 1 yer 关Lyer3 of normalized initialization results in bold are statisticall different from non-bold ones under the null hypothesis test 白日日母 Layer5 with p=0.005 字字 TYPE Shapeset MNIST CIFAR-10 ImageNet Epochs ofr 20k mini-baich updates Figure 10: 98 percentile(markers alone) and standard de Softsign l6.27 l.64 5.78 6914 viation(solid lines with markers )of the distribution of ac Softsignn 16.06 1.72 53.8 6813 Tanh 27.15 176 55.9 tivation values for hyperbolic tangent with normalized ini- 15.60 tilization during learning Sigmoid 57.28 70.66 activations(flowing upward) and gradients(flowin ure ll the error curve for the supervised fine-tuning from backward ). the initialization obtained after unsupervised pre-trainin with denoising auto-encoders( Vincent et al., 2008). For Others methods can alleviate discrepancies between lay- each network the learning rate is separately chosen to mir rs during learning, e. g, exploiting second order informa imize error on the validation set. we can remark that on tion to set the learning rate separately for each parame Shapeset-3x 2, because of the task difficulty, we observe er. For example, we can exploit the diagonal of the Hes- important saturations during learning, this might explain sian (LeCun et al., 1998b)or a gradient variance estimate that the normalized initialization or the softsign effects are Both those methods have been applied for Shapeset-3x 2 more visible with hyperbolic tangent and standard initialization. We ob served a gain in performance but not reaching the result ob- Several conclusions can be drawn from these error curves: tained from normalized initialization. In addition we ob- The more classical neural networks with sigmoid or served further gains by combining normalized initialization hyperbolic tangent units and standard initialization with second order methods: the estimated Hessian might fare rather poorly, converging more slowly and appar- then focus on discrepancies between units, not having to ently towards ultimately poorer local minima correct important initial discrepancies between layers The softsign networks seem to be more robust to the In all reported experiments we have used the same num- initialization procedure than the tanh networks, pre- ber of units per layer. However, we verified that we obtain sumably because of their gentler non- linearity the same gains when the layer size increases(or decreases with layer number For tanh networks, the proposed normalized initial ization can be quite helpful, presumably because the The other conclusions from this study are the following layer-to-layer transformations maintain magnitudes of Monitoring activations and gradients across layers and 255 Understanding the difficulty of training deep feedforward neural networks Sigmoid depth 5 Bengio, Y, Lamblin, P, Popovici, D,& larochelle, H.(2007) Sigmoid depth 4 Greedy layer-wise training of deep networks. NIPS 19(pp Tanh 153-160). MIT Press Softsign … Softsign N Bengio, Y, Simard, P, Frasconi, P(1994). Learning long-term --- Tanh N dependencies with gradient descent is difficult. IEEE Transac Pre-training tions on neural networks. 5.157-166 Bergstra, J, Desjardins, G, Lamblin, P,& Bengio, Y(2009) Quadratic polynomials learn better image features (technical 40 Report 1337). departement d Informatique ct dc recherche Universite de montreal Bradlcy, D(2009). Learning in modular systems. Doctoral dis sertation, The robotics Institute, Carnegie mellon University 心M人从时八一A一M Collobert, R, Weston, J(2008). A unified architecture for nat- ural language processing: Deep neural networks with multitask 0.5 2.0 exemples seen learning. ICML 2008 Figure ll: Test error during online training on the Erhan, D, Manzagol, P-A, Bengio, Y, Bengio, S, vincent Shapeset-3 x2 dataset, for various activation functions and P (2009). The difficulty of training deep architectures and the initialization schemes (ordered from top to bottom in de effect of unsupervised pre-training. AlSTATS 2009(pp. 153- 160) creasing final error N after the activation function name indicates the use of normalized initialization Flinton, G.E., Osindero, S,& Teh, Y.(2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527 MNIST CIFARlO Sigmoid Sigmoid Softsign N Krizhevsky, A,& hinton, G.(2009). Learning multiple layers of features from tiny images(Technical Report). University of Tanh N Softsign N Toronto Tanh N Larochelle, H, Bengio, Y, Louradour, J.,& Lamblin, P(2009) Exploring strategies for training deep neural networks. The Journal of Machine learning research, 10, 1-40 Me解 Larochelle, Il, Erhan, D, Courville, A,, Bergstra, J.,& Bengio, 0.00.5101.5 T exemp es seen exemples seen Y.(2007). An empirical evaluation of deep architectures on prohlems with many factors of variation. /CMI. 2007 Figure 12: Test error curves during training on mnist and CIfARiO. for various activation functions and initialization LeCun, Y, Bottou, L,, Bengio, Y,& Haffner, P.(1998a) schemes(ordered from top to bottom in decreasing final Gradient-based learning applied to document recognition Pro ceedings ofthe IEEE, 80, 2278-2324 error). N after the activation function name indicates the use of normalized initialization LeCun, Y, Bottou, L, Orr G. B. muller, K.-R. (1998b). Effi Notes in Computer Science LNCS 1524. Springer Verleecture t backprop. Ir training iterations is a powerful investigative tool for understanding training difficulties in deep nets Mnih, A, hinton, G. E(2009). A scalable hierarchical dis tributed language model. NIPS 21(pp. 1081-1088 Sigmoid activations(not symmetric around o) should be avoided when initializing from small random Ranzato, M, Poultney, C, Chopra, S,& Le Cun, Y(2007). Ef- ficient learning of sparse representations with an energy-based weights, because they yield poor learning dynamics model wips 9 with initial saturation of the top hidden layer Rumelhart, D E, Hinton, G. E, Williams, R.J. (1986).Learn e Keeping the layer-to-layer transformations such that ing representations by back-propagating errors. Nature, 323, both activations and gradients flow well (i.e. with a Ja- 533-536 cobian around 1)appears helpful, and allows to elim- Solla, S.A., Levin, E,& Fleisher, M(1988). Accelerated learn inate a good part of the discrepancy between purely ing in layered neural networks. Complex systems, 2, 625-639 supervised deep networks and ones pre-trained with Vincent, P, Larochelle, H, Bengio, Y, Manzagol, P.-A. (2008) unsupervised learning Extracting and composing robust features with denoising au- toencoders. ICML 2008 Many of our observations remain unexplained, sug gesting further investigations to better understand gra- Weston,J, Ratle, F,& Collobert, R.(2008). Deep learning dients and training dynamics in deep architectures via semi-supervised embedding ICML 2008(pp. 1168-1175) New York. NY. USA: ACM References Zhu, L, Chen,Y,& Yuille, A(2009). Unsupervised learning of probabilistic grammar-markov models for object categories Bengio, Y(2009). Learning deep architectures for Al. Founda- IEEE Transactions on Pattern Analysis and Machine intelli tions and Trends in Machine Learning, 2, 1-127. Also pub gence,3l,114-128. lished as a book Now Publishers 2009 256


关注 私信 TA的资源