所需积分/C币:9 2019-05-20 17:07:01 283KB PDF
收藏 收藏

Rectifier Nonlinearities Improve Neural Network Acoustic Models thout diverging from taking overly large steps. Net work training stops after two complete passes through h()=max(w()Tx,O ()7x w(i)Tx the 300 hour training set. Hidden layers contain 2,048 0. 0lw(i)Tr else hidden units, and we explore models with varyin (3) numbers of hidden layers Figure 1 shows the LRel function, which is nearly 3.1. Impact of Nonlinearity identical to thc standard RcL function. ThC LRCL sacrifices hard-zero sparsity for a gradient which is po- Our first experiment compares sigmoidal nonlinearity tentially more robust during optimization. We experi- DNNS against DNNS trained using the two rectifier ment on both types of rectifier, as well as the sigmoidal unctions discussed in section 2. DNNs with 2. 3. and tanh nonlinearity 4 hidden layers are trained for all nonlinearity types We reserved 25,000 examples from the training set 3. Experiments lo obtain a held-out esliale of the Iralme-wise cross entropy and accuracy of the neural network acoustic We perform LVCSR experiments on the 300 hour models. Such a measurement is important because Switchboard conversational telephone speech corpus recognizer word error rate(WER)is only loosely cor- (LDC97S62 ). The basclinc GMM systcm and forccd related with the cross entropy metric used in our DNN alignments for dnn training are created usimg t he acoustic model training. Table 1 shows the results for Kaldi open-source toolkit,(Povey et a.l., 2011). We both frame-wise metrics and WER use a system with 3, 034 senones and train DNNs to DNNs with rectifier nonlinearities subst a ntially out- estimate senone likelihoods in a hybrid HMM speech perform sigmoidal DNN in all error metrics, and recognition system. The input features for DNNs are across all dnn depths. Rectifier DNNs produce WER MFCCs with a context window of +/-3 frames. Per- reductions of up to 2% absolute on the full Eval2000 speaker CMVN is applied as well as fMLLR. The fea dataset as compared with sigmoidal DNNs-a substan lures are dimension reduced with lda to a final ve tial improvement for this task. Furthcrmorc, dccpcr 4 tor of 300 dimensions and globally normalized to have layer sigmoidal DNNs perform slightly worse than 2 0 mcan and unit variance. Ovcrall, tho HMM/GMM layer rectifier DNNs despite having 1.76 times more system training largely follows an existing Kaldi recipe free parameters. The performance gains observed in and we defer to that original work for details(Vesely our Sigmoidal dNNs relative to the gmm baseline et al., 2013). For recognition evaluation, we use both system are on par with other recent work with DNN the Switchboard and CallHome subsets of the HuB5 acoustic models on the Switchboard task (Yu et al 2000data(LDC2002S09) 2013). We note that in preliminary experiments we We are most interested in the effect of nonlinearity found tanh units to perform slightly better than logis- choice on dnN performance. For this reason, we use tic sigmoids, another sigmoidal nonlinearity commonly simple initialization and training procedures for DNN uscd in dnns optimization. We randomly initialize all hidden layer The choice of rectifier function used in the dnn ap weights with a mcan 0 uniform distribution. The scal- pears unimportant for both frame-wise and WER met ing of the uniform inTerval is set based on layer size rics. Both thc caky y and standard RcL nctworks pcr to prevent sigmoidal saturation in the initial network forln similarly, sugges Ling the leaky rectiliers'IlOll-ero (Glorot et al., 2011). The output layer is a standard gradient does not substantially impact training opti- softmax classifier, and cross entropy with no regular mization. During training we observed leaky rectifier ization serves as the loss function. We note that train- DNNs converge slightly faster, which is perhaps due ing and development set cross entropies are closely to the difference in gradient among the two rectifiers matched throughout training, suggesting that regular- ization is not necessary for the task. Networks are op- In addition to performing better overall, rectifier timized using stochastic gradient descent(SGD)with DNNS benefit more from depth as compared with sig- momontum and a mini-batch sizc of 256 examples. moidal DNNs. Each time we add a hidden layer, recti The momentum term is initially given a weight a, fier DNNs show a greater absolute reduction in WER and increases to 0.9 after 40.000 SGD iterations We than sigmoidal DNNs. We believe this effect results use a constant step size of 0.01. For each model we roil the lack of vanishing gradients in rectifier net initially searched over several values for the step size works. The largest models we train still underfit the parameter, [0. 1, 0.05,, 0.005, 0.001]. For each non- training sct linearity type the value 0.01 led to fastest convergence Rectifier Nonlinearities Improve Neural Network Acoustic Models Table 1. Results for dnn systems in terms of frame-wise error metrics on the development set as well as word error rates(%)on the Hub5 2000 evaluation sets. The Hub5 set(EV) contains the Switchoard(SWBD) and Call Home(Ch evaluation subsets. Frame-wise error metrics were evaluated on 25, 000 frames held out from the training set Model Dev CrossEnt Dev Acc(%) SWBD WER CH WER EV WER GMM Baseline N/A n/A 25.1 40.6 32.6 2 Layer Tanh 2.09 48.0 21.0 34.3 27.7 2 Layer RelU 191 51.7 19.1 32.3 5.7 2 Layer lrelU 1.90 51.8 19.1 32.1 25.6 3 Layer Tanh 20.0 32.7 26.4 3 Layer relU 1.83 53.3 18.1 30.6 244 B Layer L relu 53.4 17.8 30.7 243 4 Layer tanh 1.98 19.5 32.3 25.9 4 Layer RelU 1.79 17.3 29.9 23.6 4 Layer lreLU 1.78 3.9 7.3 29.9 23.7 3.2. Analyzing Coding Properties Previous work in DNNs for speech and with ReL net- LReL works suggest that sparsity of hidden layer represen- ReL tations plays an important role for both classifier per tanh both 0.75 formance and invariance to input perturbations. A tanh ncg though sparsity and invariance are not necessarily cou led. we seek to better understand how rel and tanh 5 nctworks differ. Further, onc might hypothesize that Rel networks exhibit "mostly linear"behavior where units saturate at O rarely. We analyze the hidden rep 0.25 resentations of our trained dnn acoustic models in an attempt to explain the performance gains observed when using rel nonlinearities 0 We compute the last hidden layer representations of 4- Figure 2. Empirical activation probability of hidden units Layer dNNs trained with each nonlinearity type fro m on n the final hidden laver layer of 1 hidden layer DNNs section 3.1 for 10,000 input samples from the held- Hidden units (x axis) are sorted by their probability of ac out set. For each hidden unit, we compute its em- tivation. In ReL networks, we consider any positive value pirical activation probability- the fraction of exam as active(h(a)>0). In tanh networks we consider ac ples for which the unit is not saturated. We con- Livalion in terns of llot saturating in the "of"posiliOn sider rel and lrel units saturated when the acti- (h(c)>=0.95,"tanh neg") as well as not saturating on vation is nonpositive, h(a)<0. Sigmoidal tanh units either asymptote(0.95 h(a)<0.95, "tanh both) have negative and positive saturation, measured by an activation h()≤-095andh(x)≥0.95 respec tively. For the sigmoidal units we also measure the civation probability for the rel hidden layer is 0.11 fraction of units that saturate on the negative asymp Illore than a factor of 6 less Chall the average proba tote(h(a)s-0.95), as this corresponds to the " off bility for tanh units(considering tanh to be active or position. Figure 2 shows the activation probabilities on'"when h(a)>-095). If we believe sparse ac for hidden units in the last hidden layer for each net tivation of a hidden unit is important for invariance work Lype. The units are sorted in decreasing order of to input stimuli, then rectifier networks have a clear activation probability dvantage. Notice that in terms of sparsity the two types of rectifier evaluated are nearly identical ReL DNNs contain substantially more sparse repre sensations thall signmoidal DNNs. We neasure lile Sparsity, however, is not a complete picture of code time sparsity. the average empirical activation prob quality. In a sparse code, only a few coding units rep ability of all units in thc laycr for a large samplc of resent any particular stimulus on average. However, it is possible to use the same coding units for each stimu nputs(Willmore tolhurst, 2001). The average ac- lus and still obtain a sparse code. For example, a layer Rectifier Nonlinearities Improve Neural Network Acoustic Models with four coding units and hidden unit activation prob abilities [1, 0, 0, h a,ge lifetime st 0.25 Such a code is sparse on average. but not disperse. Dispersion measures whether the set of active units is different for each stimulus(Willmore et al., 2000). A different four unit code with activation probabilities 0.25,0.25, 0.25, 0.25 again has lifetime sparsity 0.25 disi because units sh ing equally. We can informally compare dispersion by comparing the slope of curves in figure 2. A fat curve corresponds to a perfectly disperse code in this case We measure dispersion quantitatively for the hidden layers prosented in figurc 2. Wc compute the stan dard deviation of empirical activation probabilities across all units in the hidden layer A perfect. disperse code, where all units code equally, has stan dard deviation 0. Both ypes of yer have stan- dard deviation 0.04, significantly lower than the tanh layer s standard deviation of 0. 14. This indicates that ReL networks, as compared with tanh networks, pro duce sparse codes where informaLiOn is distributed more uniformly across hidden units. There are sev eral results from information theory, learning theory, and computational neuroscience which suggest sparse- disperse codes are important, and translate to im proved performance or invariance 4. Conclusion This work focuscs on thc impact of nonlincarity choice in dnn acoustic Models without sophist cated pretraining or regularization techniques. DNNS with rectifiers produce substantial gains on the 300 hour Switchboard task compared to sigmoidal DNNs Leaky rectifiers, with non-zero gradient over the entire domain, perform nearly identically to standard recti- fier DNNs. This suggests gradient-based optimization for inodel training is not adversely affected by chie use of rectifier nonlinearities. Further. ReL dnns with out prctraining or advanced optimization strategics perform on par with established benchmarks Switchboard task. Our analysis of hidden layer rep resentations revealed substantial differences in both sparsity and dispersion when using ReL units com d with tanh units. The increased sparsity and dis persion of rel hidden layers may help to explain their mproved performance in supervised acoustic modcl training. I This metric for dispersion differs from metrics in previ ous work. Previous work focuses on analyzing linear filters with Gaussian-distribted inputs. Our metric captures the idea of dispersion more suitably for non-linear coding units and non-Gaussian inputs Rectifier Nonlinearities Improve Neural Network Acoustic Model References Yu, D, Seltzer, M. L, Li, . Huang, J, and Seide, F Bengio, Y, Simard, P, and Frasconi, P Learning long Feature Learning in Deep Neural Networks Studies torm dependencies with gradicnt descent is difficult on Speech Recognition Tasks. In ICLR, 2013 IEEE TrurLsuclions oT Neurul Nelworks, 5(2), 1994. Zeiler, M.D., Ranzato, M, Monga, R, Mao, M, Yang K, Lc, Q.V., Nguyen, P, Senior, A, Vanhouckc Coates, A P and Ng, A.Y. The Impo of encod- V. Deal.. and Hinto. G.E. On Rectified Lineal ing Versus Training with Sparse Coding and Vector Quantization. In ICML, 2011 Units for Speech Processing. In ICASSP, 2013 Dahl, G.E., Yu, D, Deng, L, and Acero, A. Context Dependent Pre-trained Deep Neural Networks for Large vocabulary Speech Recognilion. IEEE Tranus actions on Audio, Speech, and Language Processing, Special Issuc on Deep Learning for Spccch and le gauge Processing. 2011 Dahl, G.E., Sainath, T N, and Hinton, G.E. Improv- ing Deep Neural Networks for LVCSR using Rect fied Linear Units and Dropout. In ICASSP, 2013 Glorot, X, Bordes, A and Bengio, Y. Deep Sparse Rectifier Networks. In AISTATS, Pp 315 323, 2011 Hinton. GE. Deng. L. Yu. D. Dahl. G.E. Mohamed A Jaitly, N, Senior, A, Vanhoucke, V, nguyen P, Sainath, T, and Kingsbury, B. Deep Neural Net works for Acoustic Modeling in Speech Recognition IEEE Signal Processing Magazine, 29(November) 82-97,2012 Kingsbury, B, Sainath, T N, and Soltau, H. Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free opti mization. In Interspeech, 2012 Krizhevsky, A, Sutskever, I, and Hinton, G.E. Ima- geNet Classification with Deep Convolutional Neu ral Networks. In NIPs 2012. Povey, D, Ghoshal, A, Boulianne, G: Burget, L Glcmbck. O. Vcsclv. K. Gocl N. Hanncmann M Motlicek, P. Qian, Y, Schwarz, P, Silovsky, J, and Stemmer, g. The ka ldi speech recognition toolkit, In asru. 2011 Vesely, K, Ghoshal, A,, Burget, L, and Povey, D Sequence-discriminative training of deep neural net- works. In Submission to Interspeech, 2013 Willmore, B. alld Tolhurst, D.J. Characterizing the sparseness of neural codes. Network: Computation i7 Neural systems,12(3):255-270.2001 Willmore, B, Watters, P.A., and Tolhurst, DJ a comparison of natural-image-based models of simple-cell coding. Perception, 29(9): 1017-1040 2000

试读 6P relu_hybrid_icml2013_final.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
关注 私信
relu_hybrid_icml2013_final.pdf 9积分/C币 立即下载

试读结束, 可继续阅读

9积分/C币 立即下载 >