Attention-based model for speech recogntion

所需积分/C币:50 2017-12-10 13:33:57 2.28MB PDF
72
收藏 收藏
举报

Attention-based model for speech recogntion Attention-based model for speech recogntion
yi 1: Two steps of th posed attention-based recurrent se quence generator (ARSG) with a hy brid attention mechanism(computing x), based on both content (h) and le cation(previous a) information. The dotted lines correspond to Eq. ( thick solid lines to Eq(2)and dashed lines to Eqs. (3)-(4) where Si-1 is the(i-1)-th state of the recurrent neural network to which we refer as the gener. ator,Ci E r is a vector of the attention weights, also often called the alignment [2]. USing the terminology from [4 we call gi a glimpse. The step is completed by computing a new generator si= Recurrency(si-1, gi, yi) Long short-term memory units (LSTM, [lll) and gated recurrent units(GrU, [12] are typically used as a recurrent activation, to which we refer as a recurrency. The process is graphically illus trated in Fig. 1 Inspired by [6] we distinguish between location-based, content-based and hybrid attention mecha- nisms. Attend in Eg. (1)describes the most generic, hybrid attention. If the term ai-1 is dropped from Attend arguments, i.e., Oi= Attend(si-1, h), we call it content-based(see, e.g., [2] or [3]) In this case, Attend is often implemented by scoring each element in h separately and normalizing the scores: ei, j= Score(si-1, h j Ci,j=exp(ei, exp(ei The main limitation of such scheme is that identical or very similar elements of h are scored equally regardless of their position in the sequence. This is the issue of""similar speech fragments"raised above. Often this issue is partially alleviated by an encoder such as e. g. a Birnn [2] or a deep convolutional network [3] that encode contextual information into every element of h. However capacity of /i elements is always limited, and thus disambiguation by context is only possible to a limited extent Alternatively, a location-based attention mechanism computes the alignment from the generator state and the previous alignment only such that ai Attend(si-1, ai-1). For instance, Graves 1] used the location-based attention mechanism using a Gaussian mixture model in his handwriting synthesis model. In the case of speech recognition, this type of location-based attention mechanism would have to predict the distance between consequent phonemes using si-1 only, which we expect to be hard due to large variance of this quantity For these limitations associated with both content-based and location-based mechanisms we argue that a hybrid attention mechanism is a natural candidate for speech recognition. Informally, we would like an attention model that uses the previous alignment ai-1 to select a short list of elements from h, from which the content-based attention, in Eqs. (5)-(6), will select the relevant ones without confusion 2.2 Proposed Model: ARSG with Convolutional Features We start from the ARSG-based model with the content-based attention mechanism proposed in [2] This model can be described by Eqs. (5)-(6), where wtanh( Wsi-1+Vh;+b) (7) n/, and b are vectors. w and v are matrices We extend this content-based attention mechanism of the original model to be location-aware by making it take into account the alignment produced at the previous step. First, we extract k vectors /i E rk for every position j of the previous alignment Ci-1 by convolving it with a matrix F∈RkX: f i= F hese additional vectors fi. are then used by the scoring mechanism ei, j ei.j=utah(Wsi-1tVhi +Uf.i+b) 2.3 Score Normalization: Sharpening and Smoothing There are three potential issues with the normalization in Eg. (6) First, when the input sequence h is long, the glimpse gi is likely to contain noisy information from many irrelevant feature vectors hi, as the normalized scores a all positive and sum to 1. This makes it difficult for the proposed ars to focus clearly on a few relevant frames at each time i. Second, the attention mechanism is required to consider all the l frames each time it decodes a single output ii while decoding the output of length T, leading to a computational complexity of O(LT). This may easily become prohibitively expensive, when input utterances are long(and issue that is less serious for machine translation, because in that case the input sequence is made of words, not of 20ms acoustic frames). The other side of the coin is that the use of softmax normalization in Eq. ( 6) prefers to mostly focus on only a single feature vector hi. This prevents the model from aggregating multiple top-scored frames to form a glimpse gi Sharpening There is a straightforward way to address the first issue of a noisy glimpse by"sharp ening " the scores a,, j. One way to sharpen the weights is to introduce an inverse temperature B>1 to the softmax function such that n=6p(3,)/∑p(a or to keep only the top-k frames according to the scores and re-normalize them. These sharpening methods, however, still requires us to compute the score of every frame each time (O(LT)), and they worsen the second issue of overly narrow focus ye also propose and investigate a windowing technique. At each time i, the attention mechanism considers only a subsequence h=(hp; -w,..., h p: tu-1)of the whole sequence h, where w <<Lis the predefined window width and pi is the median of the alignment ai-1. The scores for h, E h are not computed, resulting in a lower complexity of O(L+ T). This windowing technique is similar to taking the top h frames, and similarly, has the effect of sharpening The proposed sharpening based on windowing can be used both during training and evaluation Later, in the experiments, we only consider the case where it is used during evaluation Smoothing We observed that the proposed sharpening methods indeed helped with long utter ances. However, all of them, and especially selecting the frame with the highest score, negatively affected the models performance on the standard development set which mostly consists of short ut trances. This observations let us hypothesize that it is helpful for the model to aggregate selections from multiple top-scored frames. In a sense this brings more diversity, i.e., more effective training examples, to the output part of the model, as more input locations are considered. To facilitate this effect, we replace the unbounded exponential function of the softmax function in Fg(6)with the bounded logistic sigmoid o such that g(e alei,j. =1 This has the effect of smoothing the focus found by the attention mechanism Dependency of error rate on beam search width Baseline Smooth focu g dth igure 2: Decoding performance w.r.t. the beam size. For rigorous comparison, if decoding failed The models, especially with smooth focus, perform well even with a beam width as small as'z Size to generate(eos), we considered it wrongly recognized without retrying with a larger beams 3 Related Work Speech recognizers based on the connectionist temporal classification(CTC,[13])and its extension rnn Transducer [141, are the closest to the ARSG model considered in this paper. They follow earlier work on end-to-end trainable deep learning over sequences with gradient signals flowing through the alignment process [15]. They have been shown to perform well on the phoneme recog nition task [16. Furthermore, the ctc was recently found to be able to directly transcribe text from speech without any intermediate phonetic representation [171 The considered ARSG is different from both the CtC and rnn transducer in two ways. First whereas the attention mechanism deterministically aligns the input and the output sequences, the CtC and rnn Transducer treat the alignment as a latent random variable over which maP(max imum a posteriori) inference is performed. This deterministic nature of the ARSG'S alignment mechanism allows beam search procedure to be simpler. Furthermore, we empirically observe that a much smaller beam width can be used with the deterministic mechanism which allows faster decoding(see Sec. 4.2 and Fig. 2). Second, the alignment mechanism of both the CtC and RNN Transducer is constrained to bemonotonic"to keep marginalization of the alignment tractable. On the other hand, the proposed attention mechanism can result in non-monotonic alignment, which makes it suitable for a larger variety of tasks other than speech recognition A hybrid attention model using a convolution operation was also proposed in [6] for neural Turing machines(NTM). At each time step, the nTM computes content-based attention weights which are then convolved with a predicted shifting distribution. Unlike the NTMs approach, the hybrid mech anism proposed here lets learning figure out how the content-based and location-based addressin be combined by a deep, parametric function(see Eq (9). Sukhbaatar et al. [18] describes a similar hybrid attention mechanism, where location embeddings are used as input to the attention model this approach has an important disadvantage that the model cannot work with an input sequence longer than those seen during training. Our approach, on the other hand, works well on sequences many times longer than those seen during training(see Sec. 5.) 4 Experimental Setup We closely followed the procedure in [16]. All experiments were performed on the TiMiT corpus [19]. We used the train-dev-test split from the Kaldi [20] TIMIT s5 recipe. We trained on the standard 462 speaker set with all sa utterances removed and used the 50 speaker dey set for early stopping. We tested on the 24 speaker core test set. all networks were trained on 40 mel-scale filter bank features together with the energy in each frame, and first and second temporal differences yielding in total 123 features per frame. Each feature was rescaled to have zero mean and unit variance over the training set. Networks were trained on the full 61-phone set extended with an extraend-of-sequence" token that was appended to each target sequence. Similarly, we appended an all-zero frame at the end of each input sequence to indicate the end of the utterance. Decoding was performed using the 61+I phoneme set, while scoring was done on the 39 phoneme set 4.1 Training procedure One property of ArsG models is that different subsets of parameters are reused different number of times; L times for those of the encoder, lT for the attention weights and T times for all the other FDHCO SX209: Michael colored the bedroom wall with crayons y w be b Figure 3: Alignments produced by the baseline model. The vertical bars indicate ground truth phone location from TIMIT. Each row of the upper image indicates frames selected by the attention mechanism to emit a phone symbol. The network has clearly leaned to produce a left-to-right alignment with a tendency to look slightly ahead, and does not confuse between the repeated"kcl k phrase. Best viewed in cole parameters of the ARSG. This makes the scales of derivatives w.r. t. parameters vary significantly and we handle it by using an adaptive learning rate algorithm, AdaDelta [21] which has two hyper- parameters E and p. All the weight matrices were initialized from a normal Gaussian distribution with its standard deviation set to o. 01. Recurrent weights were furthermore orthogonalized As TIMIT is a relatively small dataset, proper regularization is crucial. We used the adaptive weight noise as a main regularizer [22]. We first trained our models with a column norm constraint [23 with the maximum norm 1 until the lowest development negative log-likelihood is achieved. During this time, c and p are set to 10 and 0.95, respectively. At this point, we began using the adaptiv weight noise, and scaled down the model complexity cost Lc by a factor of 10, while disabling the column norm constraints. Once the new lowest development log-likelihood was reached, we fine-tuned the model with a smaller e= 10, until we did not observe the improvement in the development phoneme error rate(PEr) for 100K weight updates. Batch size 1 was used throughout the training 4.2 Details of evaluated models We evaluated the argS with different attention mechanisms. The encoder was a 3-layer birnn with 256 GrU units in each direction, and the activations of the 512 top -layer units were used as the representation h. The generator had a single recurrent layer of 256 Gru units. Generate in Eq (3) had a hidden layer of 64 maxout units. The initial states of both the encoder and generator were treated as additional parameters Our baseline model is the one with a purely content-based attention mechanism(See Eqs. (5)-(7). The scoring network in Eg (7) had 512 hidden units. The other two models use the convol features in eq( 8)with k= 10 and r= 201. One of them uses the smoothing from Sec. L, tional Decoding Procedure A left-to-right beam search over phoneme sequences was used during de- coding [24]. Beam search was stopped when the"end-of-sequence"token(eos was emitted. We started with a beam width of 10, increasing it up to 40 when the network failed to produce (eos; with the narrower beam. As shown in Fig. 2, decoding with a wider beam gives little-to-none benefit s Results All the models achieved competitive pers (see Table 1). With the convolutional features, we see 3.7% relative improvement over the baseline and further 5.9%o with the smoothing To our surprise(see Sec. 2. 1.), the baseline model learned to align properly. An alignment pro duced by the baseline model on a sequence with repeated phonemes(utterance FDHCO-SX209)is presented in Fig 3 which demonstrates that the baseline model is not confused by short-range repe- titions. We can also see from the figure that it prefers to select frames that are near the beginning or Applying the weight noise from the heginning of training caused severe underfitting Table 1: Phoneme error rates(PER). The bold-faced PER corresponds to the best error rate with an attention-based recurrent sequence generator(ARSG) incorporating convolutional attention features and a smooth focus Model Dev Test Baseline model 15.9%18.7% Baseline ConV Features 16.1%180% Baseline + Conv Features Smooth Focus 158%176% RNN Transducer 16 NA17.7% HMM over Time and Frequency Convolutional Net[25][ 13. 9% 16.79 Number of incorrectly aligned phones vs utterance length, model, and decoding algorithm Plain Softmax with 6=2 Keep 50 frames Window±l50 600 Dataset Same UtL. Mixed Unt 400 g 400600 6000 06000 400600 Utterance length phones ogure 4: Results of force-aligning the concatenated utterances. Each dot represents a singl Fi gle utter- ance created by either concatenating multiple copies of the same utterance or of different randomly chosen utterances. We clearly see that the highest robustness is achieved when the hybrid attention mechanism is combined with the proposed sharpening technique (see the bottom-right plot.) even slightly before the phoneme location provided as a part of the dataset. The alignments produced by the other models were very similar visually 5.1 Forced Alignment of Long utterances The good performance of the baseline model led us to the question of how it distinguishes between repetitions of similar phoneme sequences and how reliably it decodes longer sequences with more repetitions. We created two datasets of long utterances; one by repeating each test utterance, and the other by concatenating randomly chosen utterances. In both cases, the waveforms were cross-faded with a 0.05s silence inserted as the "pau"phone. We concatenated up to 15 utterances First, we checked the forced alignment with these longer utterances by forcing the generator to emit the correct phonemes. Each alignment was considered correct if 90% of the alignment weight lies inside the ground-truth phoneme window extended by 20 frames on each side. Under this definition all phones but the (eos)shown in Fig 3 are properly aligned The first column of Fig. 4 shows the number of correctly aligned frames w r t. the utterance length (in frames) for some of the considered models. One can see that the baseline model was able to decode sequences up to about 120 phones when a single utterance was repeated, and up to about 150 phones when different utterances were concatenated. Even when it failed, it correctly aligned about 50 phones. On the other hand the model with the hybrid attention mechanism with convolutional features was able to align sequences up to 200 phones long. However, once it began to fail, the model was not able to align almost all phones. The model with the smoothing behaved similarly to the one with convolutional features onl We examined failed alignments to understand these two different modes of failure. Some of the examples are shown in the Supplementary materials We found that the baseline model properly aligns about 40 first phones, then makes a jump to the end of the recording and cycles over the last 10 phones. This behavior suggests that it learned to track its approximate location in the source sequence. However, the tracking capability is limited to the lengths observed during training. Once the tracker saturates, it jumps to the end of the recording 7 Phoneme error rates on long utterances Decoding alyorithm Baseline Cony feats Smooth Focus Wn±75 Mixed utt Saie utt Number of repetitions Figure 5: Phoneme error rates obtained on decoding long sequences. Each network was de- coded with alignment sharpening techniques that produced proper forced alignments. The proposed ARSG'S are clearly more robust to the length of the utterances than the baseline one is In contrast, when the location-aware network failed it just stopped aligning -no particular frames were selected for each phone. We attribute this behavior to the issue of noisy glimpse discussed in Sec. 2.3. With a long utterance there are many irrelevant frames negatively affecting the weight as signed to the correct frames. In line with this conjecture, the location-aware network works slightly better on the repetition of the same utterance, where all frames are somehow relevant, than on the concatenation of different utterances, where each misaligned frame is irrelevant. To gain more insight we applied the alignment sharpening schemes described in Sec. 2.3. In the remaining columns of Fig 4, we see that the sharpening methods help the location-aware network to find proper alignments, while they show little effect on the baseline network. The windowing technique helps both the baseline and location-aware networks, with the location-aware network properly aligning nearly all sequences During visual inspection, we noticed that in the middle of very long utterances the baseline model was confused by repetitions of similar content within the window, and that such confusions did not happen in the beginning. This supports our conjecture above 5.2 Decoding Long Utterances We evaluated the models on long sequences. Each model was decoded using the alignment sharp ening techniques that helped to obtain proper forced alignments. The results are presented in Fig. 5 The baseline model fails to decode long utterances, even when a narrow window is used to constrain the alignments it produces. The two other location-aware networks are able to decode utterances formed by concatenating up to 11 test utterances. Better results were obtained with a wider window presumably because it resembles more the training conditions when at each step the attention mech anism was seeing the whole input sequence. With the wide window, both of the networks scored about 20%o PER on the long utterances, indicating that the proposed location-aware attention mecha- nism can scale to sequences much longer than those in the training set with only minor modifications required at the decoding stage 6 Conclusions We proposed and evaluated a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism which combines both content and location information in order to select the next position in the input sequence for decoding One desirable property of the proposed model is that it can recognize utterances much longer than the ones it was trained on. In the future.w expect this model to be used to directly recognize text from speech [10, 17, in which case it may become important to incorporate a monolingual language model to the aRsG architecture [26] This work has contributed two novel ideas for attention mechanisms: a better normalization al proach yielding smoother alignments and a generic principle for extracting and using features from the previous alignments. Both of these can potentially be applied beyond speech recognition. Fc instance, the proposed attention can be used without modification in neural Turing machines, an6 w using 2-D convolution instead of 1-D, for improving image caption generation [31 Acknowledgments All experiments were conducted using Theano [27, 28 PyLearn2 [29], and Blocks [30] libraries The authors would like to acknowledge the support of the following agencies for research fund ing and computing support: National Science Center(Poland), NSErC, Calcul Quebec, Compute Canada, the Canada research Chairs and CiFAr. Bahdanau also thanks planet Intelligent Systems Gmbh and Yandex References [1] Alex Graves. Generating sequences with recurrent neural networks. ar Xiv: 1308.0850, August 2013 [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio Neural machine translation by jointly learning to align and translate ar Xiv: 1409.0473, September 2014 [3] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention arXiv. 502.03044 February 2015 [4 Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pages 2204-2212, 2014 [ 5] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio End-to-end continuous speech rccognition using attcntion-bascd recurrent NN: First rcsults. arXiv: 1412.1602/cs, stat Dcccmbcr 2014 [6] Alex Graves, Greg Wayne, and Ivo Danihelka Neural turing machines. arXiv: 1410.5401, 2014 [7]Jason Weston, Sumit Chopra, and Antoine Bordes Memory networks. ar Xiv: 1410.3916, 2014 [8] Mark Gales and Steve Young. The application of hidden markov models in speech recognition. Found Trends Signal Process, 1(3): 195-304, January 2007 [9] G. Hinton, Li Dcng, Dong Yu, G.E. Dahl, A Mohamed, N Jaitly, A Senior, V. Vanhouckc, P Ngu T.N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition shared views of four research groups. IEEE Signal Processing Magazine, 29(6): 82-97, November 2012 [10 Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deepspeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv: 1412.5567, 2014 [11]S Hochreiter and J Schmidhuber. Long short-term memory. Neural. Comput., 9(8): 1735-1780, 1997 [12] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation In EMNLP 2014, October 2014. to appear [13] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML-06, 2006 [14 Alex Graves. Sequence transduction with recurrent neural net works. In ICML-12, 2012 [15] Y Lc Cun, L. Bottou, Y. Bcngio, and P. Haffncr. Gradicnt bascd learning applicd to documcnt rccognition ProC. IEEE. 1998 [16 Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In ICAS.P 2013, pages 6645-6649. TEEF, 2013 [17 Alex Graves and Navdeep jaitly. Towards end-to-end speech recognition with recurrent neural networks CML-14, pages 17641772,2014. [18] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and rob Fergus. Weakly supervised memory nct works. arXiv preprint arXiv: 1503.08895, 2015 [19JS Garofolo, L. F Lamel, w. M. Fisher, J G. Fiscus, D.S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic phonetic continuous speech corpus, 1993 [20] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, and others. The kaldi speech recognition toolkit In Proc. AsRU, pages 1-4, 2011 [21] Matthew D Zcilcr. ADADELTA: An adaptive learning ratc mcthod. arXiv: 12-2. 5701, 2012 [22] Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R.S. Zemel, PL Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348-2356. Curran Associates, Inc, 2011 [23 Geoffrey E linton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan r salakhutdi nov. Improving neural nctworks by preventing co-adaptation of fcaturc detectors. arXiv preprint arXiv:l207.0580,2012 [24 Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks arXiv preprint arXiv 1409.3215, 2014 [25 Laszlo Toth. Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. In /CASSP 2014, pages 190-194, 2014 26] Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares Holger Schwenk, and Yoshua Bengio On using monolingual corpora in neural machine translation. arXiv preprint arXiv: 1503.03535, 2015 [27] James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume des- jardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expres sion compiler. In Proceedings of the Python for Scientific Computing Conference(SciPy), June 2010 Oral Presentation 28] Frederic Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, lan J Goodfellow, Arnaud Bergeron Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature L earning NIPS 2012 Workshop, 2012 29]lan J Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi mirza, Razvan Pas canu, James Bergstra, Frederic Bastien, and Yoshua Bengio. Pylearn2: a machine learning research library. arXiv preprint arXiv: 1308.4214, 2013 [30] Bart van Merrienboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and Yoshua Bengio. Blocks and fuel: Frameworks for deep learning. arXiv: 1506.00619/cs, stat/, june 2015 10

...展开详情
试读 19P Attention-based model for speech recogntion
立即下载
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
关注 私信
上传资源赚钱or赚积分
最新推荐
Attention-based model for speech recogntion 50积分/C币 立即下载
1/19
Attention-based model for speech recogntion第1页
Attention-based model for speech recogntion第2页
Attention-based model for speech recogntion第3页
Attention-based model for speech recogntion第4页

试读结束, 可继续读2页

50积分/C币 立即下载