Alex Graves大神的论文

所需积分/C币:9 2019-03-07 23:49:26 432KB PDF
收藏 收藏

This paper presents a speech recognition sys- tem that directly transcribes audio data with text, without requiring an intermediate phonetic repre- sentation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Tem- poral Cl
Towards End-to-End Speech Recognition with Recurrent Neural Networks Ut+1 nput Gate Ou: put Gate C Forget Gate at-1 Figure 1. Long Short-term Memory Cell Outputs t+1 Figure 3. Deep recurrent Neural Network. for all N layers in the stack, the hidden vector sequences Backward Layer h"are iteratively computed from n= 1 to N and t= l to Forward laver h-H(=Ihnht-+Whnhnhi-1+h)(11) where ho=a. The network outputs y, are WhNahfv+b (12) Figure 2. Bidirectional recurrent Neural Network. Deep bidirectional RNNs can be implemented by replacing each hidden sequence hn with the forward and back ward do this by processing the data in both directions with two sequences h n and h m,and ensuring that every hidden separate hidden layers, which are then fed forwards to the layer receives input from both the forward and backward same output layer. As illustrated in Fig. 2, a BRNN com- lavers at the level below If lstm is used for the hidden putes the forward hidden sequence h, the backward hid- layers the complete architecture is referred to as deep bidi den sequence h and the output sequence y by iterating the rectional lstm(Graves et al., 2013) backward layer from t= T to 1, the forward layer from t= l to T and then updating the output layer 3. Connectionist Temporal Classification ht=wiHt+Whhnt-1+b- (8 Neural networks(whether feedforward or recurrent) are typically trained as fr (W←rt+W + b (9) This requires a separate training target fo yt=Ws, ht+wi h ery frame, which in turn requires the alignment between hy ht+bo (10) the audio and transcription sequences to be determined by Combing BRNNs with LSTM gives bidirectional the HMM. However the alignment is only reliable once LSTM (Graves Schmidhuber, 2005), which can the classifier is trained, leading to a circular dependency between segmentation and recognition (known as Sayres access long-range context in both input directions paradox in the closely-related field of handwriting recog A crucial element of the recent success of hybrid systems nition Furthermore, the alignments are irrelevant to most is the use of deep architectures, which are able to build up speech recognition tasks, where only the word-level tran progressively higher level representations of acoustic data. scriptions matter:. Connectionist Temporal Classification Deep rnns can be created by stacking multiple rnn hid-(CTC)(Graves, 2012, Chapter 7)is an objective function den layers on top of each other, with the output sequence of that allows an rnn to be trained for sequence transcrip- one layer forming the input sequence for the next, as shown tion tasks without requiring any prior alignment between in Fig 3. Assuming the same hidden layer function is used the input and target sequences Towards End-to-End Speech Recognition with Recurrent Neural Networks The output layer contains a single unit for each of the tran- assessed in a more nuanced way. In speech recognition scription labels(characters, phonemes, musical notes etc. ) for example, the standard measure is the word error rate plus an extra unit referred to as the 'blank' which corre- (WER), defined as the edit distance between the true word sponds to a null emission Given a length T input sequence sequence and the most probable word sequence emitted by c, the output vectors yt are normalised with the softmax the transcriber. We would therefore prefer transcriptions function, then interpreted as the probability of emitting the with high Wer to be more probable than those with low label (or blank ) with index k at time t WER. In the interest of reducing the gap between the ob- jective function and the test criteria, this section proposes Pr(ki, ta) exp yf) (13) a method that allows an RNN to be trained to optimise the ∑ k,exp(yi expected value of an arbitrary loss function defined over output transcriptions(such as WEr) where y is element k of yt. A CT alignment a is a length T sequence of blank and label indices. The probability The network structure and the interpretation of the output Pr(ale of a is the product of the emission probabilities activations as the probability of emitting a label (or blank) at every time-step at a particular time-step remain the same as for CTC Given input sequence a, the distribution Pr(y a)over tran Pr(ala)=IPr(at, tle) scriptions sequences y defined by CTC, and a real-valued transcription loss function C(a, y), the expected transcrip tion loss f( r)is defined as For a given transcription sequence, there are as many possi- ble alignments as there different ways of separating the la C(m)= ∑ Pr(yl)(m, 1) (17) bels with blanks. For example(using- to denote blanks) the alignments (a and a,-,b,c) both In general we will not be able to calculate this expectation correspond to the transcription(a, b,c). When the same exactly, and will instead use Monte-Carlo sampling to ap- label appears on successive time-steps in an alignment, proximate both L and its gradient. Substituting Eq. (15) the repeats are removed: therefore (a, b, b,b, C,c)and into Eq (17)we see that b Id to (a, b, c). Denoting by B an operator that removes first the repeated labels, then C(a)=∑∑Pra)C(,y)(8 the blanks from alignments and observing that the total ya∈B-1(y) probability of an output transcription y is equal to the sum of the probabilities of the alignments corresponding to it, ∑Pr(a)C(m,B(a) (19) we can write Eq(14)shows that samples can be drawn from Pr(ala Pr(yx)=∑P r(aa (15) by independently picking from Pr(h, ta)at each time-step acB and concatenating the results, making it straightforward to approximate the loss: This ' integrating out over possible alignments is what al lows the network to be trained with unsegmented data. The intuition is that. because we dont know where the labels ∑C(m,B(a),a2~Pr(a) within a particular transcription will occur, we sum over all the places where they could occur. Eg. (15)can be ef- To differentiate C with respect to the network outputs, first ficiently evaluated and differentiated using a dynamic pro- observe from Eq(13)that gramming algorithm (Graves et al., 2006). Given a target alog Pr(alae transcription y", the network can then be trained to min- (21) imise the ctc objective function aPr(k, tlac) Pr(k, t Then substitute into Eq. (19), applying the identity CTC(a)=-log Pr(y a) (16)Vx/(x)=∫(x)Ⅴ log /(x), to yield aC(a aPr(al) 4. Expected Transcription LOSS a Pr(k,ta a Pr(ki, lla) C(a, B(a)) The CtC objective function maximises the log probahil ity of getting the sequence transcription completely correct ∑I Pr(aa alog Pr(ac) a Pr(h, ta) C(a, B(a)) The relative probabilities of the incorrect transcriptions are therefore ignored, which implies that they are all equally 2 Pr(alz, at=k)C(a, B(a) bad. In most cases however, transcription performance is a: at=k Towards End-to-End Speech Recognition with Recurrent Neural Networks This expectation can also be approximated with monte- only recalculating that part of the loss corresponding to the Carlo sampling. Because the output probabilities are in- alignment change. For our experiments, five samples per dependent, an unbiased sample a' from Pr(ala)can be sequence gave sufficiently low variance gradient estimates converted to an unbiased sample from Pr(a, at=h)by for effective training setting a= k. Every a can therefore be used to provide a gradient estimate for every Pr(k, tar)as follows Note that in order to calculate the word error rate. an end of-word label must be used as a delimiter aC() a Pr(k, tac) ∑C(x,B(a)(2)5. Decoding Decoding a CtC network(that is, finding the most prob with ain Pr(ala)and a, t, k- ai vt'f t, a, t, K-k able output transcription y for a given input sequence The advantage of reusing the alignment samples(as can be done to a first approximation by picking the single posed to picking separate alignments for every k, t)is most probable output at every timestep and returning the the noise due to the loss variance largely cancels out, and corresponding transcription only the difference in loss due to altering individual labels is added to the gradient. As has been widely discussed in the argmax, Pr(y a)a B(argmaxa Pr(aa)) policy gradients literature and elsewhere(Peters schaal More accurate decoding can be performed with a beam 2008), noise minimisation is crucial when optimising with search algorithm, which also makes it possible to integrate stochastic gradient estimates. The Pr(k, t a) derivatives a language model. The algorithm is similar to decoding are passed through the softmax function to give methods used for HMM-based systems, but differs slightly due to the changed interpretation of the network outputs aC(a) Pr(k, ta) auk M ∑C(x,B(a2)-z(a,1) In a hybrid system the network outputs are interpreted as posterior probabilities of state occupancy, which are then combined with transition probabilities provided by a lan where guage model and an HMM. With Ctc the network out ∑ Pr(k', taC(a B(a., k)) puts themselves represent transition probabilities (in HMM terms, the label activations are the probability of makin transitions into different states and the blank activation is the probability of remaining in the current state). The situa tion is furthe plicated by th The derivative added to yt by a given a is therefore equal emissions on successive time-steps, which makes it nec to the difference between the loss with al= k and the ex essary to distinguish alignments ending with blanks from pected loss with at sampled from Pr(k', ta). This means those ending with labels the network only receives an error term for changes to the alignment that alter the loss. For example, if the loss The pseudocode in algorithm I describes a simple beam function is the word error rate and the sampled alignment search procedure for a CtC network, which allows the yields the character transcription"WTRD ERROR RATE integration of a dictionary and language model. De the gradient would encourage outputs changing the second fine Pr(y, t), Prt(y, t) and Pr(y, t)respectively as the output label to'O, discourage outputs making changes to blank, non-blank and total probabilities assigned to some the other two words and be close to zero every where else. (partial) output transcription y, at time t by the beam searc For the sampling procedure to be effective, there must be h,and set Pr(y, t)=Pr(y, t)+Pr(y t) Define a reasonable probability of picking alignments whose vari- the extension probability Pr(h, 9, t )of y by label k at time l as follows: ants receive different losses. The vast majority of aligI ments drawn from a randomly initialised network will give completely wrong transcriptions, and there will therefore Pr(k, y, t)=Pr(k, ta)Pr(kly) Pr(y, t-1)if ye= k Pr(y, t-1) otherwise be little chance of altering the loss by modifying a single output. We therefore recommend that expected loss min- where Pr(k, ta) is the CtC emission probability of k at t, imisation is used to retrain a network already trained with as defined in Eq (13), Pr(hly) is the transition probability CTC, rather than applied from the start from y to y+k and y is the final label in y. Lastly, define Sampling alignments is cheap, so the only significant com- y as the prefix of y with the last label removed, and o as putational cost in the procedure is recalculating the loss the empty sequence, noting that Prt(, t)=0Vt for the alignment variants. However, for many loss func- The transition probabilities Pr(ky) can be used to inte- tions (including word error rate) this could be optimised by grate prior linguistic information into the search. If no Towards End-to-End Speech Recognition with Recurrent Neural Networks Algorithm 1 CTC Beam Search There were a total of 43 characters(including upper case Initialise:B←{∞};Pr(∞,0)←1 letters, punctuation and a space character to delimit the words). The input data were presented as spectrograms de b<the W most probable sequences in B rived from the raw audio files using the specgram function B←{} of the matplotlib' python toolkit, with width 254 Fourier fory∈Bdo ify≠ e then windows and an overlap of 127 frames, giving 128 inputs Pr(y, t)<PrT(y, t-1)Pr(y, ta) per frame ify∈ B then The network had five levels of bidirectional lstm hid Pr(y, t< Pr(y, t)+ Pr(y, y, t) Pr(y, t)<Pr(y, t-1)Pr(-,tac den layers, with 500 cells in each layer, giving a total of Add y to B 26.5M weights. It was trained using stochastic gradient for k=1. do descent with one weight update per utterance, a learnin Pr(y+k,t)←0 rate of 10-4 and a momentum of 0.9 k,t)←Pr(k,y,t) Add(y+k)to B The rnn was compared to a baseline deep neural network Return: maxyEB Pr lul(y, T) HMM hybrid (DNN-HMM) The dNN-HMM was created using alignments from an SGMM-HMM system trained us ing Kaldi recipe‘s5’, model tri4b’( Povey et al.,2011) such knowledge is present(as with standard ctC)then all The 14 hour subset was first used to train a Deep belief Network (DBN(Hinton Salakhutdinov, 2006) with six Pr(kly)are set to 1. Constraining the search to dictionary hidden layers of 2000 units each. The input was 15 frames words can be easily implemented by setting Pr(ky)=1 if (y+h) is in the dictionary and 0 otherwise. To apply a sta- of Mel-scale log filterbanks(I centre frame +7 frames of tistical language model, note that Pr(kly) should represent context) with 40 coefficients, deltas and accelerations. The DBN was trained layerwise then used to initialise a dNN normalised label-to-label transition probabilities The dnn was trained to classify the central input frame vert a word-level language model to a label-level one, first into one of 3385 triphone states The dnn was trained with note that any label sequence y can be expressed as the con- stochastic gradient descent, starting with a learning rate of catenation y=(w+p) where w is the longest complete 0.1. and momentum of 0.9. The learning rate was divided sequence of dictionary words in y and p is the remaining by two at the end of each epoch which failed to reduce the word prefix. Both w and p may be empty. Then we can frame error rate on the development set. After six failed at- write tempts, the learning rate was frozen The dnn posteriors Pr(ky)==yc(p+k) Pr(wlw) (23) yere divided by the square root of the state priors durin Pr(ww) decodin where Pr(ww) is the probability assigned to the transition The rnn was first decoded with no dictionary or language from the word history w to the word w, pk is the set of model, using the space character to segment the charac dictionary words prefixed by p and y is the language model ter outputs into words, and thereby calculate the WEr weighting factor The network was then decoded with a 146k word dictio nary, followed by monogram, bigram and trigram language The length normalisation in the final step of the algo- models. The dictionary was built by extending the default rithm is helpful when decoding with a language model, wsJ dictionary with 125K words using some augmen as otherwise sequences with fewer transitions are unfairly tation rules implemented into the Kaldi recipe's5'.The favoured it has little impact otherwise anguge models were built on this extended dictionary, us ing data from the WSJ CD(see scripts"wsj_extend -dict. sh 6. Experiments’ In recipe‘s5”). The language mode weight was optimised separately for all experiments. For The experiments were carried out on the Wall Street Jour- the rnn experiments with no linguistic information, and nal (wSJ) corpus(available as LDC corpus ldc93s6b those with only a dictionary, the beam search algorithm in and LDC94s13B). The rnn was trained on both the 14 Section 5 was used for decoding For the RNN experiments hour subset'train-si84 and the full 8I hour set, with the with a language model, an alternative method was used test-dev93 development set used for validation. For both partly due to implementation difficulties and partly to en training sets, the rnn was trained with CTC, as described sure a fair comparison with the baseline system: an N-best in Section 3, using the characters in the transcripts as the list of at most 300 candidate transcriptions was extracted target sequences. The rnn was then retrained to minimise from the baseline DNN-HMM and rescored by the rnN the expected word error rate using the method from Sec- using Eq (16). The RNN scores were then combined with tion 4, with five alignment samples per sequence Towards End-to-End Speech Recognition with Recurrent Neural Networks Table 1. wall Street Journal Results. All scores are word er- fering with the explicit model. Nonetheless the difference was small, considering that so much more prior informa ror rate/character error rate(where known)on the evaluation set LMis the language model used for decoding. 14 Hr'and'81 tion(audio pre-processing, pronunciation dictionary, state Hr'refer to the amount of data used for training ying, forced alignment)was encoded into the baseline sys tem. Unsurprisingly, the gap between RNN-CTC and SYSTEM LM 14HR iHR RNN-WER also shrank as the lm became more domi- NN-CTC NoNE 4.2/30.930.1/9.2 nant RNN-CTO DICTIONARY69.2/30.02408.0 RNN-CTC MONOGRAM 25.8 15.8 The baseline system improved only incrementally from the RNN-CTC BIGRAM 10.4 1 4 hour to the 8l hour training set while the rnn error RNN-CTC TRIGRAM 13.5 8.7 rate dropped dramatically. a possible explanation is that 14 RNN-WER NONE 74.5/31.327.3/8.4 RNN-WER DICTIONARY69.7/31.021.9/7.3 hours of transcribed speech is insufficient for the rnn to RNN-WER MONOGRAM 26.0 15.2 learn how tospell enough of the words it needs for accu RNN-WER BIGRAM 9.8 rate transcription-whereas it is enough to learn to identify RNN-WER TRIGRAM 8.2 phonemes bASelIN BASELINE DICTIONARY56. The combined model performed considerably better than bAsEliNE MONOGRAM23.4 19.9 either the rnn or the baseline individually. The improve bASELINE BIGRAM 6 9.4 ment of more than 1%o absolute over the baseline is consid bASELINE TRIGRAM 9.4 7.8 COMBINATION TRIGRAM 6.7 erably larger than the slight gains usually seen with model averaging; this is presumably due to the greater difference between the syste the language model to rerank the N-best lists and the WEr 7. Discussion of the best resulting transcripts was recorded. The best re sults were obtained with an rNn score weight of 7. 7 and a To provide character-level transcriptions, the network must language model weight of 16 not only learn how to recognise speech sounds, but how to For the gl hour training set, the oracle error rates for the transform them into letters In other words it must learn monogram, bigram and trigram candidates were 8.9%.o how to spell. This is challenging, especially in an ortho- and 1. 4 %o resepectively, while the anti-oracle(rank 300)er- graphically irregular language like English. The following ror rates varied from 45. 5% for monograms and 33 for examples from the evaluation set decoded with no dictio trigrams. USing larger N-best lists(up to N=1000)did not nary or language model, give some insight into how the yield significant performance improvements, from which network operates we concluded that the list was large enough to approximate target: TO ILLUSTRATE THE POINT A PROMINENT MIDDLE EASTANALYST the true decoding performance of the RNn IN WASHINGTON RECOUNTS A CALL FROM ONE CAMPAIGN An additional experiment was performed to measure the ef output: TWO ALSTRAIT THE POINT A PROMINENT MIDILLE EAST AN fect of combining the rnN and DNN. The candidate scores LYST IM WASHINGTON RECOUNCACALL FROM ONE CAMPAIGN for'RNN-WER'trained on the 81 hour set were blended target: T. W.A. ALSO PLANS TO HANG ITS BOUTIQUE SHINGLE IN AIR with the dnn acoustic model scores and used to rerank PORTS AT LAMBERT SAINT the candidates. Best results were obtained with a language output: T W.A. ALSO PLANS TOHING ITS BOOTIK SINGLE IN AIRPORTS AT model weight of l. an rnn score weight of i and a dnn LAMBERT SAINT weight of 1 target: ALL THE EQUITY RAISING IN MILAN GAVE THAT STOCK MARKET The results in Table I demonstrate that on the full training INDIGESTION LAST YEAR set the character level RNN outperforms the baseline model output: ALL THE EQUITY RAISING IN MULONG GAVE THAT STACRK MAR- when no language model is present. The rNn retrained to KET IN TO JUSTIAN LAST YEAR minimise word error rate (labelled RNN-Wer' to distin guish it from the original'RNN-CTC network) performed target: THERE'S UNREST BUT WERE NOT GOING TO LOSE THEM TO particularly well in this regime. This is likely due to two DUKAKIS factors: firstly the rnn is able to learn a more powerful Output: THERES UNREST BUT WERE NOT GOING TO LOSE THEM TO acoustic model. as it has access to more acoustic context. DEKAKIS and secondly it is able to learn an implicit language model Like all speech recognition systems, the netwok makes from the training transcriptions. However the baseline sys- phonetic mistakes, such as" instead of"single,,and tem overtook the rnn as the Lm was strengthened: in this sometimes confuses homophones like'two'and'to'.The ase the rnns implicit LM may work against it by inter Towards End-to-End Speech Recognition with Recurrent Neural Networks H R END outputs errors waveform Figure 4. Network outputs. The figure shows the frame-level character probabilities emitted by the CtC layer (different colour for each character, dotted grey line for ' blanks), along with the corresponding training errors, while processing an utterance. The target transcription was 'HIS- FRIENDS, where the underscores are end-of-word markers. The network was trained with WER loss, which tends to give very sharp output decisions, and hence sparse error signals (if an output probability is 1, nothing else can be sampled, so the gradient is o even if the output is wrong). In this case the only gradient comes from the extraneous apostrophe before the 's. Note that the characters in common sequences such as'IS', 'RI and'END' are emitted very close together, suggesting that the network learns them as single sounds latter problem may be harder than usual to fix with a lan- In the future, it would be interesting to apply the system to guage model, as words that are close in sound can be quite datasets where the language model plays a lesser role, such distant in spelling. Unlike phonetic systems, the network as spontaneous speech, or where the training set is suffi also makes lexical errors-eg. bootik' for boutique- ciently large that the network can learn a language model and errors that combine the two, such as 'alstrait' for il- from the transcripts alone. Another promising direction lustrate rould be to integrate the language model into the Ctc or It is able to correctly transcribe fairly complex words such expected transcription loss objective functions during train as 'campaign,,analyst andequity'that appear frequentl in financial texts(possibly learning them as special cases) but struggles with both the sound and spelling of unfamil- Acknowledgements iar words, especially proper names such as Milanand Dukakis. This suggests that out-of-vocabulary words The authors wish to thank daniel Povey for his assistance may still be a problem for character-level recognition, even with Kaldi. This work was partially supported by the cana in the absence of a dictionary however, the fact that the dian institute for advanced research network can spell at all shows that it is able to infer sig nificant linguistic information from the training transcripts paving the way for a truly end-to-end speech recognition system 8. Conclusion This paper has demonstrated that character-level speech transcription can be performed by a recurrent neural net work with minimal preprocessing and no explicit phonetic representation. We have also introduced a novel objective unction that allows the network to be directly optimised for word error rate, and shown how to integrate the net- work outputs with a language model during decoding. Fi nally, by combining the new model with a baseline, we have achieved state-of-the-art accuracy on the wall street Journal corpus for speaker independent recognition Towards End-to-End Speech Recognition with Recurrent Neural Networks References Graves, Alex. Supervised Sequence labelling with Recur Bahl, L, Brown, P. De Souza, P.V., and Mercer, R. Max rent Neural Networks, volume 38.5 of Studies in compu imum mutual information estimation of hidden markov tational intelligence. Springer, 2012 Imodel parameters for speech recognition. In AcousticS, Hinton, G. E. and Salakhutdinov, R.R. Reducing the Di Speech, and Signal Processing, IEEE International Con mensionality of Data with Neural Networks. Science ference on ICASSP 86., volume ll, pp. 49-52, Apr 313(5786):504-507,July2006 1986.doi:10.1109/ CASSE.1986.1169179 Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, rah- Bisani, Maximilian and Ney, Hermann. Open vocabulary man Mohamed, Abdel, Jaitly, Navdeep, Senior, Andrew speech recognition with flat hybrid models. In INTER Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, and SPEECH,pp.725-728,2005 Kingsbury, Brian. Deep neural networks for acoustic Imodeling in speech recognition Signal Processing Mug- Bourlard, Herve A. and morgan, Nelson. Connection ist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers Norwell MA usa. 1993. isbn hochreiter, s and schmidhuber, J. Long short -Term Mem- 0792393961 ory. Neural Computation, 9(8): 1735-1780, 1997 Ciresan. Dan C. Meier. Ueli. Masci. Jonathan, and Jaitly, Navdeep and Hinton, Geoffrey E. Learning a bet Schmidhuber, Jrgen. A committee of neural networks ter representation of speech sound waves using restricted for traffic sign classification. In //CNN, pp. 1918-192 boltzmann machines. In ICASSP, pp. 5884-5887, 2011 IEEE 2011 Jaitly, Navdeep, Nguyen, Patrick, Senior, Andrew w, and Davis, S. and Mermelstein, P. Comparison of paramet Vanhoucke, Vincent. Application of pretrained deep ric representations for monosyllabic word recognition neural networks to large vocabulary speech recognition in continuously spoken sentences. /EEE Transactions In INTERsPeeCh. 2012 on Acoustics, Speech and Signal Processing, 28(4):357 Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E 366, August 1980 Imagenet classification with deep convolutional neural Eyben, F, WIlmer, M, Schuller, B, and Graves, A networks. In Advances in Neural Information Processing Systems. 2012 From speech to letters- using a novel neural network architecture for grapheme based asr. In Proc. Au- Lee, Li and Rose, R. A frequency warping approach to tomatic Speech Recognition and Understanding Work speaker normalization. Speech and Audio Processing shop(asrc 2009), Merano, Italy. IEEE, 2009. 13 TEEE Transactions on, 6(1): 49-60, Jan 1998 17.12.2009 Peters, J and Schaal,S. Reinforcement learning of motor Galescu, Lucian. Recognition of out-of-vocabulary words skills with policy gradients. In Neural Networks, num with sub-lexical language models. In INTERSPEECH ber4,pp.682-97.2008 2003 Povey, D, Ghoshal, A, Boulianne, G, Burget, L,, Glem Gers, F, Schraudolph, N, and Schmidhuber, J. Learning bek, O, Goel, N, Hannemann, M, Motlicek, P, Qian Precise Timing with lstm recurrent Networks. Jour. Y, Schwarz, p Silovsky, j, Stemmer, G, and vesely nal of Machine learning Research, 3: 115-143, 2002 K. The kaldi speech recognition toolkit. In IEEE 2011 Automatic S Speech Recognition und Un Graves, A and Schmidhuber, J. Framewise Phoneme Clas- derstanding. IEEE Signal Processing Society,December sification with bidirectional lstm and other neural 2011. Network Architectures. Neural Networks, 18(5-6): 602 610, June/July 2005 Schuster. m. and paliwal. K.K. Bidirectional recurrent Neural Networks. IEEE Transactions on Signal Process- Graves. A. Fernandez. s. Gomez. f. and Schmidhuber ing,45:2673-2681,1997 J. Connectionist Temporal classification Labelling un segmented Sequence Data with Recurrent Neural Net works In ICML, Pittsburgh, USA, 2006 Graves, A, Mohamed, A, and Hinton, G. Speech recog- nition with deep recurrent neural networks. In Proc ICASSP 2013, Vancouver, Canada, May 2013

试读 9P Alex Graves大神的论文
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    关注 私信 TA的资源
    Alex Graves大神的论文 9积分/C币 立即下载
    Alex Graves大神的论文第1页
    Alex Graves大神的论文第2页
    Alex Graves大神的论文第3页


    9积分/C币 立即下载 >