Speech Recognition_Columbia(哥大最新语音识别讲义)

所需积分/C币:10 2016-11-26 08:58:27 56.5MB PDF
收藏 收藏

哥伦比亚大学2016年春季语音识别讲义,覆盖GMM, HMM, 声学模型,语言模型,模型稳健性,深度学习的语音识别等最新知识面
318 Speech Signal Representations H,LK]H2[k] H,[k] HAIK] HS[K] f0]f1]f2]f31n4151f6 f7] Figure 6.28 Triangular filters used in the computation of the mel-cepstrum using Eq(6. 140) The mel-frequency cepstrum is then the discrete cosine transform of the M filter out Its: ]=∑ S[m]cos(n(m-1/2)/ M )0≤n<M (6.145) whcrc M varies for different implementations from 24 to 40. For speech recognition, typi- cally only the first 13 cepstrum coefficients are used. It is important to note that the MFCC representation is no longer a homomorphic transformation. It would be if the order of sum mation and logarithms in Eg (6. 144)were reversed Sm]=∑m(xkH2k])0<m≤M (6.146 In practice, however, the MFCC representation is approximately homomorphic for fil ters that have a smooth transfer function. The advantage of the MFCC representation using (6.144)instead of (6.146)is that the filter energies are more robust to noise and spectral es- timation errors. This algorithm has been used extensively as a feature vector for speech rec- ognition systems. While the definition of cepstrum in Section 6.4.1 uses an inverse DFT, since S[mI is even, a DCT-II can be used instead (see Chapter 5) 6.5.3. Perceptual Linear Prediction (PLP Perceptual Linear Prediction(PLP)[16] uses the standard Durbin recursion of Section to compute LPC coefficients, and typically the LPC coefficients are transformed to LPC-cepstrum using the recursion in Section 6. 4.2. 1. But unlike standard linear prediction, the autocorrelation coefficients are not computed in the time domain through Eq (6.55) The autocorrelation R [n] is the inverse Fourier transform of the power spectrum x(o) of the signal. We cannot compute the continuous-frequency Fourier transform eas CHAPTER Linear predictive 9 Analysis of Speech goals si INTRODUCTION Linear predictive analysis of speech signals is one of the most powerful speech anal ysis techniques. This method has become the predominant technique for estimating the parameters of the discrele-Lime model for speech production(i.e, pitch, formants, short-time spectra, vocal tract area functions) and is widely used for representing speech in low bit rate transmission or storage and for automatic speech and speaker recognition. The importance of this method lies both in its ability to provide accurate estimates of the speech parameters and in its relative ease of computation. In this chap ter we present the fundamental concepts of linear predictive analysis of speech, and we discuss some of the issues involved in using linear predictive analysis in practical speech applications The philosophy of linear prediction is intimately related to the basic speech syn thesis model discussed in Chapter 5 where it was shown that a sampled speech signal can be modeled as the output of a linear, time-varying system(difference equation) excited by either quasi-periodic pulses(during voiced speech)or random noise (dur- ing unvoiced speech). The difference equation of the speech model suggests that a speech sample can be approximated as a linear combination of p past speech samples By locally minimizing the sum of the squared differences between the actual speech B samples and the linearly predicted samples, a unique set of predictor coefficients can be determined, and by equating the prediction coefficients to the coefficients of the difference equation of the model, we obtain a robust, reliable, and accurate method or estimating thc paramcters that characterize the linear time-varying system in the speech production model. In the speech processing field, linear prediction was first used in speech coding applications, and the term"linear predictive coding"(or LPC, quickly gained widespread usage. As linear predictive analysis methods became more widely used in speech processing, the term "LPC "persisted and now it is often used as a term for linear predictive analysis techniques in general. Wherever we use the term linear predictive coding or LPC, we intend it to have the general meaning, i.e., not restricted just to coding The techniques and methods of linear prediction have been available in the engi neering literature for a long time [38, 417]. One of the earliest applications of the theory of linear prediction was the work of Robinson, who used linear prediction in scismic 473 474 Chapter 9 Linear Predictive Analysis of Speech Signals signal processing 322, 323]. The ideas of linear prediction have been used in the areas of control and information theory under the names of system estimation and system identification. The term"system identification"is particularly descriptive in speech applications since the predictor coefficients are assumed to characterize an all-pole model of the system in the source/system model of speech production. As applied in speech processing, the term" linear predictive analysis"refers to a variety of essentially equivalent formulations of the problem of modeling the speech signal [12, 161, 218. 232. The differences among these formulations are often phi losophical or in point of view toward the problem of speech modeling. The differences mainly concern the details of the computations used to obtain the predictor coeff cients.Thus, as applied to speech, the various(often equivalent) formulations of linear prediction analysis have been 1. the covariance method [121 2. the autocorrelation formulation [217, 229, 2321 3. the lattice method 48, 219 4. the inverse filter formulation [2321 5, the spectral estimation formulation [48 6. the maximum likelihood formulation 161, 162 7. the inner product formulation [232 n this chapter we will examine, in detail, the similarities and differences among the first three basic methods of analysis listed above, since all the other formulations are essentially equivalent to one of these three The importance of linear prediction lies in the accuracy with which the basic model applies to speech. Thus a major part of this chapter is devoted to a discussion of how a variety of speech parameters can be reliably estimated using linear prediction methods. Furthermore, some typical examples of speech applications that rely primar- ily on linear predictive analysis are discussed here, and in Chapters 10-14, to show the wide range of problems to which LPC methods have been successfully applied 9.2 BASIC PRINCIPLES OF LINEAR PREDICTIVE ANALYSIS Throughout this book we have repeatedly referred to the basic discrete-time model for speech production that was developed in Chapter 5. The particular form of this model that is most appropriate for the discussion of linear predictive analysis is depicted in Figure 9.1. In this model, the composite spectrum effects are represented by a time varying digital filter whose steady-state system function is represented by the all-pole rational function S(z) S( GU(z) E(z) akz k k=1 Section 9.2 Basic Principles of Linear Predictive Analysis 475 Pitch Period Irnpulse Train Generator Vocal Tract Parameters 几 Voiced Unvoiced Time-Varying Switch Digital Filter s[n] H(2 Random noise Generator FIGURE 9.1 Block diagram of simplified model for speech production We will refer to H(z) as the vocal tract system function even though it represents not only the effects of the vocal tract resonances but also the effects of radiation at the lips and, in the case of voiced speech, the spectral cffccts of the glottal pulse shape. This system is excited by a quasi-periodic impulse train for voiced speech [since the glottal pulse shape is included in II(z]or a random noise sequence for unvoiced speech. Thus the parameters of this model are Excitation parameters: voiced/unvoiced classification pitch period for voiced speech gain parameter G Vocal tract system parameters coefficients ( ak, k=1, 2,..., p] of the all-pole digital filter These parameters, of course, all vary slowly with time The pitch period and voiced/unvoiced classification can be estimated using one of the many methods to be discussed in Chapter 10. As discussed in Chapter 5. this simplified all-pole model is a natural representation for non-nasal voiced sounds.For nasals and fricative sounds, however, the detailed acoustic theory calls for both poles and zeros in the vocal tract transfer function. We will subsequently see that if th digital filter order, P, is high enough, the all-pole model provides a good enough rep- resentation for almost all the sounds of speech, including nasal sounds and fricativ sounds. The major advantage of this model is that the gain parameter, G, and the filter coefficients, (arl, can be estimated in a very straightforward and computation ally efficient way by the method of linear predictive analysis. Furthermore, one of the 476 Chapter9 Linear Predictive Analysis of Speech Signals methods of pitch detection discussed in Chapter 10 is based on using the linear predic- tor as an inverse filter to extract the error signal, en representative of the excitation signal, uln For the system of Figure 9.1, the inodel speech samples, s[n], are related to the excitation, unI, by the simple difference equation s=∑as-]+Gn (9.2) k=1 A p-order linear predictor with prediction coefficients, ak, k= 1, 2,..., P), is defined as a system whose input is sn] and whose output is sn], defined as =∑sn=k (93) The system function of this pl th-order linear predictor is the z-transform polynomial k§(z P(z)=∑ S( (94) P(z)is often referred to as the predictor polynomial. The prediction error, e[n],is defined as the difference between s[] and sn]; i. e, e=s-]=n]-∑asln-k (9.5 From Eq.(9.5), it follows that the prediction error sequence is the output of a system whose input is s[n and whose system function is E(z) 1-P(x)=1 ak2 (9.6) k=1 The z-transform, A(z), is a polynomial in z. It is often called the prediction error polynomial, or equivalently, the LPC polynomia If the speech signal obeys the model of Eq(9. 2)exactly, so that the model output sn] is equal to the actual sampled speech signal, then a comparison of Egs. (9.2 )and (9.5)shows that if (akl=fak), for all k, then e[n]= Gu[n]. Thus, the prediction error filter, A(z), is an inverse filter for the vocal tract system, H(z), of Eq (9.1): i.e G·U(z)1 s(z Section 9.2 Basic Principles of Linear Predictive Analysis 477 snI e[n] A(z) inl sn] stnl P(z) FIGURE 9.2 Signal processing operations for inverse filtering and reconstruction of the speech signal: (a) inverse filtering of sn], to give the error signal, eln, and direct filtering of the error signal, en], to reconstruct the original speech signal, s[n]; (b) implementation of the system (z sh owing the predicted signa a Figure 9.2 shows the relationships between the signals s[n] e[n], and s[n]. The error signal, e[n] is obtained as the output of the inverse filter, A(z), with input s[n] as shown in Figure 9.2id. By assumption, e[n is the vocal tract excitation, so it should be a quasi-periodic impulse train for voiced speech and random noise for unvoiced speech. The original speech signal, s[nl, is obtained by processing the error signal by the all-pole filter, H(a), as also shown in Figure 9. 2a. Figure 9.2b shows the feedback processing loop for reconstructing sn] from e[]; i.e., feedback of the predicted signal sIn and addition of the error signal, e[n]. Thus we see that H(z) is a pln-order all-pole rational function of the form 1 A(z) (98) ak2 and A(z) is the pin-order polynomial in Eq (9.6). The system whose transfer function is H(z) is often called the vocal tract model system or the LPC. model system =2.1 Basic Formulation of Linear Prediction Analysis Equations The basic problem of linear predictive analysis is to determine the set of predictor coefficients (ak, k=1, 2, .. p] directly from the specch signal so as to obtain a good estimate of the time-varying spectral properties of the speech signal through the use As we have stressed before in our discussion of the speech model, the z-transform representations that we have used are not strictly valid for representing the speech model since the system is time-varying. Thus, the z-transform equations and Figure 9. 2 are assumed to be valid over only short-time frames, with parameter updates from frame-to-frame 478 Chapter 9 Linear Predictive Analysis of Speech Signals of Eq(9.8). Because of the time-varying nature of the specch signal, the predictor coefficients must be estimated by a short-time analysis procedure based on finding the set of predictor coefficients that minimize the mcan-squared prediction error over a short segment of the speech waveform. The resulting parameters are then assumed to be the parameters of the system function, H(z), in the model for speech production in Figure 9.1 It may not be immediately obvious that this approach will lead to useful results. but it can be justified in several ways. First, recall that if (ak=ak, k=1, 2,.. p then e[n]=GuIn. For voiced speech this means that e[n] would consist of a train of impulses;i.e,e[n] would be small most of the time. For unvoiced speech, the inverse filter flattens the short-Lime spectrum, thereby creating white noise. Thus, finding ar's that minimize the prediction error seems consistent with this observation. A second motivation for this approach follows from the fact that if a signal is generated Eq.(9.2)with non-time-varying coefficients and excited either by a single impulse or by a stationary white noise input, then it can be shown that the predictor coefficients that result from minimizing the mean-squared prediction error(over all time)are iden tical to the coefficients of Eq.(9. 2). A third very pragmatic justification for using the minimum mean-squared prediction error as a basis for estimating the model parame ters is that this approach leads to a set of linear equations that can be efficiently solved to obtain the predictor parameters. Perhaps more to the point-the ultimate justifica tion for linear predictive analysis of speech is simply pragmatic. The linear prediction analysis model works exceedingly well. The short-time total squared prediction error is defined as h=∑m]=2(sm1-5m)2 (9.9a) ∑(slm1-∑smn (9.9b) where snm] is a segment of speech that has been selected in the vicinity of sanple Sn[m]=sm+问 for m in some finite interval around n. The range of summation in Eqs.(9.a)and (9.9b)is temporarily left unspecified, but since we wish to develop a short-time analysis technique, the sum will always be over a finite interval. Also note that to obtain an average (or mean)squared error, we should divide the sum by the length of the speech segment. However, this constant is irrelevant to the set of linear equations that we will obtain and therefore is omitted. We can find the values of k that minimize e in Eq.(9.9b)by setting aEn/ai=0,i=1, 2, ...,p, thereby obtaining the equations ∑m-lm=∑a∑sm-lm-k,1≤i≤p,(91) Section 9.2 Basic Principles Predictive Anat 479 where ak are the values of ak that minimize fn ( Since ar is unique, we will henceforth drop the caret and use the notation ak to denote the values that minimize En )If we Define []=∑sm-m-k (9,12) then eq.(9. 11)can be written more compactly as ∑akqn]=qn0 9.13) k=1 This set of p equations in p unknowns can be solved efficiently for the unknown predic tor coefficients (ar that minimize the total squared prediction error for the segment sn[m]. Using Eqs. ( 9.9b) and (9.11), the minimum mean-squared prediction error can be shown to be =∑m-∑a∑ m]snm-k 914) and using Eq(9.12), we can express En as E=c001-∑a90,k 9.15) 真=1 where (ak, k=1, 2,..., pl is the set of predictor coefficients satisfying Eq. (9 13).Thus the total minimum error consists of a fixed component, n[0, 0], which is equal to the total sum of squares(energy) of the segment sn[m], and a component that depends on the predictor coefficients To solve for the optimum predictor coefficients, we must first compute the quan tities n[i,k]for1≤i≤pand0≤k≤p. Once this is done, we only have to solve Eq.(9.13) to obtain the aks. Thus, in principle, linear prediction analysis is very straightforward. However, the details of the computation of ili, k and the sub sequent solution of the equations are somewhat intricate and further discussion is required So far we have not explicitly indicated the limits on the sums in Eqs. (9.9a)or (9.9b)and in Eq. (9. 11); however, it should be emphasized that the limits on the sum in Eq (9. 11)are identical to the limits assumed for the mean-squared prediction error in Eqs.(9.9a)or(.9b). As we have stated, if we wish to develop a short-time analysis While the ak 's are functions of n (the time index at which they are estimated, it is cumbersome and gen erally unnecessary to show this dependence explicitly. It is also advantageous to drop the subscripts n on Ei Salm], and ali, k] when no confusion will result.

试读 127P Speech Recognition_Columbia(哥大最新语音识别讲义)
立即下载 低至0.43元/次 身份认证VIP会员低至7折
V+ 看不懂呢,谁下谁知道
关注 私信
Speech Recognition_Columbia(哥大最新语音识别讲义) 10积分/C币 立即下载
Speech Recognition_Columbia(哥大最新语音识别讲义)第1页
Speech Recognition_Columbia(哥大最新语音识别讲义)第2页
Speech Recognition_Columbia(哥大最新语音识别讲义)第3页
Speech Recognition_Columbia(哥大最新语音识别讲义)第4页
Speech Recognition_Columbia(哥大最新语音识别讲义)第5页
Speech Recognition_Columbia(哥大最新语音识别讲义)第6页
Speech Recognition_Columbia(哥大最新语音识别讲义)第7页
Speech Recognition_Columbia(哥大最新语音识别讲义)第8页
Speech Recognition_Columbia(哥大最新语音识别讲义)第9页
Speech Recognition_Columbia(哥大最新语音识别讲义)第10页
Speech Recognition_Columbia(哥大最新语音识别讲义)第11页
Speech Recognition_Columbia(哥大最新语音识别讲义)第12页
Speech Recognition_Columbia(哥大最新语音识别讲义)第13页
Speech Recognition_Columbia(哥大最新语音识别讲义)第14页
Speech Recognition_Columbia(哥大最新语音识别讲义)第15页
Speech Recognition_Columbia(哥大最新语音识别讲义)第16页
Speech Recognition_Columbia(哥大最新语音识别讲义)第17页
Speech Recognition_Columbia(哥大最新语音识别讲义)第18页
Speech Recognition_Columbia(哥大最新语音识别讲义)第19页
Speech Recognition_Columbia(哥大最新语音识别讲义)第20页

试读结束, 可继续阅读

10积分/C币 立即下载 >