Speech Recognition_Columbia(哥大最新语音识别讲义)


-
哥伦比亚大学2016年春季语音识别讲义,覆盖GMM, HMM, 声学模型,语言模型,模型稳健性,深度学习的语音识别等最新知识面
318 Speech Signal Representations H,LK]H2[k] H,[k] HAIK] HS[K] f0]f1]f2]f31n4151f6 f7] Figure 6.28 Triangular filters used in the computation of the mel-cepstrum using Eq(6. 140) The mel-frequency cepstrum is then the discrete cosine transform of the M filter out Its: ]=∑ S[m]cos(n(m-1/2)/ M )0≤n<M (6.145) whcrc M varies for different implementations from 24 to 40. For speech recognition, typi- cally only the first 13 cepstrum coefficients are used. It is important to note that the MFCC representation is no longer a homomorphic transformation. It would be if the order of sum mation and logarithms in Eg (6. 144)were reversed Sm]=∑m(xkH2k])0<m≤M (6.146 In practice, however, the MFCC representation is approximately homomorphic for fil ters that have a smooth transfer function. The advantage of the MFCC representation using (6.144)instead of (6.146)is that the filter energies are more robust to noise and spectral es- timation errors. This algorithm has been used extensively as a feature vector for speech rec- ognition systems. While the definition of cepstrum in Section 6.4.1 uses an inverse DFT, since S[mI is even, a DCT-II can be used instead (see Chapter 5) 6.5.3. Perceptual Linear Prediction (PLP Perceptual Linear Prediction(PLP)[16] uses the standard Durbin recursion of Section 6.3.2.1.2 to compute LPC coefficients, and typically the LPC coefficients are transformed to LPC-cepstrum using the recursion in Section 6. 4.2. 1. But unlike standard linear prediction, the autocorrelation coefficients are not computed in the time domain through Eq (6.55) The autocorrelation R [n] is the inverse Fourier transform of the power spectrum x(o) of the signal. We cannot compute the continuous-frequency Fourier transform eas CHAPTER Linear predictive 9 Analysis of Speech goals si INTRODUCTION Linear predictive analysis of speech signals is one of the most powerful speech anal ysis techniques. This method has become the predominant technique for estimating the parameters of the discrele-Lime model for speech production(i.e, pitch, formants, short-time spectra, vocal tract area functions) and is widely used for representing speech in low bit rate transmission or storage and for automatic speech and speaker recognition. The importance of this method lies both in its ability to provide accurate estimates of the speech parameters and in its relative ease of computation. In this chap ter we present the fundamental concepts of linear predictive analysis of speech, and we discuss some of the issues involved in using linear predictive analysis in practical speech applications The philosophy of linear prediction is intimately related to the basic speech syn thesis model discussed in Chapter 5 where it was shown that a sampled speech signal can be modeled as the output of a linear, time-varying system(difference equation) excited by either quasi-periodic pulses(during voiced speech)or random noise (dur- ing unvoiced speech). The difference equation of the speech model suggests that a speech sample can be approximated as a linear combination of p past speech samples By locally minimizing the sum of the squared differences between the actual speech B samples and the linearly predicted samples, a unique set of predictor coefficients can be determined, and by equating the prediction coefficients to the coefficients of the difference equation of the model, we obtain a robust, reliable, and accurate method or estimating thc paramcters that characterize the linear time-varying system in the speech production model. In the speech processing field, linear prediction was first used in speech coding applications, and the term"linear predictive coding"(or LPC, quickly gained widespread usage. As linear predictive analysis methods became more widely used in speech processing, the term "LPC "persisted and now it is often used as a term for linear predictive analysis techniques in general. Wherever we use the term linear predictive coding or LPC, we intend it to have the general meaning, i.e., not restricted just to coding The techniques and methods of linear prediction have been available in the engi neering literature for a long time [38, 417]. One of the earliest applications of the theory of linear prediction was the work of Robinson, who used linear prediction in scismic 473 474 Chapter 9 Linear Predictive Analysis of Speech Signals signal processing 322, 323]. The ideas of linear prediction have been used in the areas of control and information theory under the names of system estimation and system identification. The term"system identification"is particularly descriptive in speech applications since the predictor coefficients are assumed to characterize an all-pole model of the system in the source/system model of speech production. As applied in speech processing, the term" linear predictive analysis"refers to a variety of essentially equivalent formulations of the problem of modeling the speech signal [12, 161, 218. 232. The differences among these formulations are often phi losophical or in point of view toward the problem of speech modeling. The differences mainly concern the details of the computations used to obtain the predictor coeff cients.Thus, as applied to speech, the various(often equivalent) formulations of linear prediction analysis have been 1. the covariance method [121 2. the autocorrelation formulation [217, 229, 2321 3. the lattice method 48, 219 4. the inverse filter formulation [2321 5, the spectral estimation formulation [48 6. the maximum likelihood formulation 161, 162 7. the inner product formulation [232 n this chapter we will examine, in detail, the similarities and differences among the first three basic methods of analysis listed above, since all the other formulations are essentially equivalent to one of these three The importance of linear prediction lies in the accuracy with which the basic model applies to speech. Thus a major part of this chapter is devoted to a discussion of how a variety of speech parameters can be reliably estimated using linear prediction methods. Furthermore, some typical examples of speech applications that rely primar- ily on linear predictive analysis are discussed here, and in Chapters 10-14, to show the wide range of problems to which LPC methods have been successfully applied 9.2 BASIC PRINCIPLES OF LINEAR PREDICTIVE ANALYSIS Throughout this book we have repeatedly referred to the basic discrete-time model for speech production that was developed in Chapter 5. The particular form of this model that is most appropriate for the discussion of linear predictive analysis is depicted in Figure 9.1. In this model, the composite spectrum effects are represented by a time varying digital filter whose steady-state system function is represented by the all-pole rational function S(z) S( GU(z) E(z) akz k k=1 Section 9.2 Basic Principles of Linear Predictive Analysis 475 Pitch Period Irnpulse Train Generator Vocal Tract Parameters 几 Voiced Unvoiced Time-Varying Switch Digital Filter s[n] H(2 Random noise Generator FIGURE 9.1 Block diagram of simplified model for speech production We will refer to H(z) as the vocal tract system function even though it represents not only the effects of the vocal tract resonances but also the effects of radiation at the lips and, in the case of voiced speech, the spectral cffccts of the glottal pulse shape. This system is excited by a quasi-periodic impulse train for voiced speech [since the glottal pulse shape is included in II(z]or a random noise sequence for unvoiced speech. Thus the parameters of this model are Excitation parameters: voiced/unvoiced classification pitch period for voiced speech gain parameter G Vocal tract system parameters coefficients ( ak, k=1, 2,..., p] of the all-pole digital filter These parameters, of course, all vary slowly with time The pitch period and voiced/unvoiced classification can be estimated using one of the many methods to be discussed in Chapter 10. As discussed in Chapter 5. this simplified all-pole model is a natural representation for non-nasal voiced sounds.For nasals and fricative sounds, however, the detailed acoustic theory calls for both poles and zeros in the vocal tract transfer function. We will subsequently see that if th digital filter order, P, is high enough, the all-pole model provides a good enough rep- resentation for almost all the sounds of speech, including nasal sounds and fricativ sounds. The major advantage of this model is that the gain parameter, G, and the filter coefficients, (arl, can be estimated in a very straightforward and computation ally efficient way by the method of linear predictive analysis. Furthermore, one of the 476 Chapter9 Linear Predictive Analysis of Speech Signals methods of pitch detection discussed in Chapter 10 is based on using the linear predic- tor as an inverse filter to extract the error signal, en representative of the excitation signal, uln For the system of Figure 9.1, the inodel speech samples, s[n], are related to the excitation, unI, by the simple difference equation s=∑as-]+Gn (9.2) k=1 A p-order linear predictor with prediction coefficients, ak, k= 1, 2,..., P), is defined as a system whose input is sn] and whose output is sn], defined as =∑sn=k (93) The system function of this pl th-order linear predictor is the z-transform polynomial k§(z P(z)=∑ S( (94) P(z)is often referred to as the predictor polynomial. The prediction error, e[n],is defined as the difference between s[] and sn]; i. e, e=s-]=n]-∑asln-k (9.5 From Eq.(9.5), it follows that the prediction error sequence is the output of a system whose input is s[n and whose system function is E(z) 1-P(x)=1 ak2 (9.6) k=1 The z-transform, A(z), is a polynomial in z. It is often called the prediction error polynomial, or equivalently, the LPC polynomia If the speech signal obeys the model of Eq(9. 2)exactly, so that the model output sn] is equal to the actual sampled speech signal, then a comparison of Egs. (9.2 )and (9.5)shows that if (akl=fak), for all k, then e[n]= Gu[n]. Thus, the prediction error filter, A(z), is an inverse filter for the vocal tract system, H(z), of Eq (9.1): i.e G·U(z)1 s(z Section 9.2 Basic Principles of Linear Predictive Analysis 477 snI e[n] A(z) inl sn] stnl P(z) FIGURE 9.2 Signal processing operations for inverse filtering and reconstruction of the speech signal: (a) inverse filtering of sn], to give the error signal, eln, and direct filtering of the error signal, en], to reconstruct the original speech signal, s[n]; (b) implementation of the system (z sh owing the predicted signa a Figure 9.2 shows the relationships between the signals s[n] e[n], and s[n]. The error signal, e[n] is obtained as the output of the inverse filter, A(z), with input s[n] as shown in Figure 9.2id. By assumption, e[n is the vocal tract excitation, so it should be a quasi-periodic impulse train for voiced speech and random noise for unvoiced speech. The original speech signal, s[nl, is obtained by processing the error signal by the all-pole filter, H(a), as also shown in Figure 9. 2a. Figure 9.2b shows the feedback processing loop for reconstructing sn] from e[]; i.e., feedback of the predicted signal sIn and addition of the error signal, e[n]. Thus we see that H(z) is a pln-order all-pole rational function of the form 1 A(z) (98) ak2 and A(z) is the pin-order polynomial in Eq (9.6). The system whose transfer function is H(z) is often called the vocal tract model system or the LPC. model system =2.1 Basic Formulation of Linear Prediction Analysis Equations The basic problem of linear predictive analysis is to determine the set of predictor coefficients (ak, k=1, 2, .. p] directly from the specch signal so as to obtain a good estimate of the time-varying spectral properties of the speech signal through the use As we have stressed before in our discussion of the speech model, the z-transform representations that we have used are not strictly valid for representing the speech model since the system is time-varying. Thus, the z-transform equations and Figure 9. 2 are assumed to be valid over only short-time frames, with parameter updates from frame-to-frame 478 Chapter 9 Linear Predictive Analysis of Speech Signals of Eq(9.8). Because of the time-varying nature of the specch signal, the predictor coefficients must be estimated by a short-time analysis procedure based on finding the set of predictor coefficients that minimize the mcan-squared prediction error over a short segment of the speech waveform. The resulting parameters are then assumed to be the parameters of the system function, H(z), in the model for speech production in Figure 9.1 It may not be immediately obvious that this approach will lead to useful results. but it can be justified in several ways. First, recall that if (ak=ak, k=1, 2,.. p then e[n]=GuIn. For voiced speech this means that e[n] would consist of a train of impulses;i.e,e[n] would be small most of the time. For unvoiced speech, the inverse filter flattens the short-Lime spectrum, thereby creating white noise. Thus, finding ar's that minimize the prediction error seems consistent with this observation. A second motivation for this approach follows from the fact that if a signal is generated Eq.(9.2)with non-time-varying coefficients and excited either by a single impulse or by a stationary white noise input, then it can be shown that the predictor coefficients that result from minimizing the mean-squared prediction error(over all time)are iden tical to the coefficients of Eq.(9. 2). A third very pragmatic justification for using the minimum mean-squared prediction error as a basis for estimating the model parame ters is that this approach leads to a set of linear equations that can be efficiently solved to obtain the predictor parameters. Perhaps more to the point-the ultimate justifica tion for linear predictive analysis of speech is simply pragmatic. The linear prediction analysis model works exceedingly well. The short-time total squared prediction error is defined as h=∑m]=2(sm1-5m)2 (9.9a) ∑(slm1-∑smn (9.9b) where snm] is a segment of speech that has been selected in the vicinity of sanple Sn[m]=sm+问 for m in some finite interval around n. The range of summation in Eqs.(9.a)and (9.9b)is temporarily left unspecified, but since we wish to develop a short-time analysis technique, the sum will always be over a finite interval. Also note that to obtain an average (or mean)squared error, we should divide the sum by the length of the speech segment. However, this constant is irrelevant to the set of linear equations that we will obtain and therefore is omitted. We can find the values of k that minimize e in Eq.(9.9b)by setting aEn/ai=0,i=1, 2, ...,p, thereby obtaining the equations ∑m-lm=∑a∑sm-lm-k,1≤i≤p,(91) Section 9.2 Basic Principles Predictive Anat 479 where ak are the values of ak that minimize fn ( Since ar is unique, we will henceforth drop the caret and use the notation ak to denote the values that minimize En )If we Define []=∑sm-m-k (9,12) then eq.(9. 11)can be written more compactly as ∑akqn]=qn0 9.13) k=1 This set of p equations in p unknowns can be solved efficiently for the unknown predic tor coefficients (ar that minimize the total squared prediction error for the segment sn[m]. Using Eqs. ( 9.9b) and (9.11), the minimum mean-squared prediction error can be shown to be =∑m-∑a∑ m]snm-k 914) and using Eq(9.12), we can express En as E=c001-∑a90,k 9.15) 真=1 where (ak, k=1, 2,..., pl is the set of predictor coefficients satisfying Eq. (9 13).Thus the total minimum error consists of a fixed component, n[0, 0], which is equal to the total sum of squares(energy) of the segment sn[m], and a component that depends on the predictor coefficients To solve for the optimum predictor coefficients, we must first compute the quan tities n[i,k]for1≤i≤pand0≤k≤p. Once this is done, we only have to solve Eq.(9.13) to obtain the aks. Thus, in principle, linear prediction analysis is very straightforward. However, the details of the computation of ili, k and the sub sequent solution of the equations are somewhat intricate and further discussion is required So far we have not explicitly indicated the limits on the sums in Eqs. (9.9a)or (9.9b)and in Eq. (9. 11); however, it should be emphasized that the limits on the sum in Eq (9. 11)are identical to the limits assumed for the mean-squared prediction error in Eqs.(9.9a)or(.9b). As we have stated, if we wish to develop a short-time analysis While the ak 's are functions of n (the time index at which they are estimated, it is cumbersome and gen erally unnecessary to show this dependence explicitly. It is also advantageous to drop the subscripts n on Ei Salm], and ali, k] when no confusion will result.

-
2019-03-31
哪位大侠有哥伦比亚大学Michael Collins教授的NLP视频_course
2016-09-01最近在学自然语言处理,之前coursera上还有michael collins大神的nlp视频,没有来的及下载下来,现在coursera好像改版,这个视频找不到了,哪位大侠有下载过这个视频,求分享
语音播放模块下载_course
2020-07-21在开发应用软件时,有时经常需要播放语音信息来提示,如:医院就诊叫号,银行业务办理叫号,收银系统金额提示等等,与其在各个应用软件中嵌入语音播放模块,不如把语音播放模块独立出来,供各个应用软件调用。附件中
3KB
灵云语音识别API Python3接口
2018-11-21灵云的语音识别API,使用python3语言编写,在该文件中同时给出了测试样例,只需将对应16Khz/单通道/小尾端的原始音频数据路劲替换进去即可。
语音识别的问题(Speech Recognition)_course
2008-07-04欢迎大家来出谋划策! 敝人最近对语音识别这块很感兴趣,是这样的: 如果有人打电话给你,你不用手动接听电话,而是通过语音命令来选择接听与否... 或者是:你可以语音拨号,如果想打朋友的电话,只需要说出朋
利用微软软件语音识别类库System.Speech.Recognition,无法异步调用识别引擎。_course
2015-07-14C# Winform项目中,利用微软语音识别类库System.Speech.Recognition,可以正常识别语音,并异步调用语音的识别引擎。 目前遇到的问题是,在项目中有一个“视频监控”栏目。在“
利用微软软件语音识别类库System.Speech.Recognition,无法识别语音_course
2015-07-01利用微软的语音识别类库做语音识别,用带麦克风的耳机在讲话测试时。 讲“选择红色”,窗体就变成红色,讲“选择绿色”,窗体就变成绿色。 目前遇到一个问题,前段时间测试时,可以响应我的语音的选择。 但这几天
82.2MB
微信小程序源码-合集6.rar
2020-09-04微信小程序源码,包含:图片展示、外卖点餐、小工具类、小游戏类、演绎博览、新闻资讯、医疗保健、艺术生活等源码。
133KB
python 京东预约抢购茅台脚本插件 一键运行
2021-02-26python 京东预约抢购茅台脚本插件 一键运行,按照readme介绍的步骤即可。 已经测试可以抢购得到。 注意:本资源仅用于用来学习,严禁用于任何商业目的,下载之后应当在24小时之内删除。
28KB
各显卡算力对照表!
2018-01-11挖矿必备算力对照!看看你的机器是否达到标准!看完自己想想办法刷机!
-
博客
html绝对路径图片无法显示
html绝对路径图片无法显示
-
下载
2013年北京理工大学《汇编语言程序设计》期末考试题.pdf
2013年北京理工大学《汇编语言程序设计》期末考试题.pdf
-
博客
Mycat实现MySQL读写分离
Mycat实现MySQL读写分离
-
下载
wx-pyg-master.zip
wx-pyg-master.zip
-
学院
2021年 系统分析师 系列课
2021年 系统分析师 系列课
-
博客
从一些基础的概念入手,开始一个比较好的营销学习之旅吧
从一些基础的概念入手,开始一个比较好的营销学习之旅吧
-
下载
《马克思主义政治经济学原理》模拟练习(含答案).pdf
《马克思主义政治经济学原理》模拟练习(含答案).pdf
-
学院
MySQL 性能优化(思路拓展及实操)
MySQL 性能优化(思路拓展及实操)
-
博客
MySQL主从同步
MySQL主从同步
-
下载
广东海洋大学《工程热力学》期末名词解释和思考题.pdf
广东海洋大学《工程热力学》期末名词解释和思考题.pdf
-
博客
Scala学习视频心得(三)函数、流程控制
Scala学习视频心得(三)函数、流程控制
-
下载
广东海洋大学《流体力学泵与风机》试题库(含答案).pdf
广东海洋大学《流体力学泵与风机》试题库(含答案).pdf
-
学院
云开发后台+微信扫码点餐小程序+cms网页管理后台 含后厨端和用户端
云开发后台+微信扫码点餐小程序+cms网页管理后台 含后厨端和用户端
-
下载
北京交通大学《概率论与数理同济》多套期末练习卷(含答案).pdf
北京交通大学《概率论与数理同济》多套期末练习卷(含答案).pdf
-
下载
北京航空航天大学《自动控制原理》期末考题.pdf
北京航空航天大学《自动控制原理》期末考题.pdf
-
学院
CCNA_CCNP 思科网络认证 动态路由 RIP 协议
CCNA_CCNP 思科网络认证 动态路由 RIP 协议
-
学院
MySQL 四类管理日志(详解及高阶配置)
MySQL 四类管理日志(详解及高阶配置)
-
博客
设计模式——备忘录模式
设计模式——备忘录模式
-
博客
Docker——dockerfile文件
Docker——dockerfile文件
-
博客
idea配shell开发环境
idea配shell开发环境
-
下载
《管理学》复习.pdf
《管理学》复习.pdf
-
下载
北京理工大学2016年《操作系统》期末试题.pdf
北京理工大学2016年《操作系统》期末试题.pdf
-
博客
历时三个月:学渣的阿里之路
历时三个月:学渣的阿里之路
-
下载
北京理工大学《计算机组成原理》期末考试试卷(含答案).pdf
北京理工大学《计算机组成原理》期末考试试卷(含答案).pdf
-
学院
CCNA_CCNP 思科网络认证 动态路由 EIGRP 和 OSPF
CCNA_CCNP 思科网络认证 动态路由 EIGRP 和 OSPF
-
学院
基于微信的同城小程序、校园二手交易小程序 毕业设计毕设源码使用教程
基于微信的同城小程序、校园二手交易小程序 毕业设计毕设源码使用教程
-
学院
用Go语言来写区块链(一)
用Go语言来写区块链(一)
-
下载
大学《数字图像处理》知识点总结.pdf
大学《数字图像处理》知识点总结.pdf
-
下载
大学生《汇编语言》期末复习题题库(含答案).pdf
大学生《汇编语言》期末复习题题库(含答案).pdf
-
学院
《文件和目录操作命令》<Linux核心命令系列Series> <2.>
《文件和目录操作命令》<Linux核心命令系列Series> <2.>