没有合适的资源?快使用搜索试试~ 我知道了~
人工智能-语音识别-低代价语音识别技术的研究.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 175 浏览量
2022-06-27
22:28:12
上传
评论
收藏 1.7MB PDF 举报
温馨提示
试读
115页
人工智能-语音识别-低代价语音识别技术的研究.pdf
资源推荐
资源详情
资源评论
摘 要
- III -
Abstract
Speech is the most efficient and direct way to communicate between people. Compared with
other human-machine interfaces, such as keyboard and mouse, speech input is the most convenient
input method. Speech recognition technology is now being pushed on the way from lab to real
world thanks to the great achievement that has resulted from the technology dated back to 1950s
till now. Researchers are devoted to integrating Automatic Speech Recognition (ASR) system on
mobile devices such as cellular mobile phone, PDA and wireless car kits. Therefore the research
on noise robust speech recognition with low cost becomes a hotspot and industrial world has put
great emphasis on it. The low cost ASR technology is the same as the general ASR technology in
basic theory, which means technical problems in the general ASR technology still exist in the low
cost ASR technology. However, the low cost ASR faces its own special problems due to the
limited computational ability and relatively low resource. In the dissertation, a series of algorithms
suitable for low cost ASR system are proposed based on the main variabilities which affect ASR
system. These algorithms can improve the performance of low cost ASR system, and at the same
time add little computational load and resource requirement to the overall system.
VOPER is a world-leading low cost ASR system based on embedded system, which can be
integrated widely on mobile phone, PDA and wireless car kits. In the dissertation, we focus on
how to improve the performance of ASR system without adding too much cost to the overall
system. Based on the variabilities which affect ASR’s performance, we research on following
problems: real time endpoint detection in mobile environment, feature extraction based on AMR
Vocoder parameters, environmental compensation of acoustical model, and fast speaker
adaptation.
The algorithms proposed in the dissertation are based on VOPER architecture, and aims to
improve VOPER’s performance. For example, the endpoint detection algorithm based on noise
model has been integrated in VOPER and also used in industrial product. The idea of feature
extraction based on AMR Vocoder parameters is to remove the front-end module of VOPER, and
reduce the computation load of overall system. We do feasibility study on the topic in order to
embed VOPER in all kinds of cellular phones. The research on environmental compensation is to
improve VOPER’s performance in noisy environments. The research on speaker adaptation is to
improve VOPER’s performance on specific speaker, meanwhile, maintain the performance on
other speakers. The algorithm is now integrated in VOPER.
Endpoint detection can improve the performance of ASR system in terms of speed and
accuracy. Endpoint detection in noisy environments is still an unsolved problem now. We first
摘 要
- IV -
proposed a real time endpoint detection algorithm based on multiple features. The employment of
multiple features can help to improve the robustness of endpoint detection in different noisy
environments. The experiments also show the performance of endpoint detection is improved with
the number of features. Secondly, a robust endpoint detection algorithm in mobile environments is
proposed. The algorithm is based on a noise model and uses a two-level decision-making strategy.
The noise model can describe the spectral characteristic of background noise. The noise and
speech is first discriminated by the model, and then a decision logic based on a four-state
automaton is used to smooth and revise the discriminative results. The two decision-making levels
interact with each other in the whole detecting procedure. A lot of experiments are carried out to
evaluate the performance of the endpoint detection algorithm. The algorithm is of low complexity
and now is used in newly launched product.
In order to integrate ASR system on general cellular phone, we do feasibility study on the
feature exaction based on GSM Vocoder. We use the features based on Motorola i250 AMR codec
parameters to do speech recognition. Therefore the ASR system on general cellular phone can use
the chip for communications to do front-end processing. The experiments show the features based
on AMR codec parameters can also achieve satisfactory performance in noisy environments with
moderate SNR.
The research on environmental compensation is aimed to reduce the degradation caused by
environmental noise. Parallel Model Combination (PMC) is based on an environmental model.
Here a new approach is proposed to compensate the static parameters in HMM models. The
distributions of static observation of corrupted speech are directly approximated according to
clean-speech models and a noise model. Compared with traditional methods, which intend to
model the observation of corrupted speech with a presumed distribution, the new approach for
static parameter compensation can avoid the error of the presumption, especially in low SNR cases.
The experiments indicate the new approach outperforms the traditional ones in terms of accuracy
and noise robustness. Moreover, the approach is of low complexity and can work with the
endpoint detection to do real time compensation.
Speaker adaptation techniques are widely used to improve the performance of SI ASR system
but only use a fraction of speaker dependent data. An offline fast speaker adaptation algorithm is
proposed to improve the performance of SI ASR on embedded system. We assume that the basic
speech recognition system uses HMM to model the speech production process, and mixtures of
continuous-density Gaussian to model the output distributions of the HMM. A single Gaussian
HMM model is trained by a few pre-designed speech data. After the new model estimation, the
new model will be merged into the old one. A series of experiments were carried out to evaluate
not only the SI but also SD character of the adapted model. The algorithm is evaluated by a series
摘 要
- V -
of experiments. The experimental results show the algorithm can improve the performance of SI
ASR system both on native speakers and non-native speakers. After speaker adaptation, the HMM
models remain the character of speaker independency. Moreover, The HMM models can be
adapted to more than one speaker's voice. The complexity of adaptation procedure is low. It does
not increase the memory usage. The computation is done in an offline mode and will not affect the
recognition speed. This algorithm is very useful for embedded implementation. The algorithm is
patented.
The speech corpora in the dissertation are provided by Motorola Labs China Research Center.
Key Words: Speech Recognition, Low Cost, Real Time, Endpoint Detection, Noise Robust,
Environmental Compensation, Speaker Adaptation
第 1 章 绪论
- 1 -
第
1
章 绪论
1.1.
自动语音识别技术的发展和应用现状
语音是人们相互之间交流最直接而且最有效的方式 用语音的方式与机器进行通信与交
流 不但可以提高工作效率而且能提高安全性 自动语音识别(ASR) 的目的就是让机器
尤其是计算机”听懂”人口述的语言 以提供良好的人机界面 从而使得人与计算机能够顺畅
的交流 作为一种人机界面 语音与键盘和鼠标输入相比是最自然的输入方式
由于语音识别技术的应用广泛 从上世纪五十年代起 这一技术就引起人们的重视
[1][2][3] [4] 1952 年 Davis[5]实现了一个特定说话人孤立数字识别系统 在该系统中元音段
的频谱共振峰被用作特征参数 1965 年 Olson 和 Belar[6]实现了一个可以识别 10 个不同音
节的特定人语音识别系统 系统采用基于一个模拟滤波器组的频谱特征进行识别 1959 年
Forgie[7]实现了一个识别元音的非特定人系统 该系统可以识别 10 个不同的元音 同样是
采用基于滤波器组的频谱信息进行识别
上世纪 70 年代 语音识别研究领域有两个突破性的进展 分别是 Sakoe 和 Chiba[8]将动
态规划(Dynamic Programming DP)和 Itakura[10]提出将线性预测编码(Linear Predictive
Coding LPC)用于语音识别 这两个进展对目前语音识别研究的发展产生了重大的影响
动态规划将两个不同的语音在时间轴上进行对准 也称作动态实际弯曲(Dynamic Time
Warping DTW) 早在 1968 年前苏联科学家 Vintsyuk[9]就提出利用动态规划进行时间对准
直到 80 年代初才为西方科学家掌握 线性预测编码 LPC 最初是成功的应用于低比特率语音
编码 在贝尔实验室 Itakura 利用线性预测分析结合动态时间弯曲成功的建立了一个语音
识别系统[11]
从上世纪 80 年代开始 隐马尔可夫模型(Hidden Markov Model HMM)[12]的引入使得
语音识别从基于模板匹配的技术转移到基于统计模型方法的研究 早在 70 年代末 Baker[13]
和 Jelinek[14]就将 HMM 用于语音识别研究 直到 80 年代中期才开始了广泛的应用[1][4]
80 年代末开始 另一种统计模型方法 神经网络(Artificial Neural Network ANN)[15][16]
也逐渐被用于语音识别系统 然而目前研究人员更倾向于使用 HMM 因为同样是基于统计
模型的方法 HMM 的过程更利于用统计参数来描述
到了上世纪 90 年代 在一定应用条件下的语音识别系统已具有良好的性能 音素等子
词单元被作为语音识别系统的基本单元 使大词汇量识别成为可能 拥有大词汇量 非特定
人和连续语音的识别系统相继产生 例如 Carnegie Mellon 大学的 SPHINX 系列系统[17] IBM
的 ViaVo ice Microsoft 的 Whisper 贝尔实验室的 PLATO MIT 的 SUMMIT 系统 SRI 的
DECIPHER 系统等 近些年来 随着一些快速动态搜索算法 搜索策略和丢弃策略等新方
法的发现 以及子词模型 词法模型和语法模型进一步的改进 语音识别系统的识别速度
第 1 章 绪论
- 2 -
识别率和可靠度也有显著的提高 对于一个有 2 万词汇量的特定人孤立词语音识别系统 其
词错误率可以低于 0.1%[18] 而一个具有 1 万词汇量的特定人连续语音识别系统的错误率可
以达到 5%左右[19]
汉语语音识别研究[20][21]开始于七十年代 经过三十余年的发展 我国语音识别技术的
研究水平已经基本上与国外同步 在汉语语音识别技术上还有自己的特点与优势 并达到国
际先进水平 目前国内从事语音识别研究的机构包括 清华大学 中国科技大学 中科院自
动化所 中科院声学所 哈尔滨工业大学 上海交通大学和台湾大学等 1998 年的 863 测
试评比中 清华大学电子工程系以王作英教授为首的课题组完成的汉语连续语音识别系统的
字识别率达到 90 以上 代表了目前国内的先进水平
语音识别技术的进展促使人们迫切把它推向实用领域 近些年来多媒体技术日新月异
对语音处理技术也提出了新的要求
在目前的应用中对语音识别系统的要求主要包括
l 系统能够识别的词汇数量 词汇量取决于系统的实际应用范围 如听写系统一般要
求很大的词汇量
l 识别准确率 识别率通常与词汇量有关 如对于十个数字的识别在实验室环境下能
够达到或接近 100% 而对于大词汇量系统差错率很难做到小于 5% 一般来说 大
词汇量系统差错率小于 8% 小词汇量系统差错率小于 5% 就认为系统具有较好的
识别性能
l 实时性 用户往往要求系统有较快的响应时间 设计时要注意系统的繁简程度和算
法的复杂度
l 说话者适应能力 不同用户具有不同的发音方式和特点 一个支持多用户的系统应
该能够适应各种用户
当前的典型系统有 CMU的SPHINX-II系统 IBM的 ViaVoice系统和Microsoft的 Whisper
系统等 大词汇量非特定人识别实验 SPHINX-II 系统识别率约为 97% 小词汇量非特定人
连续语音识别实验 贝尔实验室的 PLATO 识别系统词识别率为 98.29%
在现实生活中 个人移动通信设备 掌上电脑 智能机器人 技术支持中心 金融部门
的自动交易 刑事侦察机关的语音辨识 军事与其它场合的声控指挥都需要用到语音识别技
术[22][26]
当前语音识别技术的应用领域主要包括:
1. 计算机输入
计算机的键盘是以英文 26 个字母为基础的 但有时会由于一些特殊情况而无法用手指
操作电脑 另外 计算机的中文输入困难一直是计算机在中国推广的最大障碍 尽管出现过
许多中文输入方法 但是往往因为速度和学习困难等问题不易推广 语音作为友好的人机界
面 利用计算机上外接的麦克风设备直接进行话音输入 具有很大的市场潜力 IBM 推出
的 ViaVoice 听写系统正是适应了这一市场需求
剩余114页未读,继续阅读
资源评论
programhh
- 粉丝: 8
- 资源: 3838
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功