人工智能-语音识别-低代价语音识别技术的研究.pdf资源-CSDN文库

版权申诉

175 浏览量 2022-06-27 22:28:12 上传评论收藏 1.7MB PDF 举报

资源推荐

资源详情

资源评论

摘要

- III -

Abstract

Speech is the most efficient and direct way to communicate between people. Compared with

other human-machine interfaces, such as keyboard and mouse, speech input is the most convenient

input method. Speech recognition technology is now being pushed on the way from lab to real

world thanks to the great achievement that has resulted from the technology dated back to 1950s

till now. Researchers are devoted to integrating Automatic Speech Recognition (ASR) system on

mobile devices such as cellular mobile phone, PDA and wireless car kits. Therefore the research

on noise robust speech recognition with low cost becomes a hotspot and industrial world has put

great emphasis on it. The low cost ASR technology is the same as the general ASR technology in

basic theory, which means technical problems in the general ASR technology still exist in the low

cost ASR technology. However, the low cost ASR faces its own special problems due to the

limited computational ability and relatively low resource. In the dissertation, a series of algorithms

suitable for low cost ASR system are proposed based on the main variabilities which affect ASR

system. These algorithms can improve the performance of low cost ASR system, and at the same

time add little computational load and resource requirement to the overall system.

VOPER is a world-leading low cost ASR system based on embedded system, which can be

integrated widely on mobile phone, PDA and wireless car kits. In the dissertation, we focus on

how to improve the performance of ASR system without adding too much cost to the overall

system. Based on the variabilities which affect ASR’s performance, we research on following

problems: real time endpoint detection in mobile environment, feature extraction based on AMR

Vocoder parameters, environmental compensation of acoustical model, and fast speaker

adaptation.

The algorithms proposed in the dissertation are based on VOPER architecture, and aims to

improve VOPER’s performance. For example, the endpoint detection algorithm based on noise

model has been integrated in VOPER and also used in industrial product. The idea of feature

extraction based on AMR Vocoder parameters is to remove the front-end module of VOPER, and

reduce the computation load of overall system. We do feasibility study on the topic in order to

embed VOPER in all kinds of cellular phones. The research on environmental compensation is to

improve VOPER’s performance in noisy environments. The research on speaker adaptation is to

improve VOPER’s performance on specific speaker, meanwhile, maintain the performance on

other speakers. The algorithm is now integrated in VOPER.

Endpoint detection can improve the performance of ASR system in terms of speed and

accuracy. Endpoint detection in noisy environments is still an unsolved problem now. We first

摘要

- IV -

proposed a real time endpoint detection algorithm based on multiple features. The employment of

multiple features can help to improve the robustness of endpoint detection in different noisy

environments. The experiments also show the performance of endpoint detection is improved with

the number of features. Secondly, a robust endpoint detection algorithm in mobile environments is

proposed. The algorithm is based on a noise model and uses a two-level decision-making strategy.

The noise model can describe the spectral characteristic of background noise. The noise and

speech is first discriminated by the model, and then a decision logic based on a four-state

automaton is used to smooth and revise the discriminative results. The two decision-making levels

interact with each other in the whole detecting procedure. A lot of experiments are carried out to

evaluate the performance of the endpoint detection algorithm. The algorithm is of low complexity

and now is used in newly launched product.

In order to integrate ASR system on general cellular phone, we do feasibility study on the

feature exaction based on GSM Vocoder. We use the features based on Motorola i250 AMR codec

parameters to do speech recognition. Therefore the ASR system on general cellular phone can use

the chip for communications to do front-end processing. The experiments show the features based

on AMR codec parameters can also achieve satisfactory performance in noisy environments with

moderate SNR.

The research on environmental compensation is aimed to reduce the degradation caused by

environmental noise. Parallel Model Combination (PMC) is based on an environmental model.

Here a new approach is proposed to compensate the static parameters in HMM models. The

distributions of static observation of corrupted speech are directly approximated according to

clean-speech models and a noise model. Compared with traditional methods, which intend to

model the observation of corrupted speech with a presumed distribution, the new approach for

static parameter compensation can avoid the error of the presumption, especially in low SNR cases.

The experiments indicate the new approach outperforms the traditional ones in terms of accuracy

and noise robustness. Moreover, the approach is of low complexity and can work with the

endpoint detection to do real time compensation.

Speaker adaptation techniques are widely used to improve the performance of SI ASR system

but only use a fraction of speaker dependent data. An offline fast speaker adaptation algorithm is

proposed to improve the performance of SI ASR on embedded system. We assume that the basic

speech recognition system uses HMM to model the speech production process, and mixtures of

continuous-density Gaussian to model the output distributions of the HMM. A single Gaussian

HMM model is trained by a few pre-designed speech data. After the new model estimation, the

new model will be merged into the old one. A series of experiments were carried out to evaluate

not only the SI but also SD character of the adapted model. The algorithm is evaluated by a series

第 1 章绪论

- 1 -

第

章　绪论

1.1.

自动语音识别技术的发展和应用现状

语音是人们相互之间交流最直接而且最有效的方式用语音的方式与机器进行通信与交

流不但可以提高工作效率而且能提高安全性自动语音识别(ASR) 的目的就是让机器

尤其是计算机”听懂”人口述的语言以提供良好的人机界面从而使得人与计算机能够顺畅

的交流作为一种人机界面语音与键盘和鼠标输入相比是最自然的输入方式

由于语音识别技术的应用广泛从上世纪五十年代起这一技术就引起人们的重视

[1][2][3] [4] 1952 年 Davis[5]实现了一个特定说话人孤立数字识别系统在该系统中元音段

的频谱共振峰被用作特征参数 1965 年 Olson 和 Belar[6]实现了一个可以识别 10 个不同音

节的特定人语音识别系统系统采用基于一个模拟滤波器组的频谱特征进行识别 1959 年

Forgie[7]实现了一个识别元音的非特定人系统该系统可以识别 10 个不同的元音同样是

采用基于滤波器组的频谱信息进行识别

上世纪 70 年代语音识别研究领域有两个突破性的进展分别是 Sakoe 和 Chiba[8]将动

态规划(Dynamic Programming DP)和 Itakura[10]提出将线性预测编码(Linear Predictive

Coding LPC)用于语音识别这两个进展对目前语音识别研究的发展产生了重大的影响

动态规划将两个不同的语音在时间轴上进行对准也称作动态实际弯曲(Dynamic Time

Warping DTW) 早在 1968 年前苏联科学家 Vintsyuk[9]就提出利用动态规划进行时间对准

直到 80 年代初才为西方科学家掌握线性预测编码 LPC 最初是成功的应用于低比特率语音

编码在贝尔实验室 Itakura 利用线性预测分析结合动态时间弯曲成功的建立了一个语音

识别系统[11]

从上世纪 80 年代开始隐马尔可夫模型(Hidden Markov Model HMM)[12]的引入使得

语音识别从基于模板匹配的技术转移到基于统计模型方法的研究早在 70 年代末 Baker[13]

和 Jelinek[14]就将 HMM 用于语音识别研究直到 80 年代中期才开始了广泛的应用[1][4]

80 年代末开始另一种统计模型方法神经网络(Artificial Neural Network ANN)[15][16]

也逐渐被用于语音识别系统然而目前研究人员更倾向于使用 HMM 因为同样是基于统计

模型的方法 HMM 的过程更利于用统计参数来描述

到了上世纪 90 年代在一定应用条件下的语音识别系统已具有良好的性能音素等子

词单元被作为语音识别系统的基本单元使大词汇量识别成为可能拥有大词汇量非特定

人和连续语音的识别系统相继产生例如 Carnegie Mellon 大学的 SPHINX 系列系统[17] IBM

的 ViaVo ice Microsoft 的 Whisper 贝尔实验室的 PLATO MIT 的 SUMMIT 系统 SRI 的

DECIPHER 系统等近些年来随着一些快速动态搜索算法搜索策略和丢弃策略等新方

法的发现以及子词模型词法模型和语法模型进一步的改进语音识别系统的识别速度

第 1 章绪论

- 2 -

识别率和可靠度也有显著的提高对于一个有 2 万词汇量的特定人孤立词语音识别系统其

词错误率可以低于 0.1%[18] 而一个具有 1 万词汇量的特定人连续语音识别系统的错误率可

以达到 5%左右[19]

汉语语音识别研究[20][21]开始于七十年代经过三十余年的发展我国语音识别技术的

研究水平已经基本上与国外同步在汉语语音识别技术上还有自己的特点与优势并达到国

际先进水平目前国内从事语音识别研究的机构包括清华大学中国科技大学中科院自

动化所中科院声学所哈尔滨工业大学上海交通大学和台湾大学等 1998 年的 863 测

试评比中清华大学电子工程系以王作英教授为首的课题组完成的汉语连续语音识别系统的

字识别率达到 90 以上代表了目前国内的先进水平

语音识别技术的进展促使人们迫切把它推向实用领域近些年来多媒体技术日新月异

对语音处理技术也提出了新的要求

在目前的应用中对语音识别系统的要求主要包括

l 系统能够识别的词汇数量词汇量取决于系统的实际应用范围如听写系统一般要

求很大的词汇量

l 识别准确率识别率通常与词汇量有关如对于十个数字的识别在实验室环境下能

够达到或接近 100% 而对于大词汇量系统差错率很难做到小于 5% 一般来说大

词汇量系统差错率小于 8% 小词汇量系统差错率小于 5% 就认为系统具有较好的

识别性能

l 实时性用户往往要求系统有较快的响应时间设计时要注意系统的繁简程度和算

法的复杂度

l 说话者适应能力不同用户具有不同的发音方式和特点一个支持多用户的系统应

该能够适应各种用户

当前的典型系统有 CMU的SPHINX-II系统 IBM的 ViaVoice系统和Microsoft的 Whisper

系统等大词汇量非特定人识别实验 SPHINX-II 系统识别率约为 97% 小词汇量非特定人

连续语音识别实验贝尔实验室的 PLATO 识别系统词识别率为 98.29%

在现实生活中个人移动通信设备掌上电脑智能机器人技术支持中心金融部门

的自动交易刑事侦察机关的语音辨识军事与其它场合的声控指挥都需要用到语音识别技

术［２２］［２６］　

当前语音识别技术的应用领域主要包括:

1. 计算机输入

计算机的键盘是以英文 26 个字母为基础的但有时会由于一些特殊情况而无法用手指

操作电脑另外计算机的中文输入困难一直是计算机在中国推广的最大障碍尽管出现过

许多中文输入方法但是往往因为速度和学习困难等问题不易推广语音作为友好的人机界

面利用计算机上外接的麦克风设备直接进行话音输入具有很大的市场潜力 IBM 推出

的 ViaVoice 听写系统正是适应了这一市场需求

剩余114页未读，继续阅读

评论收藏

内容反馈

版权申诉

programhh

粉丝: 8
资源: 3838

人工智能-语音识别-低代价语音识别技术的研究.pdf

论文研究-混合单元选择语音合成系统的目标代价构建.pdf

人工智能-机器学习-代价敏感降维及其人脸识别应用研究.pdf

人工智能-机器学习-代价敏感的机器学习算法研究及应用.pdf

人工智能-机器学习-价格预测中代价敏感的机器学习算法及优化.pdf

论文研究-RS码的盲参数识别.pdf

论文研究-基于Cholesky分解的K2DPCA人脸识别研究.pdf

论文研究-TD-SCDMA picocell低代价联合检测的实现 .pdf

论文研究-FRFT滤波的语音增强.pdf

论文研究-基于Cost-Sensitive主成分分析的人脸识别.pdf

人脸识别技术应用的隐私让渡代价与规范措施.pdf

论文研究-基于关键节点时延约束低代价组播路由算法.pdf

当ArcGIS遇见人工智能.pptx

汇丰银行-亚洲-房地产行业-新加坡REITs：安全是有代价的-5-21页.pdf

智能光通信技术-色散代价.pdf

论文研究-结合分块LBP与投影字典对学习的表情识别.pdf

论文研究-垃圾网络电话检测技术研究.pdf

论文研究-基于MapReduce的top-.pdf

论文研究-时间序列流Top-.pdf

论文研究-基于.pdf

Origin绘制相关性热图插件(Correlation Plot)

（免费）Chrome浏览器插件axure-chrome-extension

vep视频快速加密提取器

2011-2022年北大数字普惠金融指数数据（包括省市县）.zip

最新版YS9082HC主控开卡工具 YS9082HC-MPToolV8.00.00.18.826-HCS1A25E2023062

糖尿病数据集diabetes.csv（免费）

noc指导教师资格认证题库

IEEE 802.11be（WiFi7） 协议原文pdf文档

Mann -kendall突变检验的MATLAB代码

Axhub Charts Pro V2.1.1.rplib

最新资源

IEEE 802.11be（WiFi7）协议原文pdf文档