基于卷积神经网络的语音识别声学模型的研究.zip_卷积神经网络语音识别原理资源-CSDN文库

共142个文件

py：62个

pyc：56个

txt：15个

版权申诉

173 浏览量 2024-05-18 17:35:18 上传评论收藏 7.67MB ZIP 举报

基于卷积神经网络的语音识别声学模型的研究卷积神经网络（Convolutional Neural Networks, CNNs 或 ConvNets）是一类深度神经网络，特别擅长处理图像相关的机器学习和深度学习任务。它们的名称来源于网络中使用了一种叫做卷积的数学运算。以下是卷积神经网络的一些关键组件和特性：卷积层（Convolutional Layer）：卷积层是CNN的核心组件。它们通过一组可学习的滤波器（或称为卷积核、卷积器）在输入图像（或上一层的输出特征图）上滑动来工作。滤波器和图像之间的卷积操作生成输出特征图，该特征图反映了滤波器所捕捉的局部图像特性（如边缘、角点等）。通过使用多个滤波器，卷积层可以提取输入图像中的多种特征。激活函数（Activation Function）：在卷积操作之后，通常会应用一个激活函数（如ReLU、Sigmoid或tanh）来增加网络的非线性。池化层（Pooling Layer）：池化层通常位于卷积层之后，用于降低特征图的维度（空间尺寸），减少计算量和参数数量，同时保持特征的空间层次结构。常见的池化操作包括最大池化（Max Pooling）和平均池化（Average Pooling）。全连接层（Fully Connected Layer）：在CNN的末端，通常会有几层全连接层（也称为密集层或线性层）。这些层中的每个神经元都与前一层的所有神经元连接。全连接层通常用于对提取的特征进行分类或回归。训练过程： CNN的训练过程与其他深度学习模型类似，通过反向传播算法和梯度下降（或其变种）来优化网络参数（如滤波器权重和偏置）。训练数据通常被分为多个批次（mini-batches），并在每个批次上迭代更新网络参数。应用： CNN在计算机视觉领域有着广泛的应用，包括图像分类、目标检测、图像分割、人脸识别等。它们也已被扩展到处理其他类型的数据，如文本（通过卷积一维序列）和音频（通过卷积时间序列）。随着深度学习技术的发展，卷积神经网络的结构和设计也在不断演变，出现了许多新的变体和改进，如残差网络（ResNet）、深度卷积生成对抗网络（DCGAN）等。

资源推荐

资源详情

资源评论

收起资源包目录

基于卷积神经网络的语音识别声学模型的研究.zip （142个子文件）

train.wav.lst 371KB

test.wav.lst 91KB

cv.wav.lst 31KB

Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks.pdf 905KB

Thchs_Results.png 128KB

Thchs_Training_Loss.png 116KB

Models.png 91KB

STCMDS_Results.png 56KB

comparation.png 44KB

speech_model_13.py 18KB

speech_model_14.py 17KB

speech_model_11.py 17KB

speech_model_03.py 16KB

speech_model_10.py 16KB

speechmodel_densenet_01.py 13KB

speech_model_07.py 12KB

speechmodel_densenet_02.py 12KB

speech_model_20.py 12KB

speech_model_02.py 12KB

speech_model_17.py 11KB

speech_model_05.py 11KB

speech_model_09.py 11KB

speech_model_12.py 11KB

speech_model_18.py 11KB

speech_model_08.py 11KB

speechmodel_05_01.py 11KB

speech_model_06.py 11KB

speech_model_15.py 10KB

speech_model_16.py 10KB

speech_model_04.py 10KB

test.py 9KB

speech_model_01.py 9KB

LanguageModel2.py 8KB

LanguageModel.py 7KB

pinzhen.py 7KB

model_language.py 6KB

readdata_01.py 5KB

readdata_densenet_01.py 5KB

readdata_11.py 5KB

readdata_10.py 5KB

readdata_03.py 5KB

readdata_12.py 5KB

readdata_15.py 5KB

readdata_13.py 4KB

readdata_14.py 4KB

readdata_02.py 4KB

readdata_17.py 4KB

readdata_20.py 4KB

readdata_05.py 4KB

readdata_04.py 4KB

readdata_08.py 4KB

readdata_07.py 4KB

readdata_06.py 4KB

readdata_09.py 4KB

model_language_test.py 4KB

DenseNet.py 4KB

data_preprocess.py 4KB

features_extract.py 4KB

asrserver.py 3KB

feature_extract.py 2KB

load_dataset.py 2KB

ctcDecoder_tf.py 1KB

file_wav.py 980B

edit_distance.py 644B

get_pinyin.py 607B

file_dict.py 566B

train_modelSpeech.py 566B

get_language_model.py 536B

testClient.py 463B

__init__.py 0B

open_train.py 0B

sk_mcnn_01.cpython-35.pyc 12KB

speech_model_13.cpython-35.pyc 12KB

speech_model_14.cpython-35.pyc 11KB

speech_model_11.cpython-35.pyc 11KB

se_mcnn_01.cpython-35.pyc 11KB

speech_model_10.cpython-35.pyc 11KB

speech_model_03.cpython-35.pyc 10KB

speech_model_18.cpython-35.pyc 10KB

speech_model_17.cpython-35.pyc 10KB

speech_model_02.cpython-35.pyc 9KB

speech_model_05.cpython-35.pyc 9KB

speech_model_07.cpython-35.pyc 9KB

speech_model_09.cpython-35.pyc 9KB

speech_model_08.cpython-35.pyc 9KB

speechmodel_se_01.cpython-35.pyc 9KB

speech_model_15.cpython-35.pyc 9KB

speech_model_06.cpython-35.pyc 8KB

speech_model_attention.cpython-35.pyc 8KB

speech_model_04.cpython-35.pyc 8KB

face_dl.cpython-35.pyc 5KB

readdata_02.pyc 5KB

readdata_03.pyc 5KB

LanguageModel.cpython-35.pyc 5KB

pinzhen.cpython-35.pyc 4KB

readdata_13.cpython-35.pyc 4KB

readdata_11.cpython-35.pyc 4KB

readdata_03.cpython-35.pyc 4KB

readdata_12.cpython-35.pyc 4KB

readdata_densenet_01.cpython-35.pyc 4KB

共 142 条

Metadata of the chapter that will be visualized in

SpringerLink

Book Title

Intelligent Robotics and Applications

Series Title

Chapter Title Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks

Author Family Name Zhang

Particle

Given Name Wei

Prefix

Suffix

Role

Division School of Mechanical Engineering

Organization Jiangnan University

Address Wuxi, 214122, Jiangsu, China

Division

Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and

Technology

Address Wuxi, 214122, Jiangsu, China

Author Family Name Zhai

Particle

Given Name Minghao

Prefix

Suffix

Role

Division School of Mechanical Engineering

Organization Jiangnan University

Address Wuxi, 214122, Jiangsu, China

Division

Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and

Technology

Address Wuxi, 214122, Jiangsu, China

Author Family Name Huang

Particle

Given Name Zilong

Prefix

Suffix

Role

Division School of Mechanical Engineering

Organization Jiangnan University

Address Wuxi, 214122, Jiangsu, China

Division

Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and

Technology

Address Wuxi, 214122, Jiangsu, China

Author Family Name Liu

Particle

Given Name Chen

Prefix

Suffix

Role

Division School of Mechanical Engineering

Organization Jiangnan University

Address Wuxi, 214122, Jiangsu, China

Division

Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and

Technology

Address Wuxi, 214122, Jiangsu, China

Author Family Name Li

Particle

Given Name Wei

Prefix

Suffix

Role

Division

Organization Suzhou Vocational Institute of Industrial Technology

Address Suzhou, 215104, Jiangsu, China

Corresponding Author Family Name Cao

Particle

Given Name Yi

Prefix

Suffix

Role

Division School of Mechanical Engineering

Organization Jiangnan University

Address Wuxi, 214122, Jiangsu, China

Division

Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and

Technology

Address Wuxi, 214122, Jiangsu, China

Email caoyi@jiangnan.edu.cn

Abstract

Approaches to deep learning have been used all over in connection to Automatic Speech Recognition

(ASR), where they have achieved a high level of accuracy. This has mostly been seen in Convolutional

Neural Network (CNN) which has recently been investigated in ASR. Due to the fact that CNN has an

increased network’s depth on one branch, and may not be wide enough to work on capturing adequate

features on signals of human speech. We focus on a proposal for an architecture that is deep and wide in

CNN referred to as Multipath Convolutional Neural Network (MCNN). MCNN-CTC combines three

additional paths with Connectionist Temporal Classification (CTC) objective function, and can be defined

as an end-to-end system that has the ability to fully exploit spectral and temporal structures related to

speech signals simultaneously. Results from the experiments show that the newly proposed MCNN-CTC

structure enables a reduction in the error rate arising from the construction of end-to-end acoustic model. In

the absence of a Language Model (LM), our proposed MCNN-CTC acoustic model has a relative reduction

of 1.10%–12.08% comparing to the traditional HMM-based or DCNN-CTC-based models with strong

generalization performance.

Keywords

(separated by '-')

Automatic Speech Recognition (ASR) - Acoustic Model (AM) - MCNN-CTC -

Connectionist Temporal Classification (CTC)

Towards End-to-End Speech Recognition

with Deep Multipath Convolutional

Neural Networks

Wei Zhang

1,3

, Minghao Zhai

1,3

, Zilong Huang

1,3

, Chen Liu

1,3

Wei Li

, and Yi Cao

1,3(&)

School of Mechanical Engineering, Jiangnan University, Wuxi 214122,

Jiangsu, China

caoyi@jiangnan.edu.cn

Suzhou Vocational Institute of Industrial Technology, Suzhou 215104,

Jiangsu, China

Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and

Technology, Wuxi 214122, Jiangsu, China

Abstract. Approaches to deep learning have been used all over in connection

to Automatic Speech Recognition (ASR), where they have achieved a high level

of accuracy. This has mostly been seen in Convolutional Neural Network

(CNN) which has recently been investigated in ASR. Due to the fact that CNN

has an increased network’s depth on one branch, and may not be wide enough to

work on capturing adequate features on signals of human speech. We focus on a

proposal for an architecture that is deep and wide in CNN referred to as Mul-

tipath Convolutional Neural Network (MCNN). MCNN-CTC combines three

additional paths with Connectionist Temporal Classiﬁcation (CTC) objective

function, and can be deﬁned as an end-to-end system that has the ability to fully

exploit spectral and temporal structures related to speech signals simultaneously.

Results from the experiments show that the newly proposed MCNN-CTC

structure enables a reduction in the error rate arising from the construction of

end-to-end acoustic model. In the absence of a Language Model (LM), our

proposed MCNN-CTC acoustic model has a relative reduction of 1.10%–

12.08% comparing to the traditional HMM-based or DCNN-CTC-based models

with strong generalization performance.

Keywords: Automatic Speech Recognition (ASR)

 Acoustic Model (AM) 

MCNN-CTC  Connectionist Temporal Classiﬁcation (CTC)

1 Introduction

Automatic Speech Recognition (ASR) is an automatic method designed to translate

human form speech content into textual form [1]. Deep learning has in the past been

applied in ASR to increase correctness [2–4], a process that has been successful. As of

late, CNN has been successful in acoustic model [5, 6]. Which is applied in ASR

combining with HMMs [5], in a way identical to the regular Deep Neural Networks

(DNNs) [7, 8], which in turn lead to a hybrid system. DNN-HMM uses a discrim inant

H. Yu et al. (Eds.): ICIRA 2019, LNAI 11745, pp. 1–10, 2019.

https://doi.org/10.1007/978-3-030-27529-7_29

Author Proof

model to replace the GMM-HMM generation model, which takes advantage of DNN ’s

powerful ﬁtting ability to model the posterior probability of each frame. The HMM sti ll

handles the operations in temporal modelling and decoding whereas the neural network

generates posterior probability of the corresponding state [4].

A large amount of problems arise as a result of this hybrid system, where the

modules’ training which is done separately for different modules and with a different

criteria that may certainly not be optimal in the solution of the ﬁnal task. Consequently,

additional hyperparameters turning throughout all training stages are required and can

be not only time consuming but also highly laborious [9]. Contrary to the above

system, end-to-end model is proposed recently because of its simplicity of modeling

process, and also the recognition accuracy is gradually approaching the hybrid system

[10–12]. CTC is a objective function introduced by Graves as a means to simplify this

process [13, 14], which infers alignments in speech label automatically leading to an

end-to-end system. This has generated promising results that can discovery in Deep

Speech [15, 16] and EESEN [ 10].

We propose the MCNN model and construct the MCNN-CTC acoustic model in

combination with the CTC objective function, which obtains a signiﬁcant recognition

results. Based on the CTC loss function, this paper studies the speech recognition of

small and medium datasets in detail. The merits of the MCNN-CTC include: (a) The

above acoustic model can extract more useful features, both in time dimensi on and

frequency axis; (b) MCNN has wider network structure, which can extract sufﬁcient

features of speech, and has stronger nonlinear capability; (c) Thanks to the CTC loss,

MCNN-CTC can take an end-to-end training manner [17].

The rest of this paper is organized as follows. Section 2 describes the network

architecture of MC NN-CTC. A concise introduction to CTC objective function and

decoding algorithm are given in Se ct. 3. We represent the experimental results in

Sect. 4 and conclude our future work in Sect. 5.

2 Multipath Convolutional Neural Networks

As we can see clearly from Fig. 1, MCNN is an augme ntation of the CNN’s width, and

has the ability to extract additional detailed features from speech in terms of width as

compared to the basic extraction of high-dimensional speech features in term of depth.

Therefore, MCNN is able to increase the performance of the recognition.

The MCNN’s structure is shown in Fig. 1. The full structure of MCNN comprises

of a total of three sub-networks, extracting featu res of speech and concatenating them.

The calculation formulas are shown in Eq. (1)–(3):

lðÞ

¼ r W

lðÞ

 h

l1ðÞ

þ b

lðÞ



ð1Þ

In formula (1), where h

(l−1)

and h

(l)

represent two adjacent feature layers, * rep-

resents convolution calculation, and W

(l)

and b

(l)

represent weights and bias matrices

obtained from network training, respectively; W

(l)

is convoluted with h

(l−1)

, and r(•)

represents the activation function. In formula (2), t

out

represents the output value of the

2 W. Zhang et al.

Author Proof

评论收藏

内容反馈

版权申诉

生瓜蛋子

粉丝: 3829
资源: 5969

基于卷积神经网络的语音识别声学模型的研究.zip

基于卷积神经网络的语音识别声学模型的项目源码.zip

cnn卷积神经网络验证码识别项目源码.zip

python课程设计大作业-PyTorch实现图像识别基于卷积神经网络的识别方法RMB.zip

基于python深度学习卷积神经网络识别猫狗图片.zip

基于卷积神经网络的人脸识别系统.zip

基于卷积神经网络的蘑菇识别微信小程序.zip

基于卷积神经网络的医学病理图像识别项目源码+数据集.zip

Python课程设计-基于卷积神经网络手写数字识别系统.zip

深度学习基于卷积神经网络的人脸面部表情识别项目源码+面部表情数据集+训练好的模型.zip

基于CNN卷积神经网络resnet算法实现垃圾分类识别源码(含.pt模型).zip

基于卷积神经网络算法识别猫狗图片项目源码+文档说明（高分大作业）.zip

基于卷积神经网络算法识别猫狗图片项目源码+项目说明.zip

基于卷积神经网络算法识别猫狗图片项目源码+项目说明（高分期末大作业）.zip

python基于卷积神经网络的交通标志识别系统源码.zip

深度学习基于卷积神经网络的校园垃圾分类识别系统源代码.zip

基于Springboot集成yolo3构建基于神经网络的图片识别系统源码.zip

基于卷积神经网络的性别识别及人脸年龄估计系统源码+全部资料（毕业设计）.zip

相关实用应用程序（Windows可用）

李飞飞自传 我看见的世界 The World I see

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

智联招聘：2024年大学生就业力调研报告.pdf

第十九届研电赛-技术论文模板

4个亲测好用的ChatGPT4渠道

学术海报模板+论文科研+研究生

北森能力测评题库.zip

车载毫米波雷达DOA估计综述博文仿真代码

最新资源

李飞飞自传我看见的世界 The World I see