Metadata of the chapter that will be visualized in
SpringerLink
Book Title
Intelligent Robotics and Applications
Series Title
Chapter Title Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks
Copyright Year 2019
Copyright HolderName Springer Nature Switzerland AG
Author Family Name Zhang
Particle
Given Name Wei
Prefix
Suffix
Role
Division School of Mechanical Engineering
Organization Jiangnan University
Address Wuxi, 214122, Jiangsu, China
Division
Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and
Technology
Address Wuxi, 214122, Jiangsu, China
Email
Author Family Name Zhai
Particle
Given Name Minghao
Prefix
Suffix
Role
Division School of Mechanical Engineering
Organization Jiangnan University
Address Wuxi, 214122, Jiangsu, China
Division
Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and
Technology
Address Wuxi, 214122, Jiangsu, China
Email
Author Family Name Huang
Particle
Given Name Zilong
Prefix
Suffix
Role
Division School of Mechanical Engineering
Organization Jiangnan University
Address Wuxi, 214122, Jiangsu, China
Division
Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and
Technology
Address Wuxi, 214122, Jiangsu, China
Email
Author Family Name Liu
Particle
Given Name Chen
Prefix
Suffix
Role
Division School of Mechanical Engineering
Organization Jiangnan University
Address Wuxi, 214122, Jiangsu, China
Division
Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and
Technology
Address Wuxi, 214122, Jiangsu, China
Email
Author Family Name Li
Particle
Given Name Wei
Prefix
Suffix
Role
Division
Organization Suzhou Vocational Institute of Industrial Technology
Address Suzhou, 215104, Jiangsu, China
Email
Corresponding Author Family Name Cao
Particle
Given Name Yi
Prefix
Suffix
Role
Division School of Mechanical Engineering
Organization Jiangnan University
Address Wuxi, 214122, Jiangsu, China
Division
Organization Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and
Technology
Address Wuxi, 214122, Jiangsu, China
Email caoyi@jiangnan.edu.cn
Abstract
Approaches to deep learning have been used all over in connection to Automatic Speech Recognition
(ASR), where they have achieved a high level of accuracy. This has mostly been seen in Convolutional
Neural Network (CNN) which has recently been investigated in ASR. Due to the fact that CNN has an
increased network’s depth on one branch, and may not be wide enough to work on capturing adequate
features on signals of human speech. We focus on a proposal for an architecture that is deep and wide in
CNN referred to as Multipath Convolutional Neural Network (MCNN). MCNN-CTC combines three
additional paths with Connectionist Temporal Classification (CTC) objective function, and can be defined
as an end-to-end system that has the ability to fully exploit spectral and temporal structures related to
speech signals simultaneously. Results from the experiments show that the newly proposed MCNN-CTC
structure enables a reduction in the error rate arising from the construction of end-to-end acoustic model. In
the absence of a Language Model (LM), our proposed MCNN-CTC acoustic model has a relative reduction
of 1.10%–12.08% comparing to the traditional HMM-based or DCNN-CTC-based models with strong
generalization performance.
Keywords
(separated by '-')
Automatic Speech Recognition (ASR) - Acoustic Model (AM) - MCNN-CTC -
Connectionist Temporal Classification (CTC)
Towards End-to-End Speech Recognition
with Deep Multipath Convolutional
Neural Networks
Wei Zhang
1,3
, Minghao Zhai
1,3
, Zilong Huang
1,3
, Chen Liu
1,3
,
Wei Li
2
, and Yi Cao
1,3(&)
1
School of Mechanical Engineering, Jiangnan University, Wuxi 214122,
Jiangsu, China
caoyi@jiangnan.edu.cn
2
Suzhou Vocational Institute of Industrial Technology, Suzhou 215104,
Jiangsu, China
3
Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and
Technology, Wuxi 214122, Jiangsu, China
Abstract. Approaches to deep learning have been used all over in connection
to Automatic Speech Recognition (ASR), where they have achieved a high level
of accuracy. This has mostly been seen in Convolutional Neural Network
(CNN) which has recently been investigated in ASR. Due to the fact that CNN
has an increased network’s depth on one branch, and may not be wide enough to
work on capturing adequate features on signals of human speech. We focus on a
proposal for an architecture that is deep and wide in CNN referred to as Mul-
tipath Convolutional Neural Network (MCNN). MCNN-CTC combines three
additional paths with Connectionist Temporal Classification (CTC) objective
function, and can be defined as an end-to-end system that has the ability to fully
exploit spectral and temporal structures related to speech signals simultaneously.
Results from the experiments show that the newly proposed MCNN-CTC
structure enables a reduction in the error rate arising from the construction of
end-to-end acoustic model. In the absence of a Language Model (LM), our
proposed MCNN-CTC acoustic model has a relative reduction of 1.10%–
12.08% comparing to the traditional HMM-based or DCNN-CTC-based models
with strong generalization performance.
Keywords: Automatic Speech Recognition (ASR)
Acoustic Model (AM)
MCNN-CTC Connectionist Temporal Classification (CTC)
1 Introduction
Automatic Speech Recognition (ASR) is an automatic method designed to translate
human form speech content into textual form [1]. Deep learning has in the past been
applied in ASR to increase correctness [2–4], a process that has been successful. As of
late, CNN has been successful in acoustic model [5, 6]. Which is applied in ASR
combining with HMMs [5], in a way identical to the regular Deep Neural Networks
(DNNs) [7, 8], which in turn lead to a hybrid system. DNN-HMM uses a discrim inant
© Springer Nature Switzerland AG 2019
H. Yu et al. (Eds.): ICIRA 2019, LNAI 11745, pp. 1–10, 2019.
https://doi.org/10.1007/978-3-030-27529-7_29
Author Proof
model to replace the GMM-HMM generation model, which takes advantage of DNN ’s
powerful fitting ability to model the posterior probability of each frame. The HMM sti ll
handles the operations in temporal modelling and decoding whereas the neural network
generates posterior probability of the corresponding state [4].
A large amount of problems arise as a result of this hybrid system, where the
modules’ training which is done separately for different modules and with a different
criteria that may certainly not be optimal in the solution of the final task. Consequently,
additional hyperparameters turning throughout all training stages are required and can
be not only time consuming but also highly laborious [9]. Contrary to the above
system, end-to-end model is proposed recently because of its simplicity of modeling
process, and also the recognition accuracy is gradually approaching the hybrid system
[10–12]. CTC is a objective function introduced by Graves as a means to simplify this
process [13, 14], which infers alignments in speech label automatically leading to an
end-to-end system. This has generated promising results that can discovery in Deep
Speech [15, 16] and EESEN [ 10].
We propose the MCNN model and construct the MCNN-CTC acoustic model in
combination with the CTC objective function, which obtains a significant recognition
results. Based on the CTC loss function, this paper studies the speech recognition of
small and medium datasets in detail. The merits of the MCNN-CTC include: (a) The
above acoustic model can extract more useful features, both in time dimensi on and
frequency axis; (b) MCNN has wider network structure, which can extract sufficient
features of speech, and has stronger nonlinear capability; (c) Thanks to the CTC loss,
MCNN-CTC can take an end-to-end training manner [17].
The rest of this paper is organized as follows. Section 2 describes the network
architecture of MC NN-CTC. A concise introduction to CTC objective function and
decoding algorithm are given in Se ct. 3. We represent the experimental results in
Sect. 4 and conclude our future work in Sect. 5.
2 Multipath Convolutional Neural Networks
As we can see clearly from Fig. 1, MCNN is an augme ntation of the CNN’s width, and
has the ability to extract additional detailed features from speech in terms of width as
compared to the basic extraction of high-dimensional speech features in term of depth.
Therefore, MCNN is able to increase the performance of the recognition.
The MCNN’s structure is shown in Fig. 1. The full structure of MCNN comprises
of a total of three sub-networks, extracting featu res of speech and concatenating them.
The calculation formulas are shown in Eq. (1)–(3):
h
lðÞ
¼ r W
lðÞ
h
l1ðÞ
þ b
lðÞ
ð1Þ
In formula (1), where h
(l−1)
and h
(l)
represent two adjacent feature layers, * rep-
resents convolution calculation, and W
(l)
and b
(l)
represent weights and bias matrices
obtained from network training, respectively; W
(l)
is convoluted with h
(l−1)
, and r(•)
represents the activation function. In formula (2), t
out
nl
represents the output value of the
2 W. Zhang et al.
Author Proof