没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
试读
5页
基于特征学习和端到端训练的空中交通管制语音识别_Speech recognition for air traffic control via feature learning and end-to-end training.pdf
资源详情
资源评论
SPEECH RECOGNITION FOR AIR TRAFFIC CONTROL VIA FEATURE LEARNING AND
END-TO-END TRAINING
Peng Fan
1
, Dongyue Guo
1
, Yi Lin
1,2
, Bo Yang
1,2
, Jianwei Zhang
1,2
1
National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, China,
2
College of Computer Science, Sichuan University, Chengdu 610065, China
ABSTRACT
In this work, we propose a new automatic speech recognition (ASR)
system based on feature learning and an end-to-end training proce-
dure for air traffic control (ATC) systems. The proposed model in-
tegrates the feature learning block, recurrent neural network (RNN),
and connectionist temporal classification loss to build an end-to-end
ASR model. Facing the complex environments of ATC speech, in-
stead of the handcrafted features, a learning block is designed to
extract informative features from raw waveforms for acoustic mod-
eling. Both the SincNet and 1D convolution blocks are applied to
process the raw waveforms, whose outputs are concatenated to the
RNN layers for the temporal modeling. Thanks to the ability to learn
representations from raw waveforms, the proposed model can be op-
timized in a complete end-to-end manner, i.e., from waveform to
text. Finally, the multilingual issue in the ATC domain is also consid-
ered to achieve the ASR task by constructing a combined vocabulary
of Chinese characters and English letters. The proposed approach is
validated on a multilingual real-world corpus (ATCSpeech), and the
experimental results demonstrate that the proposed approach outper-
forms other baselines, achieving a 6.9% character error rate.
Index Terms— Automatic speech recognition, feature learning,
air traffic control, multilingual, end-to-end training
1. INTRODUCTION
Automatic speech recognition (ASR) can translate speech into
computer-readable texts [1]. In air traffic control (ATC), radio
speech is the primary way of communication between air traffic
controllers (ATCo) and pilots. The ASR is introduced into the ATC
system to translate the speech of the ATCo and the pilot, which can
be used to reduce the workload on the ATCo and ensure flight safety
[2].
Compared to the common ASR research, the ATC has many
new challenges and difficulties. In general, the ATCo and pilots
speeches are usually in English. However, in China, the ATCos and
pilots communicate through Chinese for the domestic flight more
frequently. That is to say, speech on the same frequency usually in
both Chinese and English, i.e., multilingual ASR is required for the
ATC domain [3]. Our previous work introduced ASR into the ATC
safety monitoring framework, and also converted ATCo and pilot
speech into instructions for controlling intent inference [4].
Recently, the end-to-end speech recognition system has pro-
vided higher performance than traditional methods for common
Yi Lin is the corresponding author. This work was supported by the National
Natural Science Foundation of China (No.62001315).
ASR tasks [5]. However, current end-to-end ASR systems usu-
ally use mel-frequency cepstral coefficients (MFCCs) or filter-bank
(FBANK) to process the raw waveform speech instead of directly
inputting the raw waveform.
A deep learning-based feature extracted method–SincNet, was
proposed to deal with ASR task and speaker recognition task, and the
experimental results showed that the neural network based on Sinc-
Net achieved better performance for both the two tasks. The Sinc-
Net can learn more informative and discriminative features from raw
waveform [6], [7]. Other deep learning-based models, like wav2vec,
were also proposed to extract speech feature from raw waveforms.
The wav2vec model explores unsupervised pre-training for speech
recognition by learning representations from raw audio through sev-
eral 1D convolution layers. The wav2vec model is trained on large
amounts of unlabeled speech and the resulting representations serve
as the input of the acoustic model for the ASR task [8].
In previous studies, the well-designed MFCC or FBANK fea-
tures were applied to perform preliminary processing on the raw
waveform. In this procedure, the raw speech is divided into frames
with 25 ms frame length and 10 ms shift, and a series of signal pro-
cessing transformations are applied to convert the 1D waveform into
2D feature map. After the raw waveform speech is processed by
MFCC or FBANK, the extracted feature map is fed into the neural
network for acoustic modeling. This method has achieved state-of-
the-art results in many ASR tasks. However, the design of FBANK
and MFCC is based on the human ear’s response to audio, it may
lose some of the raw waveform speech information. Considering
the complex ATC environment, the handcrafted feature engineering
may not be an optimal option for ASR tasks. Therefore, the learn-
ing mechanism was proposed to learn informative and discrimina-
tive features from raw waveforms, which achieved desired perfor-
mance improvement for common ASR applications, such as Sinc-
Net, wav2vec [7], [8].
In this work, an end-to-end neural network is designed to
achieve the ASR task in the ATC domain, in which a novel feature
learning block is proposed to extract high-level speech representa-
tion from raw waveforms. Both the SincNet and 1D convolution
block are designed to learn features from raw waveforms. The back-
bone network is constructed by cascading the convolutional neural
network (CNN) and recurrent neural network (RNN) layers and is
jointly optimized with features learning block by the connectionist
temporal classification (CTC) loss function. Most importantly, the
proposed model leverages the feature learning block to implement
the end-to-end training, which predicts the text sequence from raw
waveform without any pretraining.
The model proposed in this paper introduces a feature learning
approach for speech recognition tasks in the ATC domain, and the
arXiv:2111.02654v1 [cs.SD] 4 Nov 2021
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0
最新资源