基于特征学习和端到端训练的空中交通管制语音识别_Speechrecognitionforairtrafficcontr资源-CSDN文库

版权申诉

语音识别

人工智能

176 浏览量 2022-01-22 20:38:43 上传评论收藏 428KB PDF 举报

资源详情

资源评论

SPEECH RECOGNITION FOR AIR TRAFFIC CONTROL VIA FEATURE LEARNING AND

END-TO-END TRAINING

Peng Fan

, Dongyue Guo

, Yi Lin

1,2

, Bo Yang

1,2

, Jianwei Zhang

1,2

National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, China,

College of Computer Science, Sichuan University, Chengdu 610065, China

ABSTRACT

In this work, we propose a new automatic speech recognition (ASR)

system based on feature learning and an end-to-end training proce-

dure for air trafﬁc control (ATC) systems. The proposed model in-

tegrates the feature learning block, recurrent neural network (RNN),

and connectionist temporal classiﬁcation loss to build an end-to-end

ASR model. Facing the complex environments of ATC speech, in-

stead of the handcrafted features, a learning block is designed to

extract informative features from raw waveforms for acoustic mod-

eling. Both the SincNet and 1D convolution blocks are applied to

process the raw waveforms, whose outputs are concatenated to the

RNN layers for the temporal modeling. Thanks to the ability to learn

representations from raw waveforms, the proposed model can be op-

timized in a complete end-to-end manner, i.e., from waveform to

text. Finally, the multilingual issue in the ATC domain is also consid-

ered to achieve the ASR task by constructing a combined vocabulary

of Chinese characters and English letters. The proposed approach is

validated on a multilingual real-world corpus (ATCSpeech), and the

experimental results demonstrate that the proposed approach outper-

forms other baselines, achieving a 6.9% character error rate.

Index Terms— Automatic speech recognition, feature learning,

air trafﬁc control, multilingual, end-to-end training

1. INTRODUCTION

Automatic speech recognition (ASR) can translate speech into

computer-readable texts [1]. In air trafﬁc control (ATC), radio

speech is the primary way of communication between air trafﬁc

controllers (ATCo) and pilots. The ASR is introduced into the ATC

system to translate the speech of the ATCo and the pilot, which can

be used to reduce the workload on the ATCo and ensure ﬂight safety

[2].

Compared to the common ASR research, the ATC has many

new challenges and difﬁculties. In general, the ATCo and pilots

speeches are usually in English. However, in China, the ATCos and

pilots communicate through Chinese for the domestic ﬂight more

frequently. That is to say, speech on the same frequency usually in

both Chinese and English, i.e., multilingual ASR is required for the

ATC domain [3]. Our previous work introduced ASR into the ATC

safety monitoring framework, and also converted ATCo and pilot

speech into instructions for controlling intent inference [4].

Recently, the end-to-end speech recognition system has pro-

vided higher performance than traditional methods for common

Yi Lin is the corresponding author. This work was supported by the National

Natural Science Foundation of China (No.62001315).

ASR tasks [5]. However, current end-to-end ASR systems usu-

ally use mel-frequency cepstral coefﬁcients (MFCCs) or ﬁlter-bank

(FBANK) to process the raw waveform speech instead of directly

inputting the raw waveform.

A deep learning-based feature extracted method–SincNet, was

proposed to deal with ASR task and speaker recognition task, and the

experimental results showed that the neural network based on Sinc-

Net achieved better performance for both the two tasks. The Sinc-

Net can learn more informative and discriminative features from raw

waveform [6], [7]. Other deep learning-based models, like wav2vec,

were also proposed to extract speech feature from raw waveforms.

The wav2vec model explores unsupervised pre-training for speech

recognition by learning representations from raw audio through sev-

eral 1D convolution layers. The wav2vec model is trained on large

amounts of unlabeled speech and the resulting representations serve

as the input of the acoustic model for the ASR task [8].

In previous studies, the well-designed MFCC or FBANK fea-

tures were applied to perform preliminary processing on the raw

waveform. In this procedure, the raw speech is divided into frames

with 25 ms frame length and 10 ms shift, and a series of signal pro-

cessing transformations are applied to convert the 1D waveform into

2D feature map. After the raw waveform speech is processed by

MFCC or FBANK, the extracted feature map is fed into the neural

network for acoustic modeling. This method has achieved state-of-

the-art results in many ASR tasks. However, the design of FBANK

and MFCC is based on the human ear’s response to audio, it may

lose some of the raw waveform speech information. Considering

the complex ATC environment, the handcrafted feature engineering

may not be an optimal option for ASR tasks. Therefore, the learn-

ing mechanism was proposed to learn informative and discrimina-

tive features from raw waveforms, which achieved desired perfor-

mance improvement for common ASR applications, such as Sinc-

Net, wav2vec [7], [8].

In this work, an end-to-end neural network is designed to

achieve the ASR task in the ATC domain, in which a novel feature

learning block is proposed to extract high-level speech representa-

tion from raw waveforms. Both the SincNet and 1D convolution

block are designed to learn features from raw waveforms. The back-

bone network is constructed by cascading the convolutional neural

network (CNN) and recurrent neural network (RNN) layers and is

jointly optimized with features learning block by the connectionist

temporal classiﬁcation (CTC) loss function. Most importantly, the

proposed model leverages the feature learning block to implement

the end-to-end training, which predicts the text sequence from raw

waveform without any pretraining.

The model proposed in this paper introduces a feature learning

approach for speech recognition tasks in the ATC domain, and the

arXiv:2111.02654v1 [cs.SD] 4 Nov 2021

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余4页未读，立即下载

评论收藏

内容反馈

版权申诉

基于特征学习和端到端训练的空中交通管制语音识别_Speech recognition for air traffic contr

评论0

最新资源