【免费】语音中的mask---Neuralnetworkbasedspectralmaskestimationforaco

需积分: 0 49 浏览量 2022-08-03 13:01:23 上传评论收藏 234KB PDF 举报

资源推荐

资源详情

资源评论

NEURAL NETWORK BASED SPECTRAL MASK ESTIMATION

FOR ACOUSTIC BEAMFORMING

Jahn Heymann, Lukas Drude, Reinhold Haeb-Umbach

University of Paderborn, Department of Communications Engineering, Paderborn, Germany

ABSTRACT

We present a neural network based approach to acoustic beamform-

ing. The network is used to estimate spectral masks from which

the Cross-Power Spectral Density matrices of speech and noise are

estimated, which in turn are used to compute the beamformer co-

efﬁcients. The network training is independent of the number and

the geometric conﬁguration of the microphones. We further show

that it is possible to train the network on clean speech only, avoid-

ing the need for stereo data with separated speech and noise. Two

types of networks are evaluated. One small feed-forward network

with only one hidden layer and one more elaborated bi-directional

Long Short-Term Memory network. We compare our system with

different parametric approaches to mask estimation and using dif-

ferent beamforming algorithms. We show that our system yields

superior results, both in terms of perceptual speech quality and with

respect to speech recognition error rate. The results for the simple

feed-forward network are especially encouraging considering its low

computational requirements.

Index Terms— Robust Speech Recognition, Acoustic Beam-

forming, Feature Enhancement, Deep Neural Network

1. INTRODUCTION

Automatic Speech Recognition (ASR) performance experienced a

big boost in recent years with the rise of Deep Neural Networks

(DNNs) combined with ever increasing computational power and

the availability of hundreds of hours of transcribed speech data for

training. Trained in a noise-aware scenario with enough data, the

modeling power of DNNs rendered many of the signal or feature en-

hancement techniques developed for GMM-HMM systems superﬂu-

ous. Only some pre-processing steps are still able to bring noticeable

improvements. Especially if multi-channel audio data is available,

acoustic beamforming is one technique to achieve substantial gains.

And despite recent attempts to take advantage of multi-channel data

within a (convolutional) DNN or even training a network directly on

multi-channel waveforms, model-based beamforming still proved to

be superior [1] [2].

The model-based data-dependent beamforming operation re-

quires an estimate of either the Direction-of-Arrival (DoA) or the

(relative) transfer functions from the acoustic source to the micro-

phones. For the ﬁrst, the geometry of the microphone array has to be

known, while the latter usually requires an estimation of the statistics

of the target speech signal. Further, advanced beamforming opera-

tions require an estimate of the Cross-Power Spectral Density (PSD)

matrix of the noise. These statistics can be obtained by estimating

spectral masks for speech and noise, and this is where data-driven

approaches can be incorporated, as is shown in this paper.

This work was in part supported by Deutsche Forschungsgemeinschaft

under contract no. Ha 3455/11-1.

While many model-based methods exist for spectral mask esti-

mation (i.e. [3, 4, 5, 6, 7, 8]), we want to leverage the power of a

discriminatively trained data-driven approach to estimate a spectral

mask for the speech and the noise component. A distinctive advan-

tage of the proposed neural network based mask estimation is that

we are able to jointly estimate a spectral mask for all frequencies,

whereas it is common practice to treat individual frequencies sep-

arately in conventional parametric mask estimation. We show by

example that this property better captures speech charateristics, such

that the beamformer cannot be easily fooled by high-energy noise

sources and take them inadvertently as speech.

Due to their very nature, data-driven approaches usually perform

best when they are exposed to all test time variety at training time.

In our scenario, this noise-aware training requires separated speech-

and noise data. This requirement is often used to argue against such

an approach. We show that this requirement can be relaxed to some

extent and that even with only (clean) speech data available for mask

estimation good results can be achieved.

Apart from a comparison of parametric versus data-driven mask

estimation, we also compare two beamformer designs. Namely the

well known Minimum Variance Distortionless Response (MVDR)

and the Generalized Eigenvalue (GEV) beamformer with an optional

distortion reduction ﬁlter [9].

2. MASK ESTIMATION

2.1. Neural mask estimation

Our proposed mask estimator consists of multiple neural networks

with shared weights – one for each microphone channel. In this pa-

per we experimented with a small feed-forward (FF) network and

a bi-directional Long Short-Term Memory (BLSTM) network. Ta-

bles 1 and 2 show the conﬁguration of their layers. The input (y

)

for each network is a single frame of the spectral magnitude of one

channel. Note that this means that the FF network has no tempo-

ral context. The output size of the network depends on the training

method.

In case of noise-aware training, two masks are estimated: the

ﬁrst indicates which time frequency (tf) bins are presumably dom-

inated by speech, while the second one indicates which are dom-

inated by noise. When trained on clean speech only, we estimate

solely the mask for the speech component, M

, and calculate the

mask for the noise component as 1 − M

for each tf bin. The masks

for each channel are then condensed to a single speech and a single

noise mask using a median operation. The median is preferred over

a mean computation because of its resilience to outliers. Outliers

may be caused by broken or occluded microphones. The resulting

condensed masks are used to estimate the PSD matrices Φ

speech, and Φ

of noise, from which the beamformer coefﬁcients

are obtained. Note that by treating each channel separately, spatial

,((( ,&$663

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余4页未读，立即下载

评论收藏

内容反馈

航知道

粉丝: 24
资源: 302

语音中的mask---Neural network based spectral mask estimation for aco

最新资源

语音中的mask---Neural network based spectral mask estimation for aco

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and

An FPGA-Based Resource-Saving Hardware Accelerator for Deep Neural Network.pdf

Top- N Recommendation with A Neural Co-Attention Model.pdf

LSTM-Neural-Network-for-Time-Series-Prediction-master.rar

Recurrent-Attention-Convolutional-Neural-Network

2.1-a-first-look-at-a-neural-network_深度学习_

Deep Convolutional Neural Network-Based Early_neuralnetwork_

藏经阁-That Learns From a Neural Network Huge Graph-38.pdf

Deep Neural Network-Based Digital Predistorter for Doherty Power Amplifiers

基于神经网络的语音识别-Matlab-Speach-Recognition-Neural-Net-Matlab-Code.zip

Session-based Recommendations with Recurrent Neural Networks.pdf

CNN神经网络23-mgic-multigrid-in-channels-neural-network-architectures.pdf

Neural network-based model for assessing failure potential of highway slopes

Attention-Based Recurrent Neural Network Models for Joint Intent Detection

neural-networks.rar_neural network

2019-Session-based Recommendation with Graph Neural Networks.pdf

Recurrent Neural Network for Text Classification

Multi-objective learningand mask-based post-processing for deep neural network based speech enhancement

MetaPruning-Meta Learning for Automatic Neural Network Channel Pruning.pdf

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

安全认证cisp教材全套

OpenVAS GVM 中文翻译补丁

2024最新：Hvv中常见的面试问题

现代永磁同步电机控制原理及MATLAB仿真__袁雷编著1

全面的安全基线核查清单

最新资源