没有合适的资源?快使用搜索试试~ 我知道了~
语音中的mask---Neural network based spectral mask estimation for aco
需积分: 0 1 下载量 49 浏览量
2022-08-03
13:01:23
上传
评论
收藏 234KB PDF 举报
温馨提示
试读
5页
语音中的mask---Neural network based spectral mask estimation for acoustic beamforming1
资源推荐
资源详情
资源评论
NEURAL NETWORK BASED SPECTRAL MASK ESTIMATION
FOR ACOUSTIC BEAMFORMING
Jahn Heymann, Lukas Drude, Reinhold Haeb-Umbach
University of Paderborn, Department of Communications Engineering, Paderborn, Germany
ABSTRACT
We present a neural network based approach to acoustic beamform-
ing. The network is used to estimate spectral masks from which
the Cross-Power Spectral Density matrices of speech and noise are
estimated, which in turn are used to compute the beamformer co-
efficients. The network training is independent of the number and
the geometric configuration of the microphones. We further show
that it is possible to train the network on clean speech only, avoid-
ing the need for stereo data with separated speech and noise. Two
types of networks are evaluated. One small feed-forward network
with only one hidden layer and one more elaborated bi-directional
Long Short-Term Memory network. We compare our system with
different parametric approaches to mask estimation and using dif-
ferent beamforming algorithms. We show that our system yields
superior results, both in terms of perceptual speech quality and with
respect to speech recognition error rate. The results for the simple
feed-forward network are especially encouraging considering its low
computational requirements.
Index Terms— Robust Speech Recognition, Acoustic Beam-
forming, Feature Enhancement, Deep Neural Network
1. INTRODUCTION
Automatic Speech Recognition (ASR) performance experienced a
big boost in recent years with the rise of Deep Neural Networks
(DNNs) combined with ever increasing computational power and
the availability of hundreds of hours of transcribed speech data for
training. Trained in a noise-aware scenario with enough data, the
modeling power of DNNs rendered many of the signal or feature en-
hancement techniques developed for GMM-HMM systems superflu-
ous. Only some pre-processing steps are still able to bring noticeable
improvements. Especially if multi-channel audio data is available,
acoustic beamforming is one technique to achieve substantial gains.
And despite recent attempts to take advantage of multi-channel data
within a (convolutional) DNN or even training a network directly on
multi-channel waveforms, model-based beamforming still proved to
be superior [1] [2].
The model-based data-dependent beamforming operation re-
quires an estimate of either the Direction-of-Arrival (DoA) or the
(relative) transfer functions from the acoustic source to the micro-
phones. For the first, the geometry of the microphone array has to be
known, while the latter usually requires an estimation of the statistics
of the target speech signal. Further, advanced beamforming opera-
tions require an estimate of the Cross-Power Spectral Density (PSD)
matrix of the noise. These statistics can be obtained by estimating
spectral masks for speech and noise, and this is where data-driven
approaches can be incorporated, as is shown in this paper.
This work was in part supported by Deutsche Forschungsgemeinschaft
under contract no. Ha 3455/11-1.
While many model-based methods exist for spectral mask esti-
mation (i.e. [3, 4, 5, 6, 7, 8]), we want to leverage the power of a
discriminatively trained data-driven approach to estimate a spectral
mask for the speech and the noise component. A distinctive advan-
tage of the proposed neural network based mask estimation is that
we are able to jointly estimate a spectral mask for all frequencies,
whereas it is common practice to treat individual frequencies sep-
arately in conventional parametric mask estimation. We show by
example that this property better captures speech charateristics, such
that the beamformer cannot be easily fooled by high-energy noise
sources and take them inadvertently as speech.
Due to their very nature, data-driven approaches usually perform
best when they are exposed to all test time variety at training time.
In our scenario, this noise-aware training requires separated speech-
and noise data. This requirement is often used to argue against such
an approach. We show that this requirement can be relaxed to some
extent and that even with only (clean) speech data available for mask
estimation good results can be achieved.
Apart from a comparison of parametric versus data-driven mask
estimation, we also compare two beamformer designs. Namely the
well known Minimum Variance Distortionless Response (MVDR)
and the Generalized Eigenvalue (GEV) beamformer with an optional
distortion reduction filter [9].
2. MASK ESTIMATION
2.1. Neural mask estimation
Our proposed mask estimator consists of multiple neural networks
with shared weights – one for each microphone channel. In this pa-
per we experimented with a small feed-forward (FF) network and
a bi-directional Long Short-Term Memory (BLSTM) network. Ta-
bles 1 and 2 show the configuration of their layers. The input (y
t
)
for each network is a single frame of the spectral magnitude of one
channel. Note that this means that the FF network has no tempo-
ral context. The output size of the network depends on the training
method.
In case of noise-aware training, two masks are estimated: the
first indicates which time frequency (tf) bins are presumably dom-
inated by speech, while the second one indicates which are dom-
inated by noise. When trained on clean speech only, we estimate
solely the mask for the speech component, M
X
, and calculate the
mask for the noise component as 1 − M
X
for each tf bin. The masks
for each channel are then condensed to a single speech and a single
noise mask using a median operation. The median is preferred over
a mean computation because of its resilience to outliers. Outliers
may be caused by broken or occluded microphones. The resulting
condensed masks are used to estimate the PSD matrices Φ
XX
of
speech, and Φ
NN
of noise, from which the beamformer coefficients
are obtained. Note that by treating each channel separately, spatial
,((( ,&$663
资源评论
航知道
- 粉丝: 24
- 资源: 302
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功