基于深度神经网络的混响时间感知语音去混响方法资源-CSDN文库

57 浏览量 2021-03-15 18:36:11 上传评论 1 收藏 1.26MB PDF 举报

在探讨基于深度神经网络的混响时间感知语音去混响方法时，我们首先需要了解混响现象是如何对语音信号产生影响的。混响是由于声波在封闭空间内反射、散射和吸收等一系列复杂过程产生的，它导致原始语音信号中出现多个时间延迟和衰减的副本。在自由空间和多任务场景中，麦克风远离说话者时，接收到的信号会包含这些副本，进而影响语音的清晰度和可懂度。文章中提到的基于深度神经网络（DNN）的混响时间感知语音去混响框架，旨在解决不同混响时间范围内语音信号的清晰化问题。为了构建一个鲁棒的系统，文章强调了三个关键步骤：第一步，与现有算法使用Sigmoid激活函数和最小-最大归一化不同，提出的方法采用了输出层的线性激活函数以及对目标特征的全局均值-方差归一化。这些改进的目的是为了学习从混响语音到无混响语音的复杂非线性映射，并改善低频和中频内容的恢复效果。第二步，重点研究了两个关键的设计参数，即语音帧移大小（frameshift）和DNN输入时的声学上下文窗口大小。研究表明，在DNN的训练阶段需要依据混响时间（RT60）相关参数来优化，以便在不同的混响环境中实现最佳系统性能。第三步，通过估计混响时间来选择合适的帧移和上下文窗口大小，用于特征提取，然后将对数功率谱特征输入到训练好的DNN进行语音去混响处理。该研究框架的核心概念是混响时间感知（RTA），它允许系统针对不同混响时间调整参数以达到最佳去混响效果。RTA是通过估计混响时间来实现的，从而选择适合于当前混响环境的参数进行语音特征提取和处理。文章中的实验结果表明，提出的混响时间感知语音去混响框架相较于不考虑混响时间的传统DNN方法具有更优越的性能，即使在极端弱混响和严重混响条件下，其性能也只略逊于已知混响时间的“理想情况”系统。此外，该框架还能够很好地泛化到未见过的房间尺寸、扬声器和麦克风位置以及记录的房间脉冲响应。基于深度神经网络的混响时间感知语音去混响方法是针对室内外不同环境下语音信号处理中的一项先进技术。通过采用与现有技术不同的设计思路和参数调整策略，该方法有效地提高了语音的清晰度，即使在语音信号遭受重度混响干扰时也能保持较高水平的去混响效果。这些发现对语音增强和语音识别领域具有重要的理论和实际意义，为未来的研究提供了新的方向和方法论。

资源推荐

资源详情

资源评论

102 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2017

A Reverberation-Time-Aware Approach to Speech

Dereverberation Based on Deep Neural Networks

Bo Wu, Kehuang Li, Minglei Yang, and Chin-Hui Lee, Fellow, IEEE

Abstract—A reverberation-time-aware deep-neural-network

(DNN)-based speech dereverberation framework is proposed to

handle a wide range of reverberation times. There are three key

steps in designing a robust system. First, in contrast to sigmoid acti-

vation and min–max normalization in state-of-the-art algorithms,

a linear activation function at the output layer and global mean-

variance normalization of target features are adopted to learn the

complicated nonlinear mapping function from reverberant to ane-

choic speech and to improve the restoration of the low-frequency

and intermediate-frequency contents. Next, two key design param-

eters, namely, frame shift size in speech framing and acoustic con-

text window size at the DNN input, are investigated to show that

RT60-dependent parameters are needed in the DNN training stage

in order to optimize the system performance in diverse reverberant

environments. Finally, the reverberation time is estimated to select

the proper frame shift and context window sizes for feature extrac-

tion before feeding the log-power spectrum features to the trained

DNNs for speech dereverberation. Our experimental results indi-

cate that the proposed framework outperforms the conventional

DNNs without taking the reverberation time into account, while

achieving a performance only slightly worse than the oracle cases

with known reverberation times even for extremely weak and se-

vere reverberant conditions. It also generalizes well to unseen room

sizes, loudspeaker and microphone positions, and recorded room

impulse responses.

Index Terms—Acoustic context, deep neural networks (DNNs),

frame shift, linear output layer, mean-variance normalization,

reverberation-time-aware (RTA), speech dereverberation.

I. INTRODUCTION

HEN a microphone is placed at a distance from a talker

in an enclosed space in hands-free, eyes-busy speech

applications, the received signal will be a collection of many

delayed and attenuated copies of the original speech signals,

caused by the reﬂections from walls, ceilings, and ﬂoors [1]. As

a result, reverberation often seriously degrades speech quality

and intelligibility. Such deteriorations can cause decreased per-

formances for automatic speech recognition, hearing aids and

Manuscript received April 29, 2016; revised August 8, 2016 and October

20, 2016; accepted October 20, 2016. Date of publication October 31, 2016;

date of current version November 28, 2016. This work was supported in part

by the National Natural Science Foundation of China (61571344). The work

of B.Wu was supported by a grant from the China Scholarship Council. The

associate editor coordinating the review of this manuscript and approving it for

publication was Prof. DeLiang Wang.

B. Wu and M. Yang are with the National Laboratory of Radar Signal Process-

ing, Xidian University, Xi’an 710126, China (e-mail: rambowu11@gmail.com;

mlyang@xidian.edu.cn).

K. Li and C.-H. Lee are with the School of Electrical and Computer En-

gineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail:

kehlekernel@gmail.com; chl@ece.gatech.edu).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TASLP.2016.2623559

source localization. Thus, an effective dereverberation solution

will beneﬁt many speech applications.

Many dereverberation techniques have been proposed in the

past (e.g., [2]–[6]). One direct way is to estimate an inverse

ﬁlter of the room impulse response (RIR) [7] to deconvolve the

reverberant signal. However, a minimum phase assumption is

often needed, which is almost never satisﬁed in practice [7].

The RIR can also be varying in time and hard to estimate [1].

The work presented in [2] estimated a ﬁxed length of an inverse

ﬁlter of RIR by maximizing the kurtosis of the linear prediction

(LP) residual for the reduction of early reverberation, without

taking into account the impact of RIR in distinct reverberant

environments on the system performance. The inverse ﬁltering

is only effective in a short reverberation time (RT60) [8] range,

from 0.2 to 0.4 s. Mosayyebpour et al. [4] presented an iterative

method to blindly determine the ﬁlter length according to the

reverberant condition, which could be used in highly reverber-

ant rooms. Nevertheless, the stopping criterion was empirically

chosen. Kinoshita et al. [3] estimated the late reverberations

using long-term multi-step linear prediction, and then reduced

the late reverberation effect by employing spectral subtraction.

Some studies attempted to separate speech and reverberation

via homomorphic transformation [9], [10]. Nevertheless, they

are not very effective when the human auditory system is the

target. Other methods dealt with dereverberation by exploiting

the essential properties of speech such as harmonic ﬁltering

[11]. The dereverberation ﬁlter is only estimated from voiced

speech segments, therefore achieving a poor dereverberation

performance for unvoiced speech segments.

Recently, due to their strong regression capabilities, deep neu-

ral networks (DNNs) [12], [13] have also been utilized in speech

enhancement [14], [15], source separation[16], [17] and band-

width expansion [18], [19]. Han et al. [5], [20] also proposed

to dereverberate speech using DNNs, to learn a spectral map-

ping from reverberant to anechoic speech. Although the results

reﬂect the state-of-the-art performances, they represented the

DNN prediction of log-spectral magnitude into an unit range

and normalized the target features into the same range, pre-

venting a good dereverberation performance, especially at low

RT60s. Moreover, their system is environmentally insensitive,

not being to realize its full potential.

In this study, we utilize an improved DNN dereverberation

system we proposed recently [21] by adopting a linear output

layer and globally normalizing the target features into zero mean

and unit variance, and then investigate the effects of frame shift

and acoustic context sizes on the dereverberated speech qual-

ity using DNNs at different RT60s. We show that on the one

hand low frame shifts can not obtain good performances in

See http://www.ieee.org/publications

standards/publications/rights/index.html for more information.

WU et al.: REVERBERATION-TIME-AWARE APPROACH TO SPEECH DEREVERBERATION BASED ON DNNs 103

Fig. 1. A DNN-based speech dereverberation system.

a strong reverberant condition, even at the price of increased

computational complexities. We next demonstrate that on the

other hand a large number of speech frames covering a large

acoustic context commonly used at the DNN input often de-

grades the quality of dereverberated speech at a low RT60.

Based on these observations, a frame-shift-aware DNN (FSA-

DNN) and an acoustic-context-aware DNN (ACA-DNN) are

context sizes as key DNN design parameters. We further explore

a reverberation-time-aware DNN (RTA-DNN) that outperforms

separate systems by considering both effects.

The rest of the paper is organized as follows. We ﬁrst describe

the proposed RTA-DNN dereverberation system in Section II.

Motivations for exploring the frame shift and acoustic context

parameters in reverberant situation are given in Section III. Ex-

perimental results are next provided and analyzed in Section IV.

The generalization capabilities of the proposed DNN models are

illustrated in Section V. Finally we summarize our ﬁndings in

Section VI.

II. DNN-B

ASED SPEECH DEREVERBERATION MODEL

A block diagram of the DNN-based speech dereverberation

system is illustrated in Fig. 1. In the training stage, a regression

DNN [14] is trained by a set of multi-condition data, consisting

of pairs of reverberant and anechoic speech represented by log-

power spectra (LPS). In the dereverberation stage, the well-

trained DNN is fed with the LPS features of input speech to

generate the corresponding enhanced LPS features. The required

phase is directly extracted from the reverberant speech, because

human ears are considered to be not sensitiveto such information

[22]. Finally the dereverberated waveform is reconstructed from

the estimated spectral magnitude and the reverberant speech

phase with an overlap-add method [23].

A. Output Layer Activation and Target Feature Normalization

1) Sigmoid Activation and Min–Max Normalization: In [5],

Han et al. proposed to learn the log-spectral mapping function

from reverberant to anechoic speech using a nonlinear DNN-

based regression model, and generated the state-of-the-art dere-

verberation performances. They represented the DNN output

of log-spectral magnitude into an unit range of [0, 1] by using

a sigmoid activation function and normalized the target (ane-

choic) features into the same range. A minimum mean squared

error (MMSE) objective function between the DNN output and

Fig. 2. Plots of normalized log-power spectra of a target frame with HWW-

DNN (in dash curve and line) and the proposed DNN (in solid curve and line).

“normalized” and “bias” denote each spectrum and its mean, respectively.

target features can be illustrated as follows:

min E =



n=1



d=1

− X

d,norm

(1)

where E denotes mean squared error.

and X

d,norm

represent

the d-th DNN output and normalized target feature at frame

index n, with D and N denoting the size of feature vector and

mini-batch, respectively.

1+e

−(



m =1

+ b

)

(2)

d,norm

−X

min

max

−X

min

(3)

where W

and b

represent the weights and biases between the

last hidden layer and output layer, respectively, with M denoting

the last hidden layer size. S

is the output at the m-neuron of the

last hidden layer at frame index n. X

max

and X

min

denote the

maximum and minimum values of the spectral feature among

all target utterances, respectively. There is no obvious difference

between using dimension dependent min/max value vector and

using the min/max value calculated among all dimensions in

Eq. (3) in our experiments.

The dash curve in Fig. 2 illustrates the normalized log-power

spectrum of a target frame, X

d,norm

, using Eq. (3). We refer to

the DNN obtained here as HWW-DNN. It can be seen that the

dynamic range here is small, resulting in blurred harmonics and

thus preventing an accurate restoration of the estimated anechoic

spectrogram. In addition, the DNN also needs to learn the non-

zero bias of the target feature vectors. This scheme seriously

degrades the dereverberation performances, especially when the

training set size is small.

2) Linear Activation and Mean-Variance Normalization: To

deal with the abovementioned drawbacks of the target feature

mapping scheme discussed in Section II-A1, we propose a linear

activation function at the output layer of the DNN and to globally

normalize the target features over all the target utterances into

zero mean and unit variance [21]. This is the popular mean

variance normalization (MVN) strategy [24] commonly used

in the speech recognition community. The DNN output and

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38569569

粉丝: 7
资源: 931

基于深度神经网络的混响时间感知语音去混响方法

一种利用空间信息进行麦克风阵列去混响的混响时间感知DNN方法

语音去混响研究

基于深度神经网络的双声道混响语音分离

Neural-Speech-Dereverberation:语音去混响的机器和深度学习模型

Dereverberation.rar_dereverbrati_混响 消除_语音 混响消除_语音增强_逆滤波 信号

基于深度学习的中文语音识别系统.zip

语音信号处理 语音信号处理 语音信号处理

Speech Dereverberation

语音信号处理语音信号处理语音信号处理

语音识别-对现有文件识别_语音识别_

7-115-12415-9.rar_语音处理

情绪识别方法、装置、设备及存储介质与流程.docx

se_relativisticgan-master_speechenhancement_wgan_语音增强_GaN.zip

Speech and audio signal processing

sound_experiments AND data_sound_

基于CORDIC的反正弦和反余弦计算的FPGA实现

使用3DCNN和卷积LSTM进行手势识别学习时空特征

BA无标度网络中的SIR模型

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

磁悬浮系统自适应模糊PID控制器的设计

基于BP神经网络的人口预测

无人机协同目标的多无人机协同搜索方法

最新资源

Dereverberation.rar_dereverbrati_混响消除_语音混响消除_语音增强_逆滤波信号

语音信号处理语音信号处理语音信号处理