【免费】Past_review_curr_省略_il_party_problem

需积分: 0 11 浏览量 2022-08-04 12:35:57 上传评论收藏 923KB PDF 举报

资源详情

资源评论

资源推荐

40 Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63

Frontiers of Information Technology & Electronic Engineering

www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com

ISSN 2095-9184 (print); ISSN 2095-9230 (online)

E-mail: jzus@zju.edu.cn

Review:

Past review, current progress, and challenges ahead on the

cocktail party problem

∗

Yan-min QIAN

†‡1

,ChaoWENG

, Xuan-kai CHANG

, Shuai WANG

,DongYU

Tencent AI L ab, Tencent, Bellevue 98004, USA

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

†

E-mail: yanminqian@tencent.com

Received Dec. 8, 2017; Revision accepted Jan. 17, 2018; Crosschecked Jan. 25, 2018

Abstract: The cocktail party problem, i.e., tracing and recognizing the speech of a speciﬁc speaker when multiple

speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of

automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the

last two decades in attacking this problem. We focus our discussions on the speech separation problem given

its central role in the cocktail party environment, and describe the conventional single-channel techniques such as

computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the

conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly

developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and

permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker

identiﬁcation in the cocktail party environment. We argue eﬀectively exploiting information in the microphone array,

the acoustic training set, and the language itself using a more powerful model. Better optimization objective and

techniques will be the approach to solving the cocktail party problem.

Key words: Cocktail party problem; Computational auditory scene analysis; Non-negative matrix factorization;

Permutation invariant training; Multi-talker speech processing

https://doi.org/10.1631/FITEE.1700814 CLC number: TP391.4

1 Introduction

Although the accuracy of automatic speech

recognition (ASR) systems has surpassed the thresh-

old for adoption for many real-world applications

(Hinton et al., 2012; Abdel-Hamid et al., 2014;

Yu and Deng, 2014; Bi et al., 2015; Peddinti et al.,

2015; Sainath et al., 2015; Qian et al., 2016;

Sercu et al., 2016; Xiong et al., 2016; Yu and Li,

2017), there are still diﬃculties to be solved to

make ASR systems more robust and more widely

deployed (Qian et al., 2018). The cocktail party

‡

Corresponding author

Project supported by the Tencent and Shanghai Jiao Tong

University Joint Project

OR CID: Yan-min QIAN, http://orcid.org/0000-0002-0314-3790



Zhejiang University and Springer-Verlag GmbH Germany, part

of Springer Nature 2018

problem, i.e., tracing and recognizing the speech

from a speciﬁc speaker when multiple speakers talk

simultaneously and when other background noise is

involved, is one such problem. The cocktail party

problem has been widely observed. Solving it could

enable many scenarios and applications, such as

meeting transcription, multi-party human–machine

interaction, and hearing impairment assistants,

where overlapped speech cannot be ignored.

There is a long history of research on the cock-

tail party problem (Cherry, 1953; Wang and Brown,

2006; Kolbæk et al., 2017a; Yu et al., 2017b). Al-

though the processing mechanisms seem clear and

related tasks are easy for humans, researchers

have found it surprisingly diﬃcult to give ma-

chines the same ability. Although many approaches

Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63 41

were proposed and attempted in the early days,

including those based on signal processing tech-

niques (Ephraim and Malah, 1985; Hu and Loizou,

2007, 2008), computational auditory scene analy-

sis (CASA) (Brown and Cooke, 1994; Ellis, 1996;

Wang and Brown, 2006), non-negative matrix fac-

torization (NMF) (Raj et al., 2010; Schuller et al.,

2010; Chen et al., 2014), and microphone array

techniques (Fischer and Simmer, 1996; Kellermann,

1997; Anguera et al., 2007; Benesty et al., 2007), a

few of these approaches achieved robust performance

with a high separation quality, especially when only

a single channel of the mixed signal is available or

the speakers are facing the same direction.

Inspired by the great success of deep learn-

ing in speech recognition (Sainath et al., 2013;

Xiong et al., 2016; Yu et al., 2016) and speaker

identiﬁcation (Lei et al., 2014; Variani et al., 2014;

Liu et al., 2015), deep learning-based techniques

have been developed recently to address the cock-

tail party problem. These new techniques signiﬁ-

cantly outperform the conventional approaches, and

performance improvements are particularly impres-

sive for recent techniques such as deep clustering

(DPCL) (Hershey et al., 2016), the deep attractor

network (DANet) (Chen et al., 2017b), and permu-

tation invariant training (PIT) (Yu et al., 2017a,b).

The preliminary success ignites new hope and pro-

vides important stepping stones towards eventually

solving the cocktail party problem.

This paper aims to provide a comprehensive sur-

vey of the popular and eﬀective solutions to the

cocktail party problem developed in the past two

decades. We focus on the recent progress achieved

with deep learning technologies and the remaining

diﬃculties and challenges ahead. We hope this sur-

vey can help readers become familiar with this active

research area, and gain insights into the possible re-

search directions for addressing this interesting and

important problem.

2 Cocktail party problem

Natural auditory environments, such as cock-

tail parties, usually contain many concurrently ex-

isting sounds, including speech signals from multi-

ple speakers and other sounds such as music and

instruments. The cocktail party problem is the

task of separating these mixed sounds and paying

attention to only one or two sounds of interest, of-

ten speech signals, in such complex auditory envi-

ronments (Fig. 1). The cocktail party problem is

quite interesting yet diﬃcult to solve. Although

there is a long history of research on how humans

behave in the cocktail party environment and many

attempts have been make to develop computer al-

gorithms to match a machine’s ability to that of

humans in such environments, the cocktail party

problem remains a challenge to be solved to en-

able a truly free conversation between humans and

computers.

Fig. 1 A typical cocktail party scene (image from

Daniel Hagerman: High Society Cocktail Party—End

of Prohibition 1933 )

Although the cocktail party problem is diﬃcult

for computers, it seems to be easy for humans. Hu-

mans can separate a signal consisting of multiple

sources and attend to recognize one single source

(Mesgarani and Chang, 2012; Chen, 2017). For in-

stance, at a typical cocktail party, people can easily

concentrate on the speech of the conversational talk-

ers, the song from the singers, or the melody from the

musical instruments. Mesgarani and Chang (2012)

conducted a research on the cortical representation

of multi-talker mixed speech, and concluded that the

human auditory system restores the representation

of speaker of interest while suppressing irrelevant

competing speech. In fact, this ability exists in not

only humans but also other species. For example,

animals can easily identify the sounds from mates or

enemies in crowded environments where many ani-

mals vocalize at the same time (McDermott, 2009).

To match a computer’s ability to that of hu-

mans and animals in the cocktail party environment,

we need to attack two distinct challenges. The ﬁrst

challenge is how to separate sounds from the mixed

42 Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63

signal, which is the sum of all sounds in the com-

plex auditory scene. Humans are typically interested

in and capable of concentrating on only one or two

sound sources at the same time and thus need only

to separate these sounds from the mixture. How-

ever, computers can multi-task, and thus it is desir-

able to separate all sound sources from the mixture.

The second challenge, which is very important in

multi-talker conversation, is how to trace and hold

attention to the sound of interest source and switch

attention among sources. In most cases, these two

challenges are intertwined: the attention to the tar-

get source of interest can beneﬁt from good sepa-

ration and the separation can beneﬁt from speaker

tracing.

The term ‘cocktail party problem’ was coined in

Cherry’s classic paper (Cherry, 1953). This paper

studied whether humans can select one speech signal

over another, whether they retain anything about the

non-selected signal, and how they can switch their

attention between signals. About four decades later,

Bregman (1990) began studying sound segregation,

termed ‘auditory scene analysis’. In fact, most of the

past and current work on the cocktail party prob-

lem focused on the ﬁrst challenge (Du et al., 2014;

Xu et al., 2014; Wang et al., 2014; Weninger et al.,

2015; Chen, 2017), i.e., sound segregation, which is

also the main focus of this paper.

To evaluate the performance of the solu-

tion to the cocktail party problem, many met-

rics have been proposed to measure the ability

of sound separation (usually speech separation)

and target source attention (usually target speaker

tracing). For example, for the speech separa-

tion task, the metrics for speech quality, such as

perceptual evaluation of speech quality (PESQ)

(Rix et al., 2001), source-to-noise ratio (SNR),

source-to-distortion ratio (SDR), source-to-artifacts

ratio (SAR) (Vincent et al., 2006), and short-time

objective intelligibility (STOI) (Taal et al., 2010),

are commonly used. In some scenarios, the perfor-

mance measurement is task dependent. For example,

in the multi-talker speech recognition task, speech

separation is just an intermediate step and the essen-

tial metric of the system is the recognition accuracy

measured with, e.g., the word error rate (WER).

In the multi-talker speaker identiﬁcation task, the

equal error rate (EER) is often used to evaluate

the performance of the solution in the cocktail party

environment.

Although researchers have not achieved a solu-

tion yet, many technologies have been proposed to

attack the cocktail party problem over the past two

decades. In Sections 3–7 we will review the most

popular ones.

3 Conventional single-channel tech-

niques

3.1 Computational auditory scene analysis

Although speech separation has proved to be

diﬃcult for computers, it is remarkably easy for the

human auditory system. An obvious idea is to study

how humans separate speech and learn from them.

CASA follows this idea exactly.

In psychoacoustic research, the perceptual pro-

cess of separating mixtures of sound sources is called

‘auditory scene analysis (ASA)’ (Bregman, 1990).

Research in ASA has inspired CASA (Hu and Wang,

2004; Wang, 2005; Wang and Brown, 2006), in which

certain segmentation rules based on perceptual

grouping cues are (often semi-manually) designed

to operate on low-level features to estimate a time–

frequency (T-F) mask that isolates the signal com-

ponents belonging to diﬀerent speakers. This mask

is then used to reconstruct the signal. For exam-

ple, natural speech contains both voiced and un-

voiced portions, and voiced portions account for

about 75%–80% of spoken English (Hu and Wang,

2008). Because voiced speech is characterized by

periodicity (or harmonicity), harmonicity has been

used as a primary cue in many CASA systems for

segregating voiced speech (Brown and Cooke, 1994).

Although CASA was proposed more than a

decade ago, techniques based on the same principles

are still being developed. Hu and Wang (2010) used

a tandem algorithm to generate multiple simultane-

ous speech streams, and then grouped them sequen-

tially by maximizing a joint speaker recognition score

where speakers are described with Gaussian mixture

models (GMMs). Hu and Wang (2013) proposed to

use the information from a co-channel signal to im-

prove the segmentation and grouping in CASA. An

input scene is decomposed into T-F segments, each

of which originates primarily from a single sound

source. Grouping selectively aggregates segments to

form streams corresponding to sound sources. Both

Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63 43

simultaneous and sequential grouping techniques are

used. Simultaneous grouping organizes sound com-

ponents across frequencies to produce simultaneous

streams, and sequential grouping links them across

time to form ﬁnal sound streams.

Although CASA simulates the high-level behav-

ior of human listening, it suﬀers from many draw-

backs. First, it works on only speech and may fail

in the broader perspective of audio source separa-

tion. Second, most of the rules are manually de-

signed based on a limited number of observations

and generalize poorly. Third, since the ﬁnal separa-

tion is based on T-F segmentation (i.e., each T-F bin

belongs to only one sound source), the best possi-

ble result is agreement with the oracle binary mask,

which has been shown to be suboptimal in most sce-

narios (Wang, 2005; Kjems et al., 2009). Fourth, the

entire system heavily depends on the accuracy of

the pitch tracker, which is not robust under complex

acoustic conditions. Fifth, it is limited because it

cannot learn from data automatically.

3.2 Non-negative matrix factorization

In CASA, the T-F bins are grouped together

based mainly on the hand-designed rules from hu-

man observations. To ﬁnd the complex inherent

characteristics from data, the data-driven methods

were proposed. NMF (Lee and Seung, 2001), along

with other matrix decomposition models, was built

based on the assumption that the audio spectrogram

has a low rank structure that can be represented

with a small number of bases. Under certain con-

ditions, the decomposition in NMF is unique and

no other orthogonality or independence assumptions

are needed. Speciﬁcally, in NMF,

Y =



, (1)

where each source s is modeled by the low rank ap-

proximation with non-negative matrices W

and H

andthensummedtoformmixtureY . Because of the

non-negativity of the decomposition matrices, there

is no cancellation between sources in the reconstruc-

tion of mixture spectra Y , which models the addi-

tivity between mixed sources.

Fig. 2 illustrates the basic NMF process. In

the training stage, each clean source, e.g., speech,

noise, and music, is decomposed and mapped into

a set of bases and activations, and a source-speciﬁc

Clean

sources

Training

Testing

mixture

Testing

Separated

source 1

NMF

Activations

Dictionary

⎡⎤

⎢⎥

⎣⎦







[

⎡⎤

⎢⎥

⎣⎦







⎡⎤

⎢⎥

⎣⎦







⎡⎤

⎢⎥

⎣⎦







Separated

source 2

Fig. 2 The training phase where a dictionary set is

learned for each individual source, and the testing

phase where activation is inferred from non-negative

matrix factorization, which is then used to recon-

struct the source signals, given the dictionary and

testing data

dictionary W is formed. During the testing stage, all

the source-speciﬁc dictionaries learned are merged

into a combined dictionary. This combined dictio-

nary is ﬁxed and only activation H is optimized for

each source, in which case the optimization is convex

and a global optimum can be achieved. Each source

in the mixture is then reconstructed by the bases

and the corresponding activations. The basic NMF

algorithm is

min

W,H

D(Y WH) (2)

s.t. W , H ≥ 0. (3)

Several variations of the NMF methods have been

proposed. For example, the sparse NMF (Hoyer,

2004; Schmidt and Olsson, 2006; Virtanen, 2007)

forces activation H to be sparse. In the convolu-

tional NMF (Behnke, 2003; Bello, 2010; Chen et al.,

2014), the spectrogram is decomposed into the con-

volution (instead of multiplication) of the basis and

the activation. The robust NMF (Zhang et al., 2011;

Chen and Ellis, 2013) combines NMF with robust

principal component analysis.

The success of NMF is limited by a few facts.

First, it is limited by the basis. Other attributes

and regularities (e.g., temporal dynamics) of speech

signals are not exploited. Second, the power of the

model is limited by its linear system formulation,

which prevents it from achieving a high separation

quality. Third, the complexity of the decomposition

during testing is expensive, limiting its application

in real-time scenarios. Fourth, the size of the model

parameters is determined by, and increases linearly

with, the number of clean sources in the training set.

This deteriorates its eﬀectiveness in using a large

44 Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63

training set. Fifth, during testing, each source has

to have a dictionary learned during the training stage

(i.e., the source is included in the training set), which

is not feasible in most real-world applications.

3.3 Generative models

NMF cannot model temporal dynamics. To

address this limitation, several studies have

been conducted (Kristjansson et al., 2006; Virtanen,

2006; Hershey et al., 2007; Cooke et al., 2010;

Hershey et al., 2010; Rennie et al., 2010), most of

which are based on the Gaussian mixture model-

hidden Markov model (GMM-HMM) framework,

a popular generative model in single-talker speech

recognition. Among all these GMM-HMM sepa-

ration models, the factorial hidden Markov model

(FHMM) (Ghahramani and Jordan, 1996) is the

most interesting and performs best. In FHMM, each

source signal is modeled with an HMM trained on

the data for that source. For each signal source s,if

we deﬁne the clean signal as {x

} (t ∈{1, 2, ···,T}),

hidden states as {v

}, and the discrete mixture state

as {m

}, HMM has the characteristics

p(v

1:t−1

)=p(v

t−1

), (4)

p(x

1:T

)=p(x

)



p(x

)p(m

), (5)

where Eq. (4) describes the transition probability

and Eq. (5) describes the observation probability un-

der the Markov independence assumption. Given the

mixed signal {y

} of S signal sources, the new gen-

erative model, called the ‘interaction model’, can be

deﬁned as

p({y

}, {x

}, {m

}, {v

})



t=1

p(y

|{x

(t)

})



t=1



s=1

p(x

)p(m

)p(v

t−1

), (6)

where {x

} is not observable.

The process of inferring hidden state sequence

{

(s)

} for each source s using the maximum

a posterior (MAP) criterion requires computing

p(y

|{v

(t)

}) as

p(y

|{v

(t)

})



,··· ,m

p(y

|{m

(t)

})



p(m

)



(t)

}

p(y

|{m

(t)

})



p(m

), (7)

where

p(y

|{m

(t)

})



···



p(y

, {x

(t)

}|{m

(t)

})dx

···dx

(8)

p(y

|{v

(t)

}) does not factor over the speakers. The

exact MAP state sequences of the speakers must be

jointly estimated.

To reconstruct the features of source s at time

t, the posterior expected value needs to be computed

E(x

, {

(t)

})=



(t)

}

p({m

(t)

}

, {

(t)

})

· E(x

, {m

(t)

}), (9)

where

E(x

, {m

(t)

})



···



p({x

(t)

}|y

, {m

(t)

})dx

···dx

(10)

The computation process is very complicated

and intractable because all these estimates are cou-

pled over the states of the speakers. Several ap-

proximations for the interaction function have been

developed to allow the integral in Eq. (10) to be com-

puted analytically. The computation process can be

divided into two parts, i.e., computing acoustic state

likelihoods p(y

|{m

(t)

}) and combining these likeli-

hoods to infer the MAP conﬁguration of dynamic

state variables {

}. The former part includes ap-

proximation using the log-sum model and the max-

model, and the latter part includes loopy belief

propagation.

Table 1 compares FHMM with other conven-

tional techniques on the 2006 two-talker speech

separation and recognition challenge (SSC) task

(Cooke et al., 2010). All generative models outper-

form CASA and NMF. Among the generative mod-

els, FHMM (Hershey et al., 2010) performs the best

剩余23页未读，继续阅读

评论收藏

内容反馈

独角兽邹教授

粉丝: 30
资源: 320

Past_review_curr_省略_il_party_problem_Yan1

评论0

最新资源

Past_review_curr_省略_il_party_problem_Yan1

评论0

CMOS读取

Matprojet1.rar_The Power_actif power filter_filter_harmonic curr

INVERTERPWMcompensater.rar_Dead time inverter_dead_harmonic curr

使用C语言实现最大子段和的代码示例

ch13_trans_curr.zip_Quick

verilog_curr_design_fgpa_乒乓球机_源码_

大脚怪：MER和非MER的数据三角剖分

大数据与数据挖掘queue_curr_size.m

Smartmi-smart-heater:智米智能电暖器

verilog FSM 范例

xwinpp:[已停产] 准备X windows列表

Django之编辑时根据条件跳转回原页面的方法

交换机配置 中国铁通网通配置资料dis curr

CombatFixed:修复ma-gym的战斗环境，以进行COMP00124课程

HTML转换成其他语言工具，强烈推荐

SPM8MCP4725.rar

life.c mpi 程序

DataPre.py

memcached-zabbix-template:zabbix 的新 memcached 监控模板

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

禅道项目管理系统身份认证绕过漏洞(QVD-2024-15263)POC

安全认证cisp教材全套

STM32F103C8T6核心板-电路原理图1.PDF

最新资源

交换机配置中国铁通网通配置资料dis curr