没有合适的资源?快使用搜索试试~ 我知道了~
Past_review_curr_省略_il_party_problem_Yan1
需积分: 0 0 下载量 11 浏览量
2022-08-04
12:35:57
上传
评论
收藏 923KB PDF 举报
温馨提示
试读
24页
Crosschecked Jan. 25, 2018Abstract: The cocktail party problem, i.e., tracing an
资源详情
资源评论
资源推荐
40 Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63
Frontiers of Information Technology & Electronic Engineering
www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com
ISSN 2095-9184 (print); ISSN 2095-9230 (online)
E-mail: jzus@zju.edu.cn
Review:
Past review, current progress, and challenges ahead on the
cocktail party problem
∗
Yan-min QIAN
†‡1
,ChaoWENG
1
, Xuan-kai CHANG
2
, Shuai WANG
2
,DongYU
1
1
Tencent AI L ab, Tencent, Bellevue 98004, USA
2
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
†
E-mail: yanminqian@tencent.com
Received Dec. 8, 2017; Revision accepted Jan. 17, 2018; Crosschecked Jan. 25, 2018
Abstract: The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple
speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of
automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the
last two decades in attacking this problem. We focus our discussions on the speech separation problem given
its central role in the cocktail party environment, and describe the conventional single-channel techniques such as
computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the
conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly
developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and
permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker
identification in the cocktail party environment. We argue effectively exploiting information in the microphone array,
the acoustic training set, and the language itself using a more powerful model. Better optimization objective and
techniques will be the approach to solving the cocktail party problem.
Key words: Cocktail party problem; Computational auditory scene analysis; Non-negative matrix factorization;
Permutation invariant training; Multi-talker speech processing
https://doi.org/10.1631/FITEE.1700814 CLC number: TP391.4
1 Introduction
Although the accuracy of automatic speech
recognition (ASR) systems has surpassed the thresh-
old for adoption for many real-world applications
(Hinton et al., 2012; Abdel-Hamid et al., 2014;
Yu and Deng, 2014; Bi et al., 2015; Peddinti et al.,
2015; Sainath et al., 2015; Qian et al., 2016;
Sercu et al., 2016; Xiong et al., 2016; Yu and Li,
2017), there are still difficulties to be solved to
make ASR systems more robust and more widely
deployed (Qian et al., 2018). The cocktail party
‡
Corresponding author
*
Project supported by the Tencent and Shanghai Jiao Tong
University Joint Project
OR CID: Yan-min QIAN, http://orcid.org/0000-0002-0314-3790
c
Zhejiang University and Springer-Verlag GmbH Germany, part
of Springer Nature 2018
problem, i.e., tracing and recognizing the speech
from a specific speaker when multiple speakers talk
simultaneously and when other background noise is
involved, is one such problem. The cocktail party
problem has been widely observed. Solving it could
enable many scenarios and applications, such as
meeting transcription, multi-party human–machine
interaction, and hearing impairment assistants,
where overlapped speech cannot be ignored.
There is a long history of research on the cock-
tail party problem (Cherry, 1953; Wang and Brown,
2006; Kolbæk et al., 2017a; Yu et al., 2017b). Al-
though the processing mechanisms seem clear and
related tasks are easy for humans, researchers
have found it surprisingly difficult to give ma-
chines the same ability. Although many approaches
Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63 41
were proposed and attempted in the early days,
including those based on signal processing tech-
niques (Ephraim and Malah, 1985; Hu and Loizou,
2007, 2008), computational auditory scene analy-
sis (CASA) (Brown and Cooke, 1994; Ellis, 1996;
Wang and Brown, 2006), non-negative matrix fac-
torization (NMF) (Raj et al., 2010; Schuller et al.,
2010; Chen et al., 2014), and microphone array
techniques (Fischer and Simmer, 1996; Kellermann,
1997; Anguera et al., 2007; Benesty et al., 2007), a
few of these approaches achieved robust performance
with a high separation quality, especially when only
a single channel of the mixed signal is available or
the speakers are facing the same direction.
Inspired by the great success of deep learn-
ing in speech recognition (Sainath et al., 2013;
Xiong et al., 2016; Yu et al., 2016) and speaker
identification (Lei et al., 2014; Variani et al., 2014;
Liu et al., 2015), deep learning-based techniques
have been developed recently to address the cock-
tail party problem. These new techniques signifi-
cantly outperform the conventional approaches, and
performance improvements are particularly impres-
sive for recent techniques such as deep clustering
(DPCL) (Hershey et al., 2016), the deep attractor
network (DANet) (Chen et al., 2017b), and permu-
tation invariant training (PIT) (Yu et al., 2017a,b).
The preliminary success ignites new hope and pro-
vides important stepping stones towards eventually
solving the cocktail party problem.
This paper aims to provide a comprehensive sur-
vey of the popular and effective solutions to the
cocktail party problem developed in the past two
decades. We focus on the recent progress achieved
with deep learning technologies and the remaining
difficulties and challenges ahead. We hope this sur-
vey can help readers become familiar with this active
research area, and gain insights into the possible re-
search directions for addressing this interesting and
important problem.
2 Cocktail party problem
Natural auditory environments, such as cock-
tail parties, usually contain many concurrently ex-
isting sounds, including speech signals from multi-
ple speakers and other sounds such as music and
instruments. The cocktail party problem is the
task of separating these mixed sounds and paying
attention to only one or two sounds of interest, of-
ten speech signals, in such complex auditory envi-
ronments (Fig. 1). The cocktail party problem is
quite interesting yet difficult to solve. Although
there is a long history of research on how humans
behave in the cocktail party environment and many
attempts have been make to develop computer al-
gorithms to match a machine’s ability to that of
humans in such environments, the cocktail party
problem remains a challenge to be solved to en-
able a truly free conversation between humans and
computers.
Fig. 1 A typical cocktail party scene (image from
Daniel Hagerman: High Society Cocktail Party—End
of Prohibition 1933 )
Although the cocktail party problem is difficult
for computers, it seems to be easy for humans. Hu-
mans can separate a signal consisting of multiple
sources and attend to recognize one single source
(Mesgarani and Chang, 2012; Chen, 2017). For in-
stance, at a typical cocktail party, people can easily
concentrate on the speech of the conversational talk-
ers, the song from the singers, or the melody from the
musical instruments. Mesgarani and Chang (2012)
conducted a research on the cortical representation
of multi-talker mixed speech, and concluded that the
human auditory system restores the representation
of speaker of interest while suppressing irrelevant
competing speech. In fact, this ability exists in not
only humans but also other species. For example,
animals can easily identify the sounds from mates or
enemies in crowded environments where many ani-
mals vocalize at the same time (McDermott, 2009).
To match a computer’s ability to that of hu-
mans and animals in the cocktail party environment,
we need to attack two distinct challenges. The first
challenge is how to separate sounds from the mixed
42 Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63
signal, which is the sum of all sounds in the com-
plex auditory scene. Humans are typically interested
in and capable of concentrating on only one or two
sound sources at the same time and thus need only
to separate these sounds from the mixture. How-
ever, computers can multi-task, and thus it is desir-
able to separate all sound sources from the mixture.
The second challenge, which is very important in
multi-talker conversation, is how to trace and hold
attention to the sound of interest source and switch
attention among sources. In most cases, these two
challenges are intertwined: the attention to the tar-
get source of interest can benefit from good sepa-
ration and the separation can benefit from speaker
tracing.
The term ‘cocktail party problem’ was coined in
Cherry’s classic paper (Cherry, 1953). This paper
studied whether humans can select one speech signal
over another, whether they retain anything about the
non-selected signal, and how they can switch their
attention between signals. About four decades later,
Bregman (1990) began studying sound segregation,
termed ‘auditory scene analysis’. In fact, most of the
past and current work on the cocktail party prob-
lem focused on the first challenge (Du et al., 2014;
Xu et al., 2014; Wang et al., 2014; Weninger et al.,
2015; Chen, 2017), i.e., sound segregation, which is
also the main focus of this paper.
To evaluate the performance of the solu-
tion to the cocktail party problem, many met-
rics have been proposed to measure the ability
of sound separation (usually speech separation)
and target source attention (usually target speaker
tracing). For example, for the speech separa-
tion task, the metrics for speech quality, such as
perceptual evaluation of speech quality (PESQ)
(Rix et al., 2001), source-to-noise ratio (SNR),
source-to-distortion ratio (SDR), source-to-artifacts
ratio (SAR) (Vincent et al., 2006), and short-time
objective intelligibility (STOI) (Taal et al., 2010),
are commonly used. In some scenarios, the perfor-
mance measurement is task dependent. For example,
in the multi-talker speech recognition task, speech
separation is just an intermediate step and the essen-
tial metric of the system is the recognition accuracy
measured with, e.g., the word error rate (WER).
In the multi-talker speaker identification task, the
equal error rate (EER) is often used to evaluate
the performance of the solution in the cocktail party
environment.
Although researchers have not achieved a solu-
tion yet, many technologies have been proposed to
attack the cocktail party problem over the past two
decades. In Sections 3–7 we will review the most
popular ones.
3 Conventional single-channel tech-
niques
3.1 Computational auditory scene analysis
Although speech separation has proved to be
difficult for computers, it is remarkably easy for the
human auditory system. An obvious idea is to study
how humans separate speech and learn from them.
CASA follows this idea exactly.
In psychoacoustic research, the perceptual pro-
cess of separating mixtures of sound sources is called
‘auditory scene analysis (ASA)’ (Bregman, 1990).
Research in ASA has inspired CASA (Hu and Wang,
2004; Wang, 2005; Wang and Brown, 2006), in which
certain segmentation rules based on perceptual
grouping cues are (often semi-manually) designed
to operate on low-level features to estimate a time–
frequency (T-F) mask that isolates the signal com-
ponents belonging to different speakers. This mask
is then used to reconstruct the signal. For exam-
ple, natural speech contains both voiced and un-
voiced portions, and voiced portions account for
about 75%–80% of spoken English (Hu and Wang,
2008). Because voiced speech is characterized by
periodicity (or harmonicity), harmonicity has been
used as a primary cue in many CASA systems for
segregating voiced speech (Brown and Cooke, 1994).
Although CASA was proposed more than a
decade ago, techniques based on the same principles
are still being developed. Hu and Wang (2010) used
a tandem algorithm to generate multiple simultane-
ous speech streams, and then grouped them sequen-
tially by maximizing a joint speaker recognition score
where speakers are described with Gaussian mixture
models (GMMs). Hu and Wang (2013) proposed to
use the information from a co-channel signal to im-
prove the segmentation and grouping in CASA. An
input scene is decomposed into T-F segments, each
of which originates primarily from a single sound
source. Grouping selectively aggregates segments to
form streams corresponding to sound sources. Both
Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63 43
simultaneous and sequential grouping techniques are
used. Simultaneous grouping organizes sound com-
ponents across frequencies to produce simultaneous
streams, and sequential grouping links them across
time to form final sound streams.
Although CASA simulates the high-level behav-
ior of human listening, it suffers from many draw-
backs. First, it works on only speech and may fail
in the broader perspective of audio source separa-
tion. Second, most of the rules are manually de-
signed based on a limited number of observations
and generalize poorly. Third, since the final separa-
tion is based on T-F segmentation (i.e., each T-F bin
belongs to only one sound source), the best possi-
ble result is agreement with the oracle binary mask,
which has been shown to be suboptimal in most sce-
narios (Wang, 2005; Kjems et al., 2009). Fourth, the
entire system heavily depends on the accuracy of
the pitch tracker, which is not robust under complex
acoustic conditions. Fifth, it is limited because it
cannot learn from data automatically.
3.2 Non-negative matrix factorization
In CASA, the T-F bins are grouped together
based mainly on the hand-designed rules from hu-
man observations. To find the complex inherent
characteristics from data, the data-driven methods
were proposed. NMF (Lee and Seung, 2001), along
with other matrix decomposition models, was built
based on the assumption that the audio spectrogram
has a low rank structure that can be represented
with a small number of bases. Under certain con-
ditions, the decomposition in NMF is unique and
no other orthogonality or independence assumptions
are needed. Specifically, in NMF,
Y =
s
W
s
H
s
, (1)
where each source s is modeled by the low rank ap-
proximation with non-negative matrices W
s
and H
s
andthensummedtoformmixtureY . Because of the
non-negativity of the decomposition matrices, there
is no cancellation between sources in the reconstruc-
tion of mixture spectra Y , which models the addi-
tivity between mixed sources.
Fig. 2 illustrates the basic NMF process. In
the training stage, each clean source, e.g., speech,
noise, and music, is decomposed and mapped into
a set of bases and activations, and a source-specific
Clean
sources
Training
Testing
mixture
Testing
Separated
source 1
NMF
NMF
Activations
X
Y
Dictionary
Dictionary
⎡⎤
⎢⎥
⎢⎥
⎢⎥
⎣⎦
6
[
[
[
⎡⎤
⎢⎥
⎢⎥
⎢⎥
⎣⎦
6
Z
Z
Z
⎡⎤
⎢⎥
⎢⎥
⎢⎥
⎣⎦
6
Z
Z
Z
⎡⎤
⎢⎥
⎢⎥
⎢⎥
⎣⎦
6
+
+
+
Separated
source 2
Fig. 2 The training phase where a dictionary set is
learned for each individual source, and the testing
phase where activation is inferred from non-negative
matrix factorization, which is then used to recon-
struct the source signals, given the dictionary and
testing data
dictionary W is formed. During the testing stage, all
the source-specific dictionaries learned are merged
into a combined dictionary. This combined dictio-
nary is fixed and only activation H is optimized for
each source, in which case the optimization is convex
and a global optimum can be achieved. Each source
in the mixture is then reconstructed by the bases
and the corresponding activations. The basic NMF
algorithm is
min
W,H
D(Y WH) (2)
s.t. W , H ≥ 0. (3)
Several variations of the NMF methods have been
proposed. For example, the sparse NMF (Hoyer,
2004; Schmidt and Olsson, 2006; Virtanen, 2007)
forces activation H to be sparse. In the convolu-
tional NMF (Behnke, 2003; Bello, 2010; Chen et al.,
2014), the spectrogram is decomposed into the con-
volution (instead of multiplication) of the basis and
the activation. The robust NMF (Zhang et al., 2011;
Chen and Ellis, 2013) combines NMF with robust
principal component analysis.
The success of NMF is limited by a few facts.
First, it is limited by the basis. Other attributes
and regularities (e.g., temporal dynamics) of speech
signals are not exploited. Second, the power of the
model is limited by its linear system formulation,
which prevents it from achieving a high separation
quality. Third, the complexity of the decomposition
during testing is expensive, limiting its application
in real-time scenarios. Fourth, the size of the model
parameters is determined by, and increases linearly
with, the number of clean sources in the training set.
This deteriorates its effectiveness in using a large
44 Qian et al. / Front Inform Technol Electron Eng 2018 19(1):40-63
training set. Fifth, during testing, each source has
to have a dictionary learned during the training stage
(i.e., the source is included in the training set), which
is not feasible in most real-world applications.
3.3 Generative models
NMF cannot model temporal dynamics. To
address this limitation, several studies have
been conducted (Kristjansson et al., 2006; Virtanen,
2006; Hershey et al., 2007; Cooke et al., 2010;
Hershey et al., 2010; Rennie et al., 2010), most of
which are based on the Gaussian mixture model-
hidden Markov model (GMM-HMM) framework,
a popular generative model in single-talker speech
recognition. Among all these GMM-HMM sepa-
ration models, the factorial hidden Markov model
(FHMM) (Ghahramani and Jordan, 1996) is the
most interesting and performs best. In FHMM, each
source signal is modeled with an HMM trained on
the data for that source. For each signal source s,if
we define the clean signal as {x
s
t
} (t ∈{1, 2, ···,T}),
hidden states as {v
s
t
}, and the discrete mixture state
as {m
s
t
}, HMM has the characteristics
p(v
s
t
|v
s
1:t−1
)=p(v
s
t
|v
s
t−1
), (4)
p(x
s
t
|v
s
1:T
)=p(x
s
t
|v
s
t
)
=
s
s
t
p(x
s
t
|m
s
t
)p(m
s
t
|v
s
t
), (5)
where Eq. (4) describes the transition probability
and Eq. (5) describes the observation probability un-
der the Markov independence assumption. Given the
mixed signal {y
t
} of S signal sources, the new gen-
erative model, called the ‘interaction model’, can be
defined as
p({y
t
}, {x
t
}, {m
t
}, {v
t
})
=
T
t=1
p(y
t
|{x
s
(t)
})
·
T
t=1
S
s=1
p(x
s
t
|m
s
t
)p(m
s
t
|v
s
t
)p(v
s
t
|v
s
t−1
), (6)
where {x
s
t
} is not observable.
The process of inferring hidden state sequence
{
ˆ
v
(s)
t
} for each source s using the maximum
a posterior (MAP) criterion requires computing
p(y
t
|{v
s
(t)
}) as
p(y
t
|{v
s
(t)
})
=
m
1
t
,m
2
t
,··· ,m
S
t
p(y
t
|{m
i
(t)
})
s
p(m
s
t
|v
s
t
)
=
{m
i
(t)
}
p(y
t
|{m
i
(t)
})
s
p(m
s
t
|v
s
t
), (7)
where
p(y
t
|{m
s
(t)
})
=
···
p(y
t
, {x
s
(t)
}|{m
s
(t)
})dx
1
t
dx
2
t
···dx
S
t
.
(8)
p(y
t
|{v
s
(t)
}) does not factor over the speakers. The
exact MAP state sequences of the speakers must be
jointly estimated.
To reconstruct the features of source s at time
t, the posterior expected value needs to be computed
as
E(x
s
t
|y
t
, {
ˆ
v
i
(t)
})=
{m
i
(t)
}
p({m
(t)
}
i
|y
t
, {
ˆ
v
i
(t)
})
· E(x
s
t
|y
t
, {m
i
(t)
}), (9)
where
E(x
s
t
|y
t
, {m
i
(t)
})
=
···
x
s
t
p({x
i
(t)
}|y
t
, {m
i
(t)
})dx
1
t
dx
2
t
···dx
S
t
.
(10)
The computation process is very complicated
and intractable because all these estimates are cou-
pled over the states of the speakers. Several ap-
proximations for the interaction function have been
developed to allow the integral in Eq. (10) to be com-
puted analytically. The computation process can be
divided into two parts, i.e., computing acoustic state
likelihoods p(y
t
|{m
s
(t)
}) and combining these likeli-
hoods to infer the MAP configuration of dynamic
state variables {
ˆ
v
s
t
}. The former part includes ap-
proximation using the log-sum model and the max-
model, and the latter part includes loopy belief
propagation.
Table 1 compares FHMM with other conven-
tional techniques on the 2006 two-talker speech
separation and recognition challenge (SSC) task
(Cooke et al., 2010). All generative models outper-
form CASA and NMF. Among the generative mod-
els, FHMM (Hershey et al., 2010) performs the best
剩余23页未读,继续阅读
独角兽邹教授
- 粉丝: 30
- 资源: 320
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0