没有合适的资源?快使用搜索试试~ 我知道了~
Unsupervised Learning Of Finite Mixture Models
需积分: 10 7 下载量 66 浏览量
2013-08-12
18:05:05
上传
评论
收藏 753KB PDF 举报
温馨提示
试读
16页
Iteratively suboptimal approaches to EM method
资源推荐
资源详情
资源评论
Unsupervised Learning of Finite Mixture Models
Mario A.T. Figueiredo, Senior Member, IEEE, and Anil K. Jain, Fellow, IEEE
AbstractÐThis paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective
ªunsupervisedº is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the
standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another
drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space.
The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate
models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any
type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments
involving Gaussian mixtures. These experiments testify for the good performance of our approach.
Index TermsÐFinite mixtures, unsupervised learning, model selection, minimum message length criterion, Bayesian methods,
expectation-maximization algorithm, clustering.
æ
1INTRODUCTION
F
INITE mixtures are a flexible and powerful probabilistic
modeling tool for univariate and multivariate data. The
usefulness of mixture models in any area which involves
the statistical modeling of data (such as pattern recognition,
computer vision, signal and image analysis, machine
learning) is currently widely acknowledged.
In statistical pattern recognition, finite mixtures allow a
formal (probabilistic model-based) approach to unsuper-
vised learning (i.e., clustering) [28], [29], [35], [37], [57]. In
fact, finite mixtures naturally model observations which are
assumed to have been produced by one (randomly selected
and unknown) of a set of alternative random sources.
Inferring (the parameters of) these sources and identifying
which source produced each observation leads to a
clustering of the set of observations. With this model-based
approach to clustering (as opposed to heuristic methods
like k-means or hierarchical agglomerative methods [28]),
issues like the selection of the number of clusters or the
assessment of the validity of a given model can be
addressed in a principled and formal way.
The usefulness of mixture models is not limited to
unsupervised learning applications. Mixture models are
able to represent arbitrarily complex probability density
functions (pdf's). This fact makes them an excellent choice
for representing complex class-conditional pdf's (i.e., like-
lihood functions) in (Bayesian) supervised learning scenar-
ios [25], [26], [55], or priors for Bayesian parameter
estimation [16]. Mixture models can also be used to perform
feature selection [43].
The standard method used to fit finite mixture models to
observed data is the expectation-maximization (EM) algorithm
[18], [36], [37], which converges to a maximum likelihood (ML)
estimate of the mixture parameters. However, the EM
algorithm for finite mixture fitting has several drawbacks: it
is a local (greedy) method, thus sensitive to initialization
because the likelihood function of a mixture model is not
unimodal; for certain types of mixtures, it may converge to the
boundary of the parameter space (where the likelihood is
unbounded) leading to meaningless estimates.
An important issue in mixture modeling is the selection
of the number of components. The usual trade off in model
order selection problems arises: With too many compo-
nents, the mixture may over-fit the data, while a mixture
with too few components may not be flexible enough to
approximate the true underlying model.
In this paper, we deal simultaneously with the above
mentioned problems. We propose an inference criterion for
mixture models and an algorithm to implement it which:
1) automatically selects the number of components, 2) is less
sensitive to initialization than EM, and 3) avoids the
boundary of the parameters space.
Although most of the literature on finite mixtures focuses
on mixtures of Gaussian densities, many other types of
probability density functions have also been considered. The
approach proposed in this paper can be applied to any type of
parametric mixture model for which it is possible to write an
EM algorithm.
The rest of paper is organized as follows: In Section 2, we
review finite mixture models and the EM algorithm; this is
standard material and our purpose is to introduce the
problem and define notation. In Section 3, we review
previous work on the problem of learning mixtures with an
unknown number of components and dealing with the
drawbacks of the EM algorithm. In Section 4, we describe
the proposed inference criterion, while the algorithm which
implements it is presented in Section 5. Section 6 reports
experimental results and Section 7 ends the paper by
presenting some concluding remarks.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 3, MARCH 2002 381
. M.A.T. Figueiredo is with the Institute of Telecommunications and the
Department of Electrical and Computer Engineering, Instituto Superior
Te
Â
cnico, 1049-001 Lisboa, Portugal. E-mail: mtf@lx.it.pt.
. A.K. Jain is with the Department of Computer Science and Engineering,
Michigan State University, East Lansing, MI 48824.
E-mail: jain@cse.msu.edu.
Manuscript received 5 July 2000; revised 8 Feb. 2001; accepted 30 July 2001.
Recommended for acceptance by W.T. Freeman.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 112382.
0162-8828/02/$17.00 ß 2002 IEEE
2LEARNING FINITE MIXTURE MODELS
2.1 Finite Mixture Models
Let Y Y
1
; ...;Y
d
T
be a d-dimensional random variable,
with y y
1
; ...;y
d
T
representing one particular outcome of
Y. It is said that Y follows a k-component finite mixture
distributionifitsprobability densityfunction canbe writtenas
pyj
X
k
m1
m
pyj
m
; 1
where
1
; ...;
k
are the mixing probabilities, each
m
is the
set of parameters defining the mth component, and
f
1
; ...;
k
;
1
; ...;
k
g is the complete set of parameters
needed to specify the mixture. Of course, being probabil-
ities, the
m
must satisfy
m
0;m 1; ...;k; and
X
k
m1
m
1: 2
In this paper, we assume that all the components have the
same functional form (for example, they are all d-variate
Gaussian), each one being thus fully characterized by the
parameter vector
m
. For detailed and comprehensive
accounts on mixture models, see [35], [37], [57]; here, we
simply review the fundamental ideas and define our notation.
Given a set of n independent and identically distrib-
uted samples Yfy
1
; ...; y
n
g, the log-likelihood corre-
sponding to a k-component mixture is
log pYjlog
Y
n
i1
py
i
j
X
n
i1
log
X
k
m1
m
py
i
j
m
:
3
It is well-known that the maximum likelihood (ML) estimate
b
ML
arg max
log pYj
fg
cannot be found analytically. The same is true for the
Bayesian maximum a posteriori (MAP) criterion,
b
MAP
arg max
log pYjlog p
fg
;
given some prior p on the parameters. Of course, the
maximizations defining the ML or MAP estimates are
under the constraints in (2).
2.2 The EM Algorithm
The usual choice for obtaining ML or MAP estimates of the
mixture parameters is the EM algorithm [18], [35], [36], [37].
EM is an iterative procedure which finds local maxima of
log p Yjor log p Yjlog p. For the case of Gaussian
mixtures, the convergence behavior of EM is well studied
[37], [63]. It was recently shown that EM belongs to a class
of iterative methods called proximal point algorithms (PPA;
for an introduction to PPA and a comprehensive set of
references see [4], chapter 5) [13]. Seeing EM under this new
light opens the door to several extensions and general-
izations. An earlier related result, although without
identifying EM as a PPA, appeared in [41].
The EM algorithm is based on the interpretation of Y as
incomplete data. For finite mixtures, the missing part is a set
of n labels Zfz
1
; ...; z
n
g associated with the n
samples, indicating which component produced each
sample. Each label is a binary vector z
i
z
i
1
; ...;z
i
k
,
where z
i
m
1 and z
i
p
0, for p 6 m, means that sample y
i
was produced by the mth component. The complete log-
likelihood (i.e., the one from which we could estimate if
the complete data XfY; Zg was observed [36]) is
log pY; Zj
X
n
i1
X
k
m1
z
i
m
log
m
py
i
j
m
hi
: 4
The EM algorithm produces a sequence of estimates
f
b
t;t 0; 1; 2; ...g by alternatingly applying two steps
(until some convergence criterion is met):
. E-step: Computes the conditional expectation of the
complete log-likelihood, given Y and the current
estimate
b
t.Sincelog pY; Zj is linear with
respect to the missing Z, we simply have to compute
the conditional expectation WEZjY;
b
t, and
plug it into log pY; Zj. The result is the so-called
Q-function:
Q;
b
t E log pY; Zj
Y;
b
t
hi
log pY; Wj:
5
Since the elements of Z are binary, their conditional
expectations are given by
w
i
m
Ez
i
m
jY;
b
t
hi
Pr z
i
m
1j y
i
;
b
t
hi
b
m
t py
i
j
b
m
t
P
k
j1
b
j
t py
i
j
b
j
t
;
6
where the last equality is simply Bayes law (
m
is
the a priori probability that z
i
m
1, while w
i
m
is
the a posteriori probability that z
i
m
1,after
observing y
i
).
. M-step: Updates the parameter estimates according to
b
t 1arg max
fQ;
b
t log pg;
in the case of MAP estimation, or
b
t 1arg max
Q;
b
t;
for the ML criterion, in both cases, under the
constraints in (2).
3PREVIOUS WORK
3.1 Estimating the Number of Components
Let us start by defining M
k
as the class of all possible
k-component mixtures built from a certain type of pdf's (e.g.,
all d-variate Gaussian mixtures with unconstrained covar-
iance matrices). The ML criterion cannot be used to estimate k,
the number of mixture components, because M
k
M
k1
,
that is, these classes are nested. As an illustration, let
f
1
; ...;
k
;
1
; ...;
k1
;
k
g, define a mixture in M
k
,
and
0
f
1
; ...;
k
;
k1
;
1
; ...;
k1
;
0
k
;
0
k1
g,definea
mixture in M
k1
.If
k1
k
and
k
0
k
0
k1
, then
382 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 3, MARCH 2002
and
0
represent the same probability density function.
Consequently, the maximized likelihood pYj
b
ML
is a
nondecreasing function of k, thus useless as a criterion to
estimate the number of components.
Several model selection methods have been proposed to
estimate the number of components of a mixture. The vast
majority of these methods can be classified, from a computa-
tional point of view, into two classes: deterministic and
stochastic.
3.1.1 Deterministic Methods
The methods in this class start by obtaining a set of candidate
models (usually by EM) for a range of values of k (from k
min
to
k
max
) which is assumed to contain the true/optimal k. The
number of components is then selected according to
b
k arg min
k
C
b
k;k
;kk
min
; ...;k
max
no
; 7
where C
b
k;k
is some model selection criterion, and
b
k
is an estimate of the mixture parameters assuming that it
has k components. Usually, these criteria have the form
C
b
k;k
log p Yj
b
k
Pk;
where Pk in an increasing function penalizing higher
values of k. Examples of such criteria that have been used
for mixtures include:
. Approximate Bayesian criteria, like the one in
[50] (termed Laplace-empirical criterion, LEC, in
[37]), and Schwarz's Bayesian inference criterion
(BIC) [10], [17], [22], [53].
. Approaches based on information/coding theory
concepts, such as Rissanen's minimum description
length (MDL) [49], which formally coincides with
BIC, the minimum message length (MML) criterion [42],
[60], [61], Akaike's information criterion(AIC) [62], and
the informational complexity criterion (ICOMP) [8].
. Methods based on the complete likelihood (4), which
is also called classification likelihood), such as the
approximate weight of evidence (AWE) [1], the classifi-
cation likelihood criterion (CLC) [7], the normalized
entropy criterion (NEC) [6], [12], and the integrated
classification likelihood (ICL) criterion [5].
A more detailed review of these methods is found in [37]
(chapter 6) which also includes a comparative study where
ICL and LEC are found to outperform the other criteria.
3.1.2 Stochastic and Resampling Methods
Markov chain Monte Carlo (MCMC) methods can be used
in two different ways for mixture inference: to implement
model selection criteria (e.g., [2], [39], [51]); or, in fully
Bayesian way, to sample from the full a posteriori
distribution with k considered unknown [40], [45], [48].
Despite their formal appeal, we think that MCMC-based
techniques are still far too computationally demanding to
be useful in pattern recognition applications.
Resampling-based schemes [33] and cross-validation
approaches [54] have also been used to estimate the number
of mixture components. In terms of computational load,
these methods are closer to stochastic techniques than to
deterministic ones.
3.2 The Drawbacks of EM-Based Methods
Basically, all deterministic algorithms for fitting mixtures
with unknown numbers of components use the
EM-algorithm. Although some of these methods perform
well, a major draw-back remains: a whole set of candidate
models has to be obtained, and the following well-known
problems associated with EM emerge.
3.2.1 The Initialization Issue
EM is highly dependent on initialization. Common (time-
consuming) solutions include one (or a combination of
several) of the following strategies: using multiple random
starts and choosing the final estimate with the highest
likelihood [25], [36], [37], [50], and initialization by clustering
algorithms [25], [36], [37]. Recently, a modified EM algorithm
using split and merge operations to escape from local maxima
of the log-likelihood has been proposed [59].
Deterministic annealing (DA) has been used with success
to avoid the initialization dependence of k-means type
algorithms for hard-clustering [27], [38], [52]. The resulting
algorithm is similar to EM for Gaussian mixtures under the
constraint of covariance matrices of the form T I, where T is
called the temperature and I is the identity matrix.
DA clustering algorithms begin at high temperature (corre-
sponding to w
i
m
' 1=k, a high entropy, uninformative
initialization); T is then lowered according to some cooling
schedule until T ' 0. The heuristic behind DA is that forcing
the entropy of the assignments to decrease slowly avoids
premature (hard) decisions that may correspond to poor local
minima. The constraint on the covariance matrix makes DA
clustering unapplicable to mixture model fitting, when seen
as a density estimation problem. It is also not clear how it
could be applied to non-Gaussian mixtures. However, it turns
out that it is possible to obtain deterministic annealing
versions of EM for mixtures, without constraining the
covariance matrices, by modifying the E-step [31], [58].
Recently, we have shown (see [20]) that the EM algorithm
exhibits a self-annealing behavior [44], that is, it works like a
DA algorithm without a prespecified cooling schedule.
Basically, all that is necessary is a uninformative (high
entropy) initialization of the type w
i
m
' 1=k (called random
starting, in [37]), and EM will automatically anneal without
the need for externally imposing a cooling schedule. This
fact explains the good performance of the random starting
method, recently reported in [37].
3.2.2 The Boundary of the Parameter Space
EM may converge to the boundary of the parameter space.
For example, when fitting a Gaussian mixture with uncon-
strained covariance matrices, one of the
m
's may approach
zero and the corresponding covariance matrix may become
arbitrarily close to singular. When the number of components
assumed is larger than the optimal/true one, this tends to
happen frequently, thus being a serious problem for methods
that require mixture estimates for various values of k. This
problem can be avoided through the use of soft constraints on
the covariance matrices, as suggested in [31].
FIGUEIREDO AND JAIN: UNSUPERVISED LEARNING OF FINITE MIXTURE MODELS 383
4THE PROPOSED CRITERION
The well-known deterministic methods (see (7)) are model-
class selection criteria: They select a model-class (M
k
) based
on its ªbestº representative (
b
k). However, in mixture
models, the distinction between model-class selection and
model estimation is unclear, e.g., a 3-component mixture in
which one of the mixing probabilities is zero is undistin-
guishable from a 2-component mixture. These observations
suggest a shift of approach: Let k be some arbitrary large
value and infer the structure of the mixture by letting the
estimates of some of the mixing probabilities be zero. This
approach coincides with the MML philosophy [61], [60],
which does not adopt the ªmodel-class/modelº hierarchy,
but directly aims at finding the ªbestº overall model in the
entire set of available models,
[
k
max
kk
min
M
k
;
rather than selecting one among a set of candidate models
f
b
k;k k
min
; ...;k k
min
g. Previous uses of MML for
mixtures do not strictly adhere to this perspective and end
up using MML as a model-class selection criterion [42].
Rather than using EM to compute a set of candidate
models (with the drawbacks mentioned above), we will be
able to directly implement the MML criterion using a
variant of EM. The proposed algorithm turns out to be
much less initialization dependent than standard EM and
automatically avoids the boundary of the parameter space.
4.1 The Minimum Message Length Criterion
The rationale behind minimum encoding length criteria
(like MDL and MML) is: if you can build a short code for
your data, that means that you have a good data generation
model [49], [60], [61]. To formalize this idea, consider some
data-set Y, known to have been generated according to
pYj, which is to be encoded and transmitted. Following
Shannon theory [15], the shortest code length (measured in
bits, if base-2 logarithm is used, or in nats, if natural
logarithm is adopted [15]) for Y is dlog pYje, where dae
denotes ªthe smallest integer no less than a.º Since even for
moderately large data-sets log pYj1, the de operator
is usually dropped. If pYj is fully known to both the
transmitter and the receiver, they can both build the same
code and communication can proceed. However, if is a
priori unknown, the transmitter has to start by estimating
and transmitting . This leads to a two-part message, whose
total length is given by
Length; Y LengthLengthYj: 8
All minimum encoding length criteria (like MDL and MML)
state that the parameter estimate is the one minimizing
Length; Y.
A key issue of this approach, which the several flavors of
the minimum encoding length principle (e.g., MDL and
MML) address differently, is that since is a vector of real
parameters, a finite code-length can only be obtained by
quantizing to finite precision. The central idea involves the
following trade off. Let
e
be a quantized version of . If a fine
precision is used, Length
e
is large, but LengthYj
e
can be
made small because
e
can come close to the optimal value.
Conversely, with a coarse precision, Length
e
is small,
but LengthYj
e
can be very far from optimal. There are
several ways to formalize and solve this trade off; see [32] for a
comprehensive review and pointers to the literature.
The fact that the data itself may also be real-valued does
not cause any difficulty; simply truncate Y to some arbitrary
fine precision and replace the density pYj by the
probability pYj
d
(d is the dimensionality of Y). The
resulting code-length is log pYjd log , but d log is
an irrelevant additive constant.
The particular form of the MML approach herein adopted
is derived in Appendix A and leads to the following criterion
(where the minimization with respect to is to be understood
as simultaneously in and c, the dimension of ):
b
arg min
log plog pYj
1
2
log jIj
c
2
1 log
1
12
;
9
where
1
IED
2
log pYj is the (expected) Fisher
information matrix, and jIj denotes its determinant.
The MDL criterion (which formally, though not concep-
tually, coincides with BIC) can be obtained as an approxima-
tion to (9). Start by assuming a flat prior p and drop it.
Then, since InI
1
(where I
1
is the Fisher
information corresponding to a single observation),
log jIj c log n log jI
1
j. For large n, drop the order-
1 terms log jI
1
j and
c
2
1 log
1
12
. Finally, for a given c,
take log pYj'log pYj
b
c, where
b
c is the corre-
sponding ML estimate. The result is the well-known
MDL criterion,
b
c
MDL
arg min
c
log pYj
b
c
c
2
log n
no
; 10
whose two-part code interpretation is clear: the data code-
lengthislog pYj
b
c, whileeach of the ccomponents of
b
c
requires a code-length proportional to 1=2log n. Intuitively,
this means that the encoding precision of the parameter
estimates is made inversely proportional to the estimation
error standard deviation, which, under regularity conditions,
decreases with
n
p
, leading to the 1=2log n term [49].
4.2 The Proposed Criterion for Mixtures
For mixtures, I cannot, in general, be obtained analyti-
cally [37], [42], [57]. To side-step this difficulty, we replace
I by the complete-data Fisher information matrix
I
c
ED
2
log pY; Zj, which upper-bounds I
[57]. I
c
has block-diagonal structure
I
c
n block-diag
1
I
1
1
; ...;
k
I
1
k
; M
no
;
where I
1
m
is the Fisher matrix for a single observation
known to have been produced by the mth component, and M
is the Fisher matrix of a multinomial distribution (recall that
jMj
1
2
k
1
) [57]. The approximation of Iby I
c
becomes exact in the limit of nonoverlapping components.
We adopt a prior expressing lack of knowledge about the
mixture parameters. Naturally, we model the parameters of
384 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 3, MARCH 2002
1. Here, D
2
denotes the matrix of second derivatives, or Hessian.
剩余15页未读,继续阅读
资源评论
liwo4321
- 粉丝: 0
- 资源: 5
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功