UnsupervisedLearningOfFiniteMixtureModels资源-CSDN文库

Unsupervised

Learning

需积分: 10 66 浏览量 2013-08-12 18:05:05 上传评论收藏 753KB PDF 举报

资源推荐

资源详情

资源评论

Unsupervised Learning of Finite Mixture Models

Mario A.T. Figueiredo, Senior Member, IEEE, and Anil K. Jain, Fellow, IEEE

AbstractÐThis paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective

ªunsupervisedº is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the

standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another

drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space.

The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate

models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any

type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments

involving Gaussian mixtures. These experiments testify for the good performance of our approach.

Index TermsÐFinite mixtures, unsupervised learning, model selection, minimum message length criterion, Bayesian methods,

expectation-maximization algorithm, clustering.

1INTRODUCTION

INITE mixtures are a flexible and powerful probabilistic

modeling tool for univariate and multivariate data. The

usefulness of mixture models in any area which involves

the statistical modeling of data (such as pattern recognition,

computer vision, signal and image analysis, machine

learning) is currently widely acknowledged.

In statistical pattern recognition, finite mixtures allow a

formal (probabilistic model-based) approach to unsuper-

vised learning (i.e., clustering) [28], [29], [35], [37], [57]. In

fact, finite mixtures naturally model observations which are

assumed to have been produced by one (randomly selected

and unknown) of a set of alternative random sources.

Inferring (the parameters of) these sources and identifying

which source produced each observation leads to a

clustering of the set of observations. With this model-based

approach to clustering (as opposed to heuristic methods

like k-means or hierarchical agglomerative methods [28]),

issues like the selection of the number of clusters or the

assessment of the validity of a given model can be

addressed in a principled and formal way.

The usefulness of mixture models is not limited to

unsupervised learning applications. Mixture models are

able to represent arbitrarily complex probability density

functions (pdf's). This fact makes them an excellent choice

for representing complex class-conditional pdf's (i.e., like-

lihood functions) in (Bayesian) supervised learning scenar-

ios [25], [26], [55], or priors for Bayesian parameter

estimation [16]. Mixture models can also be used to perform

feature selection [43].

The standard method used to fit finite mixture models to

observed data is the expectation-maximization (EM) algorithm

[18], [36], [37], which converges to a maximum likelihood (ML)

estimate of the mixture parameters. However, the EM

algorithm for finite mixture fitting has several drawbacks: it

is a local (greedy) method, thus sensitive to initialization

because the likelihood function of a mixture model is not

unimodal; for certain types of mixtures, it may converge to the

boundary of the parameter space (where the likelihood is

unbounded) leading to meaningless estimates.

An important issue in mixture modeling is the selection

of the number of components. The usual trade off in model

order selection problems arises: With too many compo-

nents, the mixture may over-fit the data, while a mixture

with too few components may not be flexible enough to

approximate the true underlying model.

In this paper, we deal simultaneously with the above

mentioned problems. We propose an inference criterion for

mixture models and an algorithm to implement it which:

1) automatically selects the number of components, 2) is less

sensitive to initialization than EM, and 3) avoids the

boundary of the parameters space.

Although most of the literature on finite mixtures focuses

on mixtures of Gaussian densities, many other types of

probability density functions have also been considered. The

approach proposed in this paper can be applied to any type of

parametric mixture model for which it is possible to write an

EM algorithm.

The rest of paper is organized as follows: In Section 2, we

review finite mixture models and the EM algorithm; this is

standard material and our purpose is to introduce the

problem and define notation. In Section 3, we review

previous work on the problem of learning mixtures with an

unknown number of components and dealing with the

drawbacks of the EM algorithm. In Section 4, we describe

the proposed inference criterion, while the algorithm which

implements it is presented in Section 5. Section 6 reports

experimental results and Section 7 ends the paper by

presenting some concluding remarks.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 3, MARCH 2002 381

. M.A.T. Figueiredo is with the Institute of Telecommunications and the

Department of Electrical and Computer Engineering, Instituto Superior

cnico, 1049-001 Lisboa, Portugal. E-mail: mtf@lx.it.pt.

. A.K. Jain is with the Department of Computer Science and Engineering,

Michigan State University, East Lansing, MI 48824.

E-mail: jain@cse.msu.edu.

Manuscript received 5 July 2000; revised 8 Feb. 2001; accepted 30 July 2001.

Recommended for acceptance by W.T. Freeman.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number 112382.

0162-8828/02/$17.00 ß 2002 IEEE

2LEARNING FINITE MIXTURE MODELS

2.1 Finite Mixture Models

Let Y Y

; ...;Y



be a d-dimensional random variable,

with y y

; ...;y



representing one particular outcome of

Y. It is said that Y follows a k-component finite mixture

distributionifitsprobability densityfunction canbe writtenas

pyj

m1



pyj

; 1

where 

; ...;

are the mixing probabilities, each 

is the

set of parameters defining the mth component, and  

f

; ...;

;

; ...;

g is the complete set of parameters

needed to specify the mixture. Of course, being probabil-

ities, the 

must satisfy



 0;m 1; ...;k; and

m1



 1: 2

In this paper, we assume that all the components have the

same functional form (for example, they are all d-variate

Gaussian), each one being thus fully characterized by the

parameter vector 

. For detailed and comprehensive

accounts on mixture models, see [35], [37], [57]; here, we

simply review the fundamental ideas and define our notation.

Given a set of n independent and identically distrib-

uted samples Yfy

1

; ...; y

n

g, the log-likelihood corre-

sponding to a k-component mixture is

log pYjlog

i1

py

i

j

i1

log

m1



py

i

j

:

3

It is well-known that the maximum likelihood (ML) estimate



 arg max



log pYj

cannot be found analytically. The same is true for the

Bayesian maximum a posteriori (MAP) criterion,



MAP

 arg max



log pYjlog p

;

given some prior p on the parameters. Of course, the

maximizations defining the ML or MAP estimates are

under the constraints in (2).

2.2 The EM Algorithm

The usual choice for obtaining ML or MAP estimates of the

mixture parameters is the EM algorithm [18], [35], [36], [37].

EM is an iterative procedure which finds local maxima of

log p Yjor log p Yjlog p. For the case of Gaussian

mixtures, the convergence behavior of EM is well studied

[37], [63]. It was recently shown that EM belongs to a class

of iterative methods called proximal point algorithms (PPA;

for an introduction to PPA and a comprehensive set of

references see [4], chapter 5) [13]. Seeing EM under this new

light opens the door to several extensions and general-

izations. An earlier related result, although without

identifying EM as a PPA, appeared in [41].

The EM algorithm is based on the interpretation of Y as

incomplete data. For finite mixtures, the missing part is a set

of n labels Zfz

1

; ...; z

n

g associated with the n

samples, indicating which component produced each

sample. Each label is a binary vector z

i

z

i

; ...;z

i

,

where z

i

 1 and z

i

 0, for p 6 m, means that sample y

i

was produced by the mth component. The complete log-

likelihood (i.e., the one from which we could estimate  if

the complete data XfY; Zg was observed [36]) is

log pY; Zj

i1

m1

i

log 

py

i

j



: 4

The EM algorithm produces a sequence of estimates

t;t 0; 1; 2; ...g by alternatingly applying two steps

(until some convergence criterion is met):

. E-step: Computes the conditional expectation of the

complete log-likelihood, given Y and the current

estimate

t.Sincelog pY; Zj is linear with

respect to the missing Z, we simply have to compute

the conditional expectation WEZjY;

t, and

plug it into log pY; Zj. The result is the so-called

Q-function:

Q;

t  E log pY; Zj



t

 log pY; Wj:

5

Since the elements of Z are binary, their conditional

expectations are given by

i

 Ez

i

jY;

t

 Pr z

i

 1j y

i

;

t





t py

i



t

j1



t py

i



t

;

6

where the last equality is simply Bayes law (

the a priori probability that z

i

 1, while w

i

the a posteriori probability that z

i

 1,after

observing y

i

. M-step: Updates the parameter estimates according to

t  1arg max



fQ;

t  log pg;

in the case of MAP estimation, or

t  1arg max



Q;

t;

for the ML criterion, in both cases, under the

constraints in (2).

3PREVIOUS WORK

3.1 Estimating the Number of Components

Let us start by defining M

as the class of all possible

k-component mixtures built from a certain type of pdf's (e.g.,

all d-variate Gaussian mixtures with unconstrained covar-

iance matrices). The ML criterion cannot be used to estimate k,

the number of mixture components, because M

M

k1

that is, these classes are nested. As an illustration, let

 f

; ...;

;

; ...;

k1

;

g, define a mixture in M

and 

f

; ...;

;

k1

;

; ...;

k1

;

k1

g,definea

mixture in M

k1

.If

k1

 

and 

 

 

k1

, then 

382 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 3, MARCH 2002

and 

represent the same probability density function.

Consequently, the maximized likelihood pYj



 is a

nondecreasing function of k, thus useless as a criterion to

estimate the number of components.

Several model selection methods have been proposed to

estimate the number of components of a mixture. The vast

majority of these methods can be classified, from a computa-

tional point of view, into two classes: deterministic and

stochastic.

3.1.1 Deterministic Methods

The methods in this class start by obtaining a set of candidate

models (usually by EM) for a range of values of k (from k

min

max

) which is assumed to contain the true/optimal k. The

number of components is then selected according to

k  arg min

k;k



;kk

min

; ...;k

max

; 7

where C

k;k



is some model selection criterion, and

k

is an estimate of the mixture parameters assuming that it

has k components. Usually, these criteria have the form

k;k



log p Yj

k



Pk;

where Pk in an increasing function penalizing higher

values of k. Examples of such criteria that have been used

for mixtures include:

. Approximate Bayesian criteria, like the one in

[50] (termed Laplace-empirical criterion, LEC, in

[37]), and Schwarz's Bayesian inference criterion

(BIC) [10], [17], [22], [53].

. Approaches based on information/coding theory

concepts, such as Rissanen's minimum description

length (MDL) [49], which formally coincides with

BIC, the minimum message length (MML) criterion [42],

[60], [61], Akaike's information criterion(AIC) [62], and

the informational complexity criterion (ICOMP) [8].

. Methods based on the complete likelihood (4), which

is also called classification likelihood), such as the

approximate weight of evidence (AWE) [1], the classifi-

cation likelihood criterion (CLC) [7], the normalized

entropy criterion (NEC) [6], [12], and the integrated

classification likelihood (ICL) criterion [5].

A more detailed review of these methods is found in [37]

(chapter 6) which also includes a comparative study where

ICL and LEC are found to outperform the other criteria.

3.1.2 Stochastic and Resampling Methods

Markov chain Monte Carlo (MCMC) methods can be used

in two different ways for mixture inference: to implement

model selection criteria (e.g., [2], [39], [51]); or, in fully

Bayesian way, to sample from the full a posteriori

distribution with k considered unknown [40], [45], [48].

Despite their formal appeal, we think that MCMC-based

techniques are still far too computationally demanding to

be useful in pattern recognition applications.

Resampling-based schemes [33] and cross-validation

approaches [54] have also been used to estimate the number

of mixture components. In terms of computational load,

these methods are closer to stochastic techniques than to

deterministic ones.

3.2 The Drawbacks of EM-Based Methods

Basically, all deterministic algorithms for fitting mixtures

with unknown numbers of components use the

EM-algorithm. Although some of these methods perform

well, a major draw-back remains: a whole set of candidate

models has to be obtained, and the following well-known

problems associated with EM emerge.

3.2.1 The Initialization Issue

EM is highly dependent on initialization. Common (time-

consuming) solutions include one (or a combination of

several) of the following strategies: using multiple random

starts and choosing the final estimate with the highest

likelihood [25], [36], [37], [50], and initialization by clustering

algorithms [25], [36], [37]. Recently, a modified EM algorithm

using split and merge operations to escape from local maxima

of the log-likelihood has been proposed [59].

Deterministic annealing (DA) has been used with success

to avoid the initialization dependence of k-means type

algorithms for hard-clustering [27], [38], [52]. The resulting

algorithm is similar to EM for Gaussian mixtures under the

constraint of covariance matrices of the form T I, where T is

called the temperature and I is the identity matrix.

DA clustering algorithms begin at high temperature (corre-

sponding to w

i

' 1=k, a high entropy, uninformative

initialization); T is then lowered according to some cooling

schedule until T ' 0. The heuristic behind DA is that forcing

the entropy of the assignments to decrease slowly avoids

premature (hard) decisions that may correspond to poor local

minima. The constraint on the covariance matrix makes DA

clustering unapplicable to mixture model fitting, when seen

as a density estimation problem. It is also not clear how it

could be applied to non-Gaussian mixtures. However, it turns

out that it is possible to obtain deterministic annealing

versions of EM for mixtures, without constraining the

covariance matrices, by modifying the E-step [31], [58].

Recently, we have shown (see [20]) that the EM algorithm

exhibits a self-annealing behavior [44], that is, it works like a

DA algorithm without a prespecified cooling schedule.

Basically, all that is necessary is a uninformative (high

entropy) initialization of the type w

i

' 1=k (called random

starting, in [37]), and EM will automatically anneal without

the need for externally imposing a cooling schedule. This

fact explains the good performance of the random starting

method, recently reported in [37].

3.2.2 The Boundary of the Parameter Space

EM may converge to the boundary of the parameter space.

For example, when fitting a Gaussian mixture with uncon-

strained covariance matrices, one of the 

's may approach

zero and the corresponding covariance matrix may become

arbitrarily close to singular. When the number of components

assumed is larger than the optimal/true one, this tends to

happen frequently, thus being a serious problem for methods

that require mixture estimates for various values of k. This

problem can be avoided through the use of soft constraints on

the covariance matrices, as suggested in [31].

FIGUEIREDO AND JAIN: UNSUPERVISED LEARNING OF FINITE MIXTURE MODELS 383

4THE PROPOSED CRITERION

The well-known deterministic methods (see (7)) are model-

class selection criteria: They select a model-class (M

) based

on its ªbestº representative (

k). However, in mixture

models, the distinction between model-class selection and

model estimation is unclear, e.g., a 3-component mixture in

which one of the mixing probabilities is zero is undistin-

guishable from a 2-component mixture. These observations

suggest a shift of approach: Let k be some arbitrary large

value and infer the structure of the mixture by letting the

estimates of some of the mixing probabilities be zero. This

approach coincides with the MML philosophy [61], [60],

which does not adopt the ªmodel-class/modelº hierarchy,

but directly aims at finding the ªbestº overall model in the

entire set of available models,

[

max

kk

min

;

rather than selecting one among a set of candidate models

k;k k

min

; ...;k k

min

g. Previous uses of MML for

mixtures do not strictly adhere to this perspective and end

up using MML as a model-class selection criterion [42].

Rather than using EM to compute a set of candidate

models (with the drawbacks mentioned above), we will be

able to directly implement the MML criterion using a

variant of EM. The proposed algorithm turns out to be

much less initialization dependent than standard EM and

automatically avoids the boundary of the parameter space.

4.1 The Minimum Message Length Criterion

The rationale behind minimum encoding length criteria

(like MDL and MML) is: if you can build a short code for

your data, that means that you have a good data generation

model [49], [60], [61]. To formalize this idea, consider some

data-set Y, known to have been generated according to

pYj, which is to be encoded and transmitted. Following

Shannon theory [15], the shortest code length (measured in

bits, if base-2 logarithm is used, or in nats, if natural

logarithm is adopted [15]) for Y is dlog pYje, where dae

denotes ªthe smallest integer no less than a.º Since even for

moderately large data-sets log pYj1, the de operator

is usually dropped. If pYj is fully known to both the

transmitter and the receiver, they can both build the same

code and communication can proceed. However, if  is a

priori unknown, the transmitter has to start by estimating

and transmitting . This leads to a two-part message, whose

total length is given by

Length; Y  LengthLengthYj: 8

All minimum encoding length criteria (like MDL and MML)

state that the parameter estimate is the one minimizing

Length; Y.

A key issue of this approach, which the several flavors of

the minimum encoding length principle (e.g., MDL and

MML) address differently, is that since  is a vector of real

parameters, a finite code-length can only be obtained by

quantizing  to finite precision. The central idea involves the

following trade off. Let

 be a quantized version of . If a fine

precision is used, Length

 is large, but LengthYj

 can be

made small because

 can come close to the optimal value.

Conversely, with a coarse precision, Length

 is small,

but LengthYj

 can be very far from optimal. There are

several ways to formalize and solve this trade off; see [32] for a

comprehensive review and pointers to the literature.

The fact that the data itself may also be real-valued does

not cause any difficulty; simply truncate Y to some arbitrary

fine precision  and replace the density pYj by the

probability pYj

(d is the dimensionality of Y). The

resulting code-length is log pYjd log , but d log  is

an irrelevant additive constant.

The particular form of the MML approach herein adopted

is derived in Appendix A and leads to the following criterion

(where the minimization with respect to  is to be understood

as simultaneously in  and c, the dimension of ):

  arg min





 log plog pYj



log jIj 

1  log



;

9

where

IED



log pYj is the (expected) Fisher

information matrix, and jIj denotes its determinant.

The MDL criterion (which formally, though not concep-

tually, coincides with BIC) can be obtained as an approxima-

tion to (9). Start by assuming a flat prior p and drop it.

Then, since InI

1

 (where I

1

 is the Fisher

information corresponding to a single observation),

log jIj  c log n  log jI

1

j. For large n, drop the order-

1 terms log jI

1

j and

1  log



. Finally, for a given c,

take log pYj'log pYj

c, where

c is the corre-

sponding ML estimate. The result is the well-known

MDL criterion,

MDL

 arg min

log pYj

c 

log n

; 10

whose two-part code interpretation is clear: the data code-

lengthislog pYj

c, whileeach of the ccomponents of

c

requires a code-length proportional to 1=2log n. Intuitively,

this means that the encoding precision of the parameter

estimates is made inversely proportional to the estimation

error standard deviation, which, under regularity conditions,

decreases with



, leading to the 1=2log n term [49].

4.2 The Proposed Criterion for Mixtures

For mixtures, I cannot, in general, be obtained analyti-

cally [37], [42], [57]. To side-step this difficulty, we replace

I by the complete-data Fisher information matrix

ED



log pY; Zj, which upper-bounds I

[57]. I

 has block-diagonal structure

n block-diag 

1



; ...;

1



; M

;

where I

1



 is the Fisher matrix for a single observation

known to have been produced by the mth component, and M

is the Fisher matrix of a multinomial distribution (recall that

jMj







1

) [57]. The approximation of Iby I



becomes exact in the limit of nonoverlapping components.

We adopt a prior expressing lack of knowledge about the

mixture parameters. Naturally, we model the parameters of

384 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 3, MARCH 2002

1. Here, D



denotes the matrix of second derivatives, or Hessian.

剩余15页未读，继续阅读

评论收藏

内容反馈

liwo4321

粉丝: 0
资源: 5

Unsupervised Learning Of Finite Mixture Models

最新资源

Unsupervised Learning Of Finite Mixture Models

Finite Mixture Models

Unsupervised Learning of Probably Symmetric Deformable 3D Object

Unsupervised Learning of Edges_Yin Li_2016(PDF)

Conv-DBN for Scalable Unsupervised Learning of Hierarchical Representations

Natural Computing for Unsupervised Learning

Neural network (unsupervised learning)-Ch5

Unsupervised Learning of Human Action Categories Using Spatial-TemporalWords

From neural PCA to deep unsupervised learning..pdf

机器学习 -- Unsupervised Learning: Principle Component Analysis

机器学习教程 - Unsupervised Learning: Word Embedding

Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization

机器学习 -- Unsupervised Learning: Deep Auto-encoder

无监督学习入门 Hands-On Unsupervised Learning Using Python

Hands-On Unsupervised Learning Using Python epub格式

Unsupervised Machine Learning in Python [2016]

用于分层表示的可伸缩无监督学习的卷积深度信度网络

Lecture4---Unsupervised Learning Neural Networks 无监督神经网络

Unsupervised Learning.vtt

Autoencoders, Unsupervised Learning, and Deep Architectures

Unsupervised.Learning.with.R

免费下载Navicat15安装包+工具+教程.zip

cmu 15445 2023spring project0

人大金仓驱动包kingbasejdbc.jar V8.6.0、8.8.0驱动jar包

人大金仓8.6.0、达梦8.1.2.79 连接驱动

全国省市县Excel汇总表（免费下载）.xls

数据分析-附件1.xlsx

某光伏电站发电量数据集

OBCA最新题库（包括单选，多选和判断）

taos-jdbcdriver-1.0.3.jar

最新资源