AGentleTutorialoftheEMAlgorithm
and its Application to Parameter
Estimation for Gaussian Mixture and
Hidden Markov Models
Jeff A. Bilmes (bilmes@cs.berkeley.edu)
International Computer Science Institute
Berkeley CA, 94704
and
Computer Science Division
Department of Electrical Engineering and Computer Science
U.C. Berkeley
TR-97-021
April 1998
Abstract
We describe the maximum-likelihood parameter estimation problem and how the Expectation-
Maximization (EM) algorithm can be used for its solution. We first describe the abstract
form of the EM algorithm as it is often given in the literature. We then develop the EM pa-
rameter estimation procedure for two applications: 1) finding the parameters of a mixture of
Gaussian densities, and 2) finding the parameters of a hidden Markov model (HMM) (i.e.,
the Baum-Welch algorithm) for both discrete and Gaussian mixture observation models.
We derive the update equations in fairly explicit detail but we do not prove any conver-
gence properties. We try to emphasize intuition rather than mathematical rigor.
ii
1 Maximum-likelihood
Recall the definition of the maximum-likelihood estimation problem. We have a density function
that is governed by the set of parameters (e.g., might be a set of Gaussians and could
be the means and covariances). We also have a data set of size
, supposedly drawn from this
distribution, i.e.,
. That is, we assume that these data vectors are independent and
identically distributed (i.i.d.) with distribution
. Therefore, the resulting density for the samples is
This function is called the likelihood of the parameters given the data, or just the likelihood
function. The likelihood is thought of as a function of the parameters where the data is fixed.
In the maximum likelihood problem, our goal is to find the
that maximizes . That is, we wish
to find
where
argmax
Often we maximize instead because it is analytically easier.
Depending on the form of this problem can be easy or hard. For example, if
is simply a single Gaussian distribution where , then we can set the derivative of
to zero, and solve directly for and (this, in fact, results in the standard formulas
for the mean and variance of a data set). For many problems, however, it is not possible to find such
analytical expressions, and we must resort to more elaborate techniques.
2 Basic EM
The EM algorithm is one such elaborate technique. The EM algorithm [ALR77, RW84, GJ95, JJ94,
Bis95, Wu83] is a general method of finding the maximum-likelihood estimate of the parameters of
an underlying distribution from a given data set when the data is incomplete or has missing values.
There are two main applications of the EM algorithm. The first occurs when the data indeed
has missing values, due to problems with or limitations of the observation process. The second
occurs when optimizing the likelihood function is analytically intractable but when the likelihood
function can be simplified by assuming the existence of and values for additional but missing (or
hidden) parameters. The latter application is more common in the computational pattern recognition
community.
As before, we assume that data
is observed and is generated by some distribution. We call
the incomplete data. We assume that a complete data set exists and also assume (or
specify) a joint density function:
.
Where does this joint density come from? Often it “arises” from the marginal density function
and the assumption of hidden variables and parameter value guesses (e.g., our two exam-
ples, Mixture-densities and Baum-Welch). In other cases (e.g., missing data values in samples of a
distribution), we must assume a joint relationship between the missing and observed values.
1
With this new density function, we can define a new likelihood function,
,calledthecomplete-datalikelihood.Notethatthisfunction is in fact a random variable
since the missing information
is unknown, random, and presumably governed by an underlying
distribution. That is, we can think of
for some function where
and are constant and is a random variable. The original likelihood is referred to as the
incomplete-data likelihood function.
The EM algorithm first finds the expected value of the complete-data log-likelihood
with respect to the unknown data given the observed data and the current parameter estimates.
That is, we define:
(1)
Where
are the current parameters estimates that we used to evaluate the expectation and
are the new parameters that we optimize to increase .
This expression probably requires some explanation.
The key thing to understand is that
and are constants, is a normal variable that we wish to adjust, and is a random
variable governed by the distribution
. The right side of Equation 1 can therefore be
re-written as:
(2)
Note that
is the marginal distribution of the unobserved data and is dependent on
both the observed data
and on the current parameters, and is the space of values can take on.
In the best of cases, this marginal distribution is a simple analytical expression of the assumed pa-
rameters
and perhaps the data. In the worst of cases, this density might be very hard to obtain.
Sometimes, in fact, the density actually used is
but
this doesn’t effect subsequent steps since the extra factor, is not dependent on .
As an analogy, suppose we have a function
of two variables. Consider where
is a constant and is a random variable governed by some distribution .Then
is now a deterministic function that could be maximized if
desired.
The evaluation of this expectation is called the E-step of the algorithm. Notice the meaning of
the two arguments in the function . The first argument corresponds to the parameters
that ultimately will be optimized in an attempt to maximize the likelihood. The second argument
corresponds to the parameters that we use to evaluate the expectation.
The second step (the M-step) of the EM algorithm is to maximize theexpectationwecomputed
in the first step. That is, we find:
argmax
These two steps are repeated as necessary. Each iteration is guaranteed to increase the log-
likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood func-
tion. There are many rate-of-convergence papers (e.g., [ALR77, RW84, Wu83, JX96, XJ96]) but
we will not discuss them here.
Recall that . In the following discussion, we drop the subscripts from
different density functions since argument usage should should disambiguate different ones.
2
A modified form of the M-step is to, instead of maximizing , we find some
such that .ThisformofthealgorithmiscalledGeneralizedEM
(GEM) and is also guaranteed to converge.
As presented above, it’s not clear how exactly to “code up” the algorithm. This is the way,
however, that the algorithm is presented in its most general form. The details of the steps required
to compute the given quantities are very dependent on the particular application so they are not
discussed when the algorithm is presented in this abstract form.
3 Finding Maximum Likelihood Mixture Densities Parameters via EM
The mixture-density parameter estimation problem is probably one of the most widely used appli-
cations of the EM algorithm in the computational pattern recognition community. In this case, we
assume the following probabilistic model:
where the parameters are such that and each is a
density function parameterized by
. In other words, we assume we have component densities
mixed together with
mixing coefficients .
The incomplete-data log-likelihood expression for this density from the data
is given by:
which is difficult to optimize because it contains the log of the sum. If we consider as incomplete,
however, and posit the existence of unobserved data items
whose values inform us
which component density “generated” each data item, the likelihood expression is significantly
simplified. That is, we assume that
for each ,and if the sample was
generated by the
mixture component. If we know the values of , the likelihood becomes:
which, given a particular form of the component densities, can be optimized using a variety of
techniques.
The problem, of course, is that we do not know the values of
. If we assume is a random
vector, however, we can proceed.
We first must derive an expression for the distribution of the unobserved data. Let’s first guess
at parameters for the mixture density, i.e., we guess that
are the
appropriate parameters for the likelihood
.Given , we can easily compute
for each and . In addition, the mixing parameters, can be though of as prior probabilities
of each mixture component, that is
component j . Therefore, using Bayes’s rule, we can
compute:
3