nature biotechnology volume 26
number 8
august 2008 897
don’t know the coin used for each set of tosses.
However, if we had some way of completing the
data (in our case, guessing correctly which coin
was used in each of the five sets), then we could
reduce parameter estimation for this problem
with incomplete data to maximum likelihood
estimation with complete data.
One iterative scheme for obtaining comple-
tions could work as follows: starting from some
initial parameters,
θ
ˆˆˆ
=
θ
Α
,θ
Β
(
t
)
(
t
)
(
t
)
(
)
, determine for
each of the five sets whether coin A or coin B
was more likely to have generated the observed
flips (using the current parameter estimates).
Then, assume these completions (that is,
guessed coin assignments) to be correct, and
apply the regular maximum likelihood estima-
tion procedure to get
θ
ˆ
(t+1)
. Finally, repeat these
two steps until convergence. As the estimated
model improves, so too will the quality of the
resulting completions.
The expectation maximization algorithm
is a refinement on this basic idea. Rather than
picking the single most likely completion of the
missing coin assignments on each iteration, the
expectation maximization algorithm computes
probabilities for each possible completion of
the missing data, using the current parameters
θ
ˆ
(t)
. These probabilities are used to create a
weighted training set consisting of all possible
completions of the data. Finally, a modified
version of maximum likelihood estimation
that deals with weighted training examples
provides new parameter estimates,
θ
ˆ
(t+1)
. By
using weighted training examples rather than
choosing the single best completion, the expec-
tation maximization algorithm accounts for
the confidence of the model in each comple-
tion of the data (Fig. 1b).
In summary, the expectation maximiza-
tion algorithm alternates between the steps
z = (z
1
, z
2
,…, z
5
), where x
i
∈ {0,1,…,10} is the
number of heads observed during the ith set of
tosses, and z
i
∈ {A,B} is the identity of the coin
used during the ith set of tosses. Parameter esti-
mation in this setting is known as the complete
data case in that the values of all relevant ran-
dom variables in our model (that is, the result
of each coin flip and the type of coin used for
each flip) are known.
Here, a simple way to estimate
θ
A
and
θ
B
is
to return the observed proportions of heads for
each coin:
(1)
θ
Α
ˆ
=
# of heads using coin A
total # of flips using coin A
and
θ
Β
ˆ
=
# of heads using coin B
total # of flips using coin B
This intuitive guess is, in fact, known in the
statistical literature as maximum likelihood
estimation (roughly speaking, the maximum
likelihood method assesses the quality of a
statistical model based on the probability it
assigns to the observed data). If logP(x,z;
θ
) is
the logarithm of the joint probability (or log-
likelihood) of obtaining any particular vector
of observed head counts x and coin types z,
then the formulas in (1) solve for the param-
eters
θ
ˆˆˆ
=
θ
A
,θ
B
()
that maximize logP(x,z;
θ
).
Now consider a more challenging variant of
the parameter estimation problem in which we
are given the recorded head counts x but not
the identities z of the coins used for each set
of tosses. We refer to z as hidden variables or
latent factors. Parameter estimation in this new
setting is known as the incomplete data case.
This time, computing proportions of heads
for each coin is no longer possible, because we
P
robabilistic models, such as hidden Markov
models or Bayesian networks, are com-
monly used to model biological data. Much
of their popularity can be attributed to the
existence of efficient and robust procedures
for learning parameters from observations.
Often, however, the only data available for
training a probabilistic model are incomplete.
Missing values can occur, for example, in medi-
cal diagnosis, where patient histories generally
include results from a limited battery of tests.
Alternatively, in gene expression clustering,
incomplete data arise from the intentional
omission of gene-to-cluster assignments in the
probabilistic model. The expectation maximi-
zation algorithm enables parameter estimation
in probabilistic models with incomplete data.
A coin-flipping experiment
As an example, consider a simple coin-flip-
ping experiment in which we are given a pair
of coins A and B of unknown biases,
θ
A
and
θ
B
, respectively (that is, on any given flip, coin
A will land on heads with probability
θ
A
and
tails with probability 1–
θ
A
and similarly for
coin B). Our goal is to estimate
θ
= (
θ
A
,
θ
B
) by
repeating the following procedure five times:
randomly choose one of the two coins (with
equal probability), and perform ten indepen-
dent coin tosses with the selected coin. Thus,
the entire procedure involves a total of 50 coin
tosses (Fig. 1a).
During our experiment, suppose that we
keep track of two vectors x = (x
1
, x
2
, …, x
5
) and
What is the expectation maximization
algorithm?
Chuong B Do & Serafim Batzoglou
The expectation maximization algorithm arises in many computational biology applications that involve probabilistic
models. What is it good for, and how does it work?
Chuong B. Do and Serafim Batzoglou are in
the Computer Science Department, Stanford
University, 318 Campus Drive, Stanford,
California 94305-5428, USA.
e-mail: [email protected]d.edu
PRIMER
© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology