Foundations and Trends
R
in
Machine Learning
Vol. 4, No. 4 (2011) 267–373
c
2012 C. Sutton and A. McCallum
DOI: 10.1561/2200000013
An Introduction to Conditional
Random Fields
By Charles Sutton and Andrew McCallum
Contents
1 Introduction 268
1.1 Implementation Details 271
2 Modeling 272
2.1 Graphical Modeling 272
2.2 Generative versus Discriminative Models 278
2.3 Linear-chain CRFs 286
2.4 General CRFs 290
2.5 Feature Engineering 293
2.6 Examples 298
2.7 Applications of CRFs 306
2.8 Notes on Terminology 308
3 Overview of Algorithms 310
4 Inference 313
4.1 Linear-Chain CRFs 314
4.2 Inference in Graphical Models 318
4.3 Implementation Concerns 328
5 Parameter Estimation 331
5.1 Maximum Likelihood 332
5.2 Stochastic Gradient Methods 341
5.3 Parallelism 343
5.4 Approximate Training 343
5.5 Implementation Concerns 350
6 Related Work and Future Directions 352
6.1 Related Work 352
6.2 Frontier Areas 359
Acknowledgments 362
References 363
Foundations and Trends
R
in
Machine Learning
Vol. 4, No. 4 (2011) 267–373
c
2012 C. Sutton and A. McCallum
DOI: 10.1561/2200000013
An Introduction to Conditional
Random Fields
Charles Sutton
1
and Andrew McCallum
2
1
School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB,
UK, csutton@inf.ed.ac.uk
2
Department of Computer Science, University of Massachusetts, Amherst,
MA, 01003, USA, mccallum@cs.umass.edu
Abstract
Many tasks involve predicting a large number of variables that depend
on each other as well as on other observed variables. Structured
prediction methods are essentially a combination of classification and
graphical modeling. They combine the ability of graphical models
to compactly model multivariate data with the ability of classifica-
tion methods to perform prediction using large sets of input features.
This survey describes conditional random fields, a popular probabilistic
method for structured prediction. CRFs have seen wide application in
many areas, including natural language processing, computer vision,
and bioinformatics. We describe methods for inference and parame-
ter estimation for CRFs, including practical issues for implementing
large-scale CRFs. We do not assume previous knowledge of graphical
modeling, so this survey is intended to be useful to practitioners in a
wide variety of fields.
1
Introduction
Fundamental to many applications is the ability to predict multiple
variables that depend on each other. Such applications are as diverse
as classifying regions of an image [49, 61, 69], estimating the score in a
game of Go [130], segmenting genes in a strand of DNA [7], and syn-
tactic parsing of natural-language text [144]. In such applications, we
wish to predict an output vector y = {y
0
,y
1
,...,y
T
} of random vari-
ables given an observed feature vector x. A relatively simple example
from natural-language processing is part-of-speech tagging, in which
each variable y
s
is the part-of-speech tag of the word at position s, and
the input x is divided into feature vectors {x
0
,x
1
,...,x
T
}. Each x
s
contains various information about the word at position s, such as its
identity, orthographic features such as prefixes and suffixes, member-
ship in domain-specific lexicons, and information in semantic databases
such as WordNet.
One approach to this multivariate prediction problem, especially
if our goal is to maximize the number of labels y
s
that are correctly
classified, is to learn an independent per-position classifier that maps
x → y
s
for each s. The difficulty, however, is that the output variables
have complex dependencies. For example, in English adjectives do not
268
269
usually follow nouns, and in computer vision, neighboring regions in an
image tend to have similar labels. Another difficulty is that the output
variables may represent a complex structure such as a parse tree, in
which a choice of what grammar rule to use near the top of the tree
can have a large effect on the rest of the tree.
A natural way to represent the manner in which output variables
depend on each other is provided by graphical models. Graphical
models — which include such diverse model families as Bayesian net-
works, neural networks, factor graphs, Markov random fields, Ising
models, and others — represent a complex distribution over many vari-
ables as a product of local factors on smaller subsets of variables. It
is then possible to describe how a given factorization of the proba-
bility density corresponds to a particular set of conditional indepen-
dence relationships satisfied by the distribution. This correspondence
makes modeling much more convenient because often our knowledge of
the domain suggests reasonable conditional independence assumptions,
which then determine our choice of factors.
Much work in learning with graphical models, especially in statisti-
cal natural-language processing, has focused on generative models that
explicitly attempt to model a joint probability distribution p(y,x) over
inputs and outputs. Although this approach has advantages, it also
has important limitations. Not only can the dimensionality of x be
very large, but the features may have complex dependencies, so con-
structing a probability distribution over them is difficult. Modeling the
dependencies among inputs can lead to intractable models, but ignoring
them can lead to reduced performance.
A solution to this problem is a discriminative approach, similar to
that taken in classifiers such as logistic regression. Here we model the
conditional distribution p(y|x) directly, which is all that is needed for
classification. This is the approach taken by conditional random fields
(CRFs). CRFs are essentially a way of combining the advantages of dis-
criminative classification and graphical modeling, combining the ability
to compactly model multivariate outputs y with the ability to leverage
a large number of input features x for prediction. The advantage to a
conditional model is that dependencies that involve only variables in x
play no role in the conditional model, so that an accurate conditional