A Maximum Entropy Approach
to Natural Language Processing
Adam L. Berger t
Columbia University
Vincent J. Della Pietra ~
Renaissance Technologies
Stephen A. Della Pietra ~
Renaissance Technologies
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only
recently, however, have computers become powerful enough to permit the widescale application
of this concept to real world problems in statistical estimation and pattern recognition. In this
paper, we describe a method for statistical modeling based on maximum entropy. We present
a maximum-likelihood approach for automatically constructing maximum entropy models and
describe how to implement this approach efficiently, using as examples several problems in natural
language processing.
1. Introduction
Statistical modeling addresses the problem of constructing a stochastic model to predict
the behavior of a random process. In constructing this model, we typically have at our
disposal a sample of output from the process. Given this sample, which represents an
incomplete state of knowledge about the process, the modeling problem is to parlay
this knowledge into a representation of the process. We can then use this representation
to make predictions about the future behavior about the process.
Baseball managers (who rank among the better paid statistical modelers) employ
batting averages, compiled from a history of at-bats, to gauge the likelihood that a
player will succeed in his next appearance at the plate. Thus informed, they manipu-
late their lineups accordingly. Wall Street speculators (who rank among the
best
paid
statistical modelers) build models based on past stock price movements to predict to-
morrow's fluctuations and alter their portfolios to capitalize on the predicted future.
At the other end of the pay scale reside natural language researchers, who design
language and acoustic models for use in speech recognition systems and related ap-
plications.
The past few decades have witnessed significant progress toward increasing the
predictive capacity of statistical models of natural language. In language modeling, for
instance, Bahl et al. (1989) have used decision tree models and Della Pietra et al. (1994)
have used automatically inferred link grammars to model long range correlations in
language. In parsing, Black et al. (1992) have described how to extract grammatical
* This research, supported in part by ARPA under grant ONR N00014-91-C-0135, was conducted while
the authors were at the IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598
t Now at Computer Science Department, Columbia University.
:~ Now at Renaissance Technologies, Stony Brook, NY.
© 1996 Association for Computational Linguistics