Foundations
of
Statistical
Natural
Language
Processing
E0123734
Christopher D. Manning
Hinrich
Schiitze
The MIT Press
Cambridge, Massachusetts
London, England
Second printing, 1999
0
1999 Massachusetts Institute of Technology
Second printing with corrections, 2000
All rights reserved. No part of this book may be reproduced in any form by any
electronic or mechanical means (including photocopying, recording, or informa-
tion storage and retrieval) without permission in writing from the publisher.
Typeset in
lo/13
Lucida Bright by the authors using
ETPX2E.
Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Information
Manning, Christopher D.
Foundations of statistical natural language processing
/
Christopher D.
Manning, Hinrich Schutze.
p. cm.
Includes bibliographical references (p.
)
and index.
ISBN 0-262-13360-l
1. Computational linguistics-Statistical methods. I. Schutze, Hinrich.
II. Title.
P98.5.S83M36
1999
410’.285-dc21
99-21137
CIP
Brief Contents
I
Preliminaries 1
1
Introduction 3
2
Mathematical Foundations
39
3
Linguistic Essentials 81
4
Corpus-Based Work 117
II
Words 149
5
Collocations 151
6
Statistical Inference: n-gram Models over Sparse Data
191
7
Word Sense Disambiguation
229
8
Lexical Acquisition 265
III
Grammar 315
9
Markov Models 317
10
Part-of-Speech Tagging 341
11
Probabilistic Context Free Grammars
12
Probabilistic Parsing
407
381
Iv
Applications and Techniques
461
13
Statistical Alignment and Machine Translation
14
Clustering
495
15
Topics in Information Retrieval
529
16
Text Categorization 575
463
Contents
List of Tables
xv
List of Figures xxi
Table of Notations
xxv
Preface
rodx
Road Map
mxv
I Preliminaries 1
1
Introduction 3
1.1 Rationalist and Empiricist Approaches to Language
4
1.2 Scientific Content 7
1.2.1 Questions that linguistics should answer
8
1.2.2 Non-categorical phenomena in language
11
1.2.3
Language and cognition as probabilistic
phenomena 15
1.3
The Ambiguity of Language: Why NLP Is Difficult
17
1.4 Dirty Hands 19
1.4.1 Lexical resources 19
1.4.2 Word counts 20
1.4.3 Zipf’s laws 23
1.4.4 Collocations 29
1.4.5 Concordances 31
1.5 Further Reading 34
.
Vlll
Contents
1.6 Exercises 35
2
Mathematical Foundations
39
2.1 Elementary Probability
Theory
40
2.1.1
Probability spaces 40
2.1.2
Conditional probability and independence
2.1.3
Bayes’ theorem 43
2.1.4
Random variables
4 5
2.1.5
Expectation and variance
46
2.1.6
Notation 4 7
2.1.7
Joint and conditional distributions
48
2.1.8
Determining P 48
2.1.9
Standard distributions 50
2.1.10
Bayesian statistics 54
2.1.11
Exercises 59
42
2.2 Essential Information Theory 60
2.2.1 Entropy 61
2.2.2 Joint entropy and conditional entropy
63
2.2.3 Mutual information 66
2.2.4 The noisy channel model 68
2.2.5
Relative entropy or Kullback-Leibler divergence
2.2.6 The relation to language: Cross entropy
73
2.2.7 The entropy of English 76
2.2.8 Perplexity 78
2.2.9 Exercises 78
2.3 Further Reading 79
3
Linguistic Essentials 81
3.1 Parts of Speech and Morphology 8 1
3.1.1 Nouns and 83
pronouns
3.1.2
Words that accompany nouns: Determiners and
adjectives 87
3.1.3 Verbs 88
3.1.4 Other parts of speech 91
3.2 Phrase Structure 93
3.2.1 Phrase structure 96
grammars
3.2.2 Dependency: Arguments and adjuncts
3.2.3 X’ theory 106
3.2.4 Phrase structure ambiguity 107
101
72