mit机器学习自然语言处理课件资源-CSDN文库

共71个文件

pdf：35个

ps：32个

html：3个

机器学习

自然语言处理

Natural

Language

5星 · 超过95%的资源需积分: 10 137 浏览量 2009-08-02 22:41:21 上传评论 1 收藏 8.58MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

mit ml-nlp.rar （71个子文件）

mit ml-nlp

l24.pdf 257KB

l20.ps 168KB

l8.pdf 171KB

lagrange.ps 36KB

l22.pdf 147KB

l4.pdf 107KB

adwait.ps 1.5MB

adwait.pdf 828KB

6891.html 2KB

general.pdf 7KB

decoder.pdf 927KB

l10.pdf 144KB

l12.ps 177KB

l18.pdf 116KB

l11.pdf 195KB

l6.ps 2.21MB

newai98.pdf 501KB

newai98.ps 490KB

s.ps 49KB

l21.ps 99KB

l4.ps 113KB

l14.pdf 280KB

l7.ps 161KB

l15.ps 936KB

q.ps 48KB

l9.ps 243KB

l23.ps 210KB

l2.pdf 134KB

l11.ps 197KB

aaai97.pdf 128KB

l12.pdf 176KB

l16.ps 228KB

l2.ps 147KB

l24.ps 395KB

lagrange.pdf 41KB

l5.pdf 118KB

l13.ps 147KB

l20.pdf 155KB

l3.pdf 190KB

aimag97.ps 211KB

aimag97.pdf 185KB

l1.ps 508KB

l9.pdf 220KB

l5.ps 134KB

l16.pdf 190KB

J1996-1002.pdf 1.79MB

l7.pdf 156KB

l21.pdf 90KB

l10.ps 149KB

aaai97.ps 126KB

l17.pdf 213KB

l8.ps 179KB

l22.ps 228KB

l3.ps 249KB

l17.ps 215KB

s.pdf 48KB

l15.pdf 326KB

maxent.pdf 117KB

q.pdf 24KB

index.html 4KB

l23.pdf 193KB

paper14.short.ps 425KB

l1.pdf 73KB

project.html 8KB

l6.pdf 244KB

l18.ps 198KB

l13.pdf 147KB

general.ps 15KB

paper14.short.pdf 239KB

l14.ps 466KB

h015a-techreport.ps.gz 199KB

A Maximum Entropy Approach

to Natural Language Processing

Adam L. Berger t

Columbia University

Vincent J. Della Pietra ~

Renaissance Technologies

Stephen A. Della Pietra ~

Renaissance Technologies

The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only

recently, however, have computers become powerful enough to permit the widescale application

of this concept to real world problems in statistical estimation and pattern recognition. In this

paper, we describe a method for statistical modeling based on maximum entropy. We present

a maximum-likelihood approach for automatically constructing maximum entropy models and

describe how to implement this approach efficiently, using as examples several problems in natural

language processing.

1. Introduction

Statistical modeling addresses the problem of constructing a stochastic model to predict

the behavior of a random process. In constructing this model, we typically have at our

disposal a sample of output from the process. Given this sample, which represents an

incomplete state of knowledge about the process, the modeling problem is to parlay

this knowledge into a representation of the process. We can then use this representation

to make predictions about the future behavior about the process.

Baseball managers (who rank among the better paid statistical modelers) employ

batting averages, compiled from a history of at-bats, to gauge the likelihood that a

player will succeed in his next appearance at the plate. Thus informed, they manipu-

late their lineups accordingly. Wall Street speculators (who rank among the

best

paid

statistical modelers) build models based on past stock price movements to predict to-

morrow's fluctuations and alter their portfolios to capitalize on the predicted future.

At the other end of the pay scale reside natural language researchers, who design

language and acoustic models for use in speech recognition systems and related ap-

plications.

The past few decades have witnessed significant progress toward increasing the

predictive capacity of statistical models of natural language. In language modeling, for

instance, Bahl et al. (1989) have used decision tree models and Della Pietra et al. (1994)

have used automatically inferred link grammars to model long range correlations in

language. In parsing, Black et al. (1992) have described how to extract grammatical

* This research, supported in part by ARPA under grant ONR N00014-91-C-0135, was conducted while

the authors were at the IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598

t Now at Computer Science Department, Columbia University.

:~ Now at Renaissance Technologies, Stony Brook, NY.

Computational Linguistics Volume 22, Number 1

rules from annotated text automatically and incorporate these rules into statistical

models of grammar. In speech recognition, Lucassen and Mercer (1984) have intro-

duced a technique for automatically discovering relevant features for the translation

of word spelling to word pronunciation.

These efforts, while varied in specifics, all confront two essential tasks of statistical

modeling. The first task is to determine a set of statistics that captures the behavior of

a random proceSs. Given a set of statistics, the second task is to corral these facts into

an accurate model of the process--a model capable of predicting the future output

of the process. The first task is one of feature selection; the second is one of model

selection. In the following pages we present a unified approach to these two tasks

based on the maximum entropy philosophy.

In Section 2 we give an overview of the maximum entropy philosophy and work

through a motivating example. In Section 3 we describe the mathematical structure of

maximum entropy models and give an efficient algorithm for estimating the parame-

ters of such models. In Section 4 we discuss feature selection, and present an automatic

method for discovering facts about a process from a sample of output from the process.

We then present a series of refinements to the method to make it practical to imple-

ment. Finally, in Section 5 we describe the application of maximum entropy ideas to

several tasks in stochastic language processing: bilingual sense disambiguation, word

reordering, and sentence segmentation.

2. A Maximum Entropy Overview

We introduce the concept of maximum entropy through a simple example. Suppose we

wish to model an expert translator's decisions concerning the proper French rendering

of the English word

in.

Our model p of the expert's decisions assigns to each French

word or phrase f an estimate, p(f), of the probability that the expert would choose f as

a translation of

in.

To guide us in developing p, we collect a large sample of instances

of the expert's decisions. Our goal is to extract a set of facts about the decision-making

process from the sample (the first task of modeling) that will aid us in constructing a

model of this process (the second task).

One obvious clue we might glean from the sample is the list of allowed trans-

lations. For example, we might discover that the expert translator always chooses

among the following five French phrases:

{dans, en, ?l, au cours de, pendant}.

With this

information in hand, we can impose our first constraint on our model p:

p(dans) + p(en) + p(h) + p(au cours de) + p(pendant) = 1

This equation represents our first statistic of the process; we can now proceed to

search for a suitable model that obeys this equation. Of course, there are an infinite

number of models p for which this identity holds. One model satisfying the above

equation is

p(dans)

= 1; in other words, the model always predicts

dans.

Another

model obeying this constraint predicts

pendant

with a probability of 1/2, and ~ with a

probability of 1/2. But both of these models offend our sensibilities: knowing only that

the expert always chose from among these five French phrases, how can we justify

either of these probability distributions? Each seems to be making rather bold assump-

tions, with no empirical justification. Put another way, these two models assume more

than we actually know about the expert's decision-making process. All we know is

that the expert chose exclusively from among these five French phrases; given this,

Computational Linguistics Volume 22, Number I

these questions, how do we go about finding the most uniform model subject to a set

of constraints like those we have described?

The maximum entropy method answers both of these questions, as we will demon-

strate in the next few pages. Intuitively, the principle is simple: model

all

that is known

and assume nothing about that which is unknown. In other words, given a collection

of facts, choose a model consistent with all the facts, but otherwise as uniform as

possible. This is precisely the approach we took in selecting our model p at each step

in the above example.

The maximum entropy concept has a long history. Adopting the least complex

hypothesis possible is embodied in Occam's razor ("Nunquam ponenda est pluralitas

sine necesitate.') and even appears earlier, in the Bible and the writings of Herotodus

(Jaynes 1990). Laplace might justly be considered the father of maximum entropy,

having enunciated the underlying theme 200 years ago in his "Principle of Insufficient

Reason:" when one has no information to distinguish between the probability of two

events, the best strategy is to consider them equally likely (Guiasu and Shenitzer

1985). As E. T. Jaynes, a more recent pioneer of maximum entropy, put it (Jaynes

1990):

... the fact that a certain probability distribution maximizes entropy

subject to certain constraints representing our incomplete information,

is the fundamental property which justifies use of that distribution

for inference; it agrees with everything that is known, but carefully

avoids assuming anything that is not known. It is a transcription into

mathematics of an ancient principle of wisdom ...

3. Maximum Entropy Modeling

We consider a random process that produces an output value y, a member of a finite set

3;. For the translation example just considered, the process generates a translation of the

word

in,

and the output y can be any word in the set

{dans, en, ?~, au cours de, pendant}.

In generating y, the process may be influenced by some contextual information x, a

member of a finite set X. In the present example, this information could include the

words in the English sentence surrounding

in.

Our task is to construct a stochastic model that accurately represents the behavior

of the random process. Such a model is a method of estimating the conditional prob-

ability that, given a context x, the process will output y. We will denote by

p(ylx)

the

probability that the model assigns to y in context x. With a slight abuse of notation, we

will also use

p(ylx)

to denote the entire conditional probability distribution provided

by the model, with the interpretation that y and x are placeholders rather than specific

instantiations. The proper interpretation should be clear from the context. We will de-

note by/~ the set of all conditional probability distributions. Thus a model

p(y[x)

is,

by definition, just an element of ~v.

3.1 Training Data

To study the process, we observe the behavior of the random process for some time,

collecting a large number of samples (xl,yl),

(x2, y2) .....

(XN, YN).

In the example we

have been considering, each sample would consist of a phrase x containing the words

surrounding

in,

together with the translation y of

that the process produced. For

now, we can imagine that these training samples have been generated by a human

expert who was presented with a number of random phrases containing

and asked

to choose a good translation for each. When we discuss real-world applications in

评论收藏

内容反馈

wnsares

2013-04-22

还是不错的啊，内容很详细
shandongwill

2024-03-08

mit 机器学习自然语言处理课件 #内容详尽
chenguanyi1993

2014-08-09

课件比较全，内容页很详细
雨朦

2011-11-20

非常有用啊！这套课件很全，而且讲解很深入浅出，相对比Stanford的和Cmu的要好理解一些，谢谢~

shikui3596

粉丝: 1
资源: 5

mit 机器学习 自然语言处理课件

自然语言处理课件.ppt

麻省理工MIT机器学习Machine Learning课程课件

麻省理工机器学习笔记

机器学习mit课件

自然语言处理课件

MIT_自然语言处理

MIT机器视觉PPT课件17

MIT机器视觉PPT课件

MIT机器视觉课件6

机器学习分支之自然语言处理课件

机器学习和自然语言处理

机器学习与自然语言处理

MIT 机器学习优化

MIT机器视觉课件4,5

资源MIT发布的10大自然语言处理数据集和语料库

MIT机器视觉课件8

MIT博士的机器学习心得

哈工大自然语言处理课件

自然语言与机器学习.pdf

NLP课件（自然语言处理课件）

机器学习与自然语言处理(二)—概率图模型

mit-ml-master_机器学习_

MIT机器视觉课件7

加强学习mit案例

最新资源

mit 机器学习自然语言处理课件