没有合适的资源?快使用搜索试试~ 我知道了~
命名实体识别(Standford)
需积分: 44 20 下载量 161 浏览量
2018-05-30
08:04:23
上传
评论 1
收藏 318KB PDF 举报
温馨提示
试读
31页
斯坦福NLP的信息提取,包括命名实体识别和关系提取等。
资源推荐
资源详情
资源评论
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright
c
2016. All
rights reserved. Draft of August 7, 2017.
CHAPTER
21
Information Extraction
I am the very model of a modern Major-General,
I’ve information vegetable, animal, and mineral,
I know the kings of England, and I quote the fights historical
From Marathon to Waterloo, in order categorical...
Gilbert and Sullivan, Pirates of Penzance
Imagine that you are an analyst with an investment firm that tracks airline stocks.
You’re given the task of determining the relationship (if any) between airline an-
nouncements of fare increases and the behavior of their stocks the next day. His-
torical data about stock prices is easy to come by, but what about the airline an-
nouncements? You will need to know at least the name of the airline, the nature of
the proposed fare hike, the dates of the announcement, and possibly the response of
other airlines. Fortunately, these can be all found in news articles like this one:
Citing high fuel prices, United Airlines said Friday it has increased fares
by $6 per round trip on flights to some cities also served by lower-
cost carriers. American Airlines, a unit of AMR Corp., immediately
matched the move, spokesman Tim Wagner said. United, a unit of UAL
Corp., said the increase took effect Thursday and applies to most routes
where it competes against discount carriers, such as Chicago to Dallas
and Denver to San Francisco.
This chapter presents techniques for extracting limited kinds of semantic con-
tent from text. This process of information extraction (IE), turns the unstructured
information
extraction
information embedded in texts into structured data, for example for populating a
relational database to enable further processing.
The first step in most IE tasks is to find the proper names or named entities
mentioned in a text. The task of named entity recognition (NER) is to find each
named entity
recognition
mention of a named entity in the text and label its type. What constitutes a named
entity type is application specific; these commonly include people, places, and or-
ganizations but also more specific entities from the names of genes and proteins
(Cohen and Demner-Fushman, 2014) to the names of college courses (McCallum,
2005).
Having located all of the mentions of named entities in a text, it is useful to
link, or cluster, these mentions into sets that correspond to the entities behind the
mentions, for example inferring that mentions of United Airlines and United in the
sample text refer to the same real-world entity. We’ll defer discussion of this task of
coreference resolution until Chapter 23.
The task of relation extraction is to find and classify semantic relations among
relation
extraction
the text entities, often binary relations like spouse-of, child-of, employment, part-
whole, membership, and geospatial relations. Relation extraction has close links to
populating a relational database.
2 CHAPTER 21 • INFORMATION EXTRACTION
The task of event extraction is to find events in which these entities participate,
event
extraction
like, in our sample text, the fare increases by United and American and the reporting
events said and cite. We’ll also need to perform event coreference to figure out
which of the many event mentions in a text refer to the same event; in our running
example the two instances of increase and the phrase the move all refer to the same
event.
To figure out when the events in a text happened we’ll do recognition of tem-
poral expressions like days of the week (Friday and Thursday), months, holidays,
temporal
expression
etc., relative expressions like two days from now or next year and times such as 3:30
P.M. or noon. The problem of temporal expression normalization is to map these
temporal expressions onto specific calendar dates or times of day to situate events
in time. In our sample task, this will allow us to link Friday to the time of United’s
announcement, and Thursday to the previous day’s fare increase, and produce a
timeline in which United’s announcement follows the fare increase and American’s
announcement follows both of those events.
Finally, many texts describe recurring stereotypical situations. The task of tem-
plate filling is to find such situations in documents and fill the template slots with
template filling
appropriate material. These slot-fillers may consist of text segments extracted di-
rectly from the text, or concepts like times, amounts, or ontology entities that have
been inferred from text elements through additional processing.
Our airline text is an example of this kind of stereotypical situation since airlines
often raise fares and then wait to see if competitors follow along. In this situa-
tion, we can identify United as a lead airline that initially raised its fares, $6 as the
amount, Thursday as the increase date, and American as an airline that followed
along, leading to a filled template like the following.
FARE-RAISE ATTEMPT:
LEAD AIRLINE: UNITED AIRLINES
AMOUNT: $6
EFFECTIVE DATE: 2006-10-26
FOLLOWER: AMERICAN AIRLINES
The following sections review current approaches to each of these problems.
21.1 Named Entity Recognition
The first step in information extraction is to detect the entities in the text. A named
entity is, roughly speaking, anything that can be referred to with a proper name:
named entity
a person, a location, an organization. The term is commonly extended to include
things that aren’t entities per se, including dates, times, and other kinds of temporal
expressions, and even numerical expressions like prices. Here’s the sample text
temporal
expressions
introduced earlier with the named entities marked:
Citing high fuel prices, [
ORG
United Airlines] said [
TIME
Friday] it
has increased fares by [
MONEY
$6] per round trip on flights to some
cities also served by lower-cost carriers. [
ORG
American Airlines], a
unit of [
ORG
AMR Corp.], immediately matched the move, spokesman
[
PER
Tim Wagner] said. [
ORG
United], a unit of [
ORG
UAL Corp.],
said the increase took effect [
TIME
Thursday] and applies to most
routes where it competes against discount carriers, such as [
LOC
Chicago]
to [
LOC
Dallas] and [
LOC
Denver] to [
LOC
San Francisco].
21.1 • NAMED ENTITY RECOGNITION 3
The text contains 13 mentions of named entities including 5 organizations, 4 loca-
tions, 2 times, 1 person, and 1 mention of money.
In addition to their use in extracting events and the relationship between par-
ticipants, named entities are useful for many other language processing tasks. In
sentiment analysis we might want to know a consumer’s sentiment toward a partic-
ular entity. Entities are a useful first stage in question answering, or for linking text
to information in structured knowledge sources like wikipedia.
Figure 21.1 shows typical generic named entity types. Many applications will
also need to use specific entity types like proteins, genes, commercial products, or
works of art.
Type Tag Sample Categories Example sentences
People PER people, characters Turing is a giant of computer science.
Organization ORG companies, sports teams The IPCC warned about the cyclone.
Location LOC regions, mountains, seas The Mt. Sanitas loop is in Sunshine Canyon.
Geo-Political
Entity
GPE countries, states, provinces Palo Alto is raising the fees for parking.
Facility FAC bridges, buildings, airports Consider the Tappan Zee Bridge.
Vehicles VEH planes, trains, automobiles It was a classic Ford Falcon.
Figure 21.1 A list of generic named entity types with the kinds of entities they refer to.
Named entity recognition means finding spans of text that constitute proper
names and then classifying the type of the entity. Recognition is difficult partly be-
cause of the ambiguity of segmentation; we need to decide what’s an entity and what
isn’t, and where the boundaries are. Another difficulty is caused by type ambiguity.
The mention JFK can refer to a person, the airport in New York, or any number of
schools, bridges, and streets around the United States. Some examples of this kind
of cross-type confusion are given in Figures 21.2 and 21.3.
Name Possible Categories
Washington Person, Location, Political Entity, Organization, Vehicle
Downing St. Location, Organization
IRA Person, Organization, Monetary Instrument
Louis Vuitton Person, Organization, Commercial Product
Figure 21.2 Common categorical ambiguities associated with various proper names.
[
PER
Washington] was born into slavery on the farm of James Burroughs.
[
ORG
Washington] went up 2 games to 1 in the four-game series.
Blair arrived in [
LOC
Washington] for what may well be his last state visit.
In June, [
GPE
Washington] passed a primary seatbelt law.
The [
VEH
Washington] had proved to be a leaky ship, every passage I made...
Figure 21.3 Examples of type ambiguities in the use of the name Washington.
21.1.1 NER as Sequence Labeling
The standard algorithm for named entity recognition is as a word-by-word sequence
labeling task, in which the assigned tags capture both the boundary and the type. A
sequence classifier like an MEMM or CRF is trained to label the tokens in a text
with tags that indicate the presence of particular kinds of named entities. Consider
the following simplified excerpt from our running example.
4 CHAPTER 21 • INFORMATION EXTRACTION
[
ORG
American Airlines], a unit of [
ORG
AMR Corp.], immediately matched
the move, spokesman [
PER
Tim Wagner] said.
Figure 21.4 shows the same excerpt represented with IOB tagging. In IOB tag-
IOB
ging we introduce a tag for the beginning (B) and inside (I) of each entity type,
and one for tokens outside (O) any entity. The number of tags is thus 2n + 1 tags,
where n is the number of entity types. IOB tagging can represent exactly the same
information as the bracketed notation.
Words IOB Label IO Label
American B-ORG I-ORG
Airlines I-ORG I-ORG
, O O
a O O
unit O O
of O O
AMR B-ORG I-ORG
Corp. I-ORG I-ORG
, O O
immediately O O
matched O O
the O O
move O O
, O O
spokesman O O
Tim B-PER I-PER
Wagner I-PER I-PER
said O O
. O O
Figure 21.4 Named entity tagging as a sequence model, showing IOB and IO encodings.
We’ve also shown IO tagging, which loses some information by eliminating the
B tag. Without the B tag IO tagging is unable to distinguish between two entities of
the same type that are right next to each other. Since this situation doesn’t arise very
often (usually there is at least some punctuation or other deliminator), IO tagging
may be sufficient, and has the advantage of using only n +1 tags.
Having encoded our training data with IOB tags, the next step is to select a set of
features to associate with each input word token. Figure 21.5 lists standard features
used in state-of-the-art systems.
We’ve seen many of these features before in the context of part-of-speech tag-
ging, particularly for tagging unknown words. This is not surprising, as many un-
known words are in fact named entities. Word shape features are thus particularly
important in the context of NER. Recall that word shape features are used to rep-
word shape
resent the abstract letter pattern of the word by mapping lower-case letters to ‘x’,
upper-case to ‘X’, numbers to ’d’, and retaining punctuation. Thus for example
I.M.F would map to X.X.X. and DC10-30 would map to XXdd-dd. A second class
of shorter word shape features is also used. In these features consecutive character
types are removed, so DC10-30 would be mapped to Xd-d but I.M.F would still map
to X.X.X. It turns out that this feature by itself accounts for a considerable part of the
success of NER systems for English news text. Shape features are also particularly
important in recognizing names of proteins and genes in biological texts.
21.1 • NAMED ENTITY RECOGNITION 5
identity of w
i
identity of neighboring words
part of speech of w
i
part of speech of neighboring words
base-phrase syntactic chunk label of w
i
and neighboring words
presence of w
i
in a gazetteer
w
i
contains a particular prefix (from all prefixes of length ≤ 4)
w
i
contains a particular suffix (from all suffixes of length ≤ 4)
w
i
is all upper case
word shape of w
i
word shape of neighboring words
short word shape of w
i
short word shape of neighboring words
presence of hyphen
Figure 21.5 Features commonly used in training named entity recognition systems.
For example the named entity token L’Occitane would generate the following
non-zero valued feature values:
prefix(w
i
) = L
prefix(w
i
) = L’
prefix(w
i
) = L’O
prefix(w
i
) = L’Oc
suffix(w
i
) = tane
suffix(w
i
) = ane
suffix(w
i
) = ne
suffix(w
i
) = e
word-shape(w
i
) = X’Xxxxxxxx
short-word-shape(w
i
) = X’Xx
A gazetteer is a list of place names, and they can offer millions of entries for
gazetteer
all manner of locations along with detailed geographical, geologic, and political
information.
1
In addition to gazeteers, the United States Census Bureau provides
extensive lists of first names and surnames derived from its decadal census in the
U.S.
2
Similar lists of corporations, commercial products, and all manner of things
biological and mineral are also available from a variety of sources. Gazeteer features
are typically implemented as a binary feature for each name list. Unfortunately, such
lists can be difficult to create and maintain, and their usefulness varies considerably
depending on the named entity class. It appears that gazetteers can be quite effec-
tive, while extensive lists of persons and organizations are not nearly as beneficial
(Mikheev et al., 1999).
The relative usefulness of any of these features or combination of features de-
pends to a great extent on the application, genre, media, language, and text encoding.
For example, shape features, which are critical for English newswire texts, are of lit-
tle use with materials transcribed from spoken text by automatic speech recognition,
materials gleaned from informally edited sources such as blogs and discussion fo-
rums, and for character-based languages like Chinese where case information isn’t
available. The set of features given in Fig. 21.5 should therefore be thought of as
only a starting point for any given application.
1
www.geonames.org
2
www.census.gov
剩余30页未读,继续阅读
资源评论
骚铭科技
- 粉丝: 10
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功