命名实体识别（Standford）_命名实体识别NER资源-CSDN文库

NLP

需积分: 44 161 浏览量 2018-05-30 08:04:23 上传评论 1 收藏 318KB PDF 举报

资源推荐

资源详情

资源评论

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright

 2016. All

rights reserved. Draft of August 7, 2017.

CHAPTER

Information Extraction

I am the very model of a modern Major-General,

I’ve information vegetable, animal, and mineral,

I know the kings of England, and I quote the ﬁghts historical

From Marathon to Waterloo, in order categorical...

Gilbert and Sullivan, Pirates of Penzance

Imagine that you are an analyst with an investment ﬁrm that tracks airline stocks.

You’re given the task of determining the relationship (if any) between airline an-

nouncements of fare increases and the behavior of their stocks the next day. His-

torical data about stock prices is easy to come by, but what about the airline an-

nouncements? You will need to know at least the name of the airline, the nature of

the proposed fare hike, the dates of the announcement, and possibly the response of

other airlines. Fortunately, these can be all found in news articles like this one:

Citing high fuel prices, United Airlines said Friday it has increased fares

by $6 per round trip on ﬂights to some cities also served by lower-

cost carriers. American Airlines, a unit of AMR Corp., immediately

matched the move, spokesman Tim Wagner said. United, a unit of UAL

Corp., said the increase took effect Thursday and applies to most routes

where it competes against discount carriers, such as Chicago to Dallas

and Denver to San Francisco.

This chapter presents techniques for extracting limited kinds of semantic con-

tent from text. This process of information extraction (IE), turns the unstructured

information

extraction

information embedded in texts into structured data, for example for populating a

relational database to enable further processing.

The ﬁrst step in most IE tasks is to ﬁnd the proper names or named entities

mentioned in a text. The task of named entity recognition (NER) is to ﬁnd each

named entity

recognition

mention of a named entity in the text and label its type. What constitutes a named

entity type is application speciﬁc; these commonly include people, places, and or-

ganizations but also more speciﬁc entities from the names of genes and proteins

(Cohen and Demner-Fushman, 2014) to the names of college courses (McCallum,

2005).

Having located all of the mentions of named entities in a text, it is useful to

link, or cluster, these mentions into sets that correspond to the entities behind the

mentions, for example inferring that mentions of United Airlines and United in the

sample text refer to the same real-world entity. We’ll defer discussion of this task of

coreference resolution until Chapter 23.

The task of relation extraction is to ﬁnd and classify semantic relations among

relation

extraction

the text entities, often binary relations like spouse-of, child-of, employment, part-

whole, membership, and geospatial relations. Relation extraction has close links to

populating a relational database.

2 CHAPTER 21 • INFORMATION EXTRACTION

The task of event extraction is to ﬁnd events in which these entities participate,

event

extraction

like, in our sample text, the fare increases by United and American and the reporting

events said and cite. We’ll also need to perform event coreference to ﬁgure out

which of the many event mentions in a text refer to the same event; in our running

example the two instances of increase and the phrase the move all refer to the same

event.

To ﬁgure out when the events in a text happened we’ll do recognition of tem-

poral expressions like days of the week (Friday and Thursday), months, holidays,

temporal

expression

etc., relative expressions like two days from now or next year and times such as 3:30

P.M. or noon. The problem of temporal expression normalization is to map these

temporal expressions onto speciﬁc calendar dates or times of day to situate events

in time. In our sample task, this will allow us to link Friday to the time of United’s

announcement, and Thursday to the previous day’s fare increase, and produce a

timeline in which United’s announcement follows the fare increase and American’s

announcement follows both of those events.

Finally, many texts describe recurring stereotypical situations. The task of tem-

plate ﬁlling is to ﬁnd such situations in documents and ﬁll the template slots with

template ﬁlling

appropriate material. These slot-ﬁllers may consist of text segments extracted di-

rectly from the text, or concepts like times, amounts, or ontology entities that have

been inferred from text elements through additional processing.

Our airline text is an example of this kind of stereotypical situation since airlines

often raise fares and then wait to see if competitors follow along. In this situa-

tion, we can identify United as a lead airline that initially raised its fares, $6 as the

amount, Thursday as the increase date, and American as an airline that followed

along, leading to a ﬁlled template like the following.

FARE-RAISE ATTEMPT:







LEAD AIRLINE: UNITED AIRLINES

AMOUNT: $6

EFFECTIVE DATE: 2006-10-26

FOLLOWER: AMERICAN AIRLINES







The following sections review current approaches to each of these problems.

21.1 Named Entity Recognition

The ﬁrst step in information extraction is to detect the entities in the text. A named

entity is, roughly speaking, anything that can be referred to with a proper name:

named entity

a person, a location, an organization. The term is commonly extended to include

things that aren’t entities per se, including dates, times, and other kinds of temporal

expressions, and even numerical expressions like prices. Here’s the sample text

temporal

expressions

introduced earlier with the named entities marked:

Citing high fuel prices, [

ORG

United Airlines] said [

TIME

Friday] it

has increased fares by [

MONEY

$6] per round trip on ﬂights to some

cities also served by lower-cost carriers. [

ORG

American Airlines], a

unit of [

ORG

AMR Corp.], immediately matched the move, spokesman

[

PER

Tim Wagner] said. [

ORG

United], a unit of [

ORG

UAL Corp.],

said the increase took effect [

TIME

Thursday] and applies to most

routes where it competes against discount carriers, such as [

LOC

Chicago]

to [

LOC

Dallas] and [

LOC

Denver] to [

LOC

San Francisco].

21.1 • NAMED ENTITY RECOGNITION 3

The text contains 13 mentions of named entities including 5 organizations, 4 loca-

tions, 2 times, 1 person, and 1 mention of money.

In addition to their use in extracting events and the relationship between par-

ticipants, named entities are useful for many other language processing tasks. In

sentiment analysis we might want to know a consumer’s sentiment toward a partic-

ular entity. Entities are a useful ﬁrst stage in question answering, or for linking text

to information in structured knowledge sources like wikipedia.

Figure 21.1 shows typical generic named entity types. Many applications will

also need to use speciﬁc entity types like proteins, genes, commercial products, or

works of art.

Type Tag Sample Categories Example sentences

People PER people, characters Turing is a giant of computer science.

Organization ORG companies, sports teams The IPCC warned about the cyclone.

Location LOC regions, mountains, seas The Mt. Sanitas loop is in Sunshine Canyon.

Geo-Political

Entity

GPE countries, states, provinces Palo Alto is raising the fees for parking.

Facility FAC bridges, buildings, airports Consider the Tappan Zee Bridge.

Vehicles VEH planes, trains, automobiles It was a classic Ford Falcon.

Figure 21.1 A list of generic named entity types with the kinds of entities they refer to.

Named entity recognition means ﬁnding spans of text that constitute proper

names and then classifying the type of the entity. Recognition is difﬁcult partly be-

cause of the ambiguity of segmentation; we need to decide what’s an entity and what

isn’t, and where the boundaries are. Another difﬁculty is caused by type ambiguity.

The mention JFK can refer to a person, the airport in New York, or any number of

schools, bridges, and streets around the United States. Some examples of this kind

of cross-type confusion are given in Figures 21.2 and 21.3.

Name Possible Categories

Washington Person, Location, Political Entity, Organization, Vehicle

Downing St. Location, Organization

IRA Person, Organization, Monetary Instrument

Louis Vuitton Person, Organization, Commercial Product

Figure 21.2 Common categorical ambiguities associated with various proper names.

[

PER

Washington] was born into slavery on the farm of James Burroughs.

[

ORG

Washington] went up 2 games to 1 in the four-game series.

Blair arrived in [

LOC

Washington] for what may well be his last state visit.

In June, [

GPE

Washington] passed a primary seatbelt law.

The [

VEH

Washington] had proved to be a leaky ship, every passage I made...

Figure 21.3 Examples of type ambiguities in the use of the name Washington.

21.1.1 NER as Sequence Labeling

The standard algorithm for named entity recognition is as a word-by-word sequence

labeling task, in which the assigned tags capture both the boundary and the type. A

sequence classiﬁer like an MEMM or CRF is trained to label the tokens in a text

with tags that indicate the presence of particular kinds of named entities. Consider

the following simpliﬁed excerpt from our running example.

4 CHAPTER 21 • INFORMATION EXTRACTION

[

ORG

American Airlines], a unit of [

ORG

AMR Corp.], immediately matched

the move, spokesman [

PER

Tim Wagner] said.

Figure 21.4 shows the same excerpt represented with IOB tagging. In IOB tag-

IOB

ging we introduce a tag for the beginning (B) and inside (I) of each entity type,

and one for tokens outside (O) any entity. The number of tags is thus 2n + 1 tags,

where n is the number of entity types. IOB tagging can represent exactly the same

information as the bracketed notation.

Words IOB Label IO Label

American B-ORG I-ORG

Airlines I-ORG I-ORG

, O O

a O O

unit O O

of O O

AMR B-ORG I-ORG

Corp. I-ORG I-ORG

, O O

immediately O O

matched O O

the O O

move O O

, O O

spokesman O O

Tim B-PER I-PER

Wagner I-PER I-PER

said O O

. O O

Figure 21.4 Named entity tagging as a sequence model, showing IOB and IO encodings.

We’ve also shown IO tagging, which loses some information by eliminating the

B tag. Without the B tag IO tagging is unable to distinguish between two entities of

the same type that are right next to each other. Since this situation doesn’t arise very

often (usually there is at least some punctuation or other deliminator), IO tagging

may be sufﬁcient, and has the advantage of using only n +1 tags.

Having encoded our training data with IOB tags, the next step is to select a set of

features to associate with each input word token. Figure 21.5 lists standard features

used in state-of-the-art systems.

We’ve seen many of these features before in the context of part-of-speech tag-

ging, particularly for tagging unknown words. This is not surprising, as many un-

known words are in fact named entities. Word shape features are thus particularly

important in the context of NER. Recall that word shape features are used to rep-

word shape

resent the abstract letter pattern of the word by mapping lower-case letters to ‘x’,

upper-case to ‘X’, numbers to ’d’, and retaining punctuation. Thus for example

I.M.F would map to X.X.X. and DC10-30 would map to XXdd-dd. A second class

of shorter word shape features is also used. In these features consecutive character

types are removed, so DC10-30 would be mapped to Xd-d but I.M.F would still map

to X.X.X. It turns out that this feature by itself accounts for a considerable part of the

success of NER systems for English news text. Shape features are also particularly

important in recognizing names of proteins and genes in biological texts.

21.1 • NAMED ENTITY RECOGNITION 5

identity of w

identity of neighboring words

part of speech of w

part of speech of neighboring words

base-phrase syntactic chunk label of w

and neighboring words

presence of w

in a gazetteer

contains a particular preﬁx (from all preﬁxes of length ≤ 4)

contains a particular sufﬁx (from all sufﬁxes of length ≤ 4)

is all upper case

word shape of w

word shape of neighboring words

short word shape of w

short word shape of neighboring words

presence of hyphen

Figure 21.5 Features commonly used in training named entity recognition systems.

For example the named entity token L’Occitane would generate the following

non-zero valued feature values:

preﬁx(w

) = L

preﬁx(w

) = L’

preﬁx(w

) = L’O

preﬁx(w

) = L’Oc

sufﬁx(w

) = tane

sufﬁx(w

) = ane

sufﬁx(w

) = ne

sufﬁx(w

) = e

word-shape(w

) = X’Xxxxxxxx

short-word-shape(w

) = X’Xx

A gazetteer is a list of place names, and they can offer millions of entries for

gazetteer

all manner of locations along with detailed geographical, geologic, and political

information.

In addition to gazeteers, the United States Census Bureau provides

extensive lists of ﬁrst names and surnames derived from its decadal census in the

U.S.

Similar lists of corporations, commercial products, and all manner of things

biological and mineral are also available from a variety of sources. Gazeteer features

are typically implemented as a binary feature for each name list. Unfortunately, such

lists can be difﬁcult to create and maintain, and their usefulness varies considerably

depending on the named entity class. It appears that gazetteers can be quite effec-

tive, while extensive lists of persons and organizations are not nearly as beneﬁcial

(Mikheev et al., 1999).

The relative usefulness of any of these features or combination of features de-

pends to a great extent on the application, genre, media, language, and text encoding.

For example, shape features, which are critical for English newswire texts, are of lit-

tle use with materials transcribed from spoken text by automatic speech recognition,

materials gleaned from informally edited sources such as blogs and discussion fo-

rums, and for character-based languages like Chinese where case information isn’t

available. The set of features given in Fig. 21.5 should therefore be thought of as

only a starting point for any given application.

www.geonames.org

www.census.gov

剩余30页未读，继续阅读

评论收藏

内容反馈

骚铭科技

粉丝: 10
资源: 6

命名实体识别（Standford）

实体识别数据集：用于命名实体识别（NER）和实体识别任务的语料库集合。 这些带注释的数据集涵盖多种语言，域和实体类型

命名实体识别

实体识别实体识别

命名实体识别，效果最好的方法

基于条件随机场的英文产品命名实体识别

自然语言处理 中英文分词、词性标注与命名实体识别——文本和代码

msra(NER)命名实体识别语料

命名实体识别标记语料

波森命名实体识别语料

如何利用NER技术，炼造出地址实体识别的火眼金睛

命名实体识别v命名实体识别

中文命名实体识别

NER:命名实体识别

中文命名实体识别的研究

CRF+词典方法的中文命名实体识别工具

NER中文命名实体识别数据集

python命名实体识别demo

论文研究-基于CRF的中文命名实体识别 .pdf

Java 实现的自然语言处理 中文分词 词性标注 命名实体识别 依存句法分析 关键词提取 自动摘要 短语提取 拼音 简繁转换.zip

语料库英文原版新书Biber：Corpus+Linguistics[2000][P].djvu

深度学习命名实体识别【TKDE2020-南洋理工】.pdf

名称命名识别：专注于研究CONLL2003数据库上各种NER系统的研究论文：Bi-LSTM-CRF，单词嵌入

Standford Tensorflow课程资料

【BERT系列】——命名实体识别

命名实体识别算法综述

统计学人名命名实体识别

命名实体识别技术综述

命名实体识别数据集.rar

命名实体识别综述1

最新资源

实体识别数据集：用于命名实体识别（NER）和实体识别任务的语料库集合。这些带注释的数据集涵盖多种语言，域和实体类型

自然语言处理中英文分词、词性标注与命名实体识别——文本和代码

Java 实现的自然语言处理中文分词词性标注命名实体识别依存句法分析关键词提取自动摘要短语提取拼音简繁转换.zip