【免费】A_survey_of_named_entity_recognition_and_classification.pdf资源-CSDN文库

需积分: 0 183 浏览量 2017-02-11 21:19:59 上传评论收藏 131KB PDF 举报

资源推荐

资源详情

资源评论

A survey of named entity recognition and classification

David Nadeau, Satoshi Sekine

National Research Council Canada / New York University

Introduction

The term “Named Entity”, now widely used in Natural Language Processing, was coined

for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim

1996). At that time, MUC was focusing on Information Extraction (IE) tasks where

structured information of company activities and defense related activities is extracted

from unstructured text, such as newspaper articles. In defining the task, people noticed

that it is essential to recognize information units like names, including person,

organization and location names, and numeric expressions including time, date, money

and percent expressions. Identifying references to these entities in text was recognized as

one of the important sub-tasks of IE and was called “Named Entity Recognition and

Classification (NERC)”.

We present here a survey of fifteen years of research in the NERC field, from 1991

to 2006. While early systems were making use of handcrafted rule-based algorithms,

modern systems most often resort to machine learning techniques. We survey these

techniques as well as other critical aspects of NERC such as features and evaluation

methods. It was indeed concluded in a recent conference that the choice of features is at

least as important as the choice of technique for obtaining a good NERC system (E.

Tjong Kim Sang & De Meulder 2003). Moreover, the way NERC systems are evaluated

and compared is essential to progress in the field. To the best of our knowledge, NERC

features, techniques, and evaluation methods have not been surveyed extensively yet.

The first section of this survey presents some observations on published work from

the point of view of activity per year, supported languages, preferred textual genre and

domain, and supported entity types. It was collected from the review of a hundred

English language papers sampled from the major conferences and journals. We do not

claim this review to be exhaustive or representative of all the research in all languages,

but we believe it gives a good feel for the breadth and depth of previous work. Section 2

covers the algorithmic techniques that were proposed for addressing the NERC task.

Most techniques are borrowed from the Machine Learning (ML) field. Instead of

elaborating on techniques themselves, the third section lists and classifies the proposed

features, i.e., descriptions and characteristic of words for algorithmic consumption.

Section 4 presents some of the evaluation paradigms that were proposed throughout the

major forums. Finally, we present our conclusions.

1 Observations: 1991 to 2006

The computational research aiming at automatically identifying named entities in texts

forms a vast and heterogeneous pool of strategies, methods and representations. One of

the first research papers in the field was presented by Lisa F. Rau (1991) at the Seventh

IEEE Conference on Artificial Intelligence Applications. Rau’s paper describes a system

to “extract and recognize [company] names”. It relies on heuristics and handcrafted rules.

From 1991 (1 publication) to 1995 (we found 8 publications in English), the publication

rate remained relatively low. It accelerated in 1996, with the first major event dedicated

to the task: MUC-6 (R. Grishman & Sundheim 1996). It never declined since then with

steady research and numerous scientific events: HUB-4 (N. Chinchor et al. 1998), MUC-

7 and MET-2 (N. Chinchor 1999), IREX (S. Sekine & Isahara 2000), CONLL (E. Tjong

Kim Sang 2002, E. Tjong Kim Sang & De Meulder 2003), ACE (G. Doddington et al.

2004) and HAREM (D. Santos et al. 2006). The Language Resources and Evaluation

Conference (LREC)

has also been staging workshops and main conference tracks on the

topic since 2000.

1.1 Language factor

A good proportion of work in NERC research is devoted to the study of English but a

possibly larger proportion addresses language independence and multilingualism

problems. German is well studied in CONLL-2003 and in earlier works. Similarly,

Spanish and Dutch are strongly represented, boosted by a major devoted conference:

CONLL-2002. Japanese has been studied in the MUC-6 conference, the IREX conference

and other work. Chinese is studied in an abundant literature (e.g., L.-J. Wang et al. 1992,

H.-H. Chen & Lee 1996, S. Yu et al. 1998) and so are French (G. Petasis et al. 2001,

Poibeau 2003), Greek (S. Boutsis et al. 2000) and Italian (W. Black et al. 1998, A.

Cucchiarelli & Velardi 2001). Many other languages received some attention as well:

Basque (C. Whitelaw & Patrick 2003), Bulgarian (J. Da Silva et al. 2004), Catalan (X.

Carreras et al. 2003), Cebuano (J. May et al. 2003), Danish (E. Bick 2004), Hindi (S.

Cucerzan & Yarowsky 1999, J. May et al. 2003), Korean (C. Whitelaw & Patrick 2003),

Polish (J. Piskorski 2004), Romanian (S. Cucerzan & Yarowsky 1999), Russian (B.

Popov et al. 2004), Swedish (D. Kokkinakis 1998) and Turkish (S. Cucerzan &

Yarowsky 1999). Portuguese was examined by (D. Palmer & Day 1997) and, at the time

of writing this survey, the HAREM conference is revisiting that language. Finally, Arabic

(F. Huang 2005) has started to receive a lot of attention in large-scale projects such as

Global Autonomous Language Exploitation (GALE)

1.2 Textual genre or domain factor

The impact of textual genre (journalistic, scientific, informal, etc.) and domain

(gardening, sports, business, etc.) has been rather neglected in the NERC literature. Few

studies are specifically devoted to diverse genres and domains. D. Maynard et al. (2001)

designed a system for emails, scientific texts and religious texts. E. Minkov et al. (2005)

created a system specifically designed for email documents. Perhaps unsurprisingly, these

experiments demonstrated that although any domain can be reasonably supported, porting

a system to a new domain or textual genre remains a major challenge. T. Poibeau and

Kosseim (2001), for instance, tested some systems on both the MUC-6 collection

composed of newswire texts, and on a proprietary corpus made of manual translations of

phone conversations and technical emails. They report a drop in performance for every

system (some 20% to 40% of precision and recall).

http://www.lrec-conf.org/

http://projects.ldc.upenn.edu/gale/

1.3 Entity type factor

In the expression “Named Entity”, the word “Named” aims to restrict the task to only

those entities for which one or many rigid designators, as defined by S. Kripke (1982),

stands for the referent. For instance, the automotive company created by Henry Ford in

1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper

names as well as certain natural kind terms like biological species and substances. There

is a general agreement in the NERC community about the inclusion of temporal

expressions and some numerical expressions such as amounts of money and other types

of units. While some instances of these types are good examples of rigid designators

(e.g., the year 2001 is the 2001

year of the Gregorian calendar) there are also many

invalid ones (e.g., in June refers to the month of an undefined year – past June, this June,

June 2020, etc.). It is arguable that the NE definition is loosened in such cases for

practical reasons.

Early work formulates the NERC problem as recognizing “proper names” in

general (e.g., S. Coates-Stephens 1992, C. Thielen 1995). Overall, the most studied types

are three specializations of “proper names”: names of “persons”, “locations” and

“organizations”. These types are collectively known as “enamex” since the MUC-6

competition. The type “location” can in turn be divided into multiple subtypes of “fine-

grained locations”: city, state, country, etc. (M. Fleischman 2001, S. Lee & Geunbae Lee

2005). Similarly, “fine-grained person” sub-categories like “politician” and “entertainer”

appear in the work of M. Fleischman and Hovy (2002). The type “person” is quite

common and used at least once in an original way by O. Bodenreider and Zweigenbaum

(2000) who combines it with other cues for extracting medication and disease names

(e.g., “Parkinson disease”). In the ACE program, the type “facility” subsumes entities of

the types “location” and “organization”. The type “GPE” is used to represent a location

which has a government, such as a city or a country.

The type “miscellaneous” is used in the CONLL conferences and includes proper

names falling outside the classic “enamex”. The class is also sometimes augmented with

the type “product” (e.g., E. Bick 2004). The “timex” (another term coined in MUC) types

“date” and “time” and the “numex” types “money” and “percent” are also quite

predominant in the literature. Since 2003, a community named TIMEX2 (L. Ferro et al.

2005) proposes an elaborated standard for the annotation and normalization of temporal

expressions. Finally, marginal types are sometime handled for specific needs: “film” and

“scientist” (O. Etzioni et al. 2005), “email address” and “phone number” (I. Witten et al.

1999, D. Maynard et al. 2001), “research area” and “project name” (J. Zhu et al. 2005),

“book title” (S. Brin 1998, I. Witten et al. 1999), “job title” (W. Cohen & Sarawagi 2004)

and “brand” (E. Bick 2004).

A recent interest in bioinformatics, and the availability of the GENIA corpus (T.

Ohta et al. 2002) led to many studies dedicated to types such as “protein”, “DNA”,

“RNA”, “cell line” and “cell type” (e.g., D. Shen et al. 2003, B. Settles 2004) as well as

studies targeted to “protein” recognition only (Y. Tsuruoka & Tsujii 2003). Related work

also includes “drug” (T. Rindfleisch et al. 2000) and “chemical” (M. Narayanaswamy et

al. 2003) names.

Some recent work does not limit the possible types to extract and is referred as

“open domain” NERC (See E. Alfonseca & Manandhar 2002, R. Evans 2003). In this line

of research, S. Sekine and Nobata (2004) defined a named entity hierarchy which

includes many fine grained subcategories, such as museum, river or airport, and adds a

wide range of categories, such as product and event, as well as substance, animal, religion

or color. It tries to cover most frequent name types and rigid designators appearing in a

newspaper. The number of categories is about 200, and they are now defining popular

attributes for each category to make it an ontology.

1.4 What’s next?

Recent researches in multimedia indexing, semi-supervised learning, complex linguistic

phenomena, and machine translation suggest some new directions for the field. On one

side, there is a growing interest in multimedia information processing (e.g., video,

speech) and particularly NE extraction from it (R. Basili et al. 2005). Lot of effort is also

invested toward semi-supervised and unsupervised approaches to NERC motivated by the

use of very large collections of texts (O. Etzioni et al. 2005) and the possibility of

handling multiple NE types (D. Nadeau et al. 2006). Complex linguistic phenomena (e.g.,

metonymy) that are common short-coming of current systems are under investigation (T.

Poibeau, 2006). Finally, large-scale projects such as GALE, discussed in section 1.1,

open the way to integration of NERC and Machine Translation for mutual improvement.

2 Learning methods

The ability to recognize previously unknown entities is an essential part of NERC

systems. Such ability hinges upon recognition and classification rules triggered by

distinctive features associated with positive and negative examples. While early studies

were mostly based on handcrafted rules, most recent ones use supervised machine

learning (SL) as a way to automatically induce rule-based systems or sequence labeling

algorithms starting from a collection of training examples. This is evidenced, in the

research community, by the fact that five systems out of eight were rule-based in the

MUC-7 competition while sixteen systems were presented at CONLL-2003, a forum

devoted to learning techniques. When training examples are not available, handcrafted

rules remain the preferred technique, as shown in S. Sekine and Nobata (2004) who

developed a NERC system for 200 entity types.

The idea of supervised learning is to study the features of positive and negative

examples of NE over a large collection of annotated documents and design rules that

capture instances of a given type. Section 2.1 explains SL approaches in more details.

The main shortcoming of SL is the requirement of a large annotated corpus. The

unavailability of such resources and the prohibitive cost of creating them lead to two

alternative learning methods: semi-supervised learning (SSL) and unsupervised learning

(UL). These techniques are presented in section 2.2 and 2.3 respectively.

2.1 Supervised learning

The current dominant technique for addressing the NERC problem is supervised learning.

SL techniques include Hidden Markov Models (HMM) (D. Bikel et al. 1997), Decision

Trees (S. Sekine 1998), Maximum Entropy Models (ME) (A. Borthwick 1998), Support

Vector Machines (SVM) (M. Asahara & Matsumoto 2003), and Conditional Random

Fields (CRF) (A. McCallum & Li 2003). These are all variants of the SL approach that

剩余19页未读，继续阅读

评论收藏

内容反馈

P-A

粉丝: 2
资源: 11

A_survey_of_named_entity_recognition_and_classification.pdf

A Survey on Deep Learning for Named Entity Recognition.pdf

Annotated Corpus for Named Entity Recognition.zip

Apress.Pro.Entity.Framework.4.0.Mar.2010.pdf

org.apache.http.entity.mime

Apress.Pro.Entity.Framework.4.0.pdf

DevArt_Entity_Developer_6.4.719_Professional_Downloadly.ir.rar

Entity_Framework_学习.pdf

org.jibx.schema.org.oasis_open.committees.entity.release._1_0.catalog-1.0.6.zip

Entity_Framework_Code_First_Succinctly.pdf

60-Continual Learning for Named Entity Recognition.rar

Neural Architectures for Named Entity Recognition

Code_First_使用Entity._Framework编程.docx

Realtek_WPS_user_guide.pdf

CodeFirst使用Entity._Framework编程

Wrox.Professional.ADO.NET.3.5.with.LINQ.and.the.Entity.Framework.Feb.2009.rar

Z.EntityFramework.Extensions破解 注册机

Professional.ADO.NET.3.5.with.LINQ.and.the.Entity.Framework

Z.EntityFramework.Extensions破解版

国家开放大学计算机应用基础终结性考试（大作业）

离散数学知识点整理（超级全面详细！）

《科研伦理与学术规范》期末考试文档2（40题）

Word2Recite 桌面单词

2021全国及分省市县行政区划矢量图层shp文件.rar

Revit 各版本官方族库及项目样板下载和安装方法，2016-2021族库离线包下载.rar

38000词汇思维导图（1-50词根）β版.rar

博士“申请-考核制”面试——英文提问问题/答案模板

Zotero及常用插件

MCGS组态精品版图库.zip

最新资源

Z.EntityFramework.Extensions破解注册机