没有合适的资源?快使用搜索试试~ 我知道了~
A_survey_of_named_entity_recognition_and_classification.pdf
需积分: 0 7 下载量 183 浏览量
2017-02-11
21:19:59
上传
评论
收藏 131KB PDF 举报
温馨提示
试读
20页
A_survey_of_named_entity_recognition_and_classification
资源推荐
资源详情
资源评论
A survey of named entity recognition and classification
David Nadeau, Satoshi Sekine
National Research Council Canada / New York University
Introduction
The term “Named Entity”, now widely used in Natural Language Processing, was coined
for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim
1996). At that time, MUC was focusing on Information Extraction (IE) tasks where
structured information of company activities and defense related activities is extracted
from unstructured text, such as newspaper articles. In defining the task, people noticed
that it is essential to recognize information units like names, including person,
organization and location names, and numeric expressions including time, date, money
and percent expressions. Identifying references to these entities in text was recognized as
one of the important sub-tasks of IE and was called “Named Entity Recognition and
Classification (NERC)”.
We present here a survey of fifteen years of research in the NERC field, from 1991
to 2006. While early systems were making use of handcrafted rule-based algorithms,
modern systems most often resort to machine learning techniques. We survey these
techniques as well as other critical aspects of NERC such as features and evaluation
methods. It was indeed concluded in a recent conference that the choice of features is at
least as important as the choice of technique for obtaining a good NERC system (E.
Tjong Kim Sang & De Meulder 2003). Moreover, the way NERC systems are evaluated
and compared is essential to progress in the field. To the best of our knowledge, NERC
features, techniques, and evaluation methods have not been surveyed extensively yet.
The first section of this survey presents some observations on published work from
the point of view of activity per year, supported languages, preferred textual genre and
domain, and supported entity types. It was collected from the review of a hundred
English language papers sampled from the major conferences and journals. We do not
claim this review to be exhaustive or representative of all the research in all languages,
but we believe it gives a good feel for the breadth and depth of previous work. Section 2
covers the algorithmic techniques that were proposed for addressing the NERC task.
Most techniques are borrowed from the Machine Learning (ML) field. Instead of
elaborating on techniques themselves, the third section lists and classifies the proposed
features, i.e., descriptions and characteristic of words for algorithmic consumption.
Section 4 presents some of the evaluation paradigms that were proposed throughout the
major forums. Finally, we present our conclusions.
1 Observations: 1991 to 2006
The computational research aiming at automatically identifying named entities in texts
forms a vast and heterogeneous pool of strategies, methods and representations. One of
the first research papers in the field was presented by Lisa F. Rau (1991) at the Seventh
2
IEEE Conference on Artificial Intelligence Applications. Rau’s paper describes a system
to “extract and recognize [company] names”. It relies on heuristics and handcrafted rules.
From 1991 (1 publication) to 1995 (we found 8 publications in English), the publication
rate remained relatively low. It accelerated in 1996, with the first major event dedicated
to the task: MUC-6 (R. Grishman & Sundheim 1996). It never declined since then with
steady research and numerous scientific events: HUB-4 (N. Chinchor et al. 1998), MUC-
7 and MET-2 (N. Chinchor 1999), IREX (S. Sekine & Isahara 2000), CONLL (E. Tjong
Kim Sang 2002, E. Tjong Kim Sang & De Meulder 2003), ACE (G. Doddington et al.
2004) and HAREM (D. Santos et al. 2006). The Language Resources and Evaluation
Conference (LREC)
1
has also been staging workshops and main conference tracks on the
topic since 2000.
1.1 Language factor
A good proportion of work in NERC research is devoted to the study of English but a
possibly larger proportion addresses language independence and multilingualism
problems. German is well studied in CONLL-2003 and in earlier works. Similarly,
Spanish and Dutch are strongly represented, boosted by a major devoted conference:
CONLL-2002. Japanese has been studied in the MUC-6 conference, the IREX conference
and other work. Chinese is studied in an abundant literature (e.g., L.-J. Wang et al. 1992,
H.-H. Chen & Lee 1996, S. Yu et al. 1998) and so are French (G. Petasis et al. 2001,
Poibeau 2003), Greek (S. Boutsis et al. 2000) and Italian (W. Black et al. 1998, A.
Cucchiarelli & Velardi 2001). Many other languages received some attention as well:
Basque (C. Whitelaw & Patrick 2003), Bulgarian (J. Da Silva et al. 2004), Catalan (X.
Carreras et al. 2003), Cebuano (J. May et al. 2003), Danish (E. Bick 2004), Hindi (S.
Cucerzan & Yarowsky 1999, J. May et al. 2003), Korean (C. Whitelaw & Patrick 2003),
Polish (J. Piskorski 2004), Romanian (S. Cucerzan & Yarowsky 1999), Russian (B.
Popov et al. 2004), Swedish (D. Kokkinakis 1998) and Turkish (S. Cucerzan &
Yarowsky 1999). Portuguese was examined by (D. Palmer & Day 1997) and, at the time
of writing this survey, the HAREM conference is revisiting that language. Finally, Arabic
(F. Huang 2005) has started to receive a lot of attention in large-scale projects such as
Global Autonomous Language Exploitation (GALE)
2
.
1.2 Textual genre or domain factor
The impact of textual genre (journalistic, scientific, informal, etc.) and domain
(gardening, sports, business, etc.) has been rather neglected in the NERC literature. Few
studies are specifically devoted to diverse genres and domains. D. Maynard et al. (2001)
designed a system for emails, scientific texts and religious texts. E. Minkov et al. (2005)
created a system specifically designed for email documents. Perhaps unsurprisingly, these
experiments demonstrated that although any domain can be reasonably supported, porting
a system to a new domain or textual genre remains a major challenge. T. Poibeau and
Kosseim (2001), for instance, tested some systems on both the MUC-6 collection
composed of newswire texts, and on a proprietary corpus made of manual translations of
phone conversations and technical emails. They report a drop in performance for every
system (some 20% to 40% of precision and recall).
1
http://www.lrec-conf.org/
2
http://projects.ldc.upenn.edu/gale/
3
1.3 Entity type factor
In the expression “Named Entity”, the word “Named” aims to restrict the task to only
those entities for which one or many rigid designators, as defined by S. Kripke (1982),
stands for the referent. For instance, the automotive company created by Henry Ford in
1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper
names as well as certain natural kind terms like biological species and substances. There
is a general agreement in the NERC community about the inclusion of temporal
expressions and some numerical expressions such as amounts of money and other types
of units. While some instances of these types are good examples of rigid designators
(e.g., the year 2001 is the 2001
st
year of the Gregorian calendar) there are also many
invalid ones (e.g., in June refers to the month of an undefined year – past June, this June,
June 2020, etc.). It is arguable that the NE definition is loosened in such cases for
practical reasons.
Early work formulates the NERC problem as recognizing “proper names” in
general (e.g., S. Coates-Stephens 1992, C. Thielen 1995). Overall, the most studied types
are three specializations of “proper names”: names of “persons”, “locations” and
“organizations”. These types are collectively known as “enamex” since the MUC-6
competition. The type “location” can in turn be divided into multiple subtypes of “fine-
grained locations”: city, state, country, etc. (M. Fleischman 2001, S. Lee & Geunbae Lee
2005). Similarly, “fine-grained person” sub-categories like “politician” and “entertainer”
appear in the work of M. Fleischman and Hovy (2002). The type “person” is quite
common and used at least once in an original way by O. Bodenreider and Zweigenbaum
(2000) who combines it with other cues for extracting medication and disease names
(e.g., “Parkinson disease”). In the ACE program, the type “facility” subsumes entities of
the types “location” and “organization”. The type “GPE” is used to represent a location
which has a government, such as a city or a country.
The type “miscellaneous” is used in the CONLL conferences and includes proper
names falling outside the classic “enamex”. The class is also sometimes augmented with
the type “product” (e.g., E. Bick 2004). The “timex” (another term coined in MUC) types
“date” and “time” and the “numex” types “money” and “percent” are also quite
predominant in the literature. Since 2003, a community named TIMEX2 (L. Ferro et al.
2005) proposes an elaborated standard for the annotation and normalization of temporal
expressions. Finally, marginal types are sometime handled for specific needs: “film” and
“scientist” (O. Etzioni et al. 2005), “email address” and “phone number” (I. Witten et al.
1999, D. Maynard et al. 2001), “research area” and “project name” (J. Zhu et al. 2005),
“book title” (S. Brin 1998, I. Witten et al. 1999), “job title” (W. Cohen & Sarawagi 2004)
and “brand” (E. Bick 2004).
A recent interest in bioinformatics, and the availability of the GENIA corpus (T.
Ohta et al. 2002) led to many studies dedicated to types such as “protein”, “DNA”,
“RNA”, “cell line” and “cell type” (e.g., D. Shen et al. 2003, B. Settles 2004) as well as
studies targeted to “protein” recognition only (Y. Tsuruoka & Tsujii 2003). Related work
also includes “drug” (T. Rindfleisch et al. 2000) and “chemical” (M. Narayanaswamy et
al. 2003) names.
Some recent work does not limit the possible types to extract and is referred as
“open domain” NERC (See E. Alfonseca & Manandhar 2002, R. Evans 2003). In this line
of research, S. Sekine and Nobata (2004) defined a named entity hierarchy which
4
includes many fine grained subcategories, such as museum, river or airport, and adds a
wide range of categories, such as product and event, as well as substance, animal, religion
or color. It tries to cover most frequent name types and rigid designators appearing in a
newspaper. The number of categories is about 200, and they are now defining popular
attributes for each category to make it an ontology.
1.4 What’s next?
Recent researches in multimedia indexing, semi-supervised learning, complex linguistic
phenomena, and machine translation suggest some new directions for the field. On one
side, there is a growing interest in multimedia information processing (e.g., video,
speech) and particularly NE extraction from it (R. Basili et al. 2005). Lot of effort is also
invested toward semi-supervised and unsupervised approaches to NERC motivated by the
use of very large collections of texts (O. Etzioni et al. 2005) and the possibility of
handling multiple NE types (D. Nadeau et al. 2006). Complex linguistic phenomena (e.g.,
metonymy) that are common short-coming of current systems are under investigation (T.
Poibeau, 2006). Finally, large-scale projects such as GALE, discussed in section 1.1,
open the way to integration of NERC and Machine Translation for mutual improvement.
2 Learning methods
The ability to recognize previously unknown entities is an essential part of NERC
systems. Such ability hinges upon recognition and classification rules triggered by
distinctive features associated with positive and negative examples. While early studies
were mostly based on handcrafted rules, most recent ones use supervised machine
learning (SL) as a way to automatically induce rule-based systems or sequence labeling
algorithms starting from a collection of training examples. This is evidenced, in the
research community, by the fact that five systems out of eight were rule-based in the
MUC-7 competition while sixteen systems were presented at CONLL-2003, a forum
devoted to learning techniques. When training examples are not available, handcrafted
rules remain the preferred technique, as shown in S. Sekine and Nobata (2004) who
developed a NERC system for 200 entity types.
The idea of supervised learning is to study the features of positive and negative
examples of NE over a large collection of annotated documents and design rules that
capture instances of a given type. Section 2.1 explains SL approaches in more details.
The main shortcoming of SL is the requirement of a large annotated corpus. The
unavailability of such resources and the prohibitive cost of creating them lead to two
alternative learning methods: semi-supervised learning (SSL) and unsupervised learning
(UL). These techniques are presented in section 2.2 and 2.3 respectively.
2.1 Supervised learning
The current dominant technique for addressing the NERC problem is supervised learning.
SL techniques include Hidden Markov Models (HMM) (D. Bikel et al. 1997), Decision
Trees (S. Sekine 1998), Maximum Entropy Models (ME) (A. Borthwick 1998), Support
Vector Machines (SVM) (M. Asahara & Matsumoto 2003), and Conditional Random
Fields (CRF) (A. McCallum & Li 2003). These are all variants of the SL approach that
剩余19页未读,继续阅读
资源评论
P-A
- 粉丝: 2
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功