所需积分/C币:44 2017-02-11 21:19:59 131KB PDF
收藏 收藏

1.3 Entity type factor In the expression "Named Entity, the word"Named"aims to restrict the task to only those entities for which one or many rigid designators, as defined by S. Kripke(1982), stands for the referent. For instance the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind terms like biological specics and substances. There is a general agreement in the NERC community about the inclusion of temporal expressions and some numerical expressions such as amounts of money and other types of units. While some instances of these types are good examples of rigid designators (e. g, the year 2001 is the 2001 year of the Gregorian calendar)there are also many invalid ones(e. g, in June refers to the month of an undefined year- past June, this June, June 2020, etc. It is arguable that the ne definition is loosened in such cases for practical reasons Early work formulates the nerC problem as recognizing"proper names"in general(e.g ,S. Coates-Stephens 1992, C. Thielen 1995). Overall, the most studied types are three specializations of“ proper names”: names of‘ persons”,“ locations'" and "organizations'. These types are collectively known as "enamex" since the MUc-6 competition. The type"location?"can in turn be divided into multiple subtypes of"fine grained locations": city, state, country, etc (M. Fleischman 2001, S. Lee Geunbae Lee 2005). Similarly, fine-grained person"sub-categories like"politician?andentertainer appear in the work of M. Fleischman and Hovy(2002). The type"person" is quite common and used at least once in an original way by o. Bodenreider and Zweigenbaum (2000) who combines it with other cues for extracting medication and disease names (e. g,"Parkinson disease?). In the aCe program, the type"facility " subsumes entities of the types"location""organization. The type"GPE is used to represent a location which has a government, such as a city or a country The type"miscellaneous? " is used in the Conll conferences and includes proper names falling outside the classic enamex. The class is also sometimes augmented with the typeproduct(e.g, E. Bick 2004). The timex"(another term coined in Muc)types date”and“time” and the‘ numex” types‘ money”and“ percent'’ are also quite predominant in the literature Since 2003, a community named TIMEX2 L Ferro et al 2005)proposes an elaborated standard for the annotation and normalization of temporal expressions. Finally, marginal types are sometime handled for specific needs: film"and ¨ scientist”(O. Etzioni et al.2005),“ email address”and“ phone number”(. Witten et al 1999, D. Maynard et al. 2001),"research area!" and"project name(J. Zhu et al. 2005), book title''(S Brin 1998,I Witten et al. 1999),job title""(W. Cohen sarawagi 2004) and“" brand”(E.Bick2004) A recent interest in bioinformatics, and the availability of the genia corpus(. Ohta et al. 2002)led to many studies dedicated to types such as"protein,"DNA “RNA”,“ cell line”and“" cell type”(c.g,D. Shen et al,2003,B. Settles2004) as well as studies targeted to"protein?"recognition only (Y. Tsuruoka Tsujii 2003). related work also includes "drug?"(T. rindneisch et al. 2000)and"chemical(M. Narayanaswamy et al. 2003)names Some recent work does not limit the possible types to extract and is referred as open domain " NERC (See E Alfonseca Manandhar 2002, R. Evans 2003). In this line of research, S. Sekine and Nobata(2004)defined a named entity hierarchy which includes many fine grained subcategories, such as museum, river or airport, and adds a wide range of categories, such as product and event, as well as substance, animal, religion or color. It tries to cover most frequent name types and rigid designators appearing in a newspaper. The number of categories is about 200, and they are now defining popular attributes for each category to make it an ontology 1.4 What's next Recent researches in multimedia indexing, semi-supervised learning complex linguistic henomena, and machine translation suggest some new directions for the field. On one side, there is a growing interest in multimedia information processing(e.g, video, speech)and particularly Ne extraction from it(R Basili et al. 2005). Lot of effort is also invested toward semi-supervised and unsupervised approaches to nerc motivated by the use of very large collections of texts (O. Etzioni et al. 2005) and the possibility of handling multiple ne types(d. Nadeau et al. 2006). Complex linguistic phenomena(e. g metonymy)that are common short-coming of current systems are under investigation(t Poibeau, 2006). Finally, large-scale projects such as Gale, discussed in section 1.1 open the way to integration of nerc and machine Translation for mutual improvement 2 Learning methods The ability to recognize previously unknown entities is an essential part of nerc systems. Such ability hinges upon recognition and classification rules triggered b distinctive features associated with positive and negative examples. While early studies were mostly based on handcrafted rules, most recent ones use supervised machine learning(SL) as a way to automatically induce rule- based systems or sequence labeling algorithms starting from a collection of training examples. This is evidenced, in the research community, by the fact that five systems out of eight were rule-based in the MUC-7 competition while sixteen systems were presented at CONLL-2003, a forum devoted to learning techniques. When training examples are not available, handcrafted rules remain the preferred technique, as shown in S. Sekine and Nobata(2004)who developed a nerc system for 200 entity types The idea of supervised learning is to study the features of positive and negative examples of ne over a large collection of annotated documents and design rules that capture instances of a given type. Section 2. 1 explains sl approaches in more details The main shortcoming of sl is the requirement of a large annotated corpus. The unavailability of such resources and the prohibitive cost of creating them lead to two alternative learning mcthods: semi-supervised learning(SSL) and unsupervised learning qUL). These techniques are presented in section 2.2 and 2.3 respectively 2. 1 Supervised learning The current dominant technique for addressing the nerC problem is supervised learning SL techniques include Hidden Markov Models(hMm)(D. Bikel et al. 1997), Decision Trees(S. Sekine 1998), Maximum Entropy Models(ME)(A. Borthwick 1998), Support Vector Machines(SVM)(M. Asahara matsumoto 2003), and Conditional Random Fields (CrF)(A Mccallum & li 2003). These are all variants of the sL approach that typically consist of a system that reads a large annotated corpus, memorizes lists of entities, and creates disambiguation rules based on discriminative features a baseline sl method that is often proposed consists of tagging words of a test corpus when they are annotated as entities in the training corpus The performance of the baseline system depends on the vocabulary transfer, which is the proportion of words without repetitions, appearing in both training and testing corpus. D. Palmer and da (1997) calculated the vocabulary transfer on the MUc-6 training data. They report a transfer of 21%, with as much as 42% of location names being repeated but only 17% of organizations and 13% of person names. Vocabulary transfer is a good indicator of the recall (number of entities identified over the total number of entities) of the baseline system but is a pessimistic measure since some entities are frequently repeated in documents. A Mikheev et al.(1999) precisely calculated the recall of the baseline system on the Muc-7 corpus. They report a recall of 76% for locations, 49% for organizations and 26% for persons with precision ranging from 70%o to 90%0. Whitelaw and Patrick (2003)report consistent results on MUC-7 for the aggregated enamex class. For the three enamex types together, the precision of recognition is 76% and the recall is 48% 2.2 Semi-supervised learning The terr“semi- supervised”(or“ weakly supervised”) is relatively recent. The main technique for SSL is called"bootstrapping"and involves a small degree of supervision such as a set of seeds, for starting the learning process. For example, a system aimed at disease names " might ask the user to provide a small number of example names. Then the system searches for sentences that contain these names and tries to identify some contextual clues common to the five examples. Then, the system tries to find other instances of discase names that appear in similar contexts. The learning process is then reapplied to the newly found examples, so as to discover new relevant contexts. by repeating this process, a large number of disease names and a large number of contexts will eventually be gathered. Recent experiments in semi-supervised NErC (Nadeau et al 2006)report performances that rival baseline supervised approaches. Here are some examples of SSL approaches S. Brin(1998)uses lexical features implemented by regular expressions in order to generate lists of book titles paired with book authors. It starts with seed examples such as i Isaac Asimov, The Robots of dawn and use some fixed lexical control rules such as the following regular expression /A-Z/A-Za-z, &e /A-Za-z used to describe a title The main idea of his algorithm, however, is that many web sites conform to a reasonably uniform format across the site. When a given web site is found to contain seed examples, new pairs can often be identified using simple constraints such as the presence of identical text before, between or after the elements of an interesting pair. For example the passage The robots of Dawn, by lsaac Asimov(Paperback)"would allow finding, bots on the same web site, The Ants, by Bernard Werber(Paperback) M. Collins and Singer (1999)parse a complete corpus in search of candidate NE patterns. A pattern is, for instance, a proper name (as identified by a part-of-speech tagger) followed by a noun phrase in apposition(e. g. Maury Cooper a vice president at S&P). Patterns are kept in pairs spelling, context, where spelling refers to the proper name and context refers to the noun phrase in its context. Starting with an initial seed of spelling rules(e.g, rule 1: if the spelling is "New York"then it is a Location; rule 2 the spelling contains" Mr. then it is a Person; rule 3. if the spelling is all capitalized then it is an organization), the candidates are examined. Candidate that satisfy a spelling rule are classified accordingly and their contexts are accumulated. The most frequent contexts found are turned into a set of contextual rules. Following the steps above, contextual rules can be used to find further spelling rules, and so on. M. Collins and Singer and R. Yangarber et al.(2002), demonstrate the idea that learning several types of NE Simultaneously allows the finding of negative evidence (one type against all) and reduces over-generation S Cucerzan and Yarowsky(1999) also use a similar technique and apply it lo many languages E. Riloff and Jones (1999)introduce mutual bootstrapping that consists of growing a set of entities and a set of contexts in turn. Instead of working with predefined candidate ne's (found using a fixed syntactic construct), they start with a handful of seed entity examples of a given type(e. g, Bolivia, Guatemala, Honduras are entities of type country) and accumulate all patterns found around these seeds in a large corpus. Contexts (e.g. offices in X, facilities in X, .) are ranked and used to find new examples. riloff and Jones note that the performance of that algorithm can deteriorate rapidly when noise is introduced in the entity list or pattern list. While they report relatively low precision and recall in their experiments, their work proved to be highly influential A. Cucchiarelli and velardi (2001) use syntactic relations(e. g, subject-object) to discover more accurate contextual evidence around the entities. again, this is a variant of E Riloff and Jones mutual bootstrapping(1999). Interestingly, instead of using human generated seeds, they rely on existing NER systems(called early ne classifier) for initial NE examples M. Pasca et al.(2006) are also using techniques inspired by mutual bootstrapping However, they innovate through the use of D. Lin's(1998)distributional similarity to generate synonyms -or, more generally, words which are members of the same semantic class- allowing pattern generalization. For instance, for the pattern X was born in November, Lins synonyms for November are March, October, April, Mar, Aug February, Jul, NoV.,... thus allowing the induction of new patterns such as X was born in March. Onc of the contribution of Pasca et al. is to apply the technique to very large corpora(100 million web documents)and demonstrate that starting from a seed of 10 examples facts(defined as entities of type person paired with entities of type year standing for the person year of birth) it is possible to generate one million facts with a precision of about 88% The problem of unlabelled data selection is addressed by J. Heng and grishman (2006). They show how an existing ne classifier can be improved using bootstrapping methods. The main lesson they report is that relying upon large collection of documents is not sufficient by itself. Selection of documents using information retrieval-like relevance measures and selection of specific contexts that are rich in proper names and coreferences bring the best results in their experiments 2.3 Unsupervised learning The typical approach in unsupervised learning is clustering. For example, one can try to gather named entities from clustered groups based on the similarity of context. There are other unsupervised methods too. Basically, the techniques rely on lexical resources(e. g WordNet). on lexical patterns and on statistics computed on a large unannotated corpus Here are some examples E Alfonseca and Manandhar(2002) study the problem of labeling an input word with an appropriate ne type. ne types are taken from WordNet (e.g, location>country, animate>person, animate>animal, etc. The approach is to assign a topic signature to each WordNet synset by merely listing words that frequently co-occur with it in a large corpus. Then, given an input word in a given document, the word context (words appearing in a fixed-size window around the input word) is compared to type signatures and classified under the most similar one In R. Evans (2003), the method for identification of hyponyms/hypernyms described in the work of M. Hearst(1992)is applied in order to identify potential hypernyms of sequences of capitalized words appearing in a document. For instance, hen X is a capitalized sequence the query such as X", is searched on the web and, in the retrieved documents, the noun that immediately precede the query can be chosen as the hypernym of X. Similarly, in P Cimiano and Volker(2005), Hearst patterns are used but this time, the feature consists of counting the number of occurrences of passages like city such as”," organization such as”,etc. Y. Shinyama and Sekine(2004)used an observation that named entities often appear synchronously in several news articles, whereas common nouns do not. They found a strong correlation between being a named entity and appearing punctually (in time) and simultaneously in multiple news sources. This technique allows identifying rare named entities in an unsupervised manner and can be useful in combination with other NERC methods In O. Etzioni et al.(2005, Pointwise Mutual Information and Information Retrieval(PMi-IR) is used as a feature to assess that a named entity can be classified under a given type PMI-IR, developed by P. turney(2001), measures the dependence between two expressions using web queries. A high PMI-IR means that expressions tend to co-occur. O. Etzioni et al. create features for each candidate entity(e.g, London) and a large number of automatically generated discriminator phrases like"is a city',"nation of’,ctc. 3 Feature space for NErC Features are descriptors or characteristic attributes of words designed for algorithmic consumption. An example of a feature is a boolean variable with the value true if a word is capitalized and false otherwise. Feature vector representation is an abstraction over text where typically each word is represented by one or many boolean, numeric and nominal values. For example a hypothetical nerc system may represent each word of a text with 3 attributes 1)a boolean attribute with the value true if the word is capitalized and false otherwise 2)a numeric attribute corresponding to the length, in characters, of the word 3)a nominal attribute corresponding to the lowercased version of the word In this scenario, the sentence The president of Apple eats an apple excluding the punctuation, would be represented by the following feature vectors <true, 3, th <false, 9,"president">, <false, 2 5,apple"/, <false, 4, eats">, <false, 2,an">, <false, 5,apples, <trua Usually, the nerc problem is resolved by applying a rule system over the features. For instance, a system might have two rules, a recognition rule: "capitalized words are candidate entities and a classification rule: the type of candidate entities of length greater than 3 words is organization These rules work well for the exemplar sentence above. However, real systems tend to be much more complex and their rules are often created by automatic learning techniques In this section, we present the features most oflen used for the recognition and classification of named entities We organize them along three different axes: Word-level features, List lookup features and Document and corpus features 3.1 Word-level features Word-level features are related to the character makeup of words. They specifically describe word case, punctuation, numerical value and special characters. Table 1 lists subcategories of word-level features Table 1: word-level features Features Examples ase Starts with a capital letter Word is all uppercased The word is mixed case(e. g, ProSys, eBay) Punctuation Ends with period, has internal period (e.g, St, I.B.M. Internal apostrophe, hyphen or ampersand (e.g, O' Connor Digit pattern (see section .3.I.7 Cardinal and ordinal Word with digits (e. g.,W3C, 3M) Character Possessive mark, first person pronoun Greek letter Morphology singular version, stem Common ending (see section 3.1.2 Part-o三 eec proper na erb, noun foreign word Function Alpha, non-alpha, n-gram (see section 3.1.3 lowcrcascr uppercase vcrsion pattern, summarized pattern (see section 3. 1. 41) token lcngth, phrase lcngth 3.1.1 Digit pattern Digits can express a wide range of useful information such as dates, percentages, intervals, identifiers, etc. Special attention must be given to some particular patterns of digits. For example, two-digit and four-digit numbers can stand for years(D. bikel et al 1997) and when followed by ans, they can stand for a decade; one and two digits may stand for a day or a month(s. Yu et al. 1998) 3.1.2 Common word ending Morphological features are essentially related to words affixes and roots. For instance,a system may learn that a human profession often ends Journalist, cyclist) or that nationality and languages often ends in",and"an(Spanish, Danish, Romanian) Another example of common word ending is organization names that often end in"ex tch”;,and‘soft”(E.Bick2004) 3.1.3 Functions over words Features can be extracted by applying functions over words. An example is given by m Collins and Singer(1999)who create a feature by isolating the non-alphabetic characters of a word(e.g, nonalpha(a T.&T. )=.& )Another example is given by J. Patrick et al (2002)who use character n-grams as features 3.1.4 Patterns and summarized patterns Pattern features were introduced by m. Collins(2002) and then used by others (w. Cohen sarawagi 2004 and B. Settles 2004). Their role is to map words onto a small set of patterns over character types. For instance, a pattern feature might map all uppercase letters to“"A"”, all lowercase letters to“a”, all digits to“0” and all punctuation to“-” M. GetPattern(x)=A-A x ="Machine-223: GetPattern(x)="Aaaaaaa-00O The summarized pattern feature is a condensed form of the above in which consecutive character types are not repeated in the mapped string. For instance, the preceding examples become: x-G.M. Gct SummarizcdPattcrn(x)-"A-A-I 'Machine-223: Get summarizedPattern(x)=Aa-o' 3.2 List lookup features Lists are the privileged features in NERC. The terms"gazetteer,,"lexicon,and dictionary are often used interchangeably with the term " list". List inclusion is a way to express the relation"is a(e.g Paris is a city). It may appear obvious that if a word (Paris )is an element of a list of cities, then the probability of this word to be city, in a given text, is high. However, because of word polysemy, the probability is almost never 1 (e.g the probability of"Fast,to represent a company is low because of the common adjective"fast" that is much more frequent) 10 Table 2: list lookup fcatures Features Examples 11 General dictionary (see section 3.2.1) Stop words (function words) Capitalized nouns (e.g., January, Monday Common abbreviations List o= entities Organization, government airline, educational First name, last name, celebrity Astral body, continent, country, statc, city List o= entity cucs Typical words in organization (scc 3.2.2) Person title, name prefix, post-nominal letters Location typical word, cardinal point In Table 2, we present three significant categories of lists used in literature. We could enumerate many more list examples but we decided to concentrate on those aimed at recognizing enamex types 3.2./ General dictionary Common nouns listed in a dictionary are useful, for instance, in the disambiguation of capitalized words in ambiguous positions(e. g sentence beginning). A Mikheev(1999) reports that from 2677 words in ambiguous position in a given corpus, a general dictionary lookup allows identifying 1841 common nouns out of 1851(99.4%) while only discarding 171 named entities out of 826(20.7%) In other words, 20.7% of namcd entities are ambiguous with common nouns in that corpus 3.2.2 Words that are typical of organization names Many authors propose to recognize organizations by identifying words that are frequently used in their names. For instance, knowing that associates"is frequently used in organization names could lead to the recognition of "Computer Associates"and BioMedia Associates(D. McDonald 1993, R. Gaizauskas et al. 1995). The same rule applies to frequent first words("American","General")of an organization(L Rau 1991) Some authors also exploit the fact that organizations often include the name of a person F. Wolinski et al. 1995, Y. Ravin wacholder 1996)as in Alfred P. sloan Foundation,. Similarly, geographic names can be good indicators of an organization name(F Wolinski et al. 1995)as in"France Telecom" Organization designators such as inc,"and"corp(L. Rau 1991)are also useful features 3.2. 3 On the list lookup techniques Most approaches implicitly require candidate words to exactly match at least one element of a pre-existing list. However, we may want to allow some flexibility in the match conditions. At least three alternate lookup strategies are used in the nerc field First, words can be stemmed(stripping off both inflectional and derivational suffixes)or lemmatized (normalizing for inflections only) before they are matched(s Coates-Stephens 1992). For instance, if a list of cue words contains"tcchnology,, the inflected form "technologies? will be considered as a successful match. For some languages(M. Jansche 2002), diacritics can be replaced by their canonical equivalent (e.g,, ' e' replaced by ' c')

试读 20P A_survey_of_named_entity_recognition_and_classification.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
关注 私信
A_survey_of_named_entity_recognition_and_classification.pdf 44积分/C币 立即下载

试读结束, 可继续读2页

44积分/C币 立即下载 >