Neural Architectures for Named Entity Recognition
Guillaume Lample
♠
Miguel Ballesteros
♣♠
Sandeep Subramanian
♠
Kazuya Kawakami
♠
Chris Dyer
♠
♠
Carnegie Mellon University
♣
NLP Group, Pompeu Fabra University
{glample,sandeeps,kkawakam,cdyer}@cs.cmu.edu,
miguel.ballesteros@upf.edu
Abstract
State-of-the-art named entity recognition sys-
tems rely heavily on hand-crafted features and
domain-specific knowledge in order to learn
effectively from the small, supervised training
corpora that are available. In this paper, we
introduce two new neural architectures—one
based on bidirectional LSTMs and conditional
random fields, and the other that constructs
and labels segments using a transition-based
approach inspired by shift-reduce parsers.
Our models rely on two sources of infor-
mation about words: character-based word
representations learned from the supervised
corpus and unsupervised word representa-
tions learned from unannotated corpora. Our
models obtain state-of-the-art performance in
NER in four languages without resorting to
any language-specific knowledge or resources
such as gazetteers.
1
1 Introduction
Named entity recognition (NER) is a challenging
learning problem. One the one hand, in most lan-
guages and domains, there is only a very small
amount of supervised training data available. On the
other, there are few constraints on the kinds of words
that can be names, so generalizing from this small
sample of data is difficult. As a result, carefully con-
structed orthographic features and language-specific
knowledge resources, such as gazetteers, are widely
used for solving this task. Unfortunately, language-
specific resources and features are costly to de-
velop in new languages and new domains, making
NER a challenge to adapt. Unsupervised learning
1
The code of the LSTM-CRF and Stack-LSTM NER
systems are available at https://github.com/
glample/tagger and https://github.com/clab/
stack-lstm-ner
from unannotated corpora offers an alternative strat-
egy for obtaining better generalization from small
amounts of supervision. However, even systems
that have relied extensively on unsupervised fea-
tures (Collobert et al., 2011; Turian et al., 2010;
Lin and Wu, 2009; Ando and Zhang, 2005b, in-
ter alia) have used these to augment, rather than
replace, hand-engineered features (e.g., knowledge
about capitalization patterns and character classes in
a particular language) and specialized knowledge re-
sources (e.g., gazetteers).
In this paper, we present neural architectures
for NER that use no language-specific resources
or features beyond a small amount of supervised
training data and unlabeled corpora. Our mod-
els are designed to capture two intuitions. First,
since names often consist of multiple tokens, rea-
soning jointly over tagging decisions for each to-
ken is important. We compare two models here,
(i) a bidirectional LSTM with a sequential condi-
tional random layer above it (LSTM-CRF; §2), and
(ii) a new model that constructs and labels chunks
of input sentences using an algorithm inspired by
transition-based parsing with states represented by
stack LSTMs (S-LSTM; §3). Second, token-level
evidence for “being a name” includes both ortho-
graphic evidence (what does the word being tagged
as a name look like?) and distributional evidence
(where does the word being tagged tend to oc-
cur in a corpus?). To capture orthographic sen-
sitivity, we use character-based word representa-
tion model (Ling et al., 2015b) to capture distribu-
tional sensitivity, we combine these representations
with distributional representations (Mikolov et al.,
2013b). Our word representations combine both of
these, and dropout training is used to encourage the
model to learn to trust both sources of evidence (§4).
Experiments in English, Dutch, German, and
Spanish show that we are able to obtain state-
arXiv:1603.01360v3 [cs.CL] 7 Apr 2016
评论0
最新资源