没有合适的资源?快使用搜索试试~ 我知道了~
Enriching Word Vectors with Subword Information(与fastText相关)1
需积分: 0 0 下载量 72 浏览量
2022-08-03
12:59:40
上传
评论
收藏 2.26MB PDF 举报
温馨提示
试读
12页
Enriching Word Vectors with Subword InformationPiotr Bojanowski∗and Edouard Grav
资源详情
资源评论
资源推荐
Enriching Word Vectors with Subword Information
Piotr Bojanowski
∗
and Edouard Grave
∗
and Armand Joulin and Tomas Mikolov
Facebook AI Research
{bojanowski,egrave,ajoulin,tmikolov}@fb.com
Abstract
Continuous word representations, trained on
large unlabeled corpora are useful for many
natural language processing tasks. Popular
models that learn such representations ignore
the morphology of words, by assigning a dis-
tinct vector to each word. This is a limitation,
especially for languages with large vocabular-
ies and many rare words. In this paper, we pro-
pose a new approach based on the skipgram
model, where each word is represented as a
bag of character n-grams. A vector represen-
tation is associated to each character n-gram;
words being represented as the sum of these
representations. Our method is fast, allow-
ing to train models on large corpora quickly
and allows us to compute word representations
for words that did not appear in the training
data. We evaluate our word representations on
nine different languages, both on word sim-
ilarity and analogy tasks. By comparing to
recently proposed morphological word repre-
sentations, we show that our vectors achieve
state-of-the-art performance on these tasks.
1 Introduction
Learning continuous representations of words has a
long history in natural language processing (Rumel-
hart et al., 1988). These representations are typ-
ically derived from large unlabeled corpora using
co-occurrence statistics (Deerwester et al., 1990;
Schütze, 1992; Lund and Burgess, 1996). A large
body of work, known as distributional semantics,
has studied the properties of these methods (Turney
∗
The two first authors contributed equally.
et al., 2010; Baroni and Lenci, 2010). In the neural
network community, Collobert and Weston (2008)
proposed to learn word embeddings using a feed-
forward neural network, by predicting a word based
on the two words on the left and two words on the
right. More recently, Mikolov et al. (2013b) pro-
posed simple log-bilinear models to learn continu-
ous representations of words on very large corpora
efficiently.
Most of these techniques represent each word of
the vocabulary by a distinct vector, without param-
eter sharing. In particular, they ignore the internal
structure of words, which is an important limitation
for morphologically rich languages, such as Turk-
ish or Finnish. For example, in French or Spanish,
most verbs have more than forty different inflected
forms, while the Finnish language has fifteen cases
for nouns. These languages contain many word
forms that occur rarely (or not at all) in the training
corpus, making it difficult to learn good word rep-
resentations. Because many word formations follow
rules, it is possible to improve vector representations
for morphologically rich languages by using charac-
ter level information.
In this paper, we propose to learn representations
for character n-grams, and to represent words as the
sum of the n-gram vectors. Our main contribution
is to introduce an extension of the continuous skip-
gram model (Mikolov et al., 2013b), which takes
into account subword information. We evaluate this
model on nine languages exhibiting different mor-
phologies, showing the benefit of our approach.
arXiv:1607.04606v2 [cs.CL] 19 Jun 2017
2 Related work
Morphological word representations. In recent
years, many methods have been proposed to incor-
porate morphological information into word repre-
sentations. To model rare words better, Alexan-
drescu and Kirchhoff (2006) introduced factored
neural language models, where words are repre-
sented as sets of features. These features might in-
clude morphological information, and this technique
was succesfully applied to morphologically rich lan-
guages, such as Turkish (Sak et al., 2010). Re-
cently, several works have proposed different com-
position functions to derive representations of words
from morphemes (Lazaridou et al., 2013; Luong
et al., 2013; Botha and Blunsom, 2014; Qiu et
al., 2014). These different approaches rely on a
morphological decomposition of words, while ours
does not. Similarly, Chen et al. (2015) introduced
a method to jointly learn embeddings for Chinese
words and characters. Cui et al. (2015) proposed
to constrain morphologically similar words to have
similar representations. Soricut and Och (2015)
described a method to learn vector representations
of morphological transformations, allowing to ob-
tain representations for unseen words by applying
these rules. Word representations trained on mor-
phologically annotated data were introduced by Cot-
terell and Schütze (2015). Closest to our approach,
Schütze (1993) learned representations of character
four-grams through singular value decomposition,
and derived representations for words by summing
the four-grams representations. Very recently, Wi-
eting et al. (2016) also proposed to represent words
using character n-gram count vectors. However, the
objective function used to learn these representa-
tions is based on paraphrase pairs, while our model
can be trained on any text corpus.
Character level features for NLP. Another area
of research closely related to our work are character-
level models for natural language processing. These
models discard the segmentation into words and aim
at learning language representations directly from
characters. A first class of such models are recur-
rent neural networks, applied to language model-
ing (Mikolov et al., 2012; Sutskever et al., 2011;
Graves, 2013; Bojanowski et al., 2015), text nor-
malization (Chrupała, 2014), part-of-speech tag-
ging (Ling et al., 2015) and parsing (Ballesteros et
al., 2015). Another family of models are convolu-
tional neural networks trained on characters, which
were applied to part-of-speech tagging (dos San-
tos and Zadrozny, 2014), sentiment analysis (dos
Santos and Gatti, 2014), text classification (Zhang
et al., 2015) and language modeling (Kim et al.,
2016). Sperr et al. (2013) introduced a language
model based on restricted Boltzmann machines, in
which words are encoded as a set of character n-
grams. Finally, recent works in machine translation
have proposed using subword units to obtain repre-
sentations of rare words (Sennrich et al., 2016; Lu-
ong and Manning, 2016).
3 Model
In this section, we propose our model to learn word
representations while taking into account morphol-
ogy. We model morphology by considering subword
units, and representing words by a sum of its charac-
ter n-grams. We will begin by presenting the general
framework that we use to train word vectors, then
present our subword model and eventually describe
how we handle the dictionary of character n-grams.
3.1 General model
We start by briefly reviewing the continuous skip-
gram model introduced by Mikolov et al. (2013b),
from which our model is derived. Given a word vo-
cabulary of size W , where a word is identified by
its index w ∈ {1, ..., W }, the goal is to learn a
vectorial representation for each word w. Inspired
by the distributional hypothesis (Harris, 1954), word
representations are trained to predict well words that
appear in its context. More formally, given a large
training corpus represented as a sequence of words
w
1
, ..., w
T
, the objective of the skipgram model is to
maximize the following log-likelihood:
T
X
t=1
X
c∈C
t
log p(w
c
| w
t
),
where the context C
t
is the set of indices of words
surrounding word w
t
. The probability of observing
a context word w
c
given w
t
will be parameterized
using the aforementioned word vectors. For now, let
us consider that we are given a scoring function s
which maps pairs of (word, context) to scores in R.
One possible choice to define the probability of a
context word is the softmax:
p(w
c
| w
t
) =
e
s(w
t
, w
c
)
P
W
j=1
e
s(w
t
, j)
.
However, such a model is not adapted to our case as
it implies that, given a word w
t
, we only predict one
context word w
c
.
The problem of predicting context words can in-
stead be framed as a set of independent binary clas-
sification tasks. Then the goal is to independently
predict the presence (or absence) of context words.
For the word at position t we consider all context
words as positive examples and sample negatives at
random from the dictionary. For a chosen context
position c, using the binary logistic loss, we obtain
the following negative log-likelihood:
log
1 + e
−s(w
t
, w
c
)
+
X
n∈N
t,c
log
1 + e
s(w
t
, n)
,
where N
t,c
is a set of negative examples sampled
from the vocabulary. By denoting the logistic loss
function ` : x 7→ log(1 + e
−x
), we can re-write the
objective as:
T
X
t=1
X
c∈C
t
`(s(w
t
, w
c
)) +
X
n∈N
t,c
`(−s(w
t
, n))
.
A natural parameterization for the scoring function
s between a word w
t
and a context word w
c
is to use
word vectors. Let us define for each word w in the
vocabulary two vectors u
w
and v
w
in R
d
. These two
vectors are sometimes referred to as input and out-
put vectors in the literature. In particular, we have
vectors u
w
t
and v
w
c
, corresponding, respectively, to
words w
t
and w
c
. Then the score can be computed
as the scalar product between word and context vec-
tors as s(w
t
, w
c
) = u
>
w
t
v
w
c
. The model described
in this section is the skipgram model with negative
sampling, introduced by Mikolov et al. (2013b).
3.2 Subword model
By using a distinct vector representation for each
word, the skipgram model ignores the internal struc-
ture of words. In this section, we propose a different
scoring function s, in order to take into account this
information.
Each word w is represented as a bag of character
n-gram. We add special boundary symbols < and >
at the beginning and end of words, allowing to dis-
tinguish prefixes and suffixes from other character
sequences. We also include the word w itself in the
set of its n-grams, to learn a representation for each
word (in addition to character n-grams). Taking the
word where and n = 3 as an example, it will be
represented by the character n-grams:
<wh, whe, her, ere, re>
and the special sequence
<where>.
Note that the sequence <her>, corresponding to the
word her is different from the tri-gram her from the
word where. In practice, we extract all the n-grams
for n greater or equal to 3 and smaller or equal to 6.
This is a very simple approach, and different sets of
n-grams could be considered, for example taking all
prefixes and suffixes.
Suppose that you are given a dictionary of n-
grams of size G. Given a word w, let us denote by
G
w
⊂ {1, . . . , G} the set of n-grams appearing in
w. We associate a vector representation z
g
to each
n-gram g. We represent a word by the sum of the
vector representations of its n-grams. We thus ob-
tain the scoring function:
s(w, c) =
X
g∈G
w
z
>
g
v
c
.
This simple model allows sharing the representa-
tions across words, thus allowing to learn reliable
representation for rare words.
In order to bound the memory requirements of our
model, we use a hashing function that maps n-grams
to integers in 1 to K. We hash character sequences
using the Fowler-Noll-Vo hashing function (specifi-
cally the FNV-1a variant).
1
We set K = 2.10
6
be-
low. Ultimately, a word is represented by its index
in the word dictionary and the set of hashed n-grams
it contains.
4 Experimental setup
4.1 Baseline
In most experiments (except in Sec. 5.3), we
compare our model to the C implementation
1
http://www.isthe.com/chongo/tech/comp/fnv
剩余11页未读,继续阅读
朱王勇
- 粉丝: 20
- 资源: 306
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0