Enriching Word Vectors with Subword Information
Piotr Bojanowski
∗
and Edouard Grave
∗
and Armand Joulin and Tomas Mikolov
Facebook AI Research
{bojanowski,egrave,ajoulin,tmikolov}@fb.com
Abstract
Continuous word representations, trained on
large unlabeled corpora are useful for many
natural language processing tasks. Popular
models that learn such representations ignore
the morphology of words, by assigning a dis-
tinct vector to each word. This is a limitation,
especially for languages with large vocabular-
ies and many rare words. In this paper, we pro-
pose a new approach based on the skipgram
model, where each word is represented as a
bag of character n-grams. A vector represen-
tation is associated to each character n-gram;
words being represented as the sum of these
representations. Our method is fast, allow-
ing to train models on large corpora quickly
and allows us to compute word representations
for words that did not appear in the training
data. We evaluate our word representations on
nine different languages, both on word sim-
ilarity and analogy tasks. By comparing to
recently proposed morphological word repre-
sentations, we show that our vectors achieve
state-of-the-art performance on these tasks.
1 Introduction
Learning continuous representations of words has a
long history in natural language processing (Rumel-
hart et al., 1988). These representations are typ-
ically derived from large unlabeled corpora using
co-occurrence statistics (Deerwester et al., 1990;
Schütze, 1992; Lund and Burgess, 1996). A large
body of work, known as distributional semantics,
has studied the properties of these methods (Turney
∗
The two first authors contributed equally.
et al., 2010; Baroni and Lenci, 2010). In the neural
network community, Collobert and Weston (2008)
proposed to learn word embeddings using a feed-
forward neural network, by predicting a word based
on the two words on the left and two words on the
right. More recently, Mikolov et al. (2013b) pro-
posed simple log-bilinear models to learn continu-
ous representations of words on very large corpora
efficiently.
Most of these techniques represent each word of
the vocabulary by a distinct vector, without param-
eter sharing. In particular, they ignore the internal
structure of words, which is an important limitation
for morphologically rich languages, such as Turk-
ish or Finnish. For example, in French or Spanish,
most verbs have more than forty different inflected
forms, while the Finnish language has fifteen cases
for nouns. These languages contain many word
forms that occur rarely (or not at all) in the training
corpus, making it difficult to learn good word rep-
resentations. Because many word formations follow
rules, it is possible to improve vector representations
for morphologically rich languages by using charac-
ter level information.
In this paper, we propose to learn representations
for character n-grams, and to represent words as the
sum of the n-gram vectors. Our main contribution
is to introduce an extension of the continuous skip-
gram model (Mikolov et al., 2013b), which takes
into account subword information. We evaluate this
model on nine languages exhibiting different mor-
phologies, showing the benefit of our approach.
arXiv:1607.04606v2 [cs.CL] 19 Jun 2017