2 Related work
Morphological word representations. In recent
years, many methods have been proposed to incor-
porate morphological information into word repre-
sentations. To model rare words better, Alexan-
drescu and Kirchhoff (2006) introduced factored
neural language models, where words are repre-
sented as sets of features. These features might in-
clude morphological information, and this technique
was succesfully applied to morphologically rich lan-
guages, such as Turkish (Sak et al., 2010). Re-
cently, several works have proposed different com-
position functions to derive representations of words
from morphemes (Lazaridou et al., 2013; Luong
et al., 2013; Botha and Blunsom, 2014; Qiu et
al., 2014). These different approaches rely on a
morphological decomposition of words, while ours
does not. Similarly, Chen et al. (2015) introduced
a method to jointly learn embeddings for Chinese
words and characters. Cui et al. (2015) proposed
to constrain morphologically similar words to have
similar representations. Soricut and Och (2015)
described a method to learn vector representations
of morphological transformations, allowing to ob-
tain representations for unseen words by applying
these rules. Word representations trained on mor-
phologically annotated data were introduced by Cot-
terell and Schütze (2015). Closest to our approach,
Schütze (1993) learned representations of character
four-grams through singular value decomposition,
and derived representations for words by summing
the four-grams representations. Very recently, Wi-
eting et al. (2016) also proposed to represent words
using character n-gram count vectors. However, the
objective function used to learn these representa-
tions is based on paraphrase pairs, while our model
can be trained on any text corpus.
Character level features for NLP. Another area
of research closely related to our work are character-
level models for natural language processing. These
models discard the segmentation into words and aim
at learning language representations directly from
characters. A first class of such models are recur-
rent neural networks, applied to language model-
ing (Mikolov et al., 2012; Sutskever et al., 2011;
Graves, 2013; Bojanowski et al., 2015), text nor-
malization (Chrupała, 2014), part-of-speech tag-
ging (Ling et al., 2015) and parsing (Ballesteros et
al., 2015). Another family of models are convolu-
tional neural networks trained on characters, which
were applied to part-of-speech tagging (dos San-
tos and Zadrozny, 2014), sentiment analysis (dos
Santos and Gatti, 2014), text classification (Zhang
et al., 2015) and language modeling (Kim et al.,
2016). Sperr et al. (2013) introduced a language
model based on restricted Boltzmann machines, in
which words are encoded as a set of character n-
grams. Finally, recent works in machine translation
have proposed using subword units to obtain repre-
sentations of rare words (Sennrich et al., 2016; Lu-
ong and Manning, 2016).
3 Model
In this section, we propose our model to learn word
representations while taking into account morphol-
ogy. We model morphology by considering subword
units, and representing words by a sum of its charac-
ter n-grams. We will begin by presenting the general
framework that we use to train word vectors, then
present our subword model and eventually describe
how we handle the dictionary of character n-grams.
3.1 General model
We start by briefly reviewing the continuous skip-
gram model introduced by Mikolov et al. (2013b),
from which our model is derived. Given a word vo-
cabulary of size W , where a word is identified by
its index w ∈ {1, ..., W }, the goal is to learn a
vectorial representation for each word w. Inspired
by the distributional hypothesis (Harris, 1954), word
representations are trained to predict well words that
appear in its context. More formally, given a large
training corpus represented as a sequence of words
w
1
, ..., w
T
, the objective of the skipgram model is to
maximize the following log-likelihood:
T
X
t=1
X
c∈C
t
log p(w
c
| w
t
),
where the context C
t
is the set of indices of words
surrounding word w
t
. The probability of observing
a context word w
c
given w
t
will be parameterized
using the aforementioned word vectors. For now, let
us consider that we are given a scoring function s
which maps pairs of (word, context) to scores in R.
评论0