International Journal of Machine Learning and Cybernetics
1 3
2 Static representation
The development of static word representations can be
roughly divided into two stages. At the first stage, sparse and
high-dimensional vectors were used to represent words. This
kind of embeddings suffer from the problem of data sparsity
and their high dimensionality, usually as large as the vocabu-
lary size, also makes it hard to use them. To cope with these
problems, at the second stage, dense and low-dimensional
vectors were trained with large textual data to take the place
of them. In this section, we first introduce word representa-
tion models presented in both stages, and then describe the
polysemy problem as well as several works trying to solve
it with the static embeddings.
2.1 One‑hot anddistributional representations
In the early age of natural language processing, words are
represented with high-dimensional zero–one vectors, or so-
called one-hot word vectors, in which all entries are zero
except the single entry corresponding to the word, which is
one. With this approach, all vectors are orthogonal to each
other. Therefore, it is intractable to identify the semantic dis-
tance between words. For instance, the words apple, orange
and book are equally similar to each other with the one-hot
vector.
In order to model the syntactic and semantic similarity
between words, additional features are leveraged to represent
words, including: morphology (suffix, prefix), part-of-speech
tags, dictionary features, such as word sense from WordNet
2
,
Brown word clustering [11]. Further, new methods were pro-
posed under the distributional semantic hypothesis: you shall
know a word by the company it keeps [33]. Here, a word is
represented by its context, or specifically by a vector whose
entries are the count of words that appear in context, which
makes it possible to identify words that are semantically
similar to each other. As an instance, by observing a large
number of text corpora, we can find that the contexts of
apple is much more similar to those of orange than book.
Formally, we denote by
the vocabulary of words and
by
the vocabulary of predefined context words (Hence
V
and
V
are the vocabulary sizes of words and contexts
respectively). Both of them are indexed, where
stands for
the ith word in the word vocabulary and
the jth word in
the context vocabulary. A matrix
w
c
is used to
quantify the correlation of words and their contexts, where
represents the correlation between word
and context
and
denotes the number of times
occurs in the
context of
in a corpus D. And the size of D is denoted by
D
=
w∈V
,c∈V
n(w, c
.
With such distributional representation, the semantic sim-
ilarity between words can be easily quantified by measuring
their distance in vector space, such as their cosine similarity
or Euclidean distance. Therefore, we can say that the distri-
butional representation has provided access to obtain the
semantic similarity between words.
However, it is not always appropriate to measure the cor-
relation of words solely by their co-occurrence, since overly
high weights might be singed to word-context pairs contain-
ing common contexts. Consider the situation with the word
apple, word-context pairs such as an apple and the apple
would be observed much more frequently than red apple and
apple tree, even though the latter ones are more informative.
An intuitive solution to this problem is by applying
weighting factors such as tf-idf, which reduces the weights
of word-context pairs proportionally to their frequency in
the corpus [51], so that informative pairs receive relatively
high weights. Whereas an alternative approach is quantify-
ing the correlation of words and contexts with the pointwise
mutual information (PMI) metric [23, 111], which measures
the association between a pair of discrete outcomes x and y,
defined as:
In this case, the association between a word w and a context
c is measured by
, which can be estimated with
the actual observed numbers in the corpus. Hence the cor-
relation between word
and context
in the word-context
matrix changes to:
where
(w
i
)=
c∈V
n(w
i
, c
is the frequency of
in the
corpus D and
j
j
is that of
in D.
(x, y)=log
i,j
= PMI(w
i
, c
j
)=log
i
j
⋅
n(w
)n(c
)
Table 1 Notions used through the paper
Notion Description
w A single word
V Vocabulary of words
An embedding matrix
w
Fetch embedding of word w from matrix
D Corpus
L Objective function
Softmax function
Uppercase-bold-italic symbol denotes matrix
Lowercase-bold-italic symbol denotes vector
2
https ://wordn et.princ eton.edu/.
评论0