following work in sparse vector-space models
(Lin, 1998; Pad
´
o and Lapata, 2007; Baroni and
Lenci, 2010), we experiment with syntactic con-
texts that are derived from automatically produced
dependency parse-trees.
The different kinds of contexts produce no-
ticeably different embeddings, and induce differ-
ent word similarities. In particular, the bag-of-
words nature of the contexts in the “original”
SKIPGRAM model yield broad topical similari-
ties, while the dependency-based contexts yield
more functional similarities of a cohyponym na-
ture. This effect is demonstrated using both quali-
tative and quantitative analysis (Section 4).
The neural word-embeddings are considered
opaque, in the sense that it is hard to assign mean-
ings to the dimensions of the induced represen-
tation. In Section 5 we show that the SKIP-
GRAM model does allow for some introspection
by querying it for contexts that are “activated by” a
target word. This allows us to peek into the learned
representation and explore the contexts that are
found by the learning process to be most discrim-
inative of particular words (or groups of words).
To the best of our knowledge, this is the first work
to suggest such an analysis of discriminatively-
trained word-embedding models.
2 The Skip-Gram Model
Our departure point is the skip-gram neural em-
bedding model introduced in (Mikolov et al.,
2013a) trained using the negative-sampling pro-
cedure presented in (Mikolov et al., 2013b). In
this section we summarize the model and train-
ing objective following the derivation presented by
Goldberg and Levy (2014), and highlight the ease
of incorporating arbitrary contexts in the model.
In the skip-gram model, each word w ∈ W is
associated with a vector v
w
∈ R
d
and similarly
each context c ∈ C is represented as a vector
v
c
∈ R
d
, where W is the words vocabulary, C
is the contexts vocabulary, and d is the embed-
ding dimensionality. The entries in the vectors
are latent, and treated as parameters to be learned.
Loosely speaking, we seek parameter values (that
is, vector representations for both words and con-
texts) such that the dot product v
w
· v
c
associated
with “good” word-context pairs is maximized.
More specifically, the negative-sampling objec-
tive assumes a dataset D of observed (w, c) pairs
of words w and the contexts c, which appeared in
a large body of text. Consider a word-context pair
(w, c). Did this pair come from the data? We de-
note by p(D = 1|w, c) the probability that (w, c)
came from the data, and by p(D = 0|w, c) =
1 − p(D = 1|w, c) the probability that (w, c) did
not. The distribution is modeled as:
p(D = 1|w, c) =
1
1+e
−v
w
·v
c
where v
w
and v
c
(each a d-dimensional vector) are
the model parameters to be learned. We seek to
maximize the log-probability of the observed pairs
belonging to the data, leading to the objective:
arg max
v
w
,v
c
P
(w,c)∈D
log
1
1+e
−v
c
·v
w
This objective admits a trivial solution in which
p(D = 1|w, c) = 1 for every pair (w, c). This can
be easily achieved by setting v
c
= v
w
and v
c
·v
w
=
K for all c, w, where K is large enough number.
In order to prevent the trivial solution, the ob-
jective is extended with (w, c) pairs for which
p(D = 1|w, c) must be low, i.e. pairs which are
not in the data, by generating the set D
0
of ran-
dom (w, c) pairs (assuming they are all incorrect),
yielding the negative-sampling training objective:
arg max
v
w
,v
c
Q
(w,c)∈D
p(D = 1|c, w)
Q
(w,c)∈D
0
p(D = 0|c, w)
which can be rewritten as:
arg max
v
w
,v
c
P
(w,c)∈D
log σ(v
c
· v
w
) +
P
(w,c)∈D
0
log σ(−v
c
· v
w
)
where σ(x) = 1/(1+e
x
). The objective is trained
in an online fashion using stochastic-gradient up-
dates over the corpus D ∪ D
0
.
The negative samples D
0
can be constructed in
various ways. We follow the method proposed by
Mikolov et al.: for each (w, c) ∈ D we construct
n samples (w, c
1
), . . . , (w, c
n
), where n is a hy-
perparameter and each c
j
is drawn according to its
unigram distribution raised to the 3/4 power.
Optimizing this objective makes observed
word-context pairs have similar embeddings,
while scattering unobserved pairs. Intuitively,
words that appear in similar contexts should have
similar embeddings, though we have not yet found
a formal proof that SKIPGRAM does indeed max-
imize the dot product of similar words.
3 Embedding with Arbitrary Contexts
In the SKIPGRAM embedding algorithm, the con-
texts of a word w are the words surrounding it