SkipGramModel资源-CSDN文库

需积分: 10 52 浏览量 2017-07-22 15:34:39 上传评论收藏 1.15MB PDF 举报

### SkipGram 模型概述 **SkipGram**是一种在自然语言处理领域中广泛使用的语言模型，主要用于词嵌入（word embedding）的生成。该模型由Tomas Mikolov等人于2013年提出，是分布式词表示学习方法中的一种重要技术。与传统的基于线性上下文的方法不同，SkipGram模型通过考虑更广泛的上下文信息来优化词向量的表示，使其能够捕捉到词汇间的语义和语法关系。 ### SkipGram模型的工作原理 #### 基本概念 - **词嵌入**：指将词映射到多维向量空间的过程。这些向量通常称为词向量或词嵌入。 - **上下文**：在语言学中，一个词的上下文是指围绕这个词的其他词语。 - **负采样**：在训练过程中用于加速计算并减少噪声的技巧。 #### 模型结构 SkipGram模型采用了一种中心词预测上下文词的策略，与CBOW（连续词袋模型）相反。具体来说： - 输入层：包含所有词的词典。 - 隐藏层：一个全连接层，其中包含了每个词的向量表示。 - 输出层：同样包含了所有词的词典，但在这里用于预测给定中心词时可能的上下文词。 #### 训练过程 1. **初始化**：随机初始化词向量。 2. **输入数据**：对于每一条训练样本（即一个中心词及其对应的上下文词），将中心词转换为词向量。 3. **前向传播**：计算输出层的得分，即预测各个上下文词的概率。 4. **损失函数**：采用对数似然损失函数或使用负采样的损失函数。 5. **反向传播**：根据损失函数的梯度更新词向量。 6. **迭代训练**：重复以上步骤直到模型收敛。 ### SkipGram与依赖关系原文提到的作者Omer Levy和Yoav Goldberg进一步扩展了SkipGram模型，引入了依赖关系作为上下文的一部分。这一改进使得模型能够更好地捕捉词之间的语法和语义关系。 #### 依赖关系上下文 - **定义**：在句法分析中，依赖关系指的是句子中词语之间的直接关系。例如，在句子“Tom saw Jerry”，“saw”与“Tom”之间存在主谓关系，而“saw”与“Jerry”之间则存在动宾关系。 - **作用**：利用依赖关系可以更准确地识别词语的功能相似性，比如动词与其主语、宾语之间的关系。 #### 实验结果 - **功能相似性**：依赖关系上下文的词嵌入显示出更强的功能相似性。这意味着它们能够更好地捕获词语间的语法关系，如主谓、动宾等。 - **主题相关性**：与传统的基于线性上下文的SkipGram模型相比，依赖关系上下文下的词嵌入不太强调词语的主题相关性，而是更多地关注语法和功能相似性。 ### 结论 SkipGram模型及其依赖关系的扩展版本为自然语言处理领域提供了强大的工具，用于捕捉词之间的复杂关系。通过对模型进行改进，不仅可以提高词嵌入的质量，还能为后续的语言理解任务提供更加丰富的语义信息。无论是学术研究还是工业应用，这些改进都极大地推动了自然语言处理的发展。

资源推荐

资源详情

资源评论

Dependency-Based Word Embeddings

Omer Levy

∗

and Yoav Goldberg

Computer Science Department

Bar-Ilan University

Ramat-Gan, Israel

{omerlevy,yoav.goldberg}@gmail.com

Abstract

While continuous word embeddings are

gaining popularity, current models are

based solely on linear contexts. In this

work, we generalize the skip-gram model

with negative sampling introduced by

Mikolov et al. to include arbitrary con-

texts. In particular, we perform exper-

iments with dependency-based contexts,

and show that they produce markedly

different embeddings. The dependency-

based embeddings are less topical and ex-

hibit more functional similarity than the

original skip-gram embeddings.

1 Introduction

Word representation is central to natural language

processing. The default approach of represent-

ing words as discrete and distinct symbols is in-

sufﬁcient for many tasks, and suffers from poor

generalization. For example, the symbolic repre-

sentation of the words “pizza” and “hamburger”

are completely unrelated: even if we know that

the word “pizza” is a good argument for the verb

“eat”, we cannot infer that “hamburger” is also

a good argument. We thus seek a representation

that captures semantic and syntactic similarities

between words. A very common paradigm for ac-

quiring such representations is based on the distri-

butional hypothesis of Harris (1954), stating that

words in similar contexts have similar meanings.

Based on the distributional hypothesis, many

methods of deriving word representations were ex-

plored in the NLP community. On one end of the

spectrum, words are grouped into clusters based

on their contexts (Brown et al., 1992; Uszkor-

eit and Brants, 2008). On the other end, words

∗

Supported by the European Community’s Seventh

Framework Programme (FP7/2007-2013) under grant agree-

ment no. 287923 (EXCITEMENT).

are represented as a very high dimensional but

sparse vectors in which each entry is a measure

of the association between the word and a particu-

lar context (see (Turney and Pantel, 2010; Baroni

and Lenci, 2010) for a comprehensive survey).

In some works, the dimensionality of the sparse

word-context vectors is reduced, using techniques

such as SVD (Bullinaria and Levy, 2007) or LDA

(Ritter et al., 2010; S

eaghdha, 2010; Cohen et

al., 2012). Most recently, it has been proposed

to represent words as dense vectors that are de-

rived by various training methods inspired from

neural-network language modeling (Bengio et al.,

2003; Collobert and Weston, 2008; Mnih and

Hinton, 2008; Mikolov et al., 2011; Mikolov et

al., 2013b). These representations, referred to as

“neural embeddings” or “word embeddings”, have

been shown to perform well across a variety of

tasks (Turian et al., 2010; Collobert et al., 2011;

Socher et al., 2011; Al-Rfou et al., 2013).

Word embeddings are easy to work with be-

cause they enable efﬁcient computation of word

similarities through low-dimensional matrix op-

erations. Among the state-of-the-art word-

embedding methods is the skip-gram with nega-

tive sampling model (SKIPGRAM), introduced by

Mikolov et al. (2013b) and implemented in the

word2vec software.

Not only does it produce

useful word representations, but it is also very ef-

ﬁcient to train, works in an online fashion, and

scales well to huge copora (billions of words) as

well as very large word and context vocabularies.

Previous work on neural word embeddings take

the contexts of a word to be its linear context –

words that precede and follow the target word, typ-

ically in a window of k tokens to each side. How-

ever, other types of contexts can be explored too.

In this work, we generalize the SKIP-

GRAM model, and move from linear bag-of-words

contexts to arbitrary word contexts. Speciﬁcally,

code.google.com/p/word2vec/

following work in sparse vector-space models

(Lin, 1998; Pad

o and Lapata, 2007; Baroni and

Lenci, 2010), we experiment with syntactic con-

texts that are derived from automatically produced

dependency parse-trees.

The different kinds of contexts produce no-

ticeably different embeddings, and induce differ-

ent word similarities. In particular, the bag-of-

words nature of the contexts in the “original”

SKIPGRAM model yield broad topical similari-

ties, while the dependency-based contexts yield

more functional similarities of a cohyponym na-

ture. This effect is demonstrated using both quali-

tative and quantitative analysis (Section 4).

The neural word-embeddings are considered

opaque, in the sense that it is hard to assign mean-

ings to the dimensions of the induced represen-

tation. In Section 5 we show that the SKIP-

GRAM model does allow for some introspection

by querying it for contexts that are “activated by” a

target word. This allows us to peek into the learned

representation and explore the contexts that are

found by the learning process to be most discrim-

inative of particular words (or groups of words).

To the best of our knowledge, this is the ﬁrst work

to suggest such an analysis of discriminatively-

trained word-embedding models.

2 The Skip-Gram Model

Our departure point is the skip-gram neural em-

bedding model introduced in (Mikolov et al.,

2013a) trained using the negative-sampling pro-

cedure presented in (Mikolov et al., 2013b). In

this section we summarize the model and train-

ing objective following the derivation presented by

Goldberg and Levy (2014), and highlight the ease

of incorporating arbitrary contexts in the model.

In the skip-gram model, each word w ∈ W is

associated with a vector v

∈ R

and similarly

each context c ∈ C is represented as a vector

∈ R

, where W is the words vocabulary, C

is the contexts vocabulary, and d is the embed-

ding dimensionality. The entries in the vectors

are latent, and treated as parameters to be learned.

Loosely speaking, we seek parameter values (that

is, vector representations for both words and con-

texts) such that the dot product v

· v

associated

with “good” word-context pairs is maximized.

More speciﬁcally, the negative-sampling objec-

tive assumes a dataset D of observed (w, c) pairs

of words w and the contexts c, which appeared in

a large body of text. Consider a word-context pair

(w, c). Did this pair come from the data? We de-

note by p(D = 1|w, c) the probability that (w, c)

came from the data, and by p(D = 0|w, c) =

1 − p(D = 1|w, c) the probability that (w, c) did

not. The distribution is modeled as:

p(D = 1|w, c) =

1+e

−v

·v

where v

and v

(each a d-dimensional vector) are

the model parameters to be learned. We seek to

maximize the log-probability of the observed pairs

belonging to the data, leading to the objective:

arg max

(w,c)∈D

log

1+e

−v

·v

This objective admits a trivial solution in which

p(D = 1|w, c) = 1 for every pair (w, c). This can

be easily achieved by setting v

= v

and v

·v

K for all c, w, where K is large enough number.

In order to prevent the trivial solution, the ob-

jective is extended with (w, c) pairs for which

p(D = 1|w, c) must be low, i.e. pairs which are

not in the data, by generating the set D

of ran-

dom (w, c) pairs (assuming they are all incorrect),

yielding the negative-sampling training objective:

arg max



(w,c)∈D

p(D = 1|c, w)

(w,c)∈D

p(D = 0|c, w)



which can be rewritten as:

arg max



(w,c)∈D

log σ(v

· v

) +

(w,c)∈D

log σ(−v

· v

)



where σ(x) = 1/(1+e

). The objective is trained

in an online fashion using stochastic-gradient up-

dates over the corpus D ∪ D

The negative samples D

can be constructed in

various ways. We follow the method proposed by

Mikolov et al.: for each (w, c) ∈ D we construct

n samples (w, c

), . . . , (w, c

), where n is a hy-

perparameter and each c

is drawn according to its

unigram distribution raised to the 3/4 power.

Optimizing this objective makes observed

word-context pairs have similar embeddings,

while scattering unobserved pairs. Intuitively,

words that appear in similar contexts should have

similar embeddings, though we have not yet found

a formal proof that SKIPGRAM does indeed max-

imize the dot product of similar words.

3 Embedding with Arbitrary Contexts

In the SKIPGRAM embedding algorithm, the con-

texts of a word w are the words surrounding it

剩余6页未读，继续阅读

评论收藏

内容反馈

喜欢雨天的我

粉丝: 748
资源: 31

SkipGramModel

word2vec Skip-Gram模型的简单实现

用python实现skip-gram算法：AAAI-14 accepted papers（NLP）分类任务

一文详解 Word2vec 之 Skip-Gram 模型

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

机器学习期末复习题及答案

Ollama软件windows安装包(版本0.3.10)

神经网络回归预测--气温数据集

Mathwork+Matlab+编程手册

中文短信数据集-带标签

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

亚博K210模型训练部署

Plecs电力电子仿真PLECS41.64 电力系统仿真软件免安装版本

shape_predictor_68_face_landmarks.zip

hugging face的models-openai-clip-vit-large-patch14文件夹

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

改进版的yolov5+双目测距

Matlab 基于支持向量机(SVM)的数据回归预测 SVM回归

基于鲸鱼优化算法优化VMD参数试看效果代码(目标函数为样本熵)

mock_kaggle.csv

电机故障数据集.rar

《1. 机器学习前置知识》配套数据集

2024 APMCM Summary Sheet.docx2024年第十四届APMCM亚太地区大学生数学建模竞赛2024 APMCM

基于LSTM模型的股票预测模型_python

Matlab 基于BP神经网络的数据分类预测 BP分类

ADRC控制器仿真 simulink 2017a版本

examples.rar

temps.csv数据集

Tensorflow-gpu版本缺少的dll文件

TA-Lib的whl文件

《机器学习(周志华)》学习笔记.pdf

最新资源