cw2vec相关资料，cw2vec论文与WordVectors资源-CSDN文库

共2个文件

pdf：2个

蚂蚁金服

自然语言处理

1星需积分: 9 113 浏览量 2018-08-18 12:28:43 上传评论收藏 1.33MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

cw2vec资料.zip （2个子文件）

cw2vec.pdf 1.08MB

EnrichingWord Vectors with Subword Information.pdf 2.26MB

Enriching Word Vectors with Subword Information

Piotr Bojanowski

∗

and Edouard Grave

∗

and Armand Joulin and Tomas Mikolov

Facebook AI Research

{bojanowski,egrave,ajoulin,tmikolov}@fb.com

Abstract

Continuous word representations, trained on

large unlabeled corpora are useful for many

natural language processing tasks. Popular

models that learn such representations ignore

the morphology of words, by assigning a dis-

tinct vector to each word. This is a limitation,

especially for languages with large vocabular-

ies and many rare words. In this paper, we pro-

pose a new approach based on the skipgram

model, where each word is represented as a

bag of character n-grams. A vector represen-

tation is associated to each character n-gram;

words being represented as the sum of these

representations. Our method is fast, allow-

ing to train models on large corpora quickly

and allows us to compute word representations

for words that did not appear in the training

data. We evaluate our word representations on

nine different languages, both on word sim-

ilarity and analogy tasks. By comparing to

recently proposed morphological word repre-

sentations, we show that our vectors achieve

state-of-the-art performance on these tasks.

1 Introduction

Learning continuous representations of words has a

long history in natural language processing (Rumel-

hart et al., 1988). These representations are typ-

ically derived from large unlabeled corpora using

co-occurrence statistics (Deerwester et al., 1990;

Schütze, 1992; Lund and Burgess, 1996). A large

body of work, known as distributional semantics,

has studied the properties of these methods (Turney

∗

The two ﬁrst authors contributed equally.

et al., 2010; Baroni and Lenci, 2010). In the neural

network community, Collobert and Weston (2008)

proposed to learn word embeddings using a feed-

forward neural network, by predicting a word based

on the two words on the left and two words on the

right. More recently, Mikolov et al. (2013b) pro-

posed simple log-bilinear models to learn continu-

ous representations of words on very large corpora

efﬁciently.

Most of these techniques represent each word of

the vocabulary by a distinct vector, without param-

eter sharing. In particular, they ignore the internal

structure of words, which is an important limitation

for morphologically rich languages, such as Turk-

ish or Finnish. For example, in French or Spanish,

most verbs have more than forty different inﬂected

forms, while the Finnish language has ﬁfteen cases

for nouns. These languages contain many word

forms that occur rarely (or not at all) in the training

corpus, making it difﬁcult to learn good word rep-

resentations. Because many word formations follow

rules, it is possible to improve vector representations

for morphologically rich languages by using charac-

ter level information.

In this paper, we propose to learn representations

for character n-grams, and to represent words as the

sum of the n-gram vectors. Our main contribution

is to introduce an extension of the continuous skip-

gram model (Mikolov et al., 2013b), which takes

into account subword information. We evaluate this

model on nine languages exhibiting different mor-

phologies, showing the beneﬁt of our approach.

arXiv:1607.04606v2 [cs.CL] 19 Jun 2017

2 Related work

Morphological word representations. In recent

years, many methods have been proposed to incor-

porate morphological information into word repre-

sentations. To model rare words better, Alexan-

drescu and Kirchhoff (2006) introduced factored

neural language models, where words are repre-

sented as sets of features. These features might in-

clude morphological information, and this technique

was succesfully applied to morphologically rich lan-

guages, such as Turkish (Sak et al., 2010). Re-

cently, several works have proposed different com-

position functions to derive representations of words

from morphemes (Lazaridou et al., 2013; Luong

et al., 2013; Botha and Blunsom, 2014; Qiu et

al., 2014). These different approaches rely on a

morphological decomposition of words, while ours

does not. Similarly, Chen et al. (2015) introduced

a method to jointly learn embeddings for Chinese

words and characters. Cui et al. (2015) proposed

to constrain morphologically similar words to have

similar representations. Soricut and Och (2015)

described a method to learn vector representations

of morphological transformations, allowing to ob-

tain representations for unseen words by applying

these rules. Word representations trained on mor-

phologically annotated data were introduced by Cot-

terell and Schütze (2015). Closest to our approach,

Schütze (1993) learned representations of character

four-grams through singular value decomposition,

and derived representations for words by summing

the four-grams representations. Very recently, Wi-

eting et al. (2016) also proposed to represent words

using character n-gram count vectors. However, the

objective function used to learn these representa-

tions is based on paraphrase pairs, while our model

can be trained on any text corpus.

Character level features for NLP. Another area

of research closely related to our work are character-

level models for natural language processing. These

models discard the segmentation into words and aim

at learning language representations directly from

characters. A ﬁrst class of such models are recur-

rent neural networks, applied to language model-

ing (Mikolov et al., 2012; Sutskever et al., 2011;

Graves, 2013; Bojanowski et al., 2015), text nor-

malization (Chrupała, 2014), part-of-speech tag-

ging (Ling et al., 2015) and parsing (Ballesteros et

al., 2015). Another family of models are convolu-

tional neural networks trained on characters, which

were applied to part-of-speech tagging (dos San-

tos and Zadrozny, 2014), sentiment analysis (dos

Santos and Gatti, 2014), text classiﬁcation (Zhang

et al., 2015) and language modeling (Kim et al.,

2016). Sperr et al. (2013) introduced a language

model based on restricted Boltzmann machines, in

which words are encoded as a set of character n-

grams. Finally, recent works in machine translation

have proposed using subword units to obtain repre-

sentations of rare words (Sennrich et al., 2016; Lu-

ong and Manning, 2016).

3 Model

In this section, we propose our model to learn word

representations while taking into account morphol-

ogy. We model morphology by considering subword

units, and representing words by a sum of its charac-

ter n-grams. We will begin by presenting the general

framework that we use to train word vectors, then

present our subword model and eventually describe

how we handle the dictionary of character n-grams.

3.1 General model

We start by brieﬂy reviewing the continuous skip-

gram model introduced by Mikolov et al. (2013b),

from which our model is derived. Given a word vo-

cabulary of size W , where a word is identiﬁed by

its index w ∈ {1, ..., W }, the goal is to learn a

vectorial representation for each word w. Inspired

by the distributional hypothesis (Harris, 1954), word

representations are trained to predict well words that

appear in its context. More formally, given a large

training corpus represented as a sequence of words

, ..., w

, the objective of the skipgram model is to

maximize the following log-likelihood:

t=1

c∈C

log p(w

| w

where the context C

is the set of indices of words

surrounding word w

. The probability of observing

a context word w

given w

will be parameterized

using the aforementioned word vectors. For now, let

us consider that we are given a scoring function s

which maps pairs of (word, context) to scores in R.

One possible choice to deﬁne the probability of a

context word is the softmax:

p(w

| w

) =

s(w

, w

)

j=1

s(w

, j)

However, such a model is not adapted to our case as

it implies that, given a word w

, we only predict one

context word w

The problem of predicting context words can in-

stead be framed as a set of independent binary clas-

siﬁcation tasks. Then the goal is to independently

predict the presence (or absence) of context words.

For the word at position t we consider all context

words as positive examples and sample negatives at

random from the dictionary. For a chosen context

position c, using the binary logistic loss, we obtain

the following negative log-likelihood:

log



1 + e

−s(w

, w

)



n∈N

t,c

log



1 + e

s(w

, n)



where N

t,c

is a set of negative examples sampled

from the vocabulary. By denoting the logistic loss

function ` : x 7→ log(1 + e

−x

), we can re-write the

objective as:

t=1





c∈C

`(s(w

, w

)) +

n∈N

t,c

`(−s(w

, n))





A natural parameterization for the scoring function

s between a word w

and a context word w

is to use

word vectors. Let us deﬁne for each word w in the

vocabulary two vectors u

and v

in R

. These two

vectors are sometimes referred to as input and out-

put vectors in the literature. In particular, we have

vectors u

and v

, corresponding, respectively, to

words w

and w

. Then the score can be computed

as the scalar product between word and context vec-

tors as s(w

, w

) = u

. The model described

in this section is the skipgram model with negative

sampling, introduced by Mikolov et al. (2013b).

3.2 Subword model

By using a distinct vector representation for each

word, the skipgram model ignores the internal struc-

ture of words. In this section, we propose a different

scoring function s, in order to take into account this

information.

Each word w is represented as a bag of character

n-gram. We add special boundary symbols < and >

at the beginning and end of words, allowing to dis-

tinguish preﬁxes and sufﬁxes from other character

sequences. We also include the word w itself in the

set of its n-grams, to learn a representation for each

word (in addition to character n-grams). Taking the

word where and n = 3 as an example, it will be

represented by the character n-grams:

<wh, whe, her, ere, re>

and the special sequence

<where>.

Note that the sequence <her>, corresponding to the

word her is different from the tri-gram her from the

word where. In practice, we extract all the n-grams

for n greater or equal to 3 and smaller or equal to 6.

This is a very simple approach, and different sets of

n-grams could be considered, for example taking all

preﬁxes and sufﬁxes.

Suppose that you are given a dictionary of n-

grams of size G. Given a word w, let us denote by

⊂ {1, . . . , G} the set of n-grams appearing in

w. We associate a vector representation z

to each

n-gram g. We represent a word by the sum of the

vector representations of its n-grams. We thus ob-

tain the scoring function:

s(w, c) =

g∈G

This simple model allows sharing the representa-

tions across words, thus allowing to learn reliable

representation for rare words.

In order to bound the memory requirements of our

model, we use a hashing function that maps n-grams

to integers in 1 to K. We hash character sequences

using the Fowler-Noll-Vo hashing function (speciﬁ-

cally the FNV-1a variant).

We set K = 2.10

be-

low. Ultimately, a word is represented by its index

in the word dictionary and the set of hashed n-grams

it contains.

4 Experimental setup

4.1 Baseline

In most experiments (except in Sec. 5.3), we

compare our model to the C implementation

http://www.isthe.com/chongo/tech/comp/fnv

of the skipgram and cbow models from the

word2vec

package.

4.2 Optimization

We solve our optimization problem by perform-

ing stochastic gradient descent on the negative log

likelihood presented before. As in the baseline

skipgram model, we use a linear decay of the step

size. Given a training set containing T words and

a number of passes over the data equal to P , the

step size at time t is equal to γ

(1 −

T P

), where

is a ﬁxed parameter. We carry out the optimiza-

tion in parallel, by resorting to Hogwild (Recht et

al., 2011). All threads share parameters and update

vectors in an asynchronous manner.

4.3 Implementation details

For both our model and the baseline experiments, we

use the following parameters: the word vectors have

dimension 300. For each positive example, we sam-

ple 5 negatives at random, with probability propor-

tional to the square root of the uni-gram frequency.

We use a context window of size c, and uniformly

sample the size c between 1 and 5. In order to sub-

sample the most frequent words, we use a rejection

threshold of 10

−4

(for more details, see (Mikolov et

al., 2013b)). When building the word dictionary, we

keep the words that appear at least 5 times in the

training set. The step size γ

is set to 0.025 for the

skipgram baseline and to 0.05 for both our model

and the cbow baseline. These are the default values

in the word2vec package and work well for our

model too.

Using this setting on English data, our model with

character n-grams is approximately 1.5× slower

to train than the skipgram baseline. Indeed,

we process 105k words/second/thread versus 145k

words/second/thread for the baseline. Our model is

implemented in C++, and is publicly available.

4.4 Datasets

Except for the comparison to previous

work (Sec. 5.3), we train our models on Wikipedia

data.

We downloaded Wikipedia dumps in nine

languages: Arabic, Czech, German, English,

https://code.google.com/archive/p/word2vec

https://github.com/facebookresearch/fastText

https://dumps.wikimedia.org

Spanish, French, Italian, Romanian and Russian.

We normalize the raw Wikipedia data using Matt

Mahoney’s pre-processing perl script.

All the

datasets are shufﬂed, and we train our models by

doing ﬁve passes over them.

5 Results

We evaluate our model in ﬁve experiments: an eval-

uation of word similarity and word analogies, a com-

parison to state-of-the-art methods, an analysis of

the effect of the size of training data and of the size

of character n-grams that we consider. We will de-

scribe these experiments in detail in the following

sections.

5.1 Human similarity judgement

We ﬁrst evaluate the quality of our representations

on the task of word similarity / relatedness. We do

so by computing Spearman’s rank correlation co-

efﬁcient (Spearman, 1904) between human judge-

ment and the cosine similarity between the vector

representations. For German, we compare the dif-

ferent models on three datasets: GUR65, GUR350

and ZG222 (Gurevych, 2005; Zesch and Gurevych,

2006). For English, we use the WS353 dataset in-

troduced by Finkelstein et al. (2001) and the rare

word dataset (RW), introduced by Luong et al.

(2013). We evaluate the French word vectors on

the translated dataset RG65 (Joubarne and Inkpen,

2011). Spanish, Arabic and Romanian word vectors

are evaluated using the datasets described in (Hassan

and Mihalcea, 2009). Russian word vectors are eval-

uated using the HJ dataset introduced by Panchenko

et al. (2016).

We report results for our method and baselines

for all datasets in Table 1. Some words from these

datasets do not appear in our training data, and

thus, we cannot obtain word representation for these

words using the cbow and skipgram baselines. In

order to provide comparable results, we propose by

default to use null vectors for these words. Since our

model exploits subword information, we can also

compute valid representations for out-of-vocabulary

words. We do so by taking the sum of its n-gram

vectors. When OOV words are represented using

http://mattmahoney.net/dc/textdata

评论收藏

内容反馈

info_kerwin

2019-11-07

2篇论文，其中一篇是cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

IT界的小小小学生

粉丝: 3556
资源: 20

cw2vec相关资料，cw2vec论文与Word Vectors

最新资源

cw2vec相关资料，cw2vec论文与Word Vectors

google word2vec相关论文

word2vec系列资料

cw2vec:基于字符训练词向量

cw2vec:cw2vec模型的实现

智能问答系统demo, word2vec语义匹配

word2vec.rar_VEC-361_layers5cb_vec361_word2vec_word2vec 中文

cw2vec: Learning ChineseWord Embeddings with Stroke n-gram Information

node2vec: Scalable Feature Learning for Networks

Python-word2vec使用word2vec改进搜索结果

Enriching Word Vectors with Subword Information(与fastText相关)1

中文维基语料Word2Vec训练模型

Word2VEC_java-master.zip_java word2vec_word2vec_word2vec java

java版本的word2vec

word2vec中的数学原理详解

google word2vec开源项目

word2vec在PyTorch中的实现代码及其数据

中文词向量资源(xlsx、txt)与代码（python） Chinese-Word-Vectors 国内外地址和常用词语相关

word2vec_twitter word2vec_twitter_model.bin

The Inner Workings - of - word2vec ：一文搞懂word2vec

word2vec情感分析实例

word2vec+dna2vec.pptx

word2vec-google-news-300.zip.007

Word2vec谷歌词向量

word2vec 完整源码

wiki_word2vec_50.bin.zip

word2vec数学原理

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

最新资源