【免费】EnrichingWordVectorswithSubwordInformation(与fastText相关)1资源-CSDN文库

需积分: 0 72 浏览量 2022-08-03 12:59:40 上传评论收藏 2.26MB PDF 举报

资源详情

资源评论

资源推荐

Enriching Word Vectors with Subword Information

Piotr Bojanowski

∗

and Edouard Grave

∗

and Armand Joulin and Tomas Mikolov

Facebook AI Research

{bojanowski,egrave,ajoulin,tmikolov}@fb.com

Abstract

Continuous word representations, trained on

large unlabeled corpora are useful for many

natural language processing tasks. Popular

models that learn such representations ignore

the morphology of words, by assigning a dis-

tinct vector to each word. This is a limitation,

especially for languages with large vocabular-

ies and many rare words. In this paper, we pro-

pose a new approach based on the skipgram

model, where each word is represented as a

bag of character n-grams. A vector represen-

tation is associated to each character n-gram;

words being represented as the sum of these

representations. Our method is fast, allow-

ing to train models on large corpora quickly

and allows us to compute word representations

for words that did not appear in the training

data. We evaluate our word representations on

nine different languages, both on word sim-

ilarity and analogy tasks. By comparing to

recently proposed morphological word repre-

sentations, we show that our vectors achieve

state-of-the-art performance on these tasks.

1 Introduction

Learning continuous representations of words has a

long history in natural language processing (Rumel-

hart et al., 1988). These representations are typ-

ically derived from large unlabeled corpora using

co-occurrence statistics (Deerwester et al., 1990;

Schütze, 1992; Lund and Burgess, 1996). A large

body of work, known as distributional semantics,

has studied the properties of these methods (Turney

∗

The two ﬁrst authors contributed equally.

et al., 2010; Baroni and Lenci, 2010). In the neural

network community, Collobert and Weston (2008)

proposed to learn word embeddings using a feed-

forward neural network, by predicting a word based

on the two words on the left and two words on the

right. More recently, Mikolov et al. (2013b) pro-

posed simple log-bilinear models to learn continu-

ous representations of words on very large corpora

efﬁciently.

Most of these techniques represent each word of

the vocabulary by a distinct vector, without param-

eter sharing. In particular, they ignore the internal

structure of words, which is an important limitation

for morphologically rich languages, such as Turk-

ish or Finnish. For example, in French or Spanish,

most verbs have more than forty different inﬂected

forms, while the Finnish language has ﬁfteen cases

for nouns. These languages contain many word

forms that occur rarely (or not at all) in the training

corpus, making it difﬁcult to learn good word rep-

resentations. Because many word formations follow

rules, it is possible to improve vector representations

for morphologically rich languages by using charac-

ter level information.

In this paper, we propose to learn representations

for character n-grams, and to represent words as the

sum of the n-gram vectors. Our main contribution

is to introduce an extension of the continuous skip-

gram model (Mikolov et al., 2013b), which takes

into account subword information. We evaluate this

model on nine languages exhibiting different mor-

phologies, showing the beneﬁt of our approach.

arXiv:1607.04606v2 [cs.CL] 19 Jun 2017

2 Related work

Morphological word representations. In recent

years, many methods have been proposed to incor-

porate morphological information into word repre-

sentations. To model rare words better, Alexan-

drescu and Kirchhoff (2006) introduced factored

neural language models, where words are repre-

sented as sets of features. These features might in-

clude morphological information, and this technique

was succesfully applied to morphologically rich lan-

guages, such as Turkish (Sak et al., 2010). Re-

cently, several works have proposed different com-

position functions to derive representations of words

from morphemes (Lazaridou et al., 2013; Luong

et al., 2013; Botha and Blunsom, 2014; Qiu et

al., 2014). These different approaches rely on a

morphological decomposition of words, while ours

does not. Similarly, Chen et al. (2015) introduced

a method to jointly learn embeddings for Chinese

words and characters. Cui et al. (2015) proposed

to constrain morphologically similar words to have

similar representations. Soricut and Och (2015)

described a method to learn vector representations

of morphological transformations, allowing to ob-

tain representations for unseen words by applying

these rules. Word representations trained on mor-

phologically annotated data were introduced by Cot-

terell and Schütze (2015). Closest to our approach,

Schütze (1993) learned representations of character

four-grams through singular value decomposition,

and derived representations for words by summing

the four-grams representations. Very recently, Wi-

eting et al. (2016) also proposed to represent words

using character n-gram count vectors. However, the

objective function used to learn these representa-

tions is based on paraphrase pairs, while our model

can be trained on any text corpus.

Character level features for NLP. Another area

of research closely related to our work are character-

level models for natural language processing. These

models discard the segmentation into words and aim

at learning language representations directly from

characters. A ﬁrst class of such models are recur-

rent neural networks, applied to language model-

ing (Mikolov et al., 2012; Sutskever et al., 2011;

Graves, 2013; Bojanowski et al., 2015), text nor-

malization (Chrupała, 2014), part-of-speech tag-

ging (Ling et al., 2015) and parsing (Ballesteros et

al., 2015). Another family of models are convolu-

tional neural networks trained on characters, which

were applied to part-of-speech tagging (dos San-

tos and Zadrozny, 2014), sentiment analysis (dos

Santos and Gatti, 2014), text classiﬁcation (Zhang

et al., 2015) and language modeling (Kim et al.,

2016). Sperr et al. (2013) introduced a language

model based on restricted Boltzmann machines, in

which words are encoded as a set of character n-

grams. Finally, recent works in machine translation

have proposed using subword units to obtain repre-

sentations of rare words (Sennrich et al., 2016; Lu-

ong and Manning, 2016).

3 Model

In this section, we propose our model to learn word

representations while taking into account morphol-

ogy. We model morphology by considering subword

units, and representing words by a sum of its charac-

ter n-grams. We will begin by presenting the general

framework that we use to train word vectors, then

present our subword model and eventually describe

how we handle the dictionary of character n-grams.

3.1 General model

We start by brieﬂy reviewing the continuous skip-

gram model introduced by Mikolov et al. (2013b),

from which our model is derived. Given a word vo-

cabulary of size W , where a word is identiﬁed by

its index w ∈ {1, ..., W }, the goal is to learn a

vectorial representation for each word w. Inspired

by the distributional hypothesis (Harris, 1954), word

representations are trained to predict well words that

appear in its context. More formally, given a large

training corpus represented as a sequence of words

, ..., w

, the objective of the skipgram model is to

maximize the following log-likelihood:

t=1

c∈C

log p(w

| w

where the context C

is the set of indices of words

surrounding word w

. The probability of observing

a context word w

given w

will be parameterized

using the aforementioned word vectors. For now, let

us consider that we are given a scoring function s

which maps pairs of (word, context) to scores in R.

One possible choice to deﬁne the probability of a

context word is the softmax:

p(w

| w

) =

s(w

, w

)

j=1

s(w

, j)

However, such a model is not adapted to our case as

it implies that, given a word w

, we only predict one

context word w

The problem of predicting context words can in-

stead be framed as a set of independent binary clas-

siﬁcation tasks. Then the goal is to independently

predict the presence (or absence) of context words.

For the word at position t we consider all context

words as positive examples and sample negatives at

random from the dictionary. For a chosen context

position c, using the binary logistic loss, we obtain

the following negative log-likelihood:

log



1 + e

−s(w

, w

)



n∈N

t,c

log



1 + e

s(w

, n)



where N

t,c

is a set of negative examples sampled

from the vocabulary. By denoting the logistic loss

function ` : x 7→ log(1 + e

−x

), we can re-write the

objective as:

t=1





c∈C

`(s(w

, w

)) +

n∈N

t,c

`(−s(w

, n))





A natural parameterization for the scoring function

s between a word w

and a context word w

is to use

word vectors. Let us deﬁne for each word w in the

vocabulary two vectors u

and v

in R

. These two

vectors are sometimes referred to as input and out-

put vectors in the literature. In particular, we have

vectors u

and v

, corresponding, respectively, to

words w

and w

. Then the score can be computed

as the scalar product between word and context vec-

tors as s(w

, w

) = u

. The model described

in this section is the skipgram model with negative

sampling, introduced by Mikolov et al. (2013b).

3.2 Subword model

By using a distinct vector representation for each

word, the skipgram model ignores the internal struc-

ture of words. In this section, we propose a different

scoring function s, in order to take into account this

information.

Each word w is represented as a bag of character

n-gram. We add special boundary symbols < and >

at the beginning and end of words, allowing to dis-

tinguish preﬁxes and sufﬁxes from other character

sequences. We also include the word w itself in the

set of its n-grams, to learn a representation for each

word (in addition to character n-grams). Taking the

word where and n = 3 as an example, it will be

represented by the character n-grams:

<wh, whe, her, ere, re>

and the special sequence

<where>.

Note that the sequence <her>, corresponding to the

word her is different from the tri-gram her from the

word where. In practice, we extract all the n-grams

for n greater or equal to 3 and smaller or equal to 6.

This is a very simple approach, and different sets of

n-grams could be considered, for example taking all

preﬁxes and sufﬁxes.

Suppose that you are given a dictionary of n-

grams of size G. Given a word w, let us denote by

⊂ {1, . . . , G} the set of n-grams appearing in

w. We associate a vector representation z

to each

n-gram g. We represent a word by the sum of the

vector representations of its n-grams. We thus ob-

tain the scoring function:

s(w, c) =

g∈G

This simple model allows sharing the representa-

tions across words, thus allowing to learn reliable

representation for rare words.

In order to bound the memory requirements of our

model, we use a hashing function that maps n-grams

to integers in 1 to K. We hash character sequences

using the Fowler-Noll-Vo hashing function (speciﬁ-

cally the FNV-1a variant).

We set K = 2.10

be-

low. Ultimately, a word is represented by its index

in the word dictionary and the set of hashed n-grams

it contains.

4 Experimental setup

4.1 Baseline

In most experiments (except in Sec. 5.3), we

compare our model to the C implementation

http://www.isthe.com/chongo/tech/comp/fnv

剩余11页未读，继续阅读

评论收藏

内容反馈

朱王勇

粉丝: 20
资源: 306

Enriching Word Vectors with Subword Information(与fastText相关)1

评论0

最新资源

Enriching Word Vectors with Subword Information(与fastText相关)1

评论0

Enriching ebXML Registries with OWL Ontologies for Efficient Service Discovery.pdf

SIGIR2018、WWW2018 知识图谱研究综述

Splunk 7 Essentials, 3rd Edition-Packt Publishing(2018)

Web Animation Using JavaScript - js网页动画-英文原版

用matlab求mcmc代码-enriching_object_detection:丰富对象检测

使用实体信息丰富用于关系分类的预训练语言模型.zip

Python Cookbook, 2nd Edition

Machine Learning. A Constraint-based Approach

英文原版-AdvancED CSS 1st Edition

Microservices.Flexible.Software.Architecture

stanford tagger

Android.TV.Apps.Development.Building.for.Media.and.Games.148421

《The Pleasures of Pi, e and Other Interesting Numbers》作者: Adrian, Y. E.

Scala for the Impatient 2nd (完整英文第二版 带书签)

Laurence Moroney

bbhhwsua_ai_源码

Understanding UMTS Radio Network Modelling, Planning and Automated Optimisation

Lecture Notes in Civil Engineering

pipelines-project

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

STM32F103C8T6核心板-电路原理图1.PDF

软件工程导论(第六版)课后习题答案1

最新资源

Scala for the Impatient 2nd (完整英文第二版带书签)