【免费】[2016b]-FastText（词向量）.v21资源-CSDN文库

需积分: 0 6 浏览量 2022-08-04 15:26:32 上传评论收藏 2.24MB PDF 举报

【FastText：词向量与子词信息的增强】 FastText是Facebook AI Research团队在2016年提出的一种创新的词向量模型，旨在解决传统词向量模型（如Word2Vec）在处理大量词汇和稀有词时的局限性。传统的词向量模型，如CBOW和Skip-gram，通常将每个单词视为独立的实体，分配给每个单词一个独特的向量表示。然而，FastText采取了一种不同的方法，它利用字符n-gram来构建词的表示，将每个单词看作是字符n-gram的集合。 FastText的基本思想是，每个字符n-gram（连续的字符序列，例如，单词"apple"的三元组有"app", "ple", "ppl"等）都有其对应的向量表示，然后将这些向量相加得到整个单词的表示。这种设计考虑了单词的形态结构，使得模型能够学习到词内部的模式，从而更好地处理未在训练数据中出现的单词，特别是对于那些具有丰富形态变化的语言。在FastText模型中，训练过程高效且快速，可以在大规模语料库上迅速学习到高质量的词向量。由于采用了字符级别的信息，模型不仅能够为训练集中存在的单词生成向量，还能为未在训练集中出现的新词生成向量，这大大扩展了词向量模型的应用范围。 FastText的性能在多种语言的实验中得到了验证，包括词相似性和类比任务。通过与最近提出的考虑词形变化的词向量模型进行比较，FastText在这些任务上取得了最先进的结果。这表明，通过结合子词信息，FastText能够更准确地捕捉到词汇之间的语义关系，提高了自然语言处理任务的性能。词向量在自然语言处理领域的应用广泛，如情感分析、机器翻译、问答系统等。FastText的优势在于其对词汇形态的敏感性，使得它在处理多词根、多形态变化的语言时，如德语、阿拉伯语等，相比单纯基于单词的模型有显著优势。此外，FastText的高效训练和对未知词的处理能力，使其在实际应用中更具实用性。 FastText的提出，不仅是对词向量模型的一次重要改进，也是自然语言处理领域研究的一个里程碑，它推动了词向量技术的发展，并在后续的研究中激发了许多新的思路和方法，如Transformer中的Subword Tokenization等。这一模型的出现，使得我们能更深入地理解和处理语言的复杂性，进一步提升了NLP系统的性能和鲁棒性。

资源详情

资源评论

资源推荐

Enriching Word Vectors with Subword Information

Piotr Bojanowski

∗

and Edouard Grave

∗

and Armand Joulin and Tomas Mikolov

Facebook AI Research

{bojanowski,egrave,ajoulin,tmikolov}@fb.com

Abstract

Continuous word representations, trained on

large unlabeled corpora are useful for many

natural language processing tasks. Popular

models that learn such representations ignore

the morphology of words, by assigning a dis-

tinct vector to each word. This is a limitation,

especially for languages with large vocabular-

ies and many rare words. In this paper, we pro-

pose a new approach based on the skipgram

model, where each word is represented as a

bag of character n-grams. A vector represen-

tation is associated to each character n-gram;

words being represented as the sum of these

representations. Our method is fast, allow-

ing to train models on large corpora quickly

and allows us to compute word representations

for words that did not appear in the training

data. We evaluate our word representations on

nine different languages, both on word sim-

ilarity and analogy tasks. By comparing to

recently proposed morphological word repre-

sentations, we show that our vectors achieve

state-of-the-art performance on these tasks.

1 Introduction

Learning continuous representations of words has a

long history in natural language processing (Rumel-

hart et al., 1988). These representations are typ-

ically derived from large unlabeled corpora using

co-occurrence statistics (Deerwester et al., 1990;

Schütze, 1992; Lund and Burgess, 1996). A large

body of work, known as distributional semantics,

has studied the properties of these methods (Turney

∗

The two ﬁrst authors contributed equally.

et al., 2010; Baroni and Lenci, 2010). In the neural

network community, Collobert and Weston (2008)

proposed to learn word embeddings using a feed-

forward neural network, by predicting a word based

on the two words on the left and two words on the

right. More recently, Mikolov et al. (2013b) pro-

posed simple log-bilinear models to learn continu-

ous representations of words on very large corpora

efﬁciently.

Most of these techniques represent each word of

the vocabulary by a distinct vector, without param-

eter sharing. In particular, they ignore the internal

structure of words, which is an important limitation

for morphologically rich languages, such as Turk-

ish or Finnish. For example, in French or Spanish,

most verbs have more than forty different inﬂected

forms, while the Finnish language has ﬁfteen cases

for nouns. These languages contain many word

forms that occur rarely (or not at all) in the training

corpus, making it difﬁcult to learn good word rep-

resentations. Because many word formations follow

rules, it is possible to improve vector representations

for morphologically rich languages by using charac-

ter level information.

In this paper, we propose to learn representations

for character n-grams, and to represent words as the

sum of the n-gram vectors. Our main contribution

is to introduce an extension of the continuous skip-

gram model (Mikolov et al., 2013b), which takes

into account subword information. We evaluate this

model on nine languages exhibiting different mor-

phologies, showing the beneﬁt of our approach.

arXiv:1607.04606v2 [cs.CL] 19 Jun 2017

2 Related work

Morphological word representations. In recent

years, many methods have been proposed to incor-

porate morphological information into word repre-

sentations. To model rare words better, Alexan-

drescu and Kirchhoff (2006) introduced factored

neural language models, where words are repre-

sented as sets of features. These features might in-

clude morphological information, and this technique

was succesfully applied to morphologically rich lan-

guages, such as Turkish (Sak et al., 2010). Re-

cently, several works have proposed different com-

position functions to derive representations of words

from morphemes (Lazaridou et al., 2013; Luong

et al., 2013; Botha and Blunsom, 2014; Qiu et

al., 2014). These different approaches rely on a

morphological decomposition of words, while ours

does not. Similarly, Chen et al. (2015) introduced

a method to jointly learn embeddings for Chinese

words and characters. Cui et al. (2015) proposed

to constrain morphologically similar words to have

similar representations. Soricut and Och (2015)

described a method to learn vector representations

of morphological transformations, allowing to ob-

tain representations for unseen words by applying

these rules. Word representations trained on mor-

phologically annotated data were introduced by Cot-

terell and Schütze (2015). Closest to our approach,

Schütze (1993) learned representations of character

four-grams through singular value decomposition,

and derived representations for words by summing

the four-grams representations. Very recently, Wi-

eting et al. (2016) also proposed to represent words

using character n-gram count vectors. However, the

objective function used to learn these representa-

tions is based on paraphrase pairs, while our model

can be trained on any text corpus.

Character level features for NLP. Another area

of research closely related to our work are character-

level models for natural language processing. These

models discard the segmentation into words and aim

at learning language representations directly from

characters. A ﬁrst class of such models are recur-

rent neural networks, applied to language model-

ing (Mikolov et al., 2012; Sutskever et al., 2011;

Graves, 2013; Bojanowski et al., 2015), text nor-

malization (Chrupała, 2014), part-of-speech tag-

ging (Ling et al., 2015) and parsing (Ballesteros et

al., 2015). Another family of models are convolu-

tional neural networks trained on characters, which

were applied to part-of-speech tagging (dos San-

tos and Zadrozny, 2014), sentiment analysis (dos

Santos and Gatti, 2014), text classiﬁcation (Zhang

et al., 2015) and language modeling (Kim et al.,

2016). Sperr et al. (2013) introduced a language

model based on restricted Boltzmann machines, in

which words are encoded as a set of character n-

grams. Finally, recent works in machine translation

have proposed using subword units to obtain repre-

sentations of rare words (Sennrich et al., 2016; Lu-

ong and Manning, 2016).

3 Model

In this section, we propose our model to learn word

representations while taking into account morphol-

ogy. We model morphology by considering subword

units, and representing words by a sum of its charac-

ter n-grams. We will begin by presenting the general

framework that we use to train word vectors, then

present our subword model and eventually describe

how we handle the dictionary of character n-grams.

3.1 General model

We start by brieﬂy reviewing the continuous skip-

gram model introduced by Mikolov et al. (2013b),

from which our model is derived. Given a word vo-

cabulary of size W , where a word is identiﬁed by

its index w ∈ {1, ..., W }, the goal is to learn a

vectorial representation for each word w. Inspired

by the distributional hypothesis (Harris, 1954), word

representations are trained to predict well words that

appear in its context. More formally, given a large

training corpus represented as a sequence of words

, ..., w

, the objective of the skipgram model is to

maximize the following log-likelihood:

t=1

c∈C

log p(w

| w

where the context C

is the set of indices of words

surrounding word w

. The probability of observing

a context word w

given w

will be parameterized

using the aforementioned word vectors. For now, let

us consider that we are given a scoring function s

which maps pairs of (word, context) to scores in R.

剩余11页未读，继续阅读

评论收藏

内容反馈

刘璐璐璐璐璐

粉丝: 36
资源: 326

[2016b]-FastText（词向量）.v21

评论0

最新资源

[2016b]-FastText（词向量）.v21

评论0

FastText-0.9.2.zip

fasttext词向量，中文

wiki-news-300d-1M.vec.zip

fastText-fastText-latest-build43.zip

精简版的fasttext词向量

fasttext-0.9.2-cp39-cp39-win_amd64.whl.zip

PyPI 官网下载 | vectorhub-nightly-1.1.3.2021.1.29.tar.gz

fasttext-0.9.2-cp36-cp36m-win_amd64.whl.zip

facebook的预训练 fastText 模型wiki-news-300d-1M.vec下载.txt

fasttext-0.9.2-cp311-cp311-win_amd64.whl.zip

fasttext-0.9.2-cp38-cp38-win_amd64.whl.zip

fasttext-0.9.2-cp37-cp37m-win_amd64.whl.zip

A_Python_interface_for_Facebook_fastText_fastText.py.zip

词向量-中文医学词向量.zip

fasttext-0.9.2-cp310-cp310-win_amd64.whl.zip

Fasttext快速文本分类器代码

jfasttext-0.4-API文档-中英对照版.zip

词向量进行聚类word-vector-clustering-master.zip

Python库 | fasttext_serving-0.1.0-py3-none-any.whl

PyPI 官网下载 | text-embeddings-0.0.5.tar.gz

Python库 | vectorhub_nightly-1.1.0.2021.1.19-py3-none-any.whl

python fasttext 0.9.2在linux python3.7安装包

fasttext-0.9.2-pp38-pypy38_pp73-win_amd64.whl.zip

fasttest-0.8.3.tar.gz 文本分类源码研究

fasttext-0.9.1-cp35-cp35m-win_amd64.whl.zip

PyPI 官网下载 | torch-encoding-1.2.1b20200515.tar.gz

wordSim-240.297.zip

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Chrome Header Editor 插件

最新资源