【免费】（词向量表示）Fromstatictodynamicwordrepresentations-asurvey1资源-CSDN文库

需积分: 0 15 浏览量 2022-08-04 00:34:46 上传评论收藏 1.17MB PDF 举报

词向量表示发展历程：从静态到动态词表示形式综述词向量表示是自然语言处理（NLP）中一个重要的研究话题。随着大数据和机器学习技术的发展，词向量表示已经经历了从静态到动态的发展过程。在这个综述中，我们将从词向量表示的发展历程入手，探讨词向量表示模型的演变过程，并討论词向量表示在跨语言场景下的发展趋势。词向量表示的发展历程可以分为两个阶段：静态词表示和动态词表示。静态词表示是指使用低维度词向量来表示词语的语义含义，这种表示方式广泛应用于自然语言处理任务中，如词性标注、句法分析、命名实体识别、语义角色标注、机器翻译等。然而，静态词表示存在一些缺陷，例如无法处理多义词问题，因为多义词的含义取决于其上下文环境。为了解决多义词问题，研究者们开始探索动态词表示形式。动态词表示是指根据上下文环境来动态地调整词向量的表示方式，这种表示方式可以更好地捕捉多义词的含义。在跨语言场景下，动态词表示也可以应用于词向量表示的学习和推理。词向量表示的评估指标是衡量词向量表示模型效果的重要标准。常见的词向量表示评估指标包括词语相似度、词语类比、词语 clustering 等。这些指标可以评估词向量表示模型对词语语义含义的捕捉能力和泛化能力。词向量表示模型的应用非常广泛，涵盖了自然语言处理的多个方面，如信息检索、文本分类、 Sentiment 分析、机器翻译等。在这些应用中，词向量表示模型可以作为特征提取器或 embedding 层来捕捉词语的语义含义。词向量表示是自然语言处理中一个重要的研究话题。从静态到动态的词向量表示形式的发展过程已经取得了较大的进展，并且在跨语言场景下的发展趋势也非常值得关注。 future works including incremental learning, few-shot learning, and explainability of word embeddings. 词向量表示的开源问题和未来工作包括增量学习、少样本学习、词向量表示的可解释性等。这些问题和工作都是词向量表示发展的重要方向，并且对自然语言处理的发展具有重要影响。词向量表示模型的分类可以从多个角度进行，例如从词向量表示的静态和动态角度、从词向量表示的监督和无监督角度、从词向量表示的词内和词外角度等。不同的分类角度可以反映词向量表示模型的不同特点和优势。词向量表示模型的评估是一项非常重要的工作。常见的词向量表示评估指标包括词语相似度、词语类比、词语 clustering 等。这些指标可以评估词向量表示模型对词语语义含义的捕捉能力和泛化能力。在跨语言场景下，词向量表示模型的学习和推理变得更加复杂。为了解决这个问题，研究者们开始探索跨语言词向量表示模型，例如使用多语言词典、多语言训练数据等。这些方法可以提高词向量表示模型在跨语言场景下的泛化能力和robustness。词向量表示是自然语言处理中一个重要的研究话题，从静态到动态的词向量表示形式的发展过程已经取得了较大的进展，并且在跨语言场景下的发展趋势也非常值得关注。 future works including incremental learning, few-shot learning, and explainability of word embeddings.

资源详情

资源评论

资源推荐

Vol.:(0123456789)

1 3

International Journal of Machine Learning and Cybernetics

https://doi.org/10.1007/s13042-020-01069-8

ORIGINAL ARTICLE

From static todynamic word representations: asurvey

YuxuanWang

· YutaiHou

· WanxiangChe

· TingLiu

Received: 15 September 2019 / Accepted: 16 January 2020

Abstract

In the history of natural language processing (NLP) development, the representation of words has always been a signiﬁcant

research topic. In this survey, we provide a comprehensive typology of word representation models from a novel perspective

that the development from static to dynamic embeddings can eﬀectively address the polysemy problem, which has been a

great challenge in this ﬁeld. Then the survey covers the main evaluation metrics and applications of these word embeddings.

And, we further discuss the development of word embeddings from static to dynamic in cross-lingual scenario. Finally, we

point out some open issues and future works.

Keywords Word representation· Static embedding· Dynamic embedding· Cross-lingual embedding

1 Introduction

In the history of NLP, how to represent words which are often

regarded as the smallest semantic elements in the natural lan-

guage has always been a research hotspot. In recent years,

low-dimensional word representation vectors trained with

massive amounts of unannotated textual data, so-called word

embeddings [74, 86] have been demonstrated to be eﬀec-

tive in a wide range of NLP tasks including POS tagging

[119], syntactic parsing [13], named entity recognition [59]

and semantic role labeling [129], machine translation [130].

This kind of embeddings is static in the sense that they do

not change with the context once been learned. Despite their

eﬃciency, the static nature of these embeddings makes it dif-

ﬁcult to cope with the polysemy problem, since the meaning

of a polysemous word depends on its context.

To deal with this problem, a number of approaches

have been recently proposed to learn the representation

of words among their contexts.

For example, in two sen-

tences: “Apple sells phones” and “I eat an apple”, dynamic

embeddings will represent “apple” diﬀerently according to

the contexts, while static embedding can not distinguish the

semantic diﬀerence between two “apples”. These dynamic

embeddings extracted from pre-trained language models

[26, 71, 87, 90] have demonstrated dramatic superiority over

their static predecessors in various NLP tasks. In this survey,

we will provide an inclusive overview of the existing static

and dynamic embedding models. One of the main goals

of this survey is to display the developing trend of word

embedding models from static to dynamic, which demands

for coping with the polysemy problem. To facilitate this, we

ﬁrst introduce existing static embedding models and some

attempts to deal with the polysemy problem with them in

Sect.2. Then, we introduce and compare the recently pro-

posed dynamic embedding models, and show how they are

able to alleviate the polysemy problem in Sect.3.

With the embedding models described, how to evalu-

ate and apply them to downstream tasks are also notable.

Therefore, we discuss two major categories of evaluation

metrics, namely intrinsic and extrinsic ones in Sect.4. Then

we brieﬂy introduce the line of works trying to transform the

monolingual static and dynamic embeddings to cross-lingual

scenarios in Sect.5. Finally, we discuss some open issues

of word representation in Sect.6. Notions used in the paper

are listed in (Table1).

* Wanxiang Che

car@ir.hit.edu.cn

Yuxuan Wang

yxwang@ir.hit.edu.cn

Yutai Hou

ythou@ir.hit.edu.cn

Ting Liu

tliu@ir.hit.edu.cn

Research Center forSocial Computing andInformation

Retrieval, Harbin Institute ofTechnology, Harbin, China

These embeddings are contextualized or dynamic as opposed to the

traditional ones.

International Journal of Machine Learning and Cybernetics

1 3

2 Static representation

The development of static word representations can be

roughly divided into two stages. At the ﬁrst stage, sparse and

high-dimensional vectors were used to represent words. This

kind of embeddings suﬀer from the problem of data sparsity

and their high dimensionality, usually as large as the vocabu-

lary size, also makes it hard to use them. To cope with these

problems, at the second stage, dense and low-dimensional

vectors were trained with large textual data to take the place

of them. In this section, we ﬁrst introduce word representa-

tion models presented in both stages, and then describe the

polysemy problem as well as several works trying to solve

it with the static embeddings.

2.1 One‑hot anddistributional representations

In the early age of natural language processing, words are

represented with high-dimensional zero–one vectors, or so-

called one-hot word vectors, in which all entries are zero

except the single entry corresponding to the word, which is

one. With this approach, all vectors are orthogonal to each

other. Therefore, it is intractable to identify the semantic dis-

tance between words. For instance, the words apple, orange

and book are equally similar to each other with the one-hot

vector.

In order to model the syntactic and semantic similarity

between words, additional features are leveraged to represent

words, including: morphology (suﬃx, preﬁx), part-of-speech

tags, dictionary features, such as word sense from WordNet

Brown word clustering [11]. Further, new methods were pro-

posed under the distributional semantic hypothesis: you shall

know a word by the company it keeps [33]. Here, a word is

represented by its context, or speciﬁcally by a vector whose

entries are the count of words that appear in context, which

makes it possible to identify words that are semantically

similar to each other. As an instance, by observing a large

number of text corpora, we can ﬁnd that the contexts of

apple is much more similar to those of orange than book.

Formally, we denote by

the vocabulary of words and

the vocabulary of predeﬁned context words (Hence

and

are the vocabulary sizes of words and contexts

respectively). Both of them are indexed, where

stands for

the ith word in the word vocabulary and

the jth word in

the context vocabulary. A matrix

W ∈ ℝ

|×|V

is used to

quantify the correlation of words and their contexts, where

represents the correlation between word

and context

and

n(w

, c

)

denotes the number of times

occurs in the

context of

in a corpus D. And the size of D is denoted by

�

∑

w∈V

,c∈V

n(w, c

)

With such distributional representation, the semantic sim-

ilarity between words can be easily quantiﬁed by measuring

their distance in vector space, such as their cosine similarity

or Euclidean distance. Therefore, we can say that the distri-

butional representation has provided access to obtain the

semantic similarity between words.

However, it is not always appropriate to measure the cor-

relation of words solely by their co-occurrence, since overly

high weights might be singed to word-context pairs contain-

ing common contexts. Consider the situation with the word

apple, word-context pairs such as an apple and the apple

would be observed much more frequently than red apple and

apple tree, even though the latter ones are more informative.

An intuitive solution to this problem is by applying

weighting factors such as tf-idf, which reduces the weights

of word-context pairs proportionally to their frequency in

the corpus [51], so that informative pairs receive relatively

high weights. Whereas an alternative approach is quantify-

ing the correlation of words and contexts with the pointwise

mutual information (PMI) metric [23, 111], which measures

the association between a pair of discrete outcomes x and y,

deﬁned as:

In this case, the association between a word w and a context

c is measured by

PMI(w, c)

, which can be estimated with

the actual observed numbers in the corpus. Hence the cor-

relation between word

and context

in the word-context

matrix changes to:

where

∑

c∈V

n(w

, c

)

is the frequency of

in the

corpus D and

n(c

∑

w∈V

n(w, c

)

is that of

in D.

i,j

= n(w

, c

PMI

(x, y)=log

P(x, y)

P(x)P(y)

i,j

= PMI(w

, c

)=log

n(w

, c

)

⋅

|D|

n(w

)n(c

)

Table 1 Notions used through the paper

Notion Description

w A single word

V Vocabulary of words

An embedding matrix

)

Fetch embedding of word w from matrix

𝐂

D Corpus

L Objective function

𝜎

Softmax function

Uppercase-bold-italic symbol denotes matrix

Lowercase-bold-italic symbol denotes vector

https ://wordn et.princ eton.edu/.

International Journal of Machine Learning and Cybernetics

1 3

Another problem lying in the distributional representa-

tion is data sparsity, which is caused by the limitation of

data. Some of the entries in the distributional vector may be

incorrect since they are not observed in the limited data. Fur-

thermore, the high dimension of the vectors, which depends

on the predeﬁned context vocabulary size, usually hundreds

of thousands, is also bothersome.

To cope with these problems, dimensionality reduction

mechanisms such as Singular Value Decomposition (SVD)

[25] and Latent Dirichlet Allocation (LDA) [8] have been

applied to condense the high-dimensional and sparse vec-

tors to low-dimensional and dense ones. Take SVD as an

example, which works by factorizing the matrix

into the

product of three matrices

U𝜮 V

⊤

, where

U ∈ ℝ

|×|V

is an

unitary matrix,

𝜮 ∈ ℝ

|×|V

is a diagonal matrix of singu-

lar values, and

V ∈ ℝ

|×|V

is an unitary matrix. Note that

the rank r of

𝜮

is the same as that of

. Firstly, we choose

the top k (

k < r

) singular values from

𝜮

to form a diago-

nal matrix

𝜮

, and let

and

be the matrices produced

by selecting the ﬁrst k columns of

and

respectively.

Then their production

= U

𝜮

⊤

can be regarded as a

low rank approximation of the original matrix

. Accord-

ing to the Eckart-Young theorem [28],

is the best rank-k

approximation of

under

loss.

The original matrix

is sparse, which is caused by

insuﬃcient data, in other words, with more text,

would

have fewer zeros. Whereas the reconstructed matrix

dense. From this point of view, SVD is able to simulate the

unobserved text, and therefore can alleviate the data sparsity

problem. Furthermore, the matrix

makes it possible to

represent words with dense k-dimensional vectors instead

of sparse

-dimensional ones (the value of k is typically

hundreds, which is much smaller than

), which properly

solved the high-dimensionality problem.

2.2 Distributed representation

Distributional representation models introduced above have

merits of easy implementation and good interpretability. But

in most situations, they are slow to train and it is diﬃcult

to add new corpora. Besides, they lack ability to represent

semantic elements of larger granularity, such as phrases and

sentences. Therefore, a great amount of works have been

proposed to directly learn low-dimensional, dense and con-

tinuous vectors (or so-called distributed representations or

word embeddings) to represent words following the pioneer-

ing work Neural Network Language Model (NNLM) [7],

where such vectors are learned implicitly with the language

modeling task in a neural network architecture. A number of

works in the beginning stage, like NNLM, use the distributed

representation as a component of language models, therefore

it is only regarded as a by-product. Given the breadth of this

line of work, we only present a detailed introduction to the

most prominent and inﬂuential approaches. For the works

that have not received suﬃcient coverage, we provide point-

ers for readers interested in learning more details of them.

2.2.1 Neural Network Language Model

The Neural Network Language Model (NNLM) [7] is a pio-

neering work which introduces the idea of deep learning into

language modeling and successfully mitigates the curse of

dimensionality (i.e. Sequences in the test set is likely to have

not been observed in the training data) by learning a distrib-

uted representation of words. The goal of language modeling

is to learn a model that predicts the next word given previous

ones. Practically, we assume the n-order Markov properties

[70], and only the last

n − 1

words are considered when com-

puting the probability distribution of the next word:

where

i∶j

=(w

, w

i+1

, … , w

j−1

, w

)

. With the vocabulary of

the training set denoted by V and its length by |V|, the model

uses a matrix

C ∈ ℝ

|V|×d

as a map where each of its row

represents the distributed representation (vector) of word w

in V, denoted by

C(w)∈

ℝ

As depicted in Fig.1, distributed vectors of previous

n − 1

words are concatenated and then fed into a feed-forward net-

work, where the unnormalized log-probabilities

y ∈

ℝ

|V|

output words are computed:

P(w

1∶t−1

)≈P(w

t−n+1∶t−1

y = b + Wx + U tanh(d + Hx).

C(w

t-2

)C(w

t-n+1

) C(w

t-1

)

tanh

softmax

P(w

t-n+1:t-1

)

H, d

W, b

Fig. 1 Neural network architecture of NNLM, where

)

is the dis-

tributed representation (vector) of word w

Please refer to [27] for more detailed comparison and analysis of

these distributional representation models.

We will use

)

to denote the distributed embedding of word w in

the rest of this paper.

剩余19页未读，继续阅读

评论收藏

内容反馈

李多田

粉丝: 838
资源: 333

（词向量表示）From static to dynamic word representations- a survey1

评论0

最新资源

（词向量表示）From static to dynamic word representations- a survey1

评论0

static-vs-dynamic

词向量资料

14.词的向量表示

词向量-开山之作1-Efficient estimation of word representations in vector space.pdf

Dynamic Meta-Embeddings for Improved Sentence Representations.pdf

NLP：Deep contextualized word representations.pdf

Deep contextualized word representations翻译

1097038982116629vdoc.pub_integral-representations-and-computation-of-combinatorial-sums.djvu

利用bert预训练模型生成句向量或词向量.zip

词向量词向量词向量.doc

Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing

自然语言处理之动手学词向量（word embedding） 动手学词向量知识讲解 共101页.pdf

Efficient Estimation of Word Representations in_中文版.pdf

learning representations by back-propagating errors

Deep contextualized word representations

word2vec班第2课：词向量到word2vec与相关应用

Mikolov 等。 - 2013 - Efficient Estimation of Word Representations

词向量-使用BERT预训练模型生成词向量+句向量.zip

Thinning methodologies-a comprehensive survey

bert_bert词向量_BERT_

自然语言处理 词向量技术

NLP-Word2Vec.rar

引入外部词向量的文本信息网络表示学习.pdf

上百种预训练中文词向量

chinese-bert-wwm-L-12-H-768-A-12

Sparse and Redundant Representations_From Theory to Applications in Signal and Image Processing

Object Tracking A Survey

Distributed Representations of Words and Phrases and their Compositionality.zip

最新资源

自然语言处理之动手学词向量（word embedding）动手学词向量知识讲解共101页.pdf

自然语言处理词向量技术