DistributedRepresentationsofSentencesandDocuments资源-CSDN文库

nlp

5星 · 超过95%的资源需积分: 50 91 浏览量 2018-07-18 11:35:00 上传评论收藏 286KB PDF 举报

资源推荐

资源详情

资源评论

Distributed Representations of Sentences and Documents

Quoc Le QVL@GOOGLE.COM

Tomas Mikolov TMIKOLOV@GOOGLE.COM

Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043

Abstract

Many machine learning algorithms require the

input to be represented as a ﬁxed-length feature

vector. When it comes to texts, one of the most

common ﬁxed-length features is bag-of-words.

Despite their popularity, bag-of-words features

have two major weaknesses: they lose the order-

ing of the words and they also ignore semantics

of the words. For example, “powerful,” “strong”

and “Paris” are equally distant. In this paper, we

propose Paragraph Vector, an unsupervised algo-

rithm that learns ﬁxed-length feature representa-

tions from variable-length pieces of texts, such as

sentences, paragraphs, and documents. Our algo-

rithm represents each document by a dense vec-

tor which is trained to predict words in the doc-

ument. Its construction gives our algorithm the

potential to overcome the weaknesses of bag-of-

words models. Empirical results show that Para-

graph Vectors outperform bag-of-words models

as well as other techniques for text representa-

tions. Finally, we achieve new state-of-the-art re-

sults on several text classiﬁcation and sentiment

analysis tasks.

1. Introduction

Text classiﬁcation and clustering play an important role

in many applications, e.g, document retrieval, web search,

spam ﬁltering. At the heart of these applications is ma-

chine learning algorithms such as logistic regression or K-

means. These algorithms typically require the text input to

be represented as a ﬁxed-length vector. Perhaps the most

common ﬁxed-length vector representation for texts is the

bag-of-words or bag-of-n-grams (Harris, 1954) due to its

simplicity, efﬁciency and often surprising accuracy.

However, the bag-of-words (BOW) has many disadvan-

Proceedings of the 31

International Conference on Machine

Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-

right 2014 by the author(s).

tages. The word order is lost, and thus different sentences

can have exactly the same representation, as long as the

same words are used. Even though bag-of-n-grams con-

siders the word order in short context, it suffers from data

sparsity and high dimensionality. Bag-of-words and bag-

of-n-grams have very little sense about the semantics of the

words or more formally the distances between the words.

This means that words “powerful,” “strong” and “Paris” are

equally distant despite the fact that semantically, “power-

ful” should be closer to “strong” than “Paris.”

In this paper, we propose Paragraph Vector, an unsuper-

vised framework that learns continuous distributed vector

representations for pieces of texts. The texts can be of

variable-length, ranging from sentences to documents. The

name Paragraph Vector is to emphasize the fact that the

method can be applied to variable-length pieces of texts,

anything from a phrase or sentence to a large document.

In our model, the vector representation is trained to be use-

ful for predicting words in a paragraph. More precisely, we

concatenate the paragraph vector with several word vec-

tors from a paragraph and predict the following word in the

given context. Both word vectors and paragraph vectors are

trained by the stochastic gradient descent and backpropaga-

tion (Rumelhart et al., 1986). While paragraph vectors are

unique among paragraphs, the word vectors are shared. At

prediction time, the paragraph vectors are inferred by ﬁx-

ing the word vectors and training the new paragraph vector

until convergence.

Our technique is inspired by the recent work in learn-

ing vector representations of words using neural net-

works (Bengio et al., 2006; Collobert & Weston, 2008;

Mnih & Hinton, 2008; Turian et al., 2010; Mikolov et al.,

2013a;c). In their formulation, each word is represented by

a vector which is concatenated or averaged with other word

vectors in a context, and the resulting vector is used to pre-

dict other words in the context. For example, the neural

network language model proposed in (Bengio et al., 2006)

uses the concatenation of several previous word vectors to

form the input of a neural network, and tries to predict the

next word. The outcome is that after the model is trained,

the word vectors are mapped into a vector space such that

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余8页未读，立即下载

评论收藏

内容反馈

lightzhixing

2019-10-30

不错的论文值得一看

小亮PlayNLP

粉丝: 230
资源: 56

Distributed Representations of Sentences and Documents

词向量-开山之作2_Distributed Representations of Sentences and Documents.pdf

Distributed Representations of Sentences and Documents阅读笔记

Distributed Representations of Words and Phrases and their Compositionality.zip

Distributed Representations of Words and Phrases

ChatGPT教程（终极版）最全整理

博客中Kmeans以及FCM算法数据（免积分）

神经网络回归预测--气温数据集

hugging face的models-openai-clip-vit-large-patch14文件夹

XGBoost+LightGBM+LSTM-光伏发电量预测

Mathwork+Matlab+编程手册

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

中文短信数据集-带标签

亚博K210模型训练部署

Plecs电力电子仿真PLECS41.64 电力系统仿真软件免安装版本

机器学习期末复习题及答案

基于鲸鱼优化算法优化VMD参数试看效果代码(目标函数为样本熵)

基于Python+pytorch的图像处理+附完整代码图像处理，能够轻松实现图像的读取、显示、裁剪等还有机器学习等操作

Ollama+WebUI+AnythingLLM构建个人/企业知识库

Matlab 基于支持向量机(SVM)的数据回归预测 SVM回归

改进版的yolov5+双目测距

mock_kaggle.csv

shape_predictor_68_face_landmarks.zip

EDA探索式数据分析案例数据集

DragGAN离线工具

电机故障数据集.rar

Matlab 基于BP神经网络的数据分类预测 BP分类

Tensorflow-gpu版本缺少的dll文件

基于LSTM模型的股票预测模型_python

最新资源