没有合适的资源?快使用搜索试试~ 我知道了~
Distributed Representations of Sentences and Documents
5星 · 超过95%的资源 需积分: 50 14 下载量 91 浏览量
2018-07-18
11:35:00
上传
评论
收藏 286KB PDF 举报
温馨提示
试读
9页
Distributed Representations of Sentences and Documents
资源推荐
资源详情
资源评论
Distributed Representations of Sentences and Documents
Quoc Le QVL@GOOGLE.COM
Tomas Mikolov TMIKOLOV@GOOGLE.COM
Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043
Abstract
Many machine learning algorithms require the
input to be represented as a fixed-length feature
vector. When it comes to texts, one of the most
common fixed-length features is bag-of-words.
Despite their popularity, bag-of-words features
have two major weaknesses: they lose the order-
ing of the words and they also ignore semantics
of the words. For example, “powerful,” “strong”
and “Paris” are equally distant. In this paper, we
propose Paragraph Vector, an unsupervised algo-
rithm that learns fixed-length feature representa-
tions from variable-length pieces of texts, such as
sentences, paragraphs, and documents. Our algo-
rithm represents each document by a dense vec-
tor which is trained to predict words in the doc-
ument. Its construction gives our algorithm the
potential to overcome the weaknesses of bag-of-
words models. Empirical results show that Para-
graph Vectors outperform bag-of-words models
as well as other techniques for text representa-
tions. Finally, we achieve new state-of-the-art re-
sults on several text classification and sentiment
analysis tasks.
1. Introduction
Text classification and clustering play an important role
in many applications, e.g, document retrieval, web search,
spam filtering. At the heart of these applications is ma-
chine learning algorithms such as logistic regression or K-
means. These algorithms typically require the text input to
be represented as a fixed-length vector. Perhaps the most
common fixed-length vector representation for texts is the
bag-of-words or bag-of-n-grams (Harris, 1954) due to its
simplicity, efficiency and often surprising accuracy.
However, the bag-of-words (BOW) has many disadvan-
Proceedings of the 31
st
International Conference on Machine
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-
right 2014 by the author(s).
tages. The word order is lost, and thus different sentences
can have exactly the same representation, as long as the
same words are used. Even though bag-of-n-grams con-
siders the word order in short context, it suffers from data
sparsity and high dimensionality. Bag-of-words and bag-
of-n-grams have very little sense about the semantics of the
words or more formally the distances between the words.
This means that words “powerful,” “strong” and “Paris” are
equally distant despite the fact that semantically, “power-
ful” should be closer to “strong” than “Paris.”
In this paper, we propose Paragraph Vector, an unsuper-
vised framework that learns continuous distributed vector
representations for pieces of texts. The texts can be of
variable-length, ranging from sentences to documents. The
name Paragraph Vector is to emphasize the fact that the
method can be applied to variable-length pieces of texts,
anything from a phrase or sentence to a large document.
In our model, the vector representation is trained to be use-
ful for predicting words in a paragraph. More precisely, we
concatenate the paragraph vector with several word vec-
tors from a paragraph and predict the following word in the
given context. Both word vectors and paragraph vectors are
trained by the stochastic gradient descent and backpropaga-
tion (Rumelhart et al., 1986). While paragraph vectors are
unique among paragraphs, the word vectors are shared. At
prediction time, the paragraph vectors are inferred by fix-
ing the word vectors and training the new paragraph vector
until convergence.
Our technique is inspired by the recent work in learn-
ing vector representations of words using neural net-
works (Bengio et al., 2006; Collobert & Weston, 2008;
Mnih & Hinton, 2008; Turian et al., 2010; Mikolov et al.,
2013a;c). In their formulation, each word is represented by
a vector which is concatenated or averaged with other word
vectors in a context, and the resulting vector is used to pre-
dict other words in the context. For example, the neural
network language model proposed in (Bengio et al., 2006)
uses the concatenation of several previous word vectors to
form the input of a neural network, and tries to predict the
next word. The outcome is that after the model is trained,
the word vectors are mapped into a vector space such that
资源评论
- lightzhixing2019-10-30不错的论文值得一看
小亮PlayNLP
- 粉丝: 230
- 资源: 56
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- java毕业设计-网上订餐系统源码+数据库.zip
- 基于双闭环PID控制的一阶倒立摆simscape建模matlab源码.zip
- 基于51单片机+lcd12864显示俄罗斯方块小游戏MCU软件源代码.zip
- 基于单片机的智能计算器设计MCU软件源代码.zip
- 基于Javaweb的药店管理系统源码+数据库.zip
- (自适应手机端)网络建站广告公司网站pbootcms模板 品牌策划设计类网站源码下载.zip
- (自适应手机端)手机软件APP下载类网站Pbootcms模板 游戏软件应用网站源码下载.zip
- (自适应手机端)生活百科资讯文章博客类网站pbootcms模板 绿色新闻博客网站源码下载.zip
- 基于OpenCV+MySQL+QT实现的人脸识别考勤系统源码.zip
- java毕业设计网上订餐系统源码+数据库.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功