http://www.paper.edu.cn
- 1 -
基于 Word2vec 词向量的文本关键字抽取
#
李清
1
,朱文浩
1
,卢志国
2**
基金项目:国家自然科学基金(61303097);国家教育部博士点基金资助项目(20123108120026)
作者简介:李清(1991-),女,学生,主要研究方向:信息抽取,数字图书馆
通信联系人:卢志国,男,副研究员,主要研究方向:数字图书馆. E-mail: luzg@staff.shu.edu.cn
(1. 上海大学计算机工程与科学学院,上海 200444;
2. 上海大学图书馆,上海 200444) 5
摘要:信息技术的不断发展使得许多领域信息呈现爆炸式增长,如何从大规模文本信息中快
速而准确地获取所需信息成为一个巨大的挑战。关键词提取就是一种解决上述问题的有效手
段,是文本挖掘领域研究的核心技术之一,起着十分重要的作用。目前绝大多数文本信息还
尚未提供关键词,纵观已经存在的关键词提取算法,对于词组关键字以及文章中尚未出现但10
是仍可作为文章关键字的词语,还无法很好地找到解决方法;为此本文提出了基于词向量的
关键字提取方法。使用 word2vec 算法训练词向量,通过词向量的表达,文本的概念空间转
化为可计算空间。此方法将所有训练文本中出现的单词及关键字集,通过 word2vec 的训练
方法,转化为词向量集合,之后将测试文本单词用词向量表示,通过计算测试文本单词词向
量和关键字词向量间的欧式距离,找出距离最小的 TOP N 个关键字,作为自动提取文本关15
键字。实验使用论文集作为训练文本,结果表明此方法提高了词组关键词提取的精度而且能
找出不包含在文本中的关键字。
关键词:自然语言处理,信息抽取;关键词提取;词向量
中图分类号:TP301.6
20
Using Word2vec to Extract Abstract Keywords
Li Qing
1
, ZHU Wenhao
1
, LU Zhiguo
2
(1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444;
2. Shanghai University Libraries,Shanghai 200444)
Abstract: The continuous development of information technology makes many domains of 25
information exploding, Obtaining the required information from the large-scale text in a quick and
accurate way has became a great challenge. Keyword extraction is a kind of effective method to solve
these problems. It is one of the core technology in the research of text mining, plays a very important
role. Currently, the majority of text information has yet to provide keywords, throughout the already
existing keyword extraction algorithms, the phrase keywords and the keywords which don’t appeared 30
in an article, still haven’t found a solution very well. To solve this problem, the paper proposes the
keyword extraction method based on word vector. Via to train the word vectors using word2vec
algorithm, the concept of text turn into computer understandable space. This method trains all the
words and keywords which appear in the text into vector set through word2vec training method, then
the word in the test text will replace by word term vectors, through calculating them with the keyword 35
vectors by the Euclidean distance, finding the smallest TOP N distance keywords as the automatic
text extraction keyword. The experiment used computer field papers as the training text, the results
shows that this method can improve the accuracy of the phrase keyword extraction and find the
keywords which don’t contain in the text.
Key words: Natural Language Processing; Information Extraction; Keyword Extraction; Word Vectors 40