没有合适的资源?快使用搜索试试~ 我知道了~
云计算-文本分类中词语权重计算方法的改进及应用.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 52 浏览量
2022-07-06
15:38:24
上传
评论
收藏 619KB PDF 举报
温馨提示
试读
67页
云计算-文本分类中词语权重计算方法的改进及应用.pdf
资源推荐
资源详情
资源评论
II
Abstract
People can gain more and more knowledge along with the fast development of the
network and information technology. However, in the face of specific knowledge, it is
difficult to obtain it quickly in the vast world of information. Although it has been
solved for some degree by great many kinds of search engines, which only simply
include some key words and the results are vast. So it’s unbenefit for people to find their
specific information. Document automatic classification which is an efficiency
method,has become a valuable technology.In recent years,great many statistics theories
and methods of machine learning have been used for document automatic
classification,which becomes hot research.
One of the most difficulties in document automatic classification is the high
dimension of feature space and the sparseness of text representation vector. In order to
lower the dimension of feature space and improve the efficiency and precision of
classification, it is the first problem of document automatic classification to find an
effective calculation algorithm of words weight. In the research process of Chinese text
classification, this paper focuses on the improvement of calculation algorithm of words
weight, and have completed the following works:
① The traditional calculation algorithm of words weight is mainly studied in this
paper and is found that it has three limitations: 1)it does not take into account the
distribution of feature terms among categories; 2) it does not take into account the
inner-category distribution of feature terms; 3)it does not take into account the
partial-classification of feature terms. From the views of frequency degree,integration
degree, and distribution degree of terms, this paper gives the calculation algorithm of
words weight: TF-IDF-DI-WFDB.
② The proposed measure is introduced which describes the inter-category and
inner-category distribution information of the feature terms by using the inter-category
and inner-category distribution degree of the feature terms in this paper,which forms
the improved algorithm of words weight: TF-IDF-DI. And the traditional calculation
algorithm of words weight does not take into account the partial-classification of feature
terms, this paper introduced word frequency differentia based(WFDB) to make up the
III
shortcoming,which forms the improved algorithm of words weight of this paper:
TF-IDF-DI-WFDB.
③ In order to verify the improved algorithm of words weight :TF-IDF-DI-WFDB
is better than the traditional calculation algorithm of words weight, this paper gives the
first experiment of classifying through KNN algorithm. From the views of whole
confusion martrix,whole recall rate, precision rate and recall rate,precision rate of every
class, the result shows that classification result by using the improved algorithm of
words weight is better than the result by using the traditional calculation algorithm of
words weight.
④ On the basement of the improved algorithm of words
weight :TF-IDF-DI-WFDB, the paper uses genetic algorithm to train classifier. The
result shows that the classification result of genetic algorithm corresponds to that of
KNN classification, to some extent, it is better. So, it is proved that the improved
calculation algorithm of words weight of this paper is correct and practical .
Keywords: Text representation, Feature vector, Vector space model, TFIDF, Genetic
algorithm
IV
目 录
摘 要....................................................................................................... I
ABSTRACT ................................................................................................ II
1 绪 论....................................................................................................... 1
1.1 本文的研究背景及其现实意义......................................................... 1
1.2 国外研究概况..................................................................................... 2
1.3 国内研究概况..................................................................................... 3
1.4 本文所做的主要研究工作................................................................. 5
1.5 本文安排............................................................................................. 6
2 文本分类的的相关技术........................................................................... 7
2.1 文本信息检索模型............................................................................. 7
2.1.1 布尔模型(Boolean Model) ............................................................ 7
2.1.2 概率模型(Probabilistic Model) .................................................... 7
2.1.3 向量空间模型(Vector Space Model,简称 VSM)......................... 7
2.2 常用中文分词方法............................................................................. 8
2.2.1 引言............................................................................................... 8
2.2.2 中文分词中的难题....................................................................... 9
2.2.3 机械分词方法............................................................................. 10
2.2.4 N-GRAM 分词方法 .................................................................... 10
2.2.5 本文采用的分词方法
[10]
............................................................ 11
2.3 常用特征项提取方法....................................................................... 12
2.3.1 文档频率 DF(Document Frequency:DF) ................................... 12
V
2.3.2 信息增益方法 I G(Imformation Gain:IG) ................................. 13
2.3.3 互信息方法 MI(Mutual Information:MI) .................................. 13
2.3.4 x
2
统计量(CHI) ............................................................................ 14
2.3.5 文本证据权(Weight Of Evidence Text) ..................................... 15
2.4 常用分类方法................................................................................... 15
2.4.1 类中心分类法............................................................................. 15
2.4.2 朴素贝叶斯法(Naive Bayes)...................................................... 16
2.4.3 支持向量机................................................................................. 18
2.4.4 k-近邻法(k-Nearest Neighbor )...................................................20
2.5 文本分类结果的评价指标.............................................................. 21
2.6 本章小结........................................................................................... 23
3 词语权重计算方法的改进..................................................................... 24
3.1 传统词语权重计算方法的不足....................................................... 24
3.1.1 特征项频率(Term Frequency: TF)............................................. 24
3.1.2 反文档频率(Inverse Document Frequency:IDF) ....................... 24
3.1.3 TFIDF 的不足.............................................................................. 25
3.2 改进的词语权重计算方法............................................................... 27
3.2.1 特征项的类间离散度................................................................. 27
3.2.2 特征项的类内离散度................................................................. 27
3.2.3 特征项的不完全分类的词频差异............................................. 28
3.3 小结................................................................................................... 29
4 遗传算法在文本分类中的应用 ............................................................ 30
4.1 遗传算法的生物学基础................................................................... 30
4.1.1 遗传与变异................................................................................. 30
VI
4.1.2 进化............................................................................................. 31
4.1.3 遗传与进化的系统观................................................................. 31
4.2 遗传算法简介................................................................................... 32
4.2.1 遗传算法概要............................................................................. 32
4.2.2 遗传算法的运算过程................................................................. 33
4.3 遗传算法的基本实现技术及在本文中的应用............................... 34
4.3.1 编码方法..................................................................................... 35
4.3.2 适应度函数................................................................................. 36
4.3.3 选择算子..................................................................................... 36
4.3.4 交叉算子..................................................................................... 37
4.3.5 变异算子..................................................................................... 38
4.3.6 本文使用的相关参数................................................................. 39
4.4 本章小结........................................................................................... 39
5 实验与分析............................................................................................. 40
5.1 实验介绍........................................................................................... 40
5.2 实验结果及其分析........................................................................... 41
5.2.1 混淆矩阵..................................................................................... 41
5.2.2 总体查全率、查对率、F
1
值 .................................................... 47
5.2.3 各个类的查全率、查对率......................................................... 50
5.2.4 各个类的分类情况图形显示..................................................... 51
5.3 小结................................................................................................... 54
6 结束语..................................................................................................... 55
6.1 总结................................................................................................... 55
6.2 下一步的工作................................................................................... 56
剩余66页未读,继续阅读
资源评论
programxh
- 粉丝: 17
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功