云计算-文本分类中词语权重计算方法的改进及应用.pdf_多分类权重的计算方法资源-CSDN文库

版权申诉

52 浏览量 2022-07-06 15:38:24 上传评论收藏 619KB PDF 举报

资源推荐

资源详情

资源评论

Abstract

People can gain more and more knowledge along with the fast development of the

network and information technology. However, in the face of specific knowledge, it is

difficult to obtain it quickly in the vast world of information. Although it has been

solved for some degree by great many kinds of search engines, which only simply

include some key words and the results are vast. So it’s unbenefit for people to find their

specific information. Document automatic classification which is an efficiency

method,has become a valuable technology.In recent years,great many statistics theories

and methods of machine learning have been used for document automatic

classification,which becomes hot research.

One of the most difficulties in document automatic classification is the high

dimension of feature space and the sparseness of text representation vector. In order to

lower the dimension of feature space and improve the efficiency and precision of

classification, it is the first problem of document automatic classification to find an

effective calculation algorithm of words weight. In the research process of Chinese text

classification, this paper focuses on the improvement of calculation algorithm of words

weight, and have completed the following works:

① The traditional calculation algorithm of words weight is mainly studied in this

paper and is found that it has three limitations: 1）it does not take into account the

distribution of feature terms among categories; 2） it does not take into account the

inner-category distribution of feature terms; 3）it does not take into account the

partial-classification of feature terms. From the views of frequency degree,integration

degree, and distribution degree of terms, this paper gives the calculation algorithm of

words weight: TF-IDF-DI-WFDB.

② The proposed measure is introduced which describes the inter-category and

inner-category distribution information of the feature terms by using the inter-category

and inner-category distribution degree of the feature terms in this paper，which forms

the improved algorithm of words weight: TF-IDF-DI. And the traditional calculation

algorithm of words weight does not take into account the partial-classification of feature

terms, this paper introduced word frequency differentia based(WFDB) to make up the

摘要....................................................................................................... I

ABSTRACT ................................................................................................ II

1 绪论....................................................................................................... 1

1.1 本文的研究背景及其现实意义......................................................... 1

1.2 国外研究概况..................................................................................... 2

1.3 国内研究概况..................................................................................... 3

1.4 本文所做的主要研究工作................................................................. 5

1.5 本文安排............................................................................................. 6

2 文本分类的的相关技术........................................................................... 7

2.1 文本信息检索模型............................................................................. 7

2.1.1 布尔模型(Boolean Model) ............................................................ 7

2.1.2 概率模型(Probabilistic Model) .................................................... 7

2.1.3 向量空间模型(Vector Space Model,简称 VSM)......................... 7

2.2 常用中文分词方法............................................................................. 8

2.2.1 引言............................................................................................... 8

2.2.2 中文分词中的难题....................................................................... 9

2.2.3 机械分词方法............................................................................. 10

2.2.4 N-GRAM 分词方法 .................................................................... 10

2.2.5 本文采用的分词方法

[10]

............................................................ 11

2.3 常用特征项提取方法....................................................................... 12

2.3.1 文档频率 DF(Document Frequency:DF) ................................... 12

2.3.2 信息增益方法 I G(Imformation Gain:IG) ................................. 13

2.3.3 互信息方法 MI(Mutual Information:MI) .................................. 13

2.3.4 x

统计量(CHI) ............................................................................ 14

2.3.5 文本证据权(Weight Of Evidence Text) ..................................... 15

2.4 常用分类方法................................................................................... 15

2.4.1 类中心分类法............................................................................. 15

2.4.2 朴素贝叶斯法(Naive Bayes)...................................................... 16

2.4.3 支持向量机................................................................................. 18

2.4.4 k-近邻法(k-Nearest Neighbor )...................................................20

2.5 文本分类结果的评价指标.............................................................. 21

2.6 本章小结........................................................................................... 23

3 词语权重计算方法的改进..................................................................... 24

3.1 传统词语权重计算方法的不足....................................................... 24

3.1.1 特征项频率(Term Frequency: TF)............................................. 24

3.1.2 反文档频率(Inverse Document Frequency:IDF) ....................... 24

3.1.3 TFIDF 的不足.............................................................................. 25

3.2 改进的词语权重计算方法............................................................... 27

3.2.1 特征项的类间离散度................................................................. 27

3.2.2 特征项的类内离散度................................................................. 27

3.2.3 特征项的不完全分类的词频差异............................................. 28

3.3 小结................................................................................................... 29

4 遗传算法在文本分类中的应用 ............................................................ 30

4.1 遗传算法的生物学基础................................................................... 30

4.1.1 遗传与变异................................................................................. 30

4.1.2 进化............................................................................................. 31

4.1.3 遗传与进化的系统观................................................................. 31

4.2 遗传算法简介................................................................................... 32

4.2.1 遗传算法概要............................................................................. 32

4.2.2 遗传算法的运算过程................................................................. 33

4.3 遗传算法的基本实现技术及在本文中的应用............................... 34

4.3.1 编码方法..................................................................................... 35

4.3.2 适应度函数................................................................................. 36

4.3.3 选择算子..................................................................................... 36

4.3.4 交叉算子..................................................................................... 37

4.3.5 变异算子..................................................................................... 38

4.3.6 本文使用的相关参数................................................................. 39

4.4 本章小结........................................................................................... 39

5 实验与分析............................................................................................. 40

5.1 实验介绍........................................................................................... 40

5.2 实验结果及其分析........................................................................... 41

5.2.1 混淆矩阵..................................................................................... 41

5.2.2 总体查全率、查对率、F

值 .................................................... 47

5.2.3 各个类的查全率、查对率......................................................... 50

5.2.4 各个类的分类情况图形显示..................................................... 51

5.3 小结................................................................................................... 54

6 结束语..................................................................................................... 55

6.1 总结................................................................................................... 55

6.2 下一步的工作................................................................................... 56

剩余66页未读，继续阅读

评论收藏

内容反馈

版权申诉

programxh

粉丝: 17
资源: 1万+

云计算-文本分类中词语权重计算方法的改进及应用.pdf

云计算-文本特征项的权重计算方法研究.pdf

分享一些有用的做高权重外链方法(精).pdf

文本分类中词语权重计算方法的改进及应用

论文研究-WAP网页文本分类特征权重计算的改进 .pdf

论文研究-文本分类TF-IDF算法的改进研究.pdf

论文研究-多指标综合评价中一种计算权重的改进方法.pdf

多属性决策的权重确定方法及matlab 程序.pdf

论文研究-动态自适应特征权重的多类文本分类算法研究.pdf

粒子群算法惯性权重的自适应改进与研究.pdf

论文研究-情感分类中基于词性嵌入的特征权重计算方法.pdf

论文研究-基于边权重的WordNet词语相似度计算.pdf

智能导医系统中TF-IDF权重改进算法研究.pdf

论文研究-邻域粗糙集在属性约简及权重计算中的应用.pdf

利用信息交互最优权重改进神经网络的方法.pdf

论文研究-基于自适应权重的面板数据聚类方法.pdf

论文研究-复杂机电系统关键部件辨识方法及应用.pdf

论文研究-考虑物品相似权重的用户相似度计算方法.pdf

自适应惯性权重的改进粒子群算法.pdf

论文研究-层次分析法中的动态权重确定方法在阵地编成中的应用.pdf

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

2023泛娱乐社交出海手册-ZEGO即构科技

4个亲测好用的ChatGPT4渠道

HAI-2024斯坦福AI指数报告（中文译版）.pdf

学术海报模板+论文科研+研究生

最新资源