TC.rar_文本分类资源-CSDN文库

共9个文件

py：5个

pyc：4个

版权申诉

173 浏览量 2022-09-14 23:53:47 上传评论收藏 8KB RAR 举报

文本分类是自然语言处理中的一个重要任务，其目标是根据文本内容将其分配到预定义的类别中。在这个场景中，我们关注的是"TC.rar_文本分类"项目，它可能包含了一个研究或教程，对比了传统的向量空间模型在文本分类中的应用。让我们深入探讨一下这个主题。向量空间模型（Vector Space Model，VSM）是一种将文本转化为数学向量的方法，使得我们可以用数值运算来处理语义问题。在这个模型中，每个文档被表示为一个向量，其中的每个维度对应于词汇表中的一个词，词频或者TF-IDF值作为该维度的权重。TF-IDF是一种常见的词重要性度量，它结合了词频（Term Frequency）和逆文档频率（Inverse Document Frequency），用于抑制常见词汇的影响并强调在少数文档中频繁出现的词汇。文本分类通常包括以下几个步骤： 1. **预处理**：这包括去除停用词、标点符号和其他非字母字符，进行词干提取或词形还原，以及分词。这些步骤有助于减少噪声并提取有意义的信息。 2. **特征选择**：使用VSM时，需要决定哪些词是重要的，应该纳入向量表示。这可以通过TF-IDF或其他方法完成，目的是降低不相关信息的权重。 3. **向量化**：将文本转换为向量形式，每个文档对应一个向量，每个向量的维度与词汇表大小一致。 4. **训练模型**：使用有标签的样本数据，通过监督学习算法（如朴素贝叶斯、支持向量机、逻辑回归等）训练分类器。 5. **评估与优化**：使用交叉验证或独立测试集评估模型性能，根据准确率、召回率、F1分数等指标调整参数或选择更好的算法。在"TC.rar"中，可能包含了多个实验，对比了不同向量空间模型的性能，例如词袋模型（Bag-of-Words）、TF-IDF、N-gram模型等，也可能探讨了使用不同机器学习算法对这些向量表示进行分类的效果。此外，可能还涉及了如何处理词序信息（如使用词序向量模型如n-grams或词嵌入如Word2Vec）以及稀疏性问题的解决方案。总结来说，"TC.rar_文本分类"项目专注于传统向量空间模型在文本分类中的应用和对比，这对于理解文本数据的表示以及如何构建有效的文本分类系统具有重要价值。通过分析和实验，我们可以学习如何优化特征选择和模型选择，以提高文本分类的准确性和效率。

资源推荐

资源详情

资源评论

收起资源包目录

TC.rar （9个子文件）

labels.py 255B

__pycache__

featureExtracttion.cpython-36.pyc 3KB

trainModel.cpython-36.pyc 2KB

labels.cpython-36.pyc 453B

__init__.cpython-36.pyc 114B

trainModel.py 4KB

__init__.py 0B

preprocess.py 2KB

featureExtracttion.py 6KB

import time import TC.labels as labels import os import math def documentfrequency(): prepath=r"../trainTemp/" featureTable=[] for label in labels.labels: wordDocumentFrequency = dict(); for fileIndex in range(10, 1900): path = prepath + label.value + "/" + str(fileIndex) + ".txt" if os.path.exists(path): inputfile=open(path,encoding="utf-8",errors="ignore") textSet=set() for line in inputfile.readlines(): line=line.strip('\n') textSet.add(line) for it in textSet: if it in wordDocumentFrequency.keys(): wordDocumentFrequency[it]+=1 else: wordDocumentFrequency[it]=1 for k,v in wordDocumentFrequency.items(): if v>107 : if k not in featureTable: featureTable.append(k) return featureTable def mutualInformation(): prepath = r"../trainTemp/" featureTable = [] documentNumInClass=dict()# stastic the number of document in every class wordFrequentInClass = dict() # stastic the number of document containing word i in class wordDocumentNum=dict() # stastic the number of document containing word i for label in labels.labels: documentNumInClass[label.name]=0 wordFrequent = dict() for fileIndex in range(10, 1900): path = prepath + label.value + "/" + str(fileIndex) + ".txt" if os.path.exists(path): inputfile=open(path,encoding="utf-8",errors="ignore") documentNumInClass[label.name]+=1 wordSet=set() for line in inputfile.readlines(): line = line.strip('\n') wordSet.add(line) for w in wordSet: if w in wordFrequent.keys(): wordFrequent[w]+=1 else: wordFrequent[w]=0 if w in wordDocumentNum: wordDocumentNum[w]+=1 else: wordDocumentNum[w]=0 wordFrequentInClass[label.name]=wordFrequent #compute mutual information N=0 for k,v in documentNumInClass.items(): N+=v for k,v in wordFrequentInClass.items(): wordMutualInformation=dict() N1dot=documentNumInClass[k] N0dot=N-N1dot for k1,v1 in v.items(): N11=v1 if k1 in wordDocumentNum: Ndot1=wordDocumentNum[k1] else: Ndot1=1 Ndot0=N-Ndot1 N01=Ndot1-N11 N10=N1dot-N11 N0dot=N-N1dot Ndot0=N-Ndot1 N00=Ndot0-N10 if N11<=0: N11=1 if N01<=0: N01=1 if N10<=0: N10=1 if N00<=0: N00=1 if Ndot0<=0: Ndot0=1 if Ndot1<=0: Ndot1=1 if N1dot<=0: N1dot=1 if N0dot<=0: N0dot=1 mi=(N11/N)*math.log(N*(N11)/(N1dot*Ndot1)) +(N01/N)*math.log(N*N01/(N0dot*Ndot1)) +(N10/N)*math.log(N*N10/(N1dot*Ndot0)) +(N00/N)*math.log(N*N00/(Ndot0*N0dot)) wordMutualInformation[k1]=mi for kt,vt in wordMutualInformation.items(): if vt>0.0084: if kt not in featureTable: featureTable.append(kt) return featureTable def informationGain(): prepath = r"../trainTemp/" featureTable = [] documentNumInClass = dict() # stastic the number of document in every class wordFrequentInClass = dict() # stastic the number of document containing word i in class wordDocumentNum = dict() # stastic the number of document containing word i for label in labels.labels: documentNumInClass[label.name]=0 tempWordFrequent=dict() for fileIndex in range(10, 1900): path = prepath + label.value + "/" + str(fileIndex) + ".txt" if os.path.exists(path): documentNumInClass[label.name]+=1 inputfile = open(path, encoding="utf-8", errors="ignore") wordSet=set() for line in inputfile.readlines(): line = line.strip('\n') wordSet.add(line) for s in wordSet: if s in tempWordFrequent.keys(): tempWordFrequent[s]+=1 else: tempWordFrequent[s]=1 if s in wordDocumentNum.keys(): wordDocumentNum[s]+=1 else: wordDocumentNum[s]=1 wordFrequentInClass[label.name]=tempWordFrequent N=0 for k,v in documentNumInClass.items(): N+=v beginEntropy=0 for label in labels.labels: P=documentNumInClass[label.name]/N beginEntropy+=-P*math.log(P) for k,v in wordDocumentNum.items(): endEntropy_t=0 endEntropy_not_t=0 P_t = v/N P_not_t=1-P_t for labelName,wordFrequent in wordFrequentInClass.items(): c_and_t=0 if k in wordFrequent.keys(): c_and_t=wordFrequent[k] c_and_not_t=documentNumInClass[labelName]-c_and_t P_c_In_t=c_and_t/v P_c_In_not_t=c_and_not_t/(N-wordDocumentNum[k]) if P_c_In_t!=0: endEntropy_t+=P_c_In_t*math.log(P_c_In_t) if P_c_In_not_t!=0: endEntropy_not_t+=P_c_In_not_t*math.log(P_c_In_not_t) IG=endEntropy_t*P_t+endEntropy_not_t*P_not_t+beginEntropy if IG>0.0083: #0.006 featureTable.append(k) return featureTable # print(len(informationGain())) # print(len(mutualInformation())) # print(len((documentfrequency()))) #documentfrequency() # start=time.time() # t=mutualInformation() # print("length:"+str(len(t))) # end=time.time() # print(end-start)

评论收藏

内容反馈

版权申诉