基于Python实现的词典分词方法或统计分词方法.zip

共72个文件

png：22个

csv：19个

py：15个

版权申诉

Python

课程设计

5星 · 超过95%的资源 56 浏览量 2022-06-26 19:22:32 上传评论 1 收藏 60.3MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

基于Python实现的词典分词方法或统计分词方法.zip （72个子文件）

README.md 22KB

code

tongjifenci.py 11KB

test.csv 2.37MB

renmincixing.csv 9.25MB

renmin.csv 6.28MB

renmincixing.txt 10.77MB

chulicixing.py 373B

chuli.py 316B

fenci.py 4KB

renmin.txt 7.35MB

pythonProject

__MACOSX

pythonProject

._tongjifenci.py 276B

._cws.csv 580B

._cixing_tongji.py 276B

._renmin.txt 840B

._renmincixing.txt 366B

._chuli.py 276B

._my_cws_corpus.csv 235B

._renmincixing.csv 233B

._fenci.py 276B

._temp.csv 233B

._chulicixing.py 276B

._CoreNatureDictionary.csv 542B

._.DS_Store 120B

._danju.csv 177B

._renmin.csv 177B

._test.csv 580B

pythonProject

CoreNatureDictionary.csv 1.08MB

tongjifenci.py 11KB

test.csv 2.45MB

temp.csv 10.19MB

renmincixing.csv 9.27MB

renmin.csv 6.29MB

main.py 418B

renmincixing.txt 10.79MB

my_cws_corpus.csv 62B

chulicixing.py 373B

chuli.py 316B

train.lm 14.75MB

cws.csv 70B

.idea

workspace.xml 10KB

pythonProject.iml 318B

misc.xml 186B

modules.xml 278B

inspectionProfiles

profiles_settings.xml 174B

fenci.py 4KB

cixing_tongji.py 4KB

renmin.txt 7.37MB

.DS_Store 6KB

danju.csv 6.29MB

pic

截屏2021-10-20 下午1.57.11.png 1.1MB

截屏2021-10-20 下午2.32.24.png 1.06MB

截屏2021-10-20 下午12.56.50.png 1019KB

截屏2021-10-20 下午2.31.34.png 1.05MB

截屏2021-10-20 下午2.20.49.png 1.04MB

截屏2021-10-20 下午3.19.04.png 1.09MB

截屏2021-10-20 下午3.36.56.png 1.07MB

截屏2021-10-20 下午2.51.18.png 1.79MB

截屏2021-10-20 下午2.10.53.png 1.03MB

截屏2021-10-20 下午2.50.53.png 1.1MB

截屏2021-10-20 下午2.49.00.png 912KB

截屏2021-10-20 下午2.43.09.png 1.21MB

截屏2021-10-20 下午2.26.55.png 893KB

截屏2021-10-20 下午2.43.55.png 1.06MB

截屏2021-10-20 下午1.54.30.png 1010KB

截屏2021-10-20 下午2.52.01.png 1.83MB

截屏2021-10-20 下午2.52.35.png 1.13MB

截屏2021-10-20 下午2.32.55.png 1.22MB

截屏2021-10-20 下午3.37.03.png 1.08MB

截屏2021-10-20 下午2.19.00.png 1.03MB

截屏2021-10-20 下午2.39.38.png 1.08MB

截屏2021-10-20 下午2.32.43.png 1.15MB

基于词典的分词方法或统计分词方法.docx 6.95MB

# 自然语言处理作业 # 实验内容： 1. 实现基于词典的分词方法和统计分词方法：两类方法中各实现一种即可； 2. 对分词结果进行词性标注，也可以在分词的同时进行词性标注； 3. 对分词及词性标注结果进行评价，包括4个指标：正确率、召回率、F1值和效率。 # 实现平台 MacBook Air M1，全部使用Python进行实验 # 实验过程 1.基于词典的分词方法中，我们使用了四种分词方法，即完全切分式，正向最长匹配，逆向最长匹配，双向最长匹配。此处代码见附录1。这里的词典我选择使用了北京大学统计好的词典作为词典参考来进行实验。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/545e6ba9e4b22cf380c175a4ace1210f.writebug) ![](https://www.writebug.com/myres/static/uploads/2022/5/26/1962892a5eb614b3b502e8179ac8af47.writebug) ![](https://www.writebug.com/myres/static/uploads/2022/5/26/fa90432223a2d7407efc8c05dee14d61.writebug) 我们随意输入几个句子并输出结果，根据结果来看，各个方法分词的效果还算不错。接下来我们使用人民日报的分好的语料库进行一个全篇的预测。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/74821f8932c55b8c4bb3772621bee68d.writebug) 首先我们使用前向切分的方法，可以看到P，R，F1三个值，以及用时26.92s。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/7e491868536ef6b556c0295a963534c3.writebug) 接下来使用后向切分的分法，正确率微微提升，同时用时27.51s。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/09a138955c527c11ea5c2001dc360f38.writebug) 最后使用双向切分的方法，正确率几乎不变，用时52s，几乎翻倍的时间。 | |P |R |F1 |Time | |----|----|----|----|----| | 前向 |0.8413 |0.8864 |0.8864 |26.91 | | 后向 |0.8482 |0.8934 |0.8934 |27.51 | | 双向 |0.8489 |0.8939 |0.8708 |52.73 | 可见，基于词典的分词方式，不论哪种方法，正确率基本稳定在这个范围上。效果还算可以。 2.基于语料的统计分词，使用二元语法模型来构建词库，然后将句子生成词网，在使用viterbi算法来计算最优解，在这当中使用+1法来处理前后之间概率。代码放在附录2 首先我们处理原始文件，将人民日报语料库的带空格的文件处理成csv的表格形式方便我们进行分词的整理。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/576f374fb0a7df7e56cf96a6331c109c.writebug) ![](https://www.writebug.com/myres/static/uploads/2022/5/26/9d79cb0c6ed56c1275d19d3d6b7f8052.writebug) ![](https://www.writebug.com/myres/static/uploads/2022/5/26/c38fb5abc4b27f328c58b4a7e39ec605.writebug) 这里是二元语法模型的核心代码： ![](https://www.writebug.com/myres/static/uploads/2022/5/26/71976c13a4d9c3cc03f1b5d10b767e3c.writebug) 这里是生成句子词网的核心代码： ![](https://www.writebug.com/myres/static/uploads/2022/5/26/59c9152c869ad9aca565d8a67f26abee.writebug) 这里是+1平滑处理的核心代码： ![](https://www.writebug.com/myres/static/uploads/2022/5/26/bb5f6137ffb5d02795927ed09d1ec5ac.writebug) 这里是viterbi算法的核心代码： ![](https://www.writebug.com/myres/static/uploads/2022/5/26/3ba41fc7639e9c10dc8a7f6b6178cd78.writebug) 测试集和训练集均为人民日报语料库，接下来是最终训练出来的效果。展示了P，R，F1的值可以看到还是达到了一个近乎99%的正确率，同时训练+预测共耗时307。 ```c++ P = 0.9905 R = 0.9843 F1 = 0.9873 ``` ![](https://www.writebug.com/myres/static/uploads/2022/5/26/88ec41bda6d31fb995cfcce65b56f91e.writebug) 3.对分词的结果进行词性标注。使用了统计的方法。此处代码同附录2. 我们先对人民日报带词性标注的txt文件进行一个转换，转换成便于我操作的csv文件。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/271f0d6782f967f6a367cf794222908e.writebug) ![](https://www.writebug.com/myres/static/uploads/2022/5/26/a672a33c9cf450278cd693d2df322d8e.writebug) ![](https://www.writebug.com/myres/static/uploads/2022/5/26/e7e7ea4f1b6aa401050760d9661c961e.writebug) 下面是我们词性标注的训练核心代码，统计所有词性和他们的个数，获得相应的词性转移矩阵，在统计每个词语的词性概率。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/63ee9fe9d1bdbaf9878d1678da7b2d61.writebug) 这里是推测词性的核心代码： ![](https://www.writebug.com/myres/static/uploads/2022/5/26/79020c0ca01e1b3b0c35c4c30eed0aa0.writebug) 下面是对分词结果的正确性评估，这里我使用了书中使用的方法，仅计算一个Accuracy作为正确率评估标准，同时我只对之前分词正确的结果结果进行词性评估，这样可以避免其它的错误。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/d7ba42d102e40bb7cebd3b90d8541445.writebug) 通过检测结果来看，词性标注的正确率大概在95%左右，同时共计耗时315s。还算一个相对不错的效果。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/856fdcecf9833b80454fe1fd801271aa.writebug) # 实验遇到的问题与麻烦 1. 首先就遇到的难题是处理人民日报那个txt，一开始没有想到很方便的办法把它处理为list数据，最后通过转换为csv格式，在直接导入到list当中确实简化了不少步骤。 2. 处理二元语法模型的时候，这个“#始始#”，“#末末#”的处理确实费了不少功夫，总是在这里缺少一些项目，debug了很长时间。 3. Viterbi算法中存在的字典中缺少的值如何去补充，不存在的转化概率该如何去补充。 4. 这个也是尚未处理的问题，我做预测的时候仍然采用原人民日报的换行格式，即一行就为一个句子，但这个很明显并不是我们日常意义上所理解的句子，我们通常理解的句子中，是以“。”，“！”，“？”作为句子结尾的。我重新刷洗了数据，并让句子按照这三个标点符号作为结尾来预测数据。 ![](https://www.writebug.com/myres/static/uploads/2022/5/26/e22628a9b7e64a49962b8401dfd1a125.writebug) 最终得出的训练结果中，正确率不变，说明我们能很好的将标点符号分开，但是时间却缩小了一些，说明缩短句子确实有助于减少大规模词网的运算量。 # 附录 ## 附录一：基于词典的分词方法： ```c++ import csv import time start_time = time.time() # 读入字典 def load_dictionary(): word_list = set() csvFile = open("test.csv", "r") reader = csv.reader(csvFile) for item in reader: word_list.add(item[0]) return word_list # 完全切分式中文分词 # 如果在词典中则认为是一个词 def fully_segment(text, dic): word_list = [] for i in range(len(text)): for j in range(i + 1, len(text) + 1): word = text[i:j] if word in dic: word_list.append(word) return word_list # 正向最长匹配 # 从当前扫描位置的单字所有可能的结尾，我们找最长的 def forward_segment(text, dic): word_list = [] i = 0 while i < len(text): longest_word = text[i] for j in range(i + 1, len(text) + 1): word = text[i:j] if word in dic: if len(word) > len(longest_word): longest_word = word word_list.append(longest_word) i += len(longest_word) return word_list # 逆向最长匹配 def back_segment(text, dic): word_list = [] i = len(text) - 1 while i >= 0: longest_word = text[i] for j in range(0, i): word = text[j:i + 1] if word in dic: if len(word) > len(longest_word): longest_word = word word_list.app

评论收藏

内容反馈

版权申诉