计算机毕业设计：基于机器学习的古代汉语切分标注算法及语料库研究(包含完整代码+论文+PPT)，保证可靠运行，附赠计算机答辩PPT资源-CSDN文库

共120个文件

txt：42个

py：23个

png：21个

版权申诉

毕业设计

机器学习

5 浏览量 2024-03-16 16:20:23 上传评论收藏 434.35MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

计算机毕业设计：基于机器学习的古代汉语切分标注算法及语料库研究(包含完整代码+论文+PPT)，保证可靠运行，附赠计算机答辩PPT （120个子文件）

短版-基于LSTM的多标签自动分词及标注模型.doc 1.96MB

原版-基于LSTM的多标签自动分词及标注模型 .doc 1.94MB

断代中文.doc 807KB

定版6页版_基于LSTM的多标签自动分词及标注模型.doc 646KB

于学金_最终版本评审表.doc 35KB

盲审.docx 2.37MB

毕设.docx 2.37MB

查重.docx 2.25MB

A-盲审-基于机器学习的古代汉语切分标注算法及语料库研究.docx 1.9MB

毕设参考文献.docx 1.87MB

李美薇意见.docx 1.72MB

老师意见版.docx 1.72MB

论文参考.docx 1.37MB

开题报告-于学金.docx 459KB

断代前言.docx 20KB

.gitignore 40B

LICENSE 1KB

log.log 2.09MB

onehot-e2.log 313KB

onehot-e3.log 313KB

onehot-e4.log 312KB

twohotlog-e2.log 191KB

twohotlog-e3.log 190KB

twohotlog-e4.log 190KB

sanguozhi.log 25KB

log.log 15KB

README.md 125KB

基于机器学习的古代汉语切分标注算法及语料库研究.pdf 2.7MB

A Machine Learning Model on Ancient Chinese Text Dating.pdf 2.53MB

毕设.pdf 2.21MB

data.pkl 33.25MB

参数选择.png 139KB

分词onetwohot.png 128KB

Figure_1.png 114KB

断句不同阈值训精确率.png 106KB

hiden32-cost.png 98KB

分词onetwohot-cost.png 94KB

hiden32-163264.png 89KB

hiden64-cost.png 87KB

Hiden64-163264.png 85KB

断句不同阈值=代价.png 70KB

断代5book_heatmap.png 53KB

分词LSTM-dic.png 44KB

不同阈值训练下测试集争取率.png 41KB

热力图.png 40KB

标注lstm-hmm.png 39KB

断代5book.png 33KB

字典分词效果.png 33KB

前三标签.png 30KB

0-7下微调阈值.png 29KB

左传年代.png 25KB

句长分布.png 20KB

TESTWithDic.py 20KB

TESTwithoutDic.py 19KB

vitebi.py 18KB

outWithDic.py 18KB

output.py 17KB

databaseset.py 13KB

readparameter.py 10KB

importData.py 5KB

TESTonlyDic.py 5KB

makediction.py 4KB

plotdifparameter.py 4KB

onlyDic.py 3KB

getHTML.py 2KB

plot5book.py 1KB

plot.py 894B

readHTML.py 693B

cleanTXT.py 615B

cal_dic_rate.py 564B

try.py 449B

plot0_7youhua.py 417B

p.py 378B

a.txt 50.78MB

raw_train400-800.txt 28.13MB

jindaiAndShanggu_train_all.txt 24.71MB

raw_train800-1160.txt 24.05MB

jindai_train_0-800.txt 17.51MB

cleaned_train400-800.txt 13.08MB

raw_shanggu_train1600-1971.txt 12.98MB

raw_train0-400.txt 12.08MB

cleaned_train800-1160.txt 11.03MB

raw_shanggu_train800-1200.txt 10.97MB

raw_shanggu_train400-800.txt 10.51MB

a.txt 10.46MB

shanggu_train_0-1200.txt 7.2MB

cleaned_test400-800.txt 6.07MB

raw_shanggu_train1200-1600.txt 6.06MB

cleaned_test800-1160.txt 5.01MB

cleaned_train0-400.txt 4.43MB

cleaned_shanggu_train1600-1971.txt 4.03MB

cleaned_shanggu_train800-1200.txt 3.51MB

cleaned_shanggu_train400-800.txt 3.35MB

cleaned_test0-400.txt 2.73MB

cleaned_shanggu_test1600-1971.txt 2.39MB

cleaned_shanggu_test800-1200.txt 2.06MB

cleaned_shanggu_test400-800.txt 2.01MB

cleaned_shanggu_train1200-1600.txt 1.93MB

共 120 条

# 基于机器学习的古代汉语切分标注算法及语料库研究 # 一、致谢两年半的研究生生涯即将结束，在研究课题期间有过怀疑、有过否定也收获了独属于自己的成就感，庆幸自己能为古汉语自然语言处理的相关课题贡献一点小小的力量，也庆幸自己得到了许多人的帮助和关心。首先感谢我的导师皇甫伟老师，从论文的开题到终稿，您给予了我无数的指导和鼓励。谢谢您曾抽出宝贵的休息时间为我反复修改论文；谢谢您在我对课题感到迷茫时的鼓励和安慰；还谢谢您在我偷懒时的不断鞭策。很庆幸也很骄傲能成为您的学生。感谢我的三位舍友熊健、徐海祥、王梓楠，缘分让我们聚在一个宿舍，研究生生涯有你们的陪伴是我的幸运。感谢乌尼日其其格同学和秦运慧师姐，我会一直记得你们对我的帮助。感谢师姐帮我修改文章思路，感谢乌尼日同学帮我调整论文细节等很多我没有注意到的工作。感谢你们对我的付出。感谢我的其他实验室小伙伴们：刘娅汐、王欢、雷铠僖、安玮、沈一佳；还有已经毕业了的师兄师姐：王浩彬、黄鹤林、李佳轩，有你们的陪伴，使我们的实验室充满了色彩和欢乐，你们在我的研究生生涯中不仅在学业上为我提供帮助和指导，在生活上也无时无刻不在关心我，希望毕业后大家还能有机会重聚。还特别感谢我的家人，感谢我的爸爸、妈妈，谢谢您们在我浮躁的时候的鞭策和告诫，在我低潮时期的劝勉和鼓励，如今在你们的鼓舞下我已快要毕业踏入社会，希望我能不辜负你们的期望，成为你们的骄傲。 # 二、摘要近年来，深度学习的浪潮渗透在科研和生活领域的方方面面，本文主要研究深度学习在自然语言处理，尤其是古汉语自然语言处理方面的应用。本文旨在利用计算机帮助古文研究者对古汉语完成断代、断句、分词及词性标注等特殊而繁琐的任务，其中的断句、分词是不同于英文自然语言处理的，中文自然语言处理所特有的任务，尤其是断句任务更是古汉语自然语言处理所特有的任务。利用计算机处理古代汉语的各种任务有助于提高语言工作者的工作效率，避免人为主观因素误差，这将他们从繁重的古汉语基础任务中解脱出来，从使他们而将更多的精力投入到后续的授受、义理等内容方面上的研究。本文使用长短期记忆神经网络作为主体，并针对不同的古汉语自然语言处理任务，设计不同的输入输出结构来搭建具体模型，训练集使用的是网络上公开下载的古汉语语料，并且我们对其中的部分上古汉语语料文本进行了手工标记。本文中设计的模型可对古汉语文本完成断代、断句、分词及词性标注的操作。本文涉及的的主要工作和创新点如下：使用长短期记忆神经网络作为主体构建古代文本断代模型。在断代模型当中，文本中的每一个字被转换成一串高维向量，然后将文本包含的所有向量送入模型分析它们之间的非线性关系。最终，模型会输出一个该段文本的年代类别标签。实验结果表明利用 Bi-LSTM(Bi-directional Long Short-Term Memory, Bi-LSTM)神经网络构造的模型能够很好的完成断代任务，断代的正确率能达到 80% 以上。本文的断代模型提供了一种高效且准确的古文断代方法，这将节省古文研究工作者在文本断代过程中的时间。针对某些古代汉语书籍原著中缺少标点符号的问题，本文提出一个断句模型。本部分我们通过深度神经网络对大量已经断句的古汉语文本进行学习，使断句模型自动学习到某一时期、某种题材的断句规则，从而在后面的古代汉语文献信息化过程中，可以将断句工作交给计算机来完成，减少部分古汉语工作者的任务量。提出一个自动分词及词性标注一体化模型。由于目前尚没有公开的具有分词和词性标注的古汉语语料库，因此本文通过手工标记部分语料的方法得到了少量的数据集，将它们存入数据库作为训练集训练模型。实验表明本文提出的分词标注模型可以较好的完成古汉语分词标注任务。数据库也可通过模型加人工校准的方式进一步扩充。论文以 Bi-LSTM 网络为主要结构，建立了一系列针对古代汉语文本不同任务的模型。实验证明，在现有有限的古汉语语料库中本文提出的模型已具备较好的效果，并可以应用到后续更大语料库的构建当中，作为辅助工具帮助古汉语工作者对文本的标记工作。新产生的语料库又可继续用来训练模型提高模型的精度，以此构成语料库和模型互相促进提高的局面，促进古汉语信息化及大型古汉语语料库的构建。关键词：古汉语，自然语言处理，断代，断句，分词，词性标注 Machine Learning-based Segmentation, Tagging and Corpus Building for Ancient Chinese # 三、Abstract In recent years, deep learning has penetrated into every aspect of research and life. This paper mainly studies the application of deep learning in natural language processing, especially in ancient Chinese natural language processing. This paper aims to use computer to help ancient Chinese researchers to complete special and cumbersome tasks such as dating, sentence breaking, word segmentation and part-of-speech tagging in ancient Chinese. The sentence breaking and the word segmentation are the unique tasks of Chinese natural language processing, especially the sentence-breaking tasks are the unique tasks of ancient Chinese natural language processing. The use of computers to deal with the various tasks of ancient Chinese helps to improve the efficiency of language workers and avoid the subjective factors of human error, which frees them from the heavy basic tasks of ancient Chinese, so that they can put more energy into other aspects of research. In this paper, we use Long short-term memory neural networks as the main body, and design different input and output structures to build specific models for different ancient Chinese natural language processing tasks. The training set is an ancient Chinese corpus that we have publicly downloaded from the Internet, and we have manually marked some of the ancient Chinese corpus texts. The model designed in this paper can complete tasks such as breaking the ancient Chinese text, breaking sentences, word segmentation and part-of-speech tagging. The main work and innovations covered in this article are as follows: The Bi-LSTM was used as the main body to construct the ancient text dating model. In the age judging model, each word in the text is converted into a series of high-dimensional vectors, and then all the vectors contained in the text are sent to the model to analyze the nonlinear relationship between them. Finally, the model outputs a time category label for the text of the paragraph. Experiments show that the model constructed by Bi-LSTM can perform the task of age judging well, and the prediction accuracy can reach 80%. The model in this part provides an efficient and accurate method for ancient Chinese texts’ age judging, which will save the time consumption of ancient Chinese researchers in the process of textualization. In view of the lack of punctuation in the original works of some ancient Chinese books, this paper proposes a sentences breaking model. In this part, we use the deep neural network to learn a large number of ancient Chinese texts that have already been sentenced, so that the sentences breaking model automatically learns the rules of senten

评论收藏

内容反馈

版权申诉