南京中医药大学
本科生毕业论文
信息技术学院计算机科学与技术专业 15 年级
学 号 0 8 4 0 1 5 1 2 5
学 生 姓 名 赵 壮
论 文 题 目 中 医 藏 象 辨 证 量 化 诊 断
系 统 的 设 计 与 实 现
实 习 单 位 南 京 中 医 药 大 学
指 导 教 师 杨 涛
起 止 时 间 2 0 1 9 - 2 - 2 0 ~ 2 0 1 9 - 6 - 1 0
2019 年 5 月 7 日
南京中医药大学信息技术学院 2019 届本科毕业论文
I
摘 要
目的:本设计旨在深入研究深度学习、集成学习等机器学习理论,并应用相关算法开
展中医藏象辨证量化诊断的创新研究和应用,在设计和实现 AdaBoost、随机森林、卷积神
经网络和谱聚类等应用广泛、表现优良的机器学习算法的基础上,搭建以算法为核心、集
成数据采集、数据清洗、量化诊断、算法评估等功能模块的中医藏象辨证量化诊断的一体
化平台。
方法:首先利用学校的中医院校资源优势,收集到来自中医院的 7964 条原始医案数
据;其次编写脚本对医案进行基本清洗,再组织中医专业学生对症状、舌象、脉象和证型
等数据项进行规范和标准化,再利用整理出来的数据项字典进行批量替换得到最终标准化
医案样本;针对 AdaBoost、随机森林、卷积神经网络和谱聚类等算法建立相应的模型,将
样本处理成规范的特征向量输入到模型参与计算,调整模型参数,应用表现最优的参数在
测试样本上评估;利用 Flask 开发框架、MySQL 数据库和 Echarts.js 等组件,集成最优算
法模型,基于 MVC 的开发模式以及面向对象的开发方法,实现中医藏象辨证量化诊断系
统。
结果:基本清洗得到 7518 条有效医案,已规范 700 条医案,提取得到 1871 个症状特
征、32 个舌象特征、16 个脉象特征以及 50 个证型标签;基于以上清洗结果,AdaBoost、
随机森林、卷积神经网络和谱聚类等四种模型的以准确率为指标的最优表现分别为
44.62%、47.59%、52.47%和 39.28%。
结论:在医案数据未得到完全标准化、样本分布不均匀、模型大小受硬件条件约束的
情况下,卷积神经网络在测试样本集上表现最佳,并且随着网络模型复杂程度提高而有明
显提高,集成学习算法随着参数调整,表现在 40%~50%间波动,谱聚类在无监督学习上
的分类表现一般。在样本质量提高的基础上,各模型表现有望得到明显提高。
关键词:量化诊断;AdaBoost;随机森林;卷积神经网络;谱聚类;一体化平台
南京中医药大学信息技术学院 2019 届本科毕业论文
II
Abstract
Purpose: This design aims to deeply study machine learning theories such as deep learning
and integrated learning, and apply relevant algorithms to carry out innovative research and
application of TCM syndrome differentiation and diagnosis, design and implement AdaBoost,
random forest, convolutional neural network and spectrum. Based on clustering and other well-
performing machine learning algorithms, an integrated platform for syndrome differentiation and
diagnosis of TCM Tibetan elephants with algorithmic core, integrated data acquisition, data
cleaning, quantitative diagnosis, algorithm evaluation and other functional modules is built.
Methods: Firstly, 7946 original medical records from Chinese medicine hospitals were
collected by using the resources of the Chinese medicine hospitals in the school. Secondly, scripts
were used to basically clean the medical cases, and then the students of traditional Chinese
medicine were organized to symptom, tongue, pulse and syndrome. The data items are
standardized and standardized, and then the sorted data item dictionary is used for batch
replacement to obtain the final standardized medical sample; the corresponding model is
established for AdaBoost, random forest, convolutional neural network and spectral clustering
algorithms, and the sample is processed. The normalized eigenvectors are input into the model to
participate in the calculation, the model parameters are adjusted, and the parameters with the best
performance are evaluated on the test samples. The components of the Flask development
framework, MySQL database and Echarts.js are integrated to integrate the optimal algorithm
model based on MVC. Development model and object-oriented development method to realize
TCM syndrome differentiation and diagnosis system.
Results: Basic cleansing resulted in 7518 effective medical records, 700 medical records have
been standardized, and 1871 symptom features, 32 tongue features, 16 pulse features and 50
syndrome tags were extracted. Based on the above cleaning results, AdaBoost, random The
optimal performance of the four models of forest, convolutional neural network and spectral
clustering were 44.62%, 47.59%, 52.47% and 39.28%, respectively.
Conclusion: The convolutional neural network performs best on the test sample set when the
medical record data is not fully standardized, the sample distribution is not uniform, and the model
size is constrained by hardware conditions, and it is obvious as the complexity of the network
model increases. Improve, the integrated learning algorithm fluctuates between 40% and 50% with
parameter adjustment, and the spectral clustering performance in unsupervised learning is general.
On the basis of the improvement of sample quality, the performance of each model is expected to
南京中医药大学信息技术学院 2019 届本科毕业论文
III
be significantly improved.
Key words: Quantitative diagnosis; AdaBoost; random forest; convolutional neural network;
spectral clustering; integrated platform
南京中医药大学信息技术学院 2019 届本科毕业论文
IV
目 录
1.绪论.............................................................................................................................1
1.1 项目背景 ..............................................................................................................1
1.2 目的与意义 ..........................................................................................................1
1.3 任务概述 ..............................................................................................................2
1.3.1 设计目标 .......................................................................................................2
1.3.2 算法要求 .......................................................................................................2
1.3.3 平台特点 .......................................................................................................2
2.相关技术简介.............................................................................................................4
2.1 相关算法 ..............................................................................................................4
2.1.1 AdaBoost .......................................................................................................4
2.1.2 随机森林 .......................................................................................................4
2.1.3 卷积神经网络 ...............................................................................................5
2.1.4 谱聚类 ...........................................................................................................5
2.2 开发技术 ..............................................................................................................5
2.2.1 前端技术——HTML+CSS+JavaScript........................................................5
2.2.2 后端技术——MySQL+Flask .......................................................................6
2.2.3 算法框架——TensorFlow+Scikit-learn .......................................................6
3.中医藏象辨证量化诊断方法研究.............................................................................8
3.1 数据准备 ..............................................................................................................8
3.1.1 医案清洗 .......................................................................................................8
3.1.2 医案标准化 ...................................................................................................9
3.2 AdaBoost 算法建模 ...........................................................................................10
3.2.1 构造特征向量 .............................................................................................10
3.2.2 构建决策树 .................................................................................................11
3.2.3 Boosting 集成学习 ......................................................................................12
3.2.4 算法评估 .....................................................................................................13
3.3 随机森林算法建模 ............................................................................................15
3.3.1 构造特征向量 .............................................................................................15
3.3.2 构建决策树 .................................................................................................15