python决策树代码资源-CSDN文库

共6个文件

py：4个

data：2个

5星 · 超过95%的资源需积分: 43 142 浏览量 2017-01-16 18:19:27 上传评论 25 收藏 8KB ZIP 举报

Python决策树是一种广泛应用的数据挖掘和机器学习算法，它利用树状模型进行分类和回归分析。在本场景中，我们有多个数据集（如"西瓜3.data"和"西瓜2.data"）以及相关的Python脚本（"dtree.py"、"dtreeplot.py"和"treeplot.py"），它们可能用于实现和可视化决策树模型。 1. **决策树的基本概念**：决策树是一种直观的模型，通过树状结构表示输入特征与输出结果之间的关系。在每个内部节点，模型基于某个特征进行分裂，而叶节点则代表最终的决策或预测结果。 2. **ID3、C4.5和CART算法**： Python中实现决策树通常使用`sklearn`库，该库支持ID3、C4.5和CART等算法。ID3基于信息熵进行分裂，C4.5是ID3的改进版，处理连续特征和缺失值更优，CART则用于生成二叉树，可用于分类和回归问题。 3. **`dtree.py`**：这个文件可能是实现决策树模型的Python代码。可能包含了导入`sklearn`库，数据预处理，训练模型，以及评估模型性能的相关函数。 4. **`dtreeplot.py`**：此文件可能包含用于绘制决策树图形的代码。`sklearn`库提供了`export_graphviz`方法，配合`graphviz`库可以将决策树可视化，便于理解模型的决策过程。 5. **`treeplot.py`**：类似于`dtreeplot.py`，这个文件可能也用于决策树的可视化，可能是作者自定义的绘图函数或者对`sklearn`库的进一步封装，以提供不同的展示效果。 6. **数据集（"西瓜3.data"和"西瓜2.data"）**：这些数据集很可能包含了关于西瓜的属性，例如重量、颜色、纹理等，用于训练和测试决策树模型。通常，数据集包含特征和对应的类别标签，比如西瓜的好坏。 7. **`__init__.py`**：这是一个Python模块初始化文件，表明包含这些文件的目录被视为一个Python包，允许在其中导入其他模块。 8. **决策树的步骤**： - **数据预处理**：清洗数据，处理缺失值，转换非数值特征。 - **选择特征**：根据算法选择最佳分裂特征。 - **构建树**：根据选择的特征进行节点分裂，直到满足停止条件（如最大深度、最小样本数等）。 - **剪枝**：防止过拟合，可使用预剪枝或后剪枝策略。 - **模型评估**：使用交叉验证、准确率、精确率、召回率等指标评估模型性能。 9. **Python决策树的应用**：决策树在各种领域都有应用，如医疗诊断、市场分析、信用评估等。在本例中，可能是为了识别西瓜的品质，通过分析各种特征来预测西瓜是否优质。通过以上分析，我们可以看出这个项目可能涉及从数据加载、模型训练、模型可视化到结果解释的完整流程，旨在帮助用户理解和运用决策树模型。通过运行这些代码，我们可以亲自实践并理解决策树的工作原理。

资源推荐

资源详情

资源评论

收起资源包目录

dtree.zip （6个子文件）

dtreeplot.py 7KB

dtree.py 11KB

__init__.py 0B

treeplot.py 4KB

西瓜2.data 605B

西瓜3.data 908B

# -*- coding: utf-8 -*- """ Created on Mon Jan 16 12:01:08 2017 @author: icefire """ from dtreeplot import dtreeplot import math #属性类 class property: def __init__(self,idnum,attribute): self.is_continuity=False #连续型属性标记 self.attribute=attribute #属性标签 self.subattributes=[] #属性子标签 self.id=idnum #属性排在输入文本的第几位 self.index={} #属性子标签的索引值 #决策树生成类 class dtree(): ''' 构造函数 filename:输入文件名 haveID:输入是否带序号 property_set：为空则计算全部属性，否则记录set中的属性 ''' def __init__(self,filename,haveID,property_set): self.data=[] self.data_property=[] #读入数据 self.__dataread(filename,haveID) #判断选择的属性集合 if len(property_set)>0: tmp_data_property=[] for i in property_set: tmp_data_property.append(self.data_property[i]) tmp_data_property.append(self.data_property[-1]) else: tmp_data_property=self.data_property #决策树树形数组结构 self.treelink=[] #决策树主递归 self.__TreeGenerate(range(0,len(self.data[-1])),tmp_data_property,0,[],[]) #决策树绘制 dtreeplot(self.treelink,6,1,-6) ''' 决策树主递归 data_set:当前样本集合 property_set：当前熟悉集合 father:父节点索引值 attribute:父节点连接当前节点的子属性值 threshold:如果是连续参数就是阈值，否则为空 ''' def __TreeGenerate(self,data_set,property_set,father,attribute,threshold): #新增一个节点 self.treelink.append([]) #新节点的位置 curnode=len(self.treelink)-1 #记录新节点的父亲节点 self.treelink[curnode].append(father) #结束条件1：所有样本同一分类 current_data_class=self.__count(data_set,property_set[-1]) if(len(current_data_class)==1): self.treelink[curnode].append(self.data[-1][data_set[0]]) self.treelink[curnode].append(attribute) self.treelink[curnode].append(threshold) return #结束条件2：所有样本相同属性，选择分类数多的一类作为分类 if all(len(self.__count(data_set,property_set[i]))==1 for i in range(0,len(property_set)-1)): max_count=-1; for dataclass in property_set[-1].subattributes: if current_data_class[dataclass]>max_count: max_attribute=dataclass max_count=current_data_class[dataclass] self.treelink[curnode].append(max_attribute) self.treelink[curnode].append(attribute) self.treelink[curnode].append(threshold) return #信息增益选择最优属性与阈值 prop,threshold = self.__entropy_paraselect(data_set,property_set) #记录当前节点的最优属性标签与父节点连接当前节点的子属性值 self.treelink[curnode].append(prop.attribute) self.treelink[curnode].append(attribute) #从属性集合中移除当前属性 property_set.remove(prop) #判断是否是连续属性 if(prop.is_continuity): #连续属性分为2子属性，大于和小于 tmp_data_set=[[],[]] for i in data_set: tmp_data_set[self.data[prop.id][i]>threshold].append(i) for i in [0,1]: self.__TreeGenerate(tmp_data_set[i],property_set[:],curnode,prop.subattributes[i],threshold) else: #离散属性有多子属性 tmp_data_set=[[] for i in range(0,len(prop.subattributes))] for i in data_set: tmp_data_set[prop.index[self.data[prop.id][i]]].append(i) for i in range(0,len(prop.subattributes)): if len(tmp_data_set[i])>0: self.__TreeGenerate(tmp_data_set[i],property_set[:],curnode,prop.subattributes[i],[]) else: #如果某一个子属性不存没有对应的样本，则选择父节点分类更多的一项作为分类 self.treelink.append([]) max_count=-1; tnode=len(self.treelink)-1 for dataclass in property_set[-1].subattributes: if current_data_class[dataclass]>max_count: max_attribute=dataclass max_count=current_data_class[dataclass] self.treelink[tnode].append(curnode) self.treelink[tnode].append(max_attribute) self.treelink[tnode].append(prop.subattributes[i]) self.treelink[tnode].append(threshold) #为没有4个值得节点用空列表补齐4个值 for i in range(len(self.treelink[curnode]),4): self.treelink[curnode].append([]) ''' 信息增益算则最佳属性 data_set:当前样本集合 property_set:当前属性集合 ''' def __entropy_paraselect(self,data_set,property_set): #分离散和连续型分别计算信息增益，选择最大的一个 max_ent=-10000 for i in range(0,len(property_set)-1): prop_id=property_set[i].id if(property_set[i].is_continuity): tmax_ent=-10000 xlist=self.data[prop_id][:] xlist.sort() #连续型求出相邻大小值的平局值作为待选的最佳阈值 for j in range(0,len(xlist)-1): xlist[j]=(xlist[j+1]+xlist[j])/2 for j in range(0,len(xlist)-1): if(i>0 and xlist[j]==xlist[j-1]): continue cur_ent = 0 nums=[[0,0],[0,0]] for k in data_set: nums[self.data[prop_id][k]>xlist[j]][property_set[-1].index[self.data[-1][k]]]+=1 for k in [0,1]: subattribute_sum=nums[k][0]+nums[k][1] if(subattribute_sum > 0): p=nums[k][0]/subattribute_sum cur_ent +=(p*math.log(p+0.00001,2)+(1-p)*math.log(1-p+0.00001,2))*subattribute_sum/len(data_set) if(cur_ent>tmax_ent): tmax_ent = cur_ent tmp_threshold = xlist[j] if(tmax_ent > max_ent): max_ent=tmax_ent; bestprop = property_set[i]; best_threshold = tmp_threshold; else: #直接统计并计算 cur_ent=0 nums=[[0,0] for i in range(0,len(property_set[i].subattributes))] for j in data_set: nums[property_set[i].index[self.data[prop_id][j]]][property_set[-1].index[self.data[-1][j]]]+=1 for j in range(0,len(property_set[i].subattributes)): subattribute_sum=nums[j][0]+nums[j][1] if(subattribute_sum>0): p=nums[j][0]/subattribute_sum cur_ent += (p*math.log(p+0.00001,2)+(1-p)*math.log(1-p+0.00001,2))*subattribute_sum/len(data_set) if(cur_ent > max_ent): max_ent=cur_ent; bestprop = p

评论收藏

内容反馈

xuecaoyi

2018-02-06

好好好好好
大威天龙世尊地藏

2017-12-09

这是我们懒人最好的一个下载程序了，下载解压后，直接打开dtree.py文件，按F5就能出结果来。出来一张分类完成的决策树
QINGYULi

2019-03-11

懒人必备！
肥肥很菜

2018-12-21

可以直接运行的
StuDog

2018-02-27

代码不错，可以运行，谢谢。