adult数据集分析_adult数据集分析,adult数据集聚类资源-CSDN文库

共7个文件

csv：4个

py：3个

adult数据集

数据挖掘

机器学习

python

需积分: 45 52 浏览量 2017-12-15 15:42:56 上传评论 2 收藏 313KB RAR 举报

《深入解析：Adult数据集与Python决策树在数据挖掘中的应用》 Adult数据集，源自1994年美国人口普查的数据，是一个广泛用于研究分类问题的经典数据集。它的目标是预测个人年收入是否超过50,000美元，以此来探讨年龄、性别、教育程度、职业等因素对收入的影响。这个数据集包含32,561条记录，有14个特征变量，是数据挖掘和机器学习初学者理想的实践素材。在Python中进行数据挖掘，我们常常借助强大的数据分析库Pandas进行数据预处理，如清洗缺失值、异常值，以及数据类型转换等。此外，利用Scikit-learn库，我们可以轻松实现各种机器学习算法，包括决策树。决策树是一种直观且易于理解的分类模型，它通过构建树状结构来模拟一系列的决策过程。在Adult数据集中，我们可以利用决策树来寻找决定收入水平的关键因素，并生成可视化树形结构，帮助我们直观地理解各个特征的重要性。决策树的构建过程包括选择最佳分割特征、划分数据集以及设定停止生长条件等步骤。我们需要导入必要的库，如pandas、numpy、matplotlib和sklearn。然后，读取数据集，查看数据的基本信息，了解数据分布和缺失值情况。接下来，我们可以进行特征工程，比如将类别型数据编码为数值型，以便于决策树处理。数据预处理完成后，就可以划分训练集和测试集。在训练决策树模型时，Scikit-learn提供了DecisionTreeClassifier类。我们可以通过调整参数，如最大深度（max_depth）、最小叶子节点样本数（min_samples_leaf）等，来控制决策树的复杂度，防止过拟合或欠拟合。训练完成后，使用测试集评估模型的性能，常用指标包括准确率、召回率、F1分数等。为了更好地理解决策树的决策逻辑，可以使用plot_tree函数绘制决策树的图形。此外，通过特征重要性属性（feature_importances_），我们可以了解各特征对模型预测的影响程度，这对于特征选择和业务理解十分关键。在实际应用中，我们可能还会探索其他机器学习模型，如随机森林、支持向量机等，进行模型比较和选择。同时，交叉验证也是评估模型泛化能力的重要手段，可以有效防止过拟合。总结来说，Adult数据集是一个极具实践价值的教学案例，它让我们深入了解数据预处理、特征工程、决策树模型构建以及模型评估等多个环节。通过Python的便利工具，我们可以高效地完成整个分析流程，进一步提升数据驱动决策的能力。在这个过程中，不断实践和优化，将使我们更加熟练地运用数据挖掘和机器学习技术解决实际问题。

资源推荐

资源详情

资源评论

收起资源包目录

adult数据挖掘.rar （7个子文件）

adult数据挖掘

new_data_value_test.csv 429KB

new_data_value.csv 859KB

new_data.csv 1.93MB

treePlotter.py 4KB

new_data_test.csv 987KB

trees.py 5KB

project1.py 9KB

# -*- coding: utf-8 -*- """ Created on Wed Dec 13 10:27:08 2017 @author: ChenYing """ import trees import treePlotter import csv from sklearn import tree import numpy as np import matplotlib.pyplot as plt def translate(filename): age = {'0-25':0,'25-50':1,'50-75':2,'75-100':3} csvfile = file(filename, 'rb') reader = csv.reader(csvfile) data = [] for line in reader: data.append(line) csvfile.close() new_data = [] mark = 0 for dataline in data: x = [0,0,0,0,0,0,0,0,0,0,0,0,0] if mark ==0: new_data.append(dataline) mark += 1 else: agenum = int(dataline[0]) if agenum>=0 and agenum<25: x[0] = age['0-25'] elif agenum>=25 and agenum<50: x[0] = age['25-50'] elif agenum>=50 and agenum<75: x[0] = age['50-75'] elif agenum>=75: x[0] = age['75-100'] x[1] = dataline[1] x[2] = dataline[2] x[3] = dataline[3] x[4] = dataline[4] x[5] = dataline[5] x[6] = dataline[6] x[7] = dataline[7] gain = int(dataline[8]) if gain>0: x[8] = '>0' else: x[8] = '=0' loss = int(dataline[9]) if loss>0: x[9] = '>0' else: x[9] = '=0' hour = int(dataline[10]) if hour == 40: x[10] = '=40' elif hour > 40: x[10] = '>40' elif hour < 40: x[10] = '<40' if dataline[11] == 'United-States' : x[11] = 'USA' else: x[11] = 'not USA' if dataline[12] == '<=50K': x[12] = '<=50K' else: x[12] = '>50K' new_data.append(x) return new_data def translateToValue(filename): #把数据集转换成数值型的 age = {'0-25':0,'25-50':1,'50-75':2,'75-100':3} capital_gain = {'=0':0, '>0':1} #10 capital_loss = {'=0':0, '>0':1} #11 hours_per_week = {'=40':0, '>40':1, '<40':2} #12 native_country = {'USA':0, 'not USA':1} #13 workclass= {'Freelance': 1, 'Other': 3, 'Proprietor': 4, 'Private': 2, 'Government': 0} education= {'Primary': 2, 'Tertiary': 0, 'Secondary': 1} maritial_status= {'1': 1, '0': 0} occupation= {'High': 1, 'Med': 2, 'Low': 0} relationship= {'Other': 0, 'Husband': 1, 'Wife': 2} race= {'1': 0, '0': 1} sex= {'Male': 0, 'Female': 1} income = {'<=50K':0, '>50K':1} csvfile = file(filename, 'rb') reader = csv.reader(csvfile) data = [] for line in reader: data.append(line) csvfile.close() new_data = [] mark = 0 for dataline in data: x = [0,0,0,0,0,0,0,0,0,0,0,0,0] if mark ==0: new_data.append(dataline) mark += 1 else: agenum = int(dataline[0]) if agenum>=0 and agenum<25: x[0] = age['0-25'] elif agenum>=25 and agenum<50: x[0] = age['25-50'] elif agenum>=50 and agenum<75: x[0] = age['50-75'] elif agenum>=75: x[0] = age['75-100'] x[1] = workclass[dataline[1]] x[2] = education[dataline[2]] x[3] = maritial_status[dataline[3]] x[4] = occupation[dataline[4]] x[5] = relationship[dataline[5]] x[6] = race[dataline[6]] x[7] = sex[dataline[7]] gain = int(dataline[8]) if gain>0: x[8] = capital_gain['>0'] else: x[8] = capital_gain['=0'] loss = int(dataline[9]) if loss>0: x[9] = capital_loss['>0'] else: x[9] = capital_loss['=0'] hour = int(dataline[10]) if hour == 40: x[10] = hours_per_week['=40'] elif hour > 40: x[10] = hours_per_week['>40'] elif hour < 40: x[10] = hours_per_week['<40'] if dataline[11] == 'United-States' : x[11] = native_country['USA'] else: x[11] = native_country['not USA'] if dataline[12] == '<=50K': x[12] = income['<=50K'] else: x[12] = income['>50K'] new_data.append(x) return new_data def write_new_data(): #adult_data_all在原始数据的基础上对某些属性做了一定的合并、修改等 new_data_value = translateToValue('adult_data_all.csv') with open( './new_data_value.csv', 'wb') as f: writer = csv.writer(f) writer.writerows(new_data_value) f.close() new_data_value_test = translateToValue('adult_test_all.csv') with open( './new_data_value_test.csv', 'wb') as f: writer = csv.writer(f) writer.writerows(new_data_value_test) f.close() new_data = translate('adult_data_all.csv') with open( './new_data.csv', 'wb') as f: writer = csv.writer(f) writer.writerows(new_data) f.close() new_data_test = translate('adult_test_all.csv') with open( './new_data_test.csv', 'wb') as f: writer = csv.writer(f) writer.writerows(new_data_test) f.close() def readData(filename): csvfile = file(filename, 'rb') reader = csv.reader(csvfile) data_all = [] #训练数据集 data_feature = [] #特征列 data_label = [] #标签列 mark = 0 featurnlen = 0 for line in reader: if mark ==0: featurnlen = len(line) - 1 mark += 1 else: data_all.append(line) data_feature.append(line[0:featurnlen]) data_label.append(line[-1]) csvfile.close() return data_all,data_feature,data_label #调用sklearn的决策树函数 def use_sklearn_tree(): train_data,trainX,trainY = readData('new_data_value.csv') test_data,testX,testY = readData('new_data_value_test.csv') model = tree.DecisionTreeClassifier() model.max_depth = 8 model.min_samples_split = 9 model.fit(trainX, trainY) predict = model.predict(testX) accuratyNum = 0 total = 0 for index in range(len(predict)): if predict[index] == testY[index]: accuratyNum += 1 total += 1 print "when use the sklearn............" importances = model.feature_importances_ # print "the accuratyNum is",accuratyNum # print "the total num is",total print "the accuraty is" accuracy = float(accuratyNum)/total print 'accuracy: %.2f%%' % (100 * accuracy) return model.tree_ def use_myTree(): adultLabels = ['age','workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'] adultLabels_test = ['age','workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'] adult = readData('new_data.csv')[0] adult_test = readData('new_data_test.csv')[1] adult_test_label = readData('new_data_test.csv')[2] adultTree = trees.createTree(adult,adultLabels) #生成决策树 treePlotter.createPlot(adultTree) #画出决策树

评论收藏

内容反馈