【免费】随机森林关于回归与分类的数据与代码_随机森林回归预测模型资源-CSDN文库

共4个文件

xlsx：2个

py：2个

需积分: 0 195 浏览量更新于2023-08-11 1 收藏 22KB ZIP 举报

随机森林是一种强大的机器学习算法，广泛应用于回归和分类任务。这个数据集和代码包提供了实践随机森林模型的机会，让我们深入探讨一下随机森林及其在回归和分类中的应用。我们来理解什么是随机森林。随机森林是由多个决策树组成的集成学习模型，每个决策树对样本进行预测，最终的预测结果是所有树预测结果的平均或多数表决。这种“森林”由两部分构成：随机性和多样性。随机性体现在构建每一棵树时，会从原始数据集中有放回地抽取一部分样本（Bootstrap抽样），并随机选取一部分特征来分割节点。多样性则通过不同的抽样和特征选择来确保每棵树都尽可能独特，减少过拟合的风险。回归任务中，随机森林的目标是预测连续变量的值。例如，"regression.xlsx"可能包含一个数据集，其中列代表特征，最后一列是连续的目标变量。"randomForest.regression.py"可能是实现随机森林回归的Python代码，可能使用了如`sklearn.ensemble.RandomForestRegressor`这样的库。该代码可能包括数据预处理、模型训练、参数调优和预测性能评估等步骤。分类任务则涉及预测离散的类别。例如，"DataRFL.xlsx"的数据集可能包含用于分类的不同特征，而"regressionForest_tree.py"可能被修改以适应分类任务，可能使用了`sklearn.ensemble.RandomForestClassifier`。分类随机森林的工作方式与回归类似，但决策树的叶子节点存储的是类别的概率，最终预测是概率最高的类别。在实际应用中，随机森林具有许多优点，如并行化处理能力、内置特征重要性评估、以及对缺失数据和非线性关系的处理能力。随机森林的缺点包括可能产生过于复杂的模型，以及解释单个决策树的困难。不过，通过调整树的数量、最大深度、特征抽取比例等参数，可以控制模型的复杂度和泛化能力。在Python中，我们可以使用`sklearn`库的`RandomForestRegressor`和`RandomForestClassifier`类进行模型构建。这些类提供了一套完整的接口，包括训练、预测、特征重要性计算等。`fit()`方法用于训练模型，`predict()`方法用于生成预测，`feature_importances_`属性则可获取特征的重要性分数。这个压缩包提供了一个动手实践随机森林回归和分类的绝佳机会。通过分析数据、编写和运行代码，你可以更深入地理解随机森林的工作原理，以及如何在实际问题中优化和应用它。同时，这也是一次提升数据预处理、模型选择和性能评估技能的好机会。

收起资源包目录

shumo代码文件.zip （4个子文件）

regression.xlsx 11KB

randomForest.regression.py 4KB

DataRFL.xlsx 10KB

regressionForest_tree.py 6KB

资源推荐

资源预览

资源评论

import pandas as pd from sklearn import metrics import numpy as np import sklearn.ensemble as ensemble # ensemble learning: 集成学习 from sklearn.ensemble import RandomForestClassifier df = pd.read_excel('regression.xlsx') # 宽带客户数据 # print(df.head()) # 输出数据预览 df.info() # 首先将列名全部小写 df.rename(str.lower, axis='columns', inplace=True) #现在查看因变量broadband分布情况，看是否存在不平衡 from collections import Counter print('Broadband: ', Counter(df['broadband'])) ## Broadband: Counter({0: 908, 1: 206}) 比较不平衡。 ## 根据原理部分，可知随机森林是处理数据不平衡问题的利器 y = df['broadband'] X = df.iloc[:, 0:4] future_labels = X.columns[0:] print(future_labels) # print(X) #区分测试集与训练集 from sklearn.model_selection import train_test_split, GridSearchCV X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=12345) #特征重要性 forest = RandomForestClassifier(n_estimators=200, random_state=1) # n_estimators:决策树个数-随机森林特有参数 forest.fit(X_train, y_train) importances = forest.feature_importances_ print(len(importances)) importances indices = np.argsort(importances)[::-1] # 取反后是从大到小 indices ''' numpy.argsort(a, axis=-1, kind=’quicksort’, order=None) 功能: 将矩阵a在指定轴axis上排序，并返回排序后的下标参数: a:输入矩阵， axis:需要排序的维度返回值: 输出排序后的下标 ''' for i in range(X_train.shape[1]): print("%2d) %-*s %f" % (i + 1, 30, future_labels[indices[i]], importances[indices[i]])) # 随机森林建模一样是使用网格搜索，有关Python实现随机森林建模的详细参数解释可以看代码的注释 # # 直接使用交叉网格搜索来优化决策树模型，边训练边优化 # from sklearn.model_selection import GridSearchCV # # 网格搜索的参数：正常决策树建模中的参数 - 评估指标，树的深度， # ## 最小拆分的叶子样本数与树的深度 param_grid = { 'criterion': ['entropy', 'gini'], 'max_depth': [5, 6, 7, 8], # 深度：这里是森林中每棵决策树的深度 'n_estimators': [11, 13, 15], # 决策树个数-随机森林特有参数 'max_features': [0.3, 0.4, 0.5], # 每棵决策树使用的变量占比-随机森林特有参数（结合原理） 'min_samples_split': [ 2, 3, 4, 5, 6, 8] # 叶子的最小拆分样本量 } import sklearn.ensemble as ensemble # ensemble learning: 集成学习 rfc = ensemble.RandomForestClassifier() #参数说明： ''' sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, bootstrap=True, random_state=None, min_samples_split=2) n_estimators：integer，optional（default = 10）森林里的树木数量120,200,300,500,800,1200 Criterion：string，可选（default =“gini”）分割特征的测量方法 max_depth：integer或None，可选（默认=无）树的最大深度 5,8,15,25,30 max_features="auto”,每个决策树的最大特征数量 If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features)(same as "auto"). If "log2", then max_features=log2(n_features). If None, then max_features=n_features. bootstrap：boolean，optional（default = True）是否在构建树时使用放回抽样 min_samples_split:节点划分最少样本数 min_samples_leaf:叶子节点的最小样本数超参数：n_estimator, max_depth, min_samples_split,min_samples_leaf 需要网格搜索调优 ''' rfc_cv = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc', cv=4) # 传入模型，网格搜索的参数，评估指标，cv交叉验证的次数 # 这里也只是定义，还没有开始训练模型 rfc_cv.fit(X_train, y_train) # 使用随机森林对测试集进行预测 test_est = rfc_cv.predict(X_test) print('随机森林精确度...') print(metrics.classification_report(test_est, y_test)) print('随机森林 AUC...') fpr_test, tpr_test, th_test = metrics.roc_curve(test_est, y_test) # 构造 roc 曲线 print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test)) best = rfc_cv.best_params_ print(best) # 搜索最优的参数 # 重新构建 param_grid = { 'criterion': ['entropy', 'gini'], 'max_depth': [5, 6, 7, 8], # 深度：这里是森林中每棵决策树的深度 'n_estimators': [11, 13, 15], # 决策树个数-随机森林特有参数 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5], # 每棵决策树使用的变量占比-随机森林特有参数（结合原理） 'min_samples_split': [2, 3, 4, 5, 6, 8] # 叶子的最小拆分样本量 } rfc_cv = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc', cv=4) rfc_cv.fit(X_train, y_train) # 使用随机森林对测试集进行预测 test_est = rfc_cv.predict(X_test) print('随机森林精确度...') #随机森林的评价函数 print(metrics.classification_report(test_est, y_test)) print('随机森林 AUC...') fpr_test, tpr_test, th_test = metrics.roc_curve(test_est, y_test) # 构造 roc 曲线 print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test)) #特征重要性 importances = rfc.feature_importances_ print(len(importances)) importances ##特征重要性排序 indices = np.argsort(rfc.feature_importances_)[::-1] # 取反后是从大到小 print("特征重要性排序：") for i in range(X_train.shape[1]): print("%2d) %-*s %f" % (i+1 , 30, future_labels[indices[i]], rfc.feature_importances_[indices[i]])) # 绘制特征重要性条形图 import matplotlib.pyplot as plt plt.title('Feature Importance') plt.bar(range(X_train.shape[1]), rfc.feature_importances_[indices], align='center') plt.xticks(range(X_train.shape[1]), future_labels[indices], rotation=90) plt.xlim([-1, X_train.shape[1]]) plt.tight_layout() plt.show()