决策树、随机森林用泰坦尼克号事件的数据集练习一下决策树和随机森林的API资源-CSDN文库

共3个文件

md：1个

py：1个

csv：1个

需积分: 5 34 浏览量 2024-05-17 23:27:57 上传评论收藏 25KB ZIP 举报

决策树和随机森林是两种广泛应用于机器学习领域的分类与回归方法。它们在处理复杂数据集时表现出色，尤其适用于特征间存在非线性关系的情况。接下来，我们将详细探讨这两种算法及其在泰坦尼克号数据集上的应用。决策树（Decision Tree）是一种基于树状结构进行预测的模型。它通过构建一系列的规则来决定样本的类别或数值。在每个内部节点，决策树会根据一个特征值进行分割，然后将数据流向下一层，直到达到叶子节点，即最终的决策结果。决策树的学习过程包括特征选择、树的构建和剪枝等步骤。在Python的scikit-learn库中，`DecisionTreeClassifier`和`DecisionTreeRegressor`分别用于分类和回归任务。接下来是随机森林（Random Forest），这是一种集成学习方法，通过构建并结合多个决策树来进行预测。每棵树都是独立训练的，且在训练过程中引入了随机性，比如随机选择一部分特征和样本来构建树，以增加多样性，降低过拟合风险。在scikit-learn中，我们可以使用`RandomForestClassifier`和`RandomForestRegressor`类来实现随机森林。泰坦尼克号事件的数据集是一个经典的机器学习示例，其中包含乘客的个人信息，如性别、年龄、票价等，目标是预测乘客是否能幸存。在这个数据集中，我们可以利用决策树和随机森林来预测幸存率。我们需要对数据进行预处理，包括填充缺失值、转换类别变量为数值等。然后，我们可以使用这些特征构建模型，通过训练集调整模型参数，如决策树的深度、随机森林中的树数量等。我们使用测试集评估模型的性能，通常使用准确率、精确率、召回率、F1分数和AUC-ROC曲线等指标。在scikit-learn中，构建决策树模型的步骤如下： 1. 导入必要的库：`from sklearn.tree import DecisionTreeClassifier` 2. 加载并预处理数据 3. 划分数据集：`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)` 4. 创建模型：`clf = DecisionTreeClassifier(max_depth=3)` 5. 训练模型：`clf.fit(X_train, y_train)` 6. 预测：`y_pred = clf.predict(X_test)` 7. 评估模型：`accuracy_score(y_test, y_pred)` 对于随机森林，步骤类似，只需替换模型为`RandomForestClassifier`： 1. 导入必要的库：`from sklearn.ensemble import RandomForestClassifier` 2. 创建模型：`rf = RandomForestClassifier(n_estimators=100)` 3. 其他步骤同上通过对比决策树和随机森林在泰坦尼克号数据集上的表现，我们可以更深入地理解这两种模型的优缺点。例如，随机森林往往比单棵决策树有更高的泛化能力，但计算成本也会相应增加。在实际应用中，我们需要根据问题的具体情况来选择合适的模型。

资源推荐

资源详情

资源评论

收起资源包目录

sklearn-DecisionTree-master.zip （3个子文件）

sklearn-DecisionTree-master

demo_tree.py 2KB

titanic.csv 60KB

README.md 5KB

# sklearn-DecisionTree 决策树、随机森林用泰坦尼克号事件的数据集练习一下决策树和随机森林的API。 # 分析数据集信息先读入数据集，看看有哪些特征： ``` import pandas as pd titanic = pd.read_csv("titanic.csv") # 分析数据集信息 print("*" * 30 + " info " + "*" * 30) print(titanic.info()) print(titanic.head()) ``` 输出： ``` ****************************** info ****************************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB None PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 NaN S [5 rows x 12 columns] ``` `survived`字段表示是否生存，我们以此作为预测目标。 ``` y = titanic['survived'] print(y.head()) ``` 输出： ``` 0 0 1 1 2 1 3 1 4 0 Name: survived, dtype: int64 ``` 我们取其中三个特征做分析演示，分别是： - pclass：1-一等舱，2-二等舱，3-三等舱 - age年龄 - sex性别 ``` x = titanic[['pclass', 'age', 'sex']] print(x.info()) print(x.head()) ``` 输出： ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 3 columns): Pclass 891 non-null int64 Age 714 non-null float64 Sex 891 non-null object dtypes: float64(1), int64(1), object(1) memory usage: 21.0+ KB None Pclass Age Sex 0 3 22.0 male 1 1 38.0 female 2 3 26.0 female 3 1 35.0 female 4 3 35.0 male ``` # 缺失值处理 age字段存在缺失，用均值填充： ``` age_mean = x['age'].mean() print("*" * 30 + " age_mean " + "*" * 30) print(age_mean) x['age'].fillna(age_mean, inplace=True) print("*" * 30 + " 处理age缺失值后 " + "*" * 30) print(x.info()) ``` 输出： ``` ****************************** age_mean ****************************** 29.69911764705882 ****************************** 处理age缺失值后 ****************************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 3 columns): Pclass 891 non-null int64 Age 891 non-null float64 Sex 891 non-null object dtypes: float64(1), int64(1), object(1) memory usage: 21.0+ KB ``` # 特征抽取 - onehot编码为了方便使用字典特征抽取，构造字典列表： ``` x_dict_list = x.to_dict(orient='records') print("*" * 30 + " train_dict " + "*" * 30) print(pd.Series(x_dict_list[:5])) dict_vec = DictVectorizer(sparse=False) x = dict_vec.fit_transform(x_dict_list) print("*" * 30 + " onehot编码 " + "*" * 30) print(dict_vec.get_feature_names()) print(x[:5]) ``` ``` ****************************** train_dict ****************************** 0 {'Pclass': 3, 'Age': 22.0, 'Sex': 'male'} 1 {'Pclass': 1, 'Age': 38.0, 'Sex': 'female'} 2 {'Pclass': 3, 'Age': 26.0, 'Sex': 'female'} 3 {'Pclass': 1, 'Age': 35.0, 'Sex': 'female'} 4 {'Pclass': 3, 'Age': 35.0, 'Sex': 'male'} dtype: object ****************************** onehot编码 ****************************** ['Age', 'Pclass', 'Sex=female', 'Sex=male'] [[22. 3. 0. 1.] [38. 1. 1. 0.] [26. 3. 1. 0.] [35. 1. 1. 0.] [35. 3. 0. 1.]] ``` # 划分训练集和测试集 ``` x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25) ``` # 决策树分类器 ``` dec_tree = DecisionTreeClassifier() dec_tree.fit(x_train, y_train) print("*" * 30 + " 准确率 " + "*" * 30) print(dec_tree.score(x_test, y_test)) ``` 输出： ``` ****************************** 准确率 ****************************** 0.7892376681614349 ``` # 随机森林分类器 - n_jobs: -1表示设置为核心数量 - n_estimators: 决策树数目 - max_depth: 树最大深度同时使用网格搜索最优超参数： ``` rf = RandomForestClassifier(n_jobs=-1) param = { "n_estimators": [120, 200, 300, 500, 800, 1200], "max_depth": [5, 8, 15, 25, 30] } # 2折交叉验证 search = GridSearchCV(rf, param_grid=param, cv=2) print("*" * 30 + " 超参数网格搜索 " + "*" * 30) start_time = time.time() search.fit(x_train, y_train) print("耗时：{}".format(time.time() - start_time)) print("最优参数：{}".format(search.best_params_)) print("*" * 30 + " 准确率 " + "*" * 30) print(search.score(x_test, y_test)) ``` 输出： ``` ****************************** 超参数网格搜索 ****************************** 耗时：66.85670185089111 最优参数：{'max_depth': 5, 'n_estimators': 120} ****************************** 准确率 ****************************** 0.7847533632286996 ``` 最优的参数是`{'max_depth': 5, 'n_estimators': 120}`。在我的2015款MacBookPro上，仅2折的交叉验证就跑了66秒。- -|

评论收藏

内容反馈