演示YandexCatBoost梯度增强分类器在从Kaggle获得的虚构IBMHR数据集上的功能.zip资源-CSDN文库

共8个文件

png：3个

py：1个

ipynb：1个

版权申诉

181 浏览量 2023-03-31 23:12:13 上传评论收藏 725KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

演示YandexCatBoost梯度增强分类器在从Kaggle获得的虚构IBMHR数据集上的功能。在数据集上执行数据探索、清理、预处理和模型调整.zip （8个子文件）

YandexCatBoost-Python-Demo-master

IBM-HR-Employee-Attrition.csv 223KB

LICENSE 1KB

YandexCatBoost-Demo.ipynb 576KB

YandexCatBoost-Demo.py 11KB

images

output_69_0.png 28KB

output_30_0.png 62KB

output_24_0.png 221KB

README.md 22KB

## Exploration of Yandex CatBoost in Python ## This demo will provide a brief introduction in - performing data exploration and preprocessing - feature subset selection: low variance filter - feature subset selection: high correlation filter - catboost model tuning - importance of data preprocessing: data normalization - exploration of catboost's feature importance ranking ## Getting started Open `YandexCatBoost-Demo.ipynb` on a jupyter notebook environment, or Google colab. The notebook consists of further technical details. ## Future Improvements ## Results from the feature importance ranking shows that attribute ‘MaritalStatus’ impacts minimally in class label prediction and could potential be a noise attribute. Removing it might increase model’s accuracy. ## Codes Walkthrough Installing the open source Yandex CatBoost package ```python pip install catboost ``` Importing the required packaged: Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn and CatBoost ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt # plt.style.use('ggplot') import seaborn as sns from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget from sklearn.preprocessing import MinMaxScaler from sklearn.feature_selection import VarianceThreshold ``` Loading of [IBM HR Dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/data) into pandas dataframe ```python ibm_hr_df = pd.read_csv("IBM-HR-Employee-Attrition.csv") ``` ### Part 1a: Data Exploration - Summary Statistics ### Getting the summary statistics of the IBM HR dataset ```python ibm_hr_df.describe() ``` <div> <table class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>DailyRate</th> <th>DistanceFromHome</th> <th>Education</th> <th>EmployeeCount</th> <th>EmployeeNumber</th> <th>EnvironmentSatisfaction</th> <th>HourlyRate</th> <th>JobInvolvement</th> <th>JobLevel</th> <th>...</th> <th>RelationshipSatisfaction</th> <th>StandardHours</th> <th>StockOptionLevel</th> <th>TotalWorkingYears</th> <th>TrainingTimesLastYear</th> <th>WorkLifeBalance</th> <th>YearsAtCompany</th> <th>YearsInCurrentRole</th> <th>YearsSinceLastPromotion</th> <th>YearsWithCurrManager</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.0</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>...</td> <td>1470.000000</td> <td>1470.0</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> <td>1470.000000</td> </tr> <tr> <th>mean</th> <td>36.923810</td> <td>802.485714</td> <td>9.192517</td> <td>2.912925</td> <td>1.0</td> <td>1024.865306</td> <td>2.721769</td> <td>65.891156</td> <td>2.729932</td> <td>2.063946</td> <td>...</td> <td>2.712245</td> <td>80.0</td> <td>0.793878</td> <td>11.279592</td> <td>2.799320</td> <td>2.761224</td> <td>7.008163</td> <td>4.229252</td> <td>2.187755</td> <td>4.123129</td> </tr> <tr> <th>std</th> <td>9.135373</td> <td>403.509100</td> <td>8.106864</td> <td>1.024165</td> <td>0.0</td> <td>602.024335</td> <td>1.093082</td> <td>20.329428</td> <td>0.711561</td> <td>1.106940</td> <td>...</td> <td>1.081209</td> <td>0.0</td> <td>0.852077</td> <td>7.780782</td> <td>1.289271</td> <td>0.706476</td> <td>6.126525</td> <td>3.623137</td> <td>3.222430</td> <td>3.568136</td> </tr> <tr> <th>min</th> <td>18.000000</td> <td>102.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.0</td> <td>1.000000</td> <td>1.000000</td> <td>30.000000</td> <td>1.000000</td> <td>1.000000</td> <td>...</td> <td>1.000000</td> <td>80.0</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>1.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <th>25%</th> <td>30.000000</td> <td>465.000000</td> <td>2.000000</td> <td>2.000000</td> <td>1.0</td> <td>491.250000</td> <td>2.000000</td> <td>48.000000</td> <td>2.000000</td> <td>1.000000</td> <td>...</td> <td>2.000000</td> <td>80.0</td> <td>0.000000</td> <td>6.000000</td> <td>2.000000</td> <td>2.000000</td> <td>3.000000</td> <td>2.000000</td> <td>0.000000</td> <td>2.000000</td> </tr> <tr> <th>50%</th> <td>36.000000</td> <td>802.000000</td> <td>7.000000</td> <td>3.000000</td> <td>1.0</td> <td>1020.500000</td> <td>3.000000</td> <td>66.000000</td> <td>3.000000</td> <td>2.000000</td> <td>...</td> <td>3.000000</td> <td>80.0</td> <td>1.000000</td> <td>10.000000</td> <td>3.000000</td> <td>3.000000</td> <td>5.000000</td> <td>3.000000</td> <td>1.000000</td> <td>3.000000</td> </tr> <tr> <th>75%</th> <td>43.000000</td> <td>1157.000000</td> <td>14.000000</td> <td>4.000000</td> <td>1.0</td> <td>1555.750000</td> <td>4.000000</td> <td>83.750000</td> <td>3.000000</td> <td>3.000000</td> <td>...</td> <td>4.000000</td> <td>80.0</td> <td>1.000000</td> <td>15.000000</td> <td>3.000000</td> <td>3.000000</td> <td>9.000000</td> <td>7.000000</td> <td>3.000000</td> <td>7.000000</td> </tr> <tr> <th>max</th> <td>60.000000</td> <td>1499.000000</td> <td>29.000000</td> <td>5.000000</td> <td>1.0</td> <td>2068.000000</td> <td>4.000000</td> <td>100.000000</td> <td>4.000000</td> <td>5.000000</td> <td>...</td> <td>4.000000</td> <td>80.0</td> <td>3.000000</td> <td>40.000000</td> <td>6.000000</td> <td>4.000000</td> <td>40.000000</td> <td>18.000000</td> <td>15.000000</td> <td>17.000000</td> </tr> </tbody> </table> <p>8 rows × 26 columns</p> </div> Zooming in on the summary statistics of irrelevant attributes __*EmployeeCount*__ and __*StandardHours*__ ```python irrList = ['EmployeeCount', 'StandardHours'] ibm_hr_df[irrList].describe() ``` <div> <table class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>EmployeeCount</th> <th>StandardHours</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>1470.0</td> <td>1470.0</td> </tr> <tr> <th>mean</th> <td>1.0</td> <td>80.0</td> </tr> <tr> <th>std</th> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>min</th> <td>1.0</td> <td>80.0</td> </tr> <tr> <th>25%</th> <td>1.0</td> <td>80.0</td> </tr> <tr> <th>50%</th> <td>1.0</td> <td>80.0</td> </tr> <tr> <th>75%</th> <td>1.0</td> <td>80.0</td> </tr> <tr> <th>max</th> <td>1.0</td> <td>80.0</td> </tr> </tbody> </table> </div> Zooming in on the summary statistics of irrelevant attribute __*Over18*__ ```python ibm_hr_df["Over18"].value_counts() ``` Y

评论收藏

内容反馈

版权申诉