## Exploration of Yandex CatBoost in Python ##
This demo will provide a brief introduction in
- performing data exploration and preprocessing
- feature subset selection: low variance filter
- feature subset selection: high correlation filter
- catboost model tuning
- importance of data preprocessing: data normalization
- exploration of catboost's feature importance ranking
## Getting started
Open `YandexCatBoost-Demo.ipynb` on a jupyter notebook environment, or Google colab. The notebook consists of further technical details.
## Future Improvements ##
Results from the feature importance ranking shows that attribute ‘MaritalStatus’ impacts minimally in class label prediction and could potential be a noise attribute. Removing it might increase model’s accuracy.
## Codes Walkthrough
Installing the open source Yandex CatBoost package
```python
pip install catboost
```
Importing the required packaged: Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn and CatBoost
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# plt.style.use('ggplot')
import seaborn as sns
from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
```
Loading of [IBM HR Dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/data) into pandas dataframe
```python
ibm_hr_df = pd.read_csv("IBM-HR-Employee-Attrition.csv")
```
### Part 1a: Data Exploration - Summary Statistics ###
Getting the summary statistics of the IBM HR dataset
```python
ibm_hr_df.describe()
```
<div>
<table class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Age</th>
<th>DailyRate</th>
<th>DistanceFromHome</th>
<th>Education</th>
<th>EmployeeCount</th>
<th>EmployeeNumber</th>
<th>EnvironmentSatisfaction</th>
<th>HourlyRate</th>
<th>JobInvolvement</th>
<th>JobLevel</th>
<th>...</th>
<th>RelationshipSatisfaction</th>
<th>StandardHours</th>
<th>StockOptionLevel</th>
<th>TotalWorkingYears</th>
<th>TrainingTimesLastYear</th>
<th>WorkLifeBalance</th>
<th>YearsAtCompany</th>
<th>YearsInCurrentRole</th>
<th>YearsSinceLastPromotion</th>
<th>YearsWithCurrManager</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.0</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>...</td>
<td>1470.000000</td>
<td>1470.0</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
<td>1470.000000</td>
</tr>
<tr>
<th>mean</th>
<td>36.923810</td>
<td>802.485714</td>
<td>9.192517</td>
<td>2.912925</td>
<td>1.0</td>
<td>1024.865306</td>
<td>2.721769</td>
<td>65.891156</td>
<td>2.729932</td>
<td>2.063946</td>
<td>...</td>
<td>2.712245</td>
<td>80.0</td>
<td>0.793878</td>
<td>11.279592</td>
<td>2.799320</td>
<td>2.761224</td>
<td>7.008163</td>
<td>4.229252</td>
<td>2.187755</td>
<td>4.123129</td>
</tr>
<tr>
<th>std</th>
<td>9.135373</td>
<td>403.509100</td>
<td>8.106864</td>
<td>1.024165</td>
<td>0.0</td>
<td>602.024335</td>
<td>1.093082</td>
<td>20.329428</td>
<td>0.711561</td>
<td>1.106940</td>
<td>...</td>
<td>1.081209</td>
<td>0.0</td>
<td>0.852077</td>
<td>7.780782</td>
<td>1.289271</td>
<td>0.706476</td>
<td>6.126525</td>
<td>3.623137</td>
<td>3.222430</td>
<td>3.568136</td>
</tr>
<tr>
<th>min</th>
<td>18.000000</td>
<td>102.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.0</td>
<td>1.000000</td>
<td>1.000000</td>
<td>30.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>...</td>
<td>1.000000</td>
<td>80.0</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
</tr>
<tr>
<th>25%</th>
<td>30.000000</td>
<td>465.000000</td>
<td>2.000000</td>
<td>2.000000</td>
<td>1.0</td>
<td>491.250000</td>
<td>2.000000</td>
<td>48.000000</td>
<td>2.000000</td>
<td>1.000000</td>
<td>...</td>
<td>2.000000</td>
<td>80.0</td>
<td>0.000000</td>
<td>6.000000</td>
<td>2.000000</td>
<td>2.000000</td>
<td>3.000000</td>
<td>2.000000</td>
<td>0.000000</td>
<td>2.000000</td>
</tr>
<tr>
<th>50%</th>
<td>36.000000</td>
<td>802.000000</td>
<td>7.000000</td>
<td>3.000000</td>
<td>1.0</td>
<td>1020.500000</td>
<td>3.000000</td>
<td>66.000000</td>
<td>3.000000</td>
<td>2.000000</td>
<td>...</td>
<td>3.000000</td>
<td>80.0</td>
<td>1.000000</td>
<td>10.000000</td>
<td>3.000000</td>
<td>3.000000</td>
<td>5.000000</td>
<td>3.000000</td>
<td>1.000000</td>
<td>3.000000</td>
</tr>
<tr>
<th>75%</th>
<td>43.000000</td>
<td>1157.000000</td>
<td>14.000000</td>
<td>4.000000</td>
<td>1.0</td>
<td>1555.750000</td>
<td>4.000000</td>
<td>83.750000</td>
<td>3.000000</td>
<td>3.000000</td>
<td>...</td>
<td>4.000000</td>
<td>80.0</td>
<td>1.000000</td>
<td>15.000000</td>
<td>3.000000</td>
<td>3.000000</td>
<td>9.000000</td>
<td>7.000000</td>
<td>3.000000</td>
<td>7.000000</td>
</tr>
<tr>
<th>max</th>
<td>60.000000</td>
<td>1499.000000</td>
<td>29.000000</td>
<td>5.000000</td>
<td>1.0</td>
<td>2068.000000</td>
<td>4.000000</td>
<td>100.000000</td>
<td>4.000000</td>
<td>5.000000</td>
<td>...</td>
<td>4.000000</td>
<td>80.0</td>
<td>3.000000</td>
<td>40.000000</td>
<td>6.000000</td>
<td>4.000000</td>
<td>40.000000</td>
<td>18.000000</td>
<td>15.000000</td>
<td>17.000000</td>
</tr>
</tbody>
</table>
<p>8 rows × 26 columns</p>
</div>
Zooming in on the summary statistics of irrelevant attributes __*EmployeeCount*__ and __*StandardHours*__
```python
irrList = ['EmployeeCount', 'StandardHours']
ibm_hr_df[irrList].describe()
```
<div>
<table class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>EmployeeCount</th>
<th>StandardHours</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>1470.0</td>
<td>1470.0</td>
</tr>
<tr>
<th>mean</th>
<td>1.0</td>
<td>80.0</td>
</tr>
<tr>
<th>std</th>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>min</th>
<td>1.0</td>
<td>80.0</td>
</tr>
<tr>
<th>25%</th>
<td>1.0</td>
<td>80.0</td>
</tr>
<tr>
<th>50%</th>
<td>1.0</td>
<td>80.0</td>
</tr>
<tr>
<th>75%</th>
<td>1.0</td>
<td>80.0</td>
</tr>
<tr>
<th>max</th>
<td>1.0</td>
<td>80.0</td>
</tr>
</tbody>
</table>
</div>
Zooming in on the summary statistics of irrelevant attribute __*Over18*__
```python
ibm_hr_df["Over18"].value_counts()
```
Y