# Boruta-Shap
BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values. This combination has proven to out perform the original Permutation Importance method in both speed, and the quality of the feature subset produced. Not only does this algorithm provide a better subset of features, but it can also simultaneously provide the most accurate and consistent global feature rankings which can be used for model inference too. Unlike the orginal R package, which limits the user to a Random Forest model, BorutaShap allows the user to choose any Tree Based learner as the base model in the feature selection process.
Despite BorutaShap's runtime improvments the SHAP TreeExplainer scales linearly with the number of observations making it's use cumbersome for large datasets. To combat this, BorutaShap includes a sampling procedure which uses the smallest possible subsample of the data availble at each iteration of the algorithm. It finds this sample by comparing the distributions produced by an isolation forest of the sample and the data using ks-test. From experiments, this procedure can reduce the run time up to 80% while still creating a valid approximation of the entire data set. Even with these improvments the user still might want a faster solution so BorutaShap has included an option to use the mean decrease in gini impurity. This importance measure is independent of the size dataset as it uses the tree's structure to compute a global feature ranking making it much faster than SHAP at larger datasets. Although this metric returns somewhat comparable feature subsets, it is not a reliable measure of global feature importance in spite of it's wide spread use. Thus, I would recommend to using the SHAP metric whenever possible.
### Algorithm
1. Start by creating new copies of all the features in the data set and name them shadow + feature_name, shuffle these newly added features to remove their correlations with the response variable.
2. Run a classifier on the extended data with the random shadow features included. Then rank the features using a feature importance metric the original algorithm used permutation importance as it's metric of choice.
3. Create a threshold using the maximum importance score from the shadow features. Then assign a hit to any feature that had exceeded this threshold.
4. For every unassigned feature preform a two sided T-test of equality.
5. Attributes which have an importance significantly lower than the threshold are deemed 'unimportant' and are removed them from process. Deem the attributes which have importance significantly higher than than the threshold as 'important'.
6. Remove all shadow attributes and repeat the procedure until an importance has been assigned for each feature, or the algorithm has reached the previously set limit of runs.
If the algorithm has reached its set limit of runs and an importance has not been assigned to each feature the user has two choices. Either increase the number of runs or use the tentative rough fix function which compares the median importance values between unassigned features and the maximum shadow feature to make the decision.
## Installation
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install foobar.
```bash
pip install BorutaShap
```
## Usage
For more use cases such as alternative models, sampling or changing the importance metric please view the notebooks in the example folder above.
### Using Shap and Basic Random Forest
```python
from BorutaShap import BorutaShap, load_data
X, y = load_data(data_type='regression')
X.head()
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/BostonHead.PNG?raw=true" height="203" width="722">
```python
# no model selected default is Random Forest, if classification is True it is a Classification problem
Feature_Selector = BorutaShap(importance_measure='shap',
classification=False)
Feature_Selector.fit(X=X, y=y, n_trials=100, random_state=0)
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/BostonOutput.PNG?raw=true">
```python
# Returns Boxplot of features
Feature_Selector.plot(which_features='all')
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/Bostonplot.PNG?raw=true" height="530" width="699">
```python
# Returns a subset of the original data with the selected features
subset = Feature_Selector.Subset()
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/bostonsubset.PNG?raw=true" height="194" width="465">
### Using BorutaShap with another model XGBoost
```python
from BorutaShap import BorutaShap, load_data
from xgboost import XGBClassifier
X, y = load_data(data_type='classification')
X.head()
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/binaryhead.PNG?raw=true">
```python
model = XGBClassifier()
# if classification is False it is a Regression problem
Feature_Selector = BorutaShap(model=model,
importance_measure='shap',
classification=True)
Feature_Selector.fit(X=X, y=y, n_trials=100, random_state=0)
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/binaryoutput.PNG?raw=true">
```python
# Returns Boxplot of features
Feature_Selector.plot(which_features='all')
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/binaryplot.PNG?raw=true" height="565" width="671">
```python
# Returns a subset of the original data with the selected features
subset = Feature_Selector.Subset()
```
<img src="https://github.com/Ekeany/Boruta-Shap/blob/master/images/binarysubset.PNG?raw=true">
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
## License
[MIT](https://choosealicense.com/licenses/mit/)
PyPI 官网下载 | BorutaShap-1.0.6.tar.gz
版权申诉
106 浏览量
2022-02-10
04:27:51
上传
评论 1
收藏 11KB GZ 举报
挣扎的蓝藻
- 粉丝: 13w+
- 资源: 15万+
最新资源
- 课设毕设基于SSM的校园餐厅管理 LW+PPT+源码可运行.zip
- Python井字棋代码
- 课设毕设基于SSM的书店仓库管理系统2021 LW+PPT+源码可运行.zip
- 课设毕设基于SSM的沙县小吃点餐系统 LW+PPT+源码可运行.zip
- 课设毕设基于SSM的旅游景点线路网站 LW+PPT+源码可运行.zip
- EDA实验计数器CNT9999-DTCNT9999实验源代码
- 课设毕设基于SSM的抗疫医疗用品销售平台 LW+PPT+源码可运行.zip
- 基于Halcon的仿照VisonPro的机器视觉软件.zip
- battery-percentage-detector 使用 Javascript 的电池百分比检测器
- 毕业设计基于Qt+FFmpeg+SDL实现的音视频播放器源码.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈