# Problem Statement:
A large company having around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. The management believes that this level of attrition (employees leaving, either on their own or because they got fired) is bad for the company, because of the following reasons -
1. The former employees’ projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners
2. A sizeable department has to be maintained, for the purposes of recruiting new talent
3. More often than not, the new employees have to be trained for the job and/or given time to acclimatise themselves to the company
Hence, the management wants to understand what factors they should focus on, in order to curb attrition. In other words, they want to know what changes they should make to their workplace, in order to get most of their employees to stay. Also, they want to know which of these variables is most important and needs to be addressed right away.
# About Data:
Data is divide in to five files-
1. Employee information (Demographic information, education details, salary etc.)
2. Employee survey details
3. manager Survey Details
4. Employee in-time(check-in time)
5. Employee out-time(check-out-time)
# Solution:
As a part of the solution, I've done the
1. Collating all the files in to one
2. Data preparation
- Cleaning data
- checking for missing or NA values
- oulier detection/treatment
- Correcting the variable format
3. Exploratory Data Analysis (by plotting different graphs)
4. Feature standardisation
- Normalising continuous features
- Created the dummy variables for the categorical features
5. Divided the data in to the training and the test data
6. Build a Logistic regression model using the glm() function in R and also need to set the "family" parameter in function, it was binomial in our case.
- First build a general model with all the variables
- Then used the StepAIC function to remove the insignificant variables (direction as "both")
- I've checked for the p-value (which should be less than 0.05) and VIF (usually around 2 or 3 but we can consider variables with high value if p-value is significant or we can also try to remove the variable with high VIF and significant p-value, if it doesn't decrease the accuracy then we're good to remove those variables) for further removing the variables
7. Predicted the probabilities of Attrition for test data (as the output of the Logistic regression model is the probablities not the class)
8. Tested for the various probability cutoff (like 50%, 40% etc.) [in order to convert the proabalities in to class] and checked how many records are classified as correct by confusionMatrix.
9. Then find the optimal probalility cutoff and calculated the acuracy, specificity and sensitivity
accuracy - yeses (or positives) correctly predicted by it as yeses (or positives) and nos (or negatives) correctly predicted by the model as nos (or negatives)
- **Specificity -** specificity is equal to the proportion of nos (or negatives) correctly predicted by the model as nos (or negatives).
- **Sensitivity -** Sensitivity of a model is the proportion of yeses (or positives) correctly predicted by it as yeses (or positives).
10. Also, calculated the KS-statistis and plotted the Gain and Lift chart
* **KS-statistics:** A high KS statistic means that not only does your model have all churns at the top, it has has all non-churns at the bottom. For a good model, KS statistic would be more than 40% and would lie in the top few deciles
* **ROC curve:** I plotted ROC curve which can be plotted between % of bad and % of good, or in simple language, % of eventhappen and % of non-event happen.The perfect model is pretty much a right triangle, whereas the random model is a straight line. Basically, a model that rises steeply is a good model.
* **Gain and Lift Chart:** Gain and Lift chart are mainly concerned to check the rank ordering of the probabilities
# Which variables are indicators/predictor of attrition?
1. We’d seen employees in early 20’s and Single or having less experience are more prone to the attrition, So we have to emphasis on them either my meeting, survey’s etc.
2. Keep an eye on the employee past companies work duration. Make sure it’s not frequent.
3. Recently promoted
4. working environment and job satisfaction
没有合适的资源?快使用搜索试试~ 我知道了~
Data-Science:R中的EDA和机器学习模型(回归,分类,聚类,SVM,决策树,随机森林,时间序列分析,推荐系统,XGB...

共255个文件
png:169个
md:39个
csv:18个

需积分: 50 437 浏览量
2021-03-20
12:59:57
上传
评论 2
收藏 26.84MB ZIP 举报
EDA和ML项目 存储库包含各种项目,这些项目都使用R语言编写了以下代码: 探索性数据分析 机器学习模型(线性回归,逻辑回归,k均值聚类,分层聚类,SVM,决策树,随机森林,时间序列分析,XGBoost) 以下是一些常用的程序包/库的列表,这些程序包/库被用作数据分析和构建机器学习模型的一部分 数据处理: dplyr,plyr,tidyr,stringer,data.table,lubridate(用于日期处理), 数据可视化: ggplot2,cowplot,ggthemes,比例 ML模型: randomForest,caret(用于数据拆分,交叉验证,预处理,特征选择,变量重要性估计等) 推荐模型: re荐 文本挖掘: tm,tidyverse
资源详情
资源评论
资源推荐
收起资源包目录





































































































共 255 条
- 1
- 2
- 3























张一库
- 粉丝: 22
- 资源: 4680

上传资源 快速赚钱
我的内容管理 收起
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助

会员权益专享
安全验证
文档复制为VIP权益,开通VIP直接复制

评论0