# Salary-Prediction
Predict salary based on job descriptions
## Defining the problem
Job salaries and differences between them are determined by a number of various factors, including skillset, experience and the job title itself.
Given the available datsets, we would like to estimate job salaries to understand key features driving salaries and deploy a model solution to predict salaries to gauge reasonable salaries based on these features.
## Approach
### 1. Data loading
- 'train_features': training dataset for each feature of each job ID: the job title, company, degree, major, industry, years of experience and distance from a metropolis (miles).
- 'train_salaries': training dataset of salaries (target variable) for each job ID
- 'test_features': testing dataset equivalent of the feature train set.
### 2. Data cleaning
As well as finding the data types and size of each dataset, data cleaning involved discovering and treating missing data, duplicates, invalid data (for example, salaries <= 0) and suspected outliers. Lower outliers are outliers below the 25 percentile - 1.5 * Inter Quartile Range; and upper outliers are above the 75 percentile -1.5 * Inter Quartile Range.
Below is the statistics summary for the categorical and numeric data.
#### Categorical data summary
| | jobId | companyId | jobType | degree | major | industry |
|:-------|:-----------------|:------------|:----------|:------------|:--------|:-----------|
| count | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 |
| unique | 1000000 | 63 | 8 | 5 | 9 | 7 |
| top | JOB1362685349471 | COMP39 | Senior | High School | None | Web |
| freq | 1 | 16193 | 125886 | 236976 | 532355 | 143206 |
#### Numeric data summary:
| | yearsExperience | milesFromMetropolis | salary |
|:--------------------|------------------:|----------------------:|---------:|
| count | 1e+06 | 1e+06 | 1e+06 |
| mean | 11.9924 | 49.5293 | 116.062 |
| std | 7.21239 | 28.8777 | 38.7179 |
| min | 0 | 0 | 0 |
| 25% | 6 | 25 | 88 |
| 50% | 12 | 50 | 114 |
| 75% | 18 | 75 | 141 |
| max | 24 | 99 | 301 |
| lower_outlier_check | 0 | 0 | 1 |
| lower_outliers | -12 | -50 | 8.5 |
| upper_outlier_check | 0 | 0 | 1 |
| upper_outliers | 36 | 150 | 220.5 |
While no missing data, duplicates or invalid data were found, suspected outliers in terms of salary were found and explored further:
- 5 observations had salaries of zero (below). These are potentially missing salary inputs and were dropped given that we are predicting salaries.
| | jobId | companyId | jobType | degree | major | industry | yearsExperience | milesFromMetropolis | salary |
|-------:|:-----------------|:------------|:---------------|:------------|:------------|:-----------|------------------:|----------------------:|---------:|
| 30559 | JOB1362684438246 | COMP44 | Junior | Doctoral | Math | Auto | 11 | 7 | 0 |
| 495984 | JOB1362684903671 | COMP34 | Junior | None | None | Oil | 1 | 25 | 0 |
| 652076 | JOB1362685059763 | COMP25 | CTO | High School | None | Auto | 6 | 60 | 0 |
| 816129 | JOB1362685223816 | COMP42 | Manager | Doctoral | Engineering | Finance | 18 | 6 | 0 |
| 828156 | JOB1362685235843 | COMP40 | Vice President | Masters | Engineering | Web | 3 | 29 | 0 |
- 7,117 observations were suspected upper-end outliers (examples of these outliers are in table below). No actions were taken to treat these observations, as they were reasonable to their tendencies of being higher-up positions and higher educated. Note that these outliers were not tied to specific companies but were mostly found in the oil and finance industries, and most upper outliers majored Engineering and Business.
| | jobId | companyId | jobType | degree | major | industry | yearsExperience | milesFromMetropolis | salary |
|----:|:-----------------|:------------|:---------------|:---------|:--------|:-----------|------------------:|----------------------:|---------:|
| 266 | JOB1362684407953 | COMP30 | CEO | Masters | Biology | Oil | 23 | 60 | 223 |
| 362 | JOB1362684408049 | COMP38 | CTO | Masters | None | Health | 24 | 3 | 223 |
| 560 | JOB1362684408247 | COMP53 | CEO | Masters | Biology | Web | 22 | 7 | 248 |
| 670 | JOB1362684408357 | COMP26 | CEO | Masters | Math | Auto | 23 | 9 | 240 |
| 719 | JOB1362684408406 | COMP54 | Vice President | Doctoral | Biology | Oil | 21 | 14 | 225 |
!!! Insert line charts of upper outliers here !!!
### 3. Exploratory Data Analysis (EDA)
For convenience, 'train_data' was define as the merging of training datasets after data cleaning. EDA was performed to better understand and visualise the data. While performing EDA, the numeric and categorical variables were defined as the following:
- Numeric variables: i) Target variable ('salary') and ii) numeric features - 'yearsExperience' and 'milesFromMetropolis'.
- Categorical features: 'companyId','jobType','degree','major' and 'industry'.
#### Numeric features
![EDA_salary](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/numeric_target_plots.png)
![EDA_milesFromMetropolis](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/numeric_feature_plotsmilesFromMetropolis.png)
![EDA_yearsExperience](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/numeric_feature_plotsyearsExperience.png)
#### Categorical features
![EDA_companyId](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/categorical_feature_plotscompanyId.png)
![EDA_degree](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/categorical_feature_plotsdegree.png)
![EDA_industry](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/categorical_feature_plotsindustry.png)
![EDA_jobType](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/categorical_feature_plotsjobType.png)
![EDA_major](https://github.com/Bennett-Heung/Salary-Prediction/blob/main/images/categorical_feature_plotsmajor.png)
#### Correlations
The heatmap of correlations below provide the following findings of correlations between the features variables and salary (target variable), and other correlations amongst the features themselves.
In terms of correlations with salary (target variable):
- Job position (jobType) is the variable that positively correlates most to salary, followed by degree (degree), years of experience (yearsExperience) and major (major).
- There is also slight positive correlations between the job industry (industry) and salary.
- There is no significant cor
没有合适的资源?快使用搜索试试~ 我知道了~
薪资预测:根据职位描述预测薪资
共10个文件
png:9个
md:1个
需积分: 41 12 下载量 53 浏览量
2021-02-11
09:57:01
上传
评论 4
收藏 213KB ZIP 举报
温馨提示
薪资预测 根据职位描述预测薪水 定义问题 职位的薪资和两者之间的差异取决于多种因素,包括技能,经验和职称本身。 在给定可用的数据集的情况下,我们希望估算工作薪资以了解推动薪资的关键特征,并部署模型解决方案来预测薪资以基于这些特征衡量合理的薪资。 方法 1.数据加载 'train_features':每个工作ID的每个功能的训练数据集:工作名称,公司,学位,专业,行业,经验的年限以及与大都市的距离(英里)。 'train_salaries':每个工作ID的薪水(目标变量)训练数据集 'test_features':等效于功能训练集的测试数据集。 2.数据清理 除了查找每个数据集的数据类型和大小外,数据清理还涉及发现和处理丢失的数据,重复项,无效数据(例如,工资<= 0)和可疑的异常值。 较低的异常值是低于25个百分点的异常值-1.5 *四分位间距; 以及较高的离群值在75个百分位数-1.5
资源推荐
资源详情
资源评论
收起资源包目录
Salary-Prediction-main.zip (10个子文件)
Salary-Prediction-main
images
categorical_feature_plotscompanyId.png 42KB
categorical_feature_plotsdegree.png 16KB
numeric_feature_plotsyearsExperience.png 23KB
numeric_target_plots.png 15KB
corr_heatmap.png 53KB
categorical_feature_plotsjobType.png 19KB
categorical_feature_plotsmajor.png 21KB
numeric_feature_plotsmilesFromMetropolis.png 25KB
categorical_feature_plotsindustry.png 18KB
README.md 10KB
共 10 条
- 1
资源评论
彷徨的牛
- 粉丝: 55
- 资源: 4720
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功