## <i> Project 2: Supervised Machine Learning </i>
### **_by Sebastian Sbirna_**
***
In this report, we will evaluate the performance and characteristics of various types of supervised learning models upon our chosen Heart Disease dataset. For more information upon the actual data dictionary and a description of the properties of our observations and attributes, please refer to our former project [1].
### I. Regression Models
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure, plot, xlabel, ylabel, legend, show, clim, semilogx, loglog, title, subplot, grid
import sklearn.linear_model as lm
from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression
from sklearn import model_selection, tree
from scipy import stats
import torch
from toolbox_02450 import feature_selector_lr, bmplot, rlr_validate, train_neural_net
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
```
### Regression, part A
Our dataset was collected for classification purposes, having 14 attributes: 5 numerical and 9 categorical. Out of the 14 attributes, there exists one ‘target’ variable which should be predicted in a classification setting. Otherwise, there are no variables which were collected specifically for a regression purpose within this dataset, therefore model results within this section may be prone to more errors.
Still, we will use one of the five numerical (_of ratio type_) attributes as our criterion (_i.e. dependent_) variable for a regression analysis, where the other 13 attributes (_which increase to 20 attributes after a one-out-of-K encoding_) will all act as predictor (_i.e. independent_) variables. Now, we must decide which variable is best suited to serve as criterion.
However, our dataset has one important peculiarity which makes regression extremely difficult: most of our attributes are either uncorrelated or weakly correlated with each other, resulting in very little predictive power being stored in our dataset for any variable.
#### Loading the dataset and performing data wrangling:
```python
df = pd.read_csv('heart.csv')
```
```python
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>age</th>
<th>sex</th>
<th>cp</th>
<th>trestbps</th>
<th>chol</th>
<th>fbs</th>
<th>restecg</th>
<th>thalach</th>
<th>exang</th>
<th>oldpeak</th>
<th>slope</th>
<th>ca</th>
<th>thal</th>
<th>target</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>63</td>
<td>1</td>
<td>3</td>
<td>145</td>
<td>233</td>
<td>1</td>
<td>0</td>
<td>150</td>
<td>0</td>
<td>2.3</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>37</td>
<td>1</td>
<td>2</td>
<td>130</td>
<td>250</td>
<td>0</td>
<td>1</td>
<td>187</td>
<td>0</td>
<td>3.5</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>41</td>
<td>0</td>
<td>1</td>
<td>130</td>
<td>204</td>
<td>0</td>
<td>0</td>
<td>172</td>
<td>0</td>
<td>1.4</td>
<td>2</td>
<td>0</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<th>3</th>
<td>56</td>
<td>1</td>
<td>1</td>
<td>120</td>
<td>236</td>
<td>0</td>
<td>1</td>
<td>178</td>
<td>0</td>
<td>0.8</td>
<td>2</td>
<td>0</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<th>4</th>
<td>57</td>
<td>0</td>
<td>0</td>
<td>120</td>
<td>354</td>
<td>0</td>
<td>1</td>
<td>163</td>
<td>1</td>
<td>0.6</td>
<td>2</td>
<td>0</td>
<td>2</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
```python
df.drop(index = (df[df.ca == 4]).index, inplace = True)
df.drop(index = (df[df.thal == 0]).index, inplace = True)
df.loc[df.thal == 1, 'thal'] = 6
df.loc[df.thal == 3, 'thal'] = 7
df.loc[df.thal == 2, 'thal'] = 3
df.loc[df.cp == 0, 'cp'] = 4
df.loc[df.cp == 3, 'cp'] = 7
df.loc[df.cp == 2, 'cp'] = 3
df.loc[df.cp == 1, 'cp'] = 2
df.loc[df.cp == 7, 'cp'] = 1
df.loc[df.slope == 2, 'slope'] = 3
df.loc[df.slope == 1, 'slope'] = 2
df.loc[df.slope == 0, 'slope'] = 1
```
```python
numerical_columns = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
```
```python
df['sex_male'] = df.sex
df.drop(columns = 'sex', inplace = True)
df = pd.get_dummies(data = df, columns = ['cp'], drop_first=True)
df.rename({'cp_2': 'cp_atypical', 'cp_3' : 'cp_non_anginal', 'cp_4': 'cp_asymptomatic'}, axis = 'columns', inplace = True)
df['fbs_true'] = df.fbs
df.drop(columns = 'fbs', inplace = True)
df = pd.get_dummies(data = df, columns = ['restecg'], drop_first=True)
df.rename({'restecg_1': 'restecg_st_t', 'restecg_2' : 'restecg_hypertrophy'}, axis = 'columns', inplace = True)
df['exang_yes'] = df.exang
df.drop(columns = 'exang', inplace = True)
df = pd.get_dummies(data = df, columns = ['slope'], drop_first=True)
df.rename({'slope_2': 'slope_flat', 'slope_3' : 'slope_downsloping'}, axis = 'columns', inplace = True)
df = pd.get_dummies(data = df, columns = ['ca'], drop_first=True)
df = pd.get_dummies(data = df, columns = ['thal'], drop_first=True)
df.rename({'thal_6': 'thal_fixed', 'thal_7' : 'thal_reversible'}, axis = 'columns', inplace = True)
df['target_true'] = df.target
df.drop(columns = 'target', inplace = True)
df.rename({'target_true': 'target'}, axis = 'columns', inplace = True)
```
```python
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>age</th>
<th>trestbps</th>
<th>chol</th>
<th>thalach</th>
<th>oldpeak</th>
<th>sex_male</th>
<th>cp_atypical</th>
<th>cp_non_anginal</th>
<th>cp_asymptomatic</th>
<th>fbs_true</th>
<th>...</th>
<th>restecg_hypertrophy</th>
<th>exang_yes</th>
<th>slope_flat</th>
<th>slope_downsloping</th>
<th>ca_1</th>
<th>ca_2</th>
<th>ca_3</th>
<th>thal_fixed</th>
<th>thal_reversible</th>
<th>target</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>63</td>
<td>145</td>
<td>233</td>
<td>150</td>
<td>2.3</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>37</td>
<td>130</td>
<td>250</td>
<td>187</td>
<td>3.5</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>41</td>
<td>130</td>
<td>204</td>
<td>172</td>
<td>1.4</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td>0</td>
<td>0
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
本项目致力于利用机器学习方法实现冠心病的诊断。冠心病是一种常见的心血管疾病,早期诊断对于治疗和患者生活质量至关重要。 我们采用机器学习算法,通过分析患者的生物医学数据,如年龄、性别、血压、胆固醇水平等,实现对冠心病的自动识别和分类。项目使用的数据集包括公开的冠心病数据集,如 Framingham Heart Study 等,并进行了预处理,包括缺失值处理、异常值处理和特征工程等。 在运行环境方面,我们使用Python编程语言,基于Scikit-learn机器学习库进行开发。为了提高模型的性能,我们还使用了交叉验证、参数调优等技术。此外,我们还采用了Docker容器技术,确保实验结果的可重复性。 项目完成后,将实现对冠心病的早期、准确诊断,为患者提供更好的治疗和生活质量。同时,项目成果也可应用于其他心血管疾病的诊断和研究。
资源推荐
资源详情
资源评论
收起资源包目录
基于机器学习的冠心病诊断内含数据集和运行环境说明.zip (180个子文件)
heart.csv 11KB
heart.csv 11KB
heart.csv 11KB
iris.csv 5KB
messy_data.data 1KB
tree_gini_Wine_data.gvz 6KB
tree_deviance.gvz 1KB
tree_entropy.gvz 1KB
tree_gini.gvz 1KB
tree_gini.gvz 1KB
Project 3 - Unsupervised Learning.ipynb 1.74MB
Project 1 - Data feature extraction and visualisation.ipynb 618KB
Project 2 - Supervised Learning.ipynb 458KB
wildfaces_grayscale.mat 11.5MB
digits.mat 3.68MB
zipdata.mat 3.68MB
wine.mat 123KB
wine2.mat 117KB
synth4.mat 49KB
synth5.mat 8KB
synth7.mat 8KB
synth3.mat 7KB
synth1.mat 7KB
synth2.mat 7KB
xor.mat 7KB
body.mat 2KB
iris.mat 2KB
faithful.mat 2KB
synth6.mat 594B
README.md 84KB
README.md 56KB
README.md 32KB
README.md 2KB
tree_gini_wine_data.png 3.03MB
output_32_0.png 805KB
tree_entropy.png 472KB
tree_gini.png 451KB
output_39_0.png 170KB
output_73_0.png 110KB
output_42_0.png 99KB
output_27_0.png 85KB
output_13_0.png 67KB
output_37_1.png 60KB
output_40_1.png 60KB
output_38_1.png 60KB
output_39_1.png 59KB
output_21_1.png 54KB
output_39_2.png 24KB
output_40_2.png 24KB
output_38_2.png 23KB
output_36_0.png 22KB
output_37_2.png 21KB
output_17_4.png 21KB
output_17_3.png 20KB
output_17_1.png 20KB
output_17_5.png 19KB
output_17_2.png 19KB
output_69_0.png 18KB
output_24_1.png 18KB
output_33_1.png 15KB
output_30_1.png 14KB
output_17_6.png 13KB
output_70_1.png 10KB
output_64_1.png 10KB
output_67_1.png 9KB
__init__.py 50KB
ANN.py 9KB
LinReg Lambda.py 8KB
#9 - ANN - binary classification using ANN and logistic regression.py 7KB
#2 - ML Data Loading - Cleaning messy data.py 7KB
#9 - Regularization - Finding best regularization factor and weights for a linear regression, and compare test errors.py 6KB
#9 - ANN - regression using ANN and showing diagram of best neural net.py 6KB
#7 - Cross-Validation - Multi-level cross-validation with linear regression and feature selection of attributes.py 5KB
#9 - ANN - binary classification using ANN and showing diagram of best neural net.py 4KB
#2 - ML Data Loading - Plotting of classification and regression problems.py 4KB
#12 - KDE, KNN and ARD - Plotting outliers of hand-written digits using KDE, KNN density and ARD density methods.py 4KB
#1 - Basic Python operations - Matrix operations.py 4KB
#5 - Data Visualization - Identifying and removing outliers based on data visualization.py 3KB
#2 - ML Data Loading - CSV format.py 3KB
#3 - PCA - Standardized vs Non-standardized - Cumulative Variance, Projections and Attribute Coefficient - Plots.py 3KB
#10 - Boosting - Using AdaBoost technique upon an ensemble of logistic regressions.py 3KB
#9 - Regularization - Finding best regularization strength for logistic regression.py 3KB
similarity.py 3KB
#9 - ANN - multinominal classification using ANN and softmax function.py 3KB
#8 - KNN - Classification, plotting and confusion matrix plot for KNN model.py 3KB
#7 - Cross-Validation - Model selection based on their mean error difference, using cross-validation.py 3KB
#3 - Digit PCA - Visualize digit reconstruction images according to different PCx.py 2KB
#7 - Cross-Validation - Test & training error across K cross-validation folds of Decision Tree model.py 2KB
#11 - K-means clustering - Cluster images and show their corresponding centroids.py 2KB
#12 - GMM - Fitting a GMM and showing its cluster plot.py 2KB
#4 - Measures of similarity - Similarity scores in digit recognizing (using all measures!).py 2KB
#12 - GMM - Selecting the number of GMM components K using CV, BIC and AIC.py 2KB
#5 - Data Visualization - Pairplot showing best dataset class-separation features.py 2KB
#8 - Naive Bayes - Multinominal NB error rate, using one-level CV and one-hot encoded categorical variables.py 2KB
#9 - Multinominal log regression - Histogram of label classifications and decision boundary plot.py 2KB
#4 - Bag-of-words - Creation of document-term matrix.py 2KB
#10 - Bagging - Using bagging technique upon an ensemble of logistic regressions.py 2KB
#5 - Multivariate Normal Distribution - Generating new handwritten digits using uni and multivariate normal distribution.py 2KB
#4 - Bag-of-words - Stemming of document-term matrix.py 2KB
#7 - Model error - Misclassification rate, test & training error of Decision Tree model, dependent on model complexity.py 2KB
共 180 条
- 1
- 2
资源评论
小码蚁.
- 粉丝: 2664
- 资源: 4483
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功