4-from-linear-regression-to-logistic-regression.zip资源-CSDN文库

共1个文件

pdf：1个

需积分: 9 15 浏览量 2021-09-22 12:06:47 上传评论收藏 1.89MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

4-from-linear-regression-to-logistic-regression.zip （1个子文件）

4-from-linear-regression-to-logistic-regression.pdf 2.19MB

从线性回归到逻辑回归

在第2章，线性回归里面，我们介绍了一元线性回归，多元线性回归和多项式回归。这些模型都是广

义线性回归模型的具体形式，广义线性回归是一种灵活的框架，比普通线性回归要求更少的假设。这

一章，我们讨论广义线性回归模型的具体形式的另一种形式，逻辑回归（logistic regression）。

和前面讨论的模型不同，逻辑回归是用来做分类任务的。分类任务的目标是找一个函数，把观测值匹

配到相关的类和标签上。学习算法必须用成对的特征向量和对应的标签来估计匹配函数的参数，从而

实现更好的分类效果。在二元分类（binary classification）中，分类算法必须把一个实例配置两个类

别。二元分类案例包括，预测患者是否患有某种疾病，音频中是否含有人声，杜克大学男子篮球队在

NCAA比赛中第一场的输赢。多元分类中，分类算法需要为每个实例都分类一组标签。本章，我们会

用逻辑回归来介绍一些分类算法问题，研究分类任务的效果评价，也会用到上一章学的特征抽取方

法。

逻辑回归处理二元分类

普通的线性回归假设响应变量呈正态分布，也称为高斯分布（Gaussian distribution ）或钟形曲线

（bell curve）。正态分布数据是对称的，且均值，中位数和众数（mode）是一样的。很多自然现象

都服从正态分布。比如，人类的身高就服从正态分布，姚明那样的高度极少，在99%之外了。

在某些问题里，响应变量不是正态分布的。比如，掷一个硬币获取正反两面的概率分布是伯努力分布

（Bernoulli distribution），又称两点分布或者0-1分布。表示一个事件发生的概率是，不发生的概

率是，概率在{0,1}之间。线性回归假设解释变量值的变化会引起响应变量值的变化，如果响

应变量的值是概率的，这条假设就不满足了。广义线性回归去掉了这条假设，用一个联连函数(link

function)来描述解释变量与响应变量的关系。实际上，在第2章，线性回归里面，我们已经用了联连

函数。普通线性回归作为广义线性回归的特例使用的是恒等联连函数(identity link function)，将解释

变量的通过线性组合的方式来联接服从正态分布的响应变量。如果响应变量不服从正态分布，就要用

另外一种联连函数了。

在逻辑回归里，响应变量描述了类似于掷一个硬币结果为正面的概率。如果响应变量等于或超过了指

定的临界值，预测结果就是正面，否则预测结果就是反面。响应变量是一个像线性回归中的解释变量

构成的函数表示，称为逻辑函数（logistic function）。一个值在{0,1}之间的逻辑函数如下所示：

下面是在{-6,6}的图形：

−

(

) =

1 +

−

In[27]:

%matplotlib inline

import matplotlib.pyplot as plt

from matplotlib.font_manager import FontProperties

font = FontProperties(fname=r"c:\windows\fonts\msyh.ttc", size=10)

In[18]:

import numpy as np

plt.figure()

plt.axis([-6, 6, 0, 1])

plt.grid(True)

X = np.arange(-6,6,0.1)

y = 1 / (1 + np.e ** (-X))

plt.plot(X, y, 'b-');

在逻辑回归中，是解释变量的线性组合，公式如下：

对数函数（logit function）是逻辑函数的逆运算：

定义了逻辑回归的模型之后，我们用它来完成一个分类任务。

垃圾邮件分类

经典的二元分类问题就是垃圾邮件分类（spam classification）。这里，我们分类垃圾短信。我们用

第三章介绍的TF-IDF算法来抽取短信的特征向量，然后用逻辑回归分类。

我们可以用UCI Machine Learning Repository

(http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)的短信垃圾分类数据集（SMS Spam

Classification Data Set）。首先，我们还是用Pandas做一些描述性统计：

(

) =

1 +

−

( + )

(

) =

= +

(

)

−

(

)

In[21]:

import pandas as pd

df = pd.read_csv('mlslpic/SMSSpamCollection', delimiter='\t', header=None

)

print(df.head())

print('含spam短信数量：', df[df[0] == 'spam'][0].count())

print('含ham短信数量：', df[df[0] == 'ham'][0].count())

每条信息的前面已经被打上了标签。共5574条短信里面，4827条是ham，747条是spam。ham短信

用0标记，spam短信用1标记。观察数据会看到更多建模时需要的信息。下面的几条信息体现两种类

型的特征：

Spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May

Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

Spam: WINNER!! As a valued network customer you have been selected to receivea £900

prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

Ham: Sorry my roommates took forever, it ok if I come by now?

Ham: Finished class where are you.

让我们用LogisticRegression类来预测：

In[1]:

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model.logistic import LogisticRegression

from sklearn.cross_validation import train_test_split

首先，用pandas加载数据.csv文件，然后用train_test_split分成训练集（75%）和测试集

（25%）：

In[36]:

df = pd.read_csv('mlslpic/SMSSpamCollection', delimiter='\t', header=None

)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1],

df[0])

然后，我们建一个TfidfVectorizer实例来计算TF-IDF权重：

0 1

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

含spam短信数量： 747

含ham短信数量： 4825

In[3]:

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train_raw)

X_test = vectorizer.transform(X_test_raw)

最后，我们建一个LogisticRegression实例来训练模型。和LinearRegression类

似，LogisticRegression同样实现了fit()和predict()方法。最后把结果打印出来看看：

In[5]:

classifier = LogisticRegression()

classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

In[25]:

for i, prediction in enumerate(predictions[-5:]):

print('预测类型：%s. 信息：%s' % (prediction, X_test_raw.iloc[i]))

分类模型的运行效果如何？有线性回归的度量方法在这里不太适用了。我们感兴趣的是分类是否正确

（如第一章介绍的肿瘤预测问题），并不在乎它的决策范围。下面，我们来介绍二元分类的效果评估

方法。

二元分类效果评估方法

二元分类的效果评估方法有很多，常见的包括第一章里介绍的肿瘤预测使用的准确率（accuracy），

精确率（precision）和召回率（recall）三项指标，以及综合评价指标（F1 measure）， ROC AUC

值（Receiver Operating Characteristic ROC，Area Under Curve，AUC）。这些指标评价的样本分

类是真阳性（true positives），真阴性（true negatives），假阳性（false positives），假阴性

（false negatives）。阳性和阴性指分类，真和假指预测的正确与否。

在我们的垃圾短信分类里，真阳性是指分类器将一个垃圾短信分辨为spam类。真阴性是指分类器将

一个正常短信分辨为ham类。假阳性是指分类器将一个正常短信分辨为spam类。假阴性是指分类器

将一个垃圾短信分辨为ham类。混淆矩阵（Confusion matrix），也称列联表分析（Contingency

table）可以用来描述真假与阴阳的关系。矩阵的行表示实际类型，列表示预测类型。

预测类型：ham. 信息：Are u coming to the funeral home

预测类型：ham. 信息：Love isn't a decision, it's a feeling. If we

could decide who to love, then, life would be much simpler, bu

t then less magical

预测类型：ham. 信息：Dont think so. It turns off like randomlly w

ithin 5min of opening

预测类型：spam. 信息：Hey happy birthday...

预测类型：ham. 信息：None of that's happening til you get here th

ough

In[30]:

from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

y_test = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

y_pred = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1]

confusion_matrix = confusion_matrix(y_test, y_pred)

print(confusion_matrix)

plt.matshow(confusion_matrix)

plt.title('混淆矩阵',fontproperties=font)

plt.colorbar()

plt.ylabel('实际类型',fontproperties=font)

plt.xlabel('预测类型',fontproperties=font)

plt.show()

准确率

准确率是分类器预测正确性的评估指标。scikit-learn提供了accuracy_score来计算：

In[31]:

from sklearn.metrics import accuracy_score

y_pred, y_true = [0, 1, 1, 0], [1, 1, 1, 1]

print(accuracy_score(y_true, y_pred))

LogisticRegression.score()用来计算模型预测的准确率：

[[4 1]

[2 3]]

0.5

评论收藏

内容反馈

D-I-M

粉丝: 7
资源: 89

4-from-linear-regression-to-logistic-regression.zip

评论0

最新资源

4-from-linear-regression-to-logistic-regression.zip

评论0

from-linear-regression-to-logistic-regression.pdf

logistic regression.zip

linear-regression.zip

Logistic Regression.zip_logistic regression_logistic回归_machine l

logistic_regression：使用Python和Numpy从头开始进行Logistic回归.zip

iris1 logistic-Lasso-Ridge-LinearRegression.ipynb

health-insurance-price-predict-linear-regression.ipynb

Bayesian-Logistic-Regression.ipynb

94-acuracy-on-the-final-logistic-regression-model.ipynb

keras_linear_regression_python_源码.zip

Logistic-Regress.zip_Logistic_logistic regression_logistic回归_斯坦福

Desktop.zip_linear regression_rpo_梯度下降_线性回归_线性回归梯度

scl-linear-regression.zip_SCL_scl plc

前端开源库-ml-regression-multivariate-linear

random-linear-regression.zip

accord-statistics-regression-(linear-and-logistic

斯坦福机器学习编程作业machine-learning-ex1，Linear Regression，线性回归

2d-logistic-regression-demo.rar_2d Logistic_2d-logistic_DEMO_Log

ex4y.zip_Heavy-ball_Logistic_heavy ball_logistic regression_trai

Titanic_Logistic-Regression.ipynb

Wiley ebook-Applied Logistic Regression (Second Edition).rar

Logistic_regression.zip

前端开源库-ml-regression-simple-linear

logisticregression.py

go-online-linear-regression-源码.rar

Logistic-Regression.rar_Logistic_Logistic 分类_logistic regression

Java 面经手册·小傅哥.pdf

解压后拖入浏览器扩展程序使用.zip

最新资源