没有合适的资源?快使用搜索试试~ 我知道了~
Support Vector Machines
需积分: 9 4 下载量 179 浏览量
2015-05-31
23:56:56
上传
评论
收藏 187KB PDF 举报
温馨提示
这套笔记介绍了支持向量机(SVM)学习— 算法。支持向量机是最好的(很多人认为是最好的) “现成”的监督学习算法。对支持向量机的故事,我们会 首先要谈的利润与大数据分离的思想 “缺口”。接下来,我们将谈论的最佳间隔分类器,这将导致 我们为拉格朗日对偶题外话。我们也会看到仁,这给 一种应用支持向量机的效率非常高维(例如无限— 维)的特征空间,最后,我们将关闭的故事 SMO算法,给出了支持向量机的高效实现。
资源推荐
资源详情
资源评论
CS229 Lecture notes
Andrew Ng
Part V
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al-
gorithm. SVMs are among the best (and many believe are indeed the best)
“off-the-shelf” sup ervised learning algorithm. To tell the SVM story, we’ll
need to first talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classifier, which will lead
us into a digression on L agrange duality. We’ll also see kernels, which give
a way t o apply SVMs efficiently in very hig h dimensional (such as infinite-
dimensional) feature spaces, and finally, we’ll close off the story with the
SMO algorithm, which gives an efficient implementation of SVMs.
1 Margins: Intuition
We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the “ confidence” of our predic-
tions; these ideas will be made formal in Section 3.
Consider logistic regression, where the probability p(y = 1|x; θ) is mod-
eled by h
θ
(x) = g(θ
T
x). We would then predict “1” on an input x if and
only if h
θ
(x) ≥ 0.5, or equivalently, if and only if θ
T
x ≥ 0. Consider a
positive training example (y = 1 ) . The lar ger θ
T
x is, the larger also is
h
θ
(x) = p(y = 1|x; w, b), and thus a lso the higher our degree of “confidence”
that the label is 1. Thus, informally we can think of our prediction as being
a very confident one that y = 1 if θ
T
x ≫ 0. Similarly, we think of logistic
regression as making a very confident prediction of y = 0, if θ
T
x ≪ 0. Given
a training set, again informally it seems that we’d have found a good fit to
the training data if we can find θ so that θ
T
x
(i)
≫ 0 whenever y
(i)
= 1, and
1
2
θ
T
x
(i)
≪ 0 whenever y
(i)
= 0, since this would reflect a very confident ( and
correct) set of classifications for all the training examples. This seems to be
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional margins.
For a different type of int uition, consider the following figure, in which x’s
represent positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation θ
T
x = 0, and
is also called the separating hyperplane) is also shown, and three point s
have also been labeled A, B and C.
01
01
01
B
A
C
Notice that the point A is very far from the decision boundary. If we are
asked to make a prediction for the value of y at A, it seems we should be
quite confident that y = 1 there. Conversely, the po int C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict y = 1, it seems likely that just a small change to
the decision boundary could easily have caused our prediction to be y = 0.
Hence, we’re much more confident about our prediction at A than at C. The
point B lies in-between these two cases, and more broadly, we see that if
a point is far from t he separating hyperplane, then we may be significantly
more confident in our predictions. Again, inf ormally we think it’d be nice if,
given a training set, we manage to find a decision boundary that allows us
to make all correct and confident (meaning far from the decision boundary)
predictions on the training examples. We’ll formalize this later using t he
notion of geometric margins.
3
2 Notation
To make our discussion of SVMs easier, we’ll first need to introduce a new
notation for talking about classification. We will be considering a linear
classifier for a binary classification problem with labels y and features x.
From now, we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels.
Also, rather than parameterizing our linear classifier with the vect or θ, we
will use parameters w, b, and write our classifier as
h
w,b
(x) = g(w
T
x + b).
Here, g(z) = 1 if z ≥ 0, and g(z) = −1 otherwise. This “w, b” notation
allows us to explicitly t r eat the intercept term b separately from the other
parameters. (We also drop the convention we ha d previously of letting x
0
= 1
be an extra coordinate in the input feature vector.) Thus, b takes the role of
what was previously θ
0
, a nd w takes the role of [θ
1
. . . θ
n
]
T
.
Note also that, from our definition of g above, our classifier will directly
predict either 1 or −1 (cf. the perceptron algorithm), without first going
through the intermediate step of estimating the probability of y being 1
(which was what logistic regression did).
3 Function al and geometri c margins
Let’s formalize the notions of the functional and geometric margins. Given a
training example (x
(i)
, y
(i)
), we define the functional margin of (w, b) with
respect to the training example
ˆγ
(i)
= y
(i)
(w
T
x + b).
Note that if y
(i)
= 1, then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need w
T
x + b to be a large
positive number. Conver sely, if y
(i)
= −1, then for the functional margin
to be larg e, we need w
T
x + b to be a larg e negative number. Moreover, if
y
(i)
(w
T
x + b) > 0, then our prediction on this example is correct. (Check
this yourself.) Hence, a la rge functional ma r gin represents a confident and a
correct prediction.
For a linear classifier with the cho ice of g given above (taking values in
{−1, 1}), there’s one property of the functional margin that ma kes it not a
very good measure of confidence, however. Given our choice of g, we note that
if we replace w with 2w and b with 2b, then since g(w
T
x+b) = g(2w
T
x+2b),
4
this would not change h
w,b
(x) at all. I.e., g, and hence also h
w,b
(x), depends
only on the sign, but not on the magnitude, of w
T
x + b. However, replacing
(w, b) with (2w, 2b) also results in multiplying our functional margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale w and b,
we can make the functional margin arbitrarily large without really changing
anything meaningful. Intuitively, it might therefore make sense to impose
some sort of normalization condition such as that ||w||
2
= 1; i.e., we might
replace (w, b) with (w/||w||
2
, b/||w||
2
), and instead consider the functional
margin of (w/||w||
2
, b/||w||
2
). We’ll come back to this later.
Given a training set S = {( x
(i)
, y
(i)
); i = 1, . . . , m}, we also define the
function margin of (w, b) with respect to S to be the smallest of the functional
margins of t he individual training examples. Denoted by ˆγ, this can therefore
be written:
ˆγ = min
i=1,...,m
ˆγ
(i)
.
Next, let’s talk about geometric margins. Consider t he picture below:
wA
γ
B
(i)
The decision boundary corresponding to (w, b) is shown, along with the
vector w. Note that w is orthogonal (at 90
◦
) to the separating hyperplane.
(You should convince your self that this must be the case.) Consider the
point at A, which represents the input x
(i)
of some training example with
label y
(i)
= 1. Its distance to the decision boundary, γ
(i)
, is given by the line
segment AB.
How can we find the value of γ
(i)
? Well, w/||w|| is a unit-length vector
pointing in the same direction as w. Since A represents x
(i)
, we therefore
剩余24页未读,继续阅读
资源评论
lhr18716032695
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功