没有合适的资源?快使用搜索试试~ 我知道了~
SMO_C++_implementation
5星 · 超过95%的资源 需积分: 17 43 下载量 71 浏览量
2013-08-16
14:07:02
上传
评论
收藏 394KB PDF 举报
温馨提示
试读
43页
这是一篇非常好的介绍SMO的英文材料,有算法的介绍,还有C++的实现,讲的很透彻!
资源推荐
资源详情
资源评论
Sequential Minimal Optimization for SVM
Contents
1 Introduction to Support Vector Machine (SVM) 2
1.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Non-linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Imperfect separation . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 The KKT conditions . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Checking KKT condition without using threshold b . . . . . . . . 7
2 SMO Algorithm 9
2.1 Optimize two α
i
’s . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 SMO Algorithm: Updating after a successful optimization step . 13
2.3 SMO Algorithm: Pick two α
i
’s for optimization . . . . . . . . . . 14
3 C
++
Implementation 15
3.1 The main routine . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 The examineExample routine . . . . . . . . . . . . . . . . . . . . 18
3.3 The takeStep routine . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Evaluating classification function . . . . . . . . . . . . . . . . . . 24
3.5 Functions to compute dot product . . . . . . . . . . . . . . . . . 26
3.6 Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.1 Get parameters by command line . . . . . . . . . . . . . . 29
3.7.2 Read in data . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7.3 Saving and loading model parameters . . . . . . . . . . . 34
3.8 Compute error rate . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 Multiclass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.10 Makefiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A The weight vectors of the parallel supporting planes 41
B The objective function of the dual problem 42
1
Abstract
This is a C
++
implementation of John C. Platt’s sequential minimal
optimization (SMO) for training a support vector machine (SVM). This
program is based on the pseudocode in Platt (1998).
This is both the documentation and the C
++
code. It is a NUWEB doc-
ument from which both the L
A
T
E
X file and the C
++
file can be generated.
The documentation is essentially my notes when reading the papers (most
of them being cut-and-paste from the papers).
1 Introduction to Support Vector Machine (SVM)
This introductio to Support Vector Machine for binary classification is based on
Burges (1998).
1.1 Linear SVM
First let us look at the linear support vector machine. It is based on the idea
of hyperplane classifier, or linearly separability.
Suppose we have N training data points {(x
1
, y
1
), (x
2
, y
2
), . . . , (x
N
, y
N
)}
where x
i
∈ R
d
and y
i
∈ {±1}. We would like to learn a linear separating
hyperplane classifier:
f(x) = sgn(w · x − b).
Furthermore, we want this hyperplane to have the maximum separating mar-
gin with respect to the two classes. Specifically, we want to find this hyperplane
H : y = w·x−b = 0 and two hyperplanes parallel to it and with equal distances
to it,
H
1
: y = w · x − b = +1,
2
H
2
: y = w · x − b = −1,
with the condition that there are no data points between H
1
and H
2
, and the
distance between H
1
and H
2
is maximized.
For any separating plane H and the corresponding H
1
and H
2
, we can always
“normalize” the coefficients vector w so that H
1
will be y = w · x − b = +1,
and H
2
will be y = w · x − b = −1. See Appendix A for details.
We want to maximize the distance between H
1
and H
2
. So there will be some
positive examples on H
1
and some negative examples on H
2
. These examples
are called support vectors because only they participate in the definition of
the separating hyperplane, and other examples can be removed and/or moved
around as long as they do not cross the planes H
1
and H
2
.
Recall that in 2-D, the distance from a point (x
0
, y
0
) to a line Ax+By+C = 0
is
|Ax
0
+By
0
+C|
√
A
2
+B
2
. Similarly, the distance of a point on H
1
to H : w · x − b = 0
is
|w·x−b|
kwk
=
1
kwk
, and the distance between H
1
and H
2
is
2
kwk
. So, in order
to maximize the distance, we should minimize kwk = w
T
w with the condition
that there are no data points between H
1
and H
2
:
w · x − b ≥ +1, for positive examples y
i
= +1,
w · x − b ≤ −1, for negative examples y
i
= −1.
These two conditions can be combined into
y
i
(w · x
i
− b) ≥ 1.
So our problem can be formulated as
min
w,b
1
2
w
T
w subject to y
i
(w · x
i
− b) ≥ 1.
This is a convex, quadratic programming problem (in w, b), in a convex set.
Introducing Lagrange multipliers α
1
, α
2
, . . . , α
N
≥ 0, we have the following
Lagrangian:
L (w, b, α) ≡
1
2
w
T
w −
N
X
i=1
α
i
y
i
(w · x
i
− b) +
N
X
i=1
α
i
.
1.2 The dual problem
We can solve the Wolfe dual instead: maximize L (w, b, α) with respect to α,
subject to the constraints that the gradient of L (w, b, α) with respect to the
primal variables w and b vanish:
∂L
∂w
= 0, (1)
∂L
∂b
= 0, (2)
3
and that
α ≥ 0.
From Equations 1 and 2, we have
w =
N
X
i=1
α
i
y
i
x
i
,
N
X
i=1
α
i
y
i
= 0.
Substitute them into L (w, b, α), we have
L
D
≡
N
X
i=1
α
i
−
1
2
X
i,j
α
i
α
j
y
i
y
j
x
i
· x
j
,
in which the primal variables are eliminated.
When we solve α
i
, we can get w =
P
N
i=1
α
i
y
i
x
i
, (we will later show how to
compute the threshold b), and we can classify a new object x with
f(x) = sgn(w · x + b)
= sgn((
N
X
i=1
α
i
y
i
x
i
) · x + b)
= sgn(
N
X
i=1
α
i
y
i
(x
i
· x) + b).
Please note that in the objective function and the solution, the training
vectors x
i
occur only in the form of dot product.
Before going into the details to how to solve this quadratic programming
problem, let’s extend it in two directions.
1.3 Non-linear SVM
What if the surface separating the two classes are not linear? Well, we can
transform the data points to another high dimensional space such that the data
points will be linearly separable. Let the transformation be Φ(·). In the high
dimensional space, we solve
L
D
≡
N
X
i=1
α
i
−
1
2
X
i,j
α
i
α
j
y
i
y
j
Φ(x
i
) · Φ(x
j
)
Suppose, in addition, Φ(x
i
) · Φ(x
j
) = k(x
i
, x
j
). That is, the dot product in
that high dimensional space is equivalent to a kernel function of the input space.
So we need not be explicit about the transformation Φ(·) as long as we know
4
that the kernel function k(x
i
, x
j
) is equivalent to the dot product of some other
high dimensional space. There are many kernel functions that can be used this
way, for example, the radial basis function (Gaussian kernel)
K(x
i
, x
j
) = e
−kx
i
−x
j
k
2
/2σ
2
.
The Mercer’s condition can be used to determine if a function can be used
as a kernel function:
There exists a mapping Φ and an expansion
K(x, y) =
X
i
Φ(x)
i
Φ(y)
i
,
if and only if, for any g(x) such that
Z
g(x)
2
dx is finite,
then
Z
K(x, y)g(x)g(y)dxdy ≥ 0.
1.4 Imperfect separation
The other direction to extend SVM is to allow for noise, or imperfect separation.
That is, we do not strictly enforce that there be no data points between H
1
and
H
2
, but we definitely want to penalize the data points that cross the boundaries.
The penalty C will be finite. (If C = ∞, we come back to the original perfect
separating case.)
We introduce non-negative slack variables ξ
i
≥ 0, so that
w · x
i
− b ≥ +1 − ξ
i
for y
i
= +1,
w · x
i
− b ≤ −1 + ξ
i
for y
i
= −1,
ξ
i
≥ 0, ∀i.
and we add to the objective function a penalizing term:
minimize
w,b,ξ
1
2
w
T
w + C(
X
i
ξ
i
)
m
where m is usually set to 1, which gives us
minimize
w,b,ξ
i
1
2
w
T
w + C
P
N
i=1
ξ
i
subject to y
i
(w
T
x
i
− b) + ξ
i
− 1 ≥ 0, 1 ≤ i ≤ N
ξ
i
≥ 0, 1 ≤ i ≤ N
5
剩余42页未读,继续阅读
资源评论
- woliao2017-08-22感谢分享,比较有用
- flying_20152014-03-19感谢分享 比较有用
60荷兰盾
- 粉丝: 58
- 资源: 7
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功