没有合适的资源?快使用搜索试试~ 我知道了~
模式识别第四版课后习题答案
5星 · 超过95%的资源 需积分: 39 15 下载量 59 浏览量
2022-02-21
11:09:26
上传
评论 2
收藏 1.29MB PDF 举报
温馨提示
试读
209页
模式识别第四版课后习题答案
资源详情
资源评论
资源推荐
Notes and Solutions for: Pattern Recognition by
Sergios Theodoridis and Konstantinos Koutroumbas.
John L. Weatherwax
∗
January 19, 2006
Introduction
Here you’ll find some notes that I wrote up as I worked through this excellent book. I’ve
worked hard to make these notes as good as I can, but I have no illusions tha t they are perfect.
If you feel that that there is a better way to accomplish or explain an exercise or derivation
presented in these notes; or that one or more of the explanations is unclear, incomplete,
or misleading, please tell me. If yo u find an error of any kind – technical, grammatical,
typographical, whatever – please tell me that, too. I’ll gladly add to the acknowledgment s
in later printings the name of the first person to bring each problem to my attention.
Acknowledgments
Special thanks to (most recent comments are listed first): Karonny F, Mohammad Heshajin
for helping improve these notes and solutions.
All comments (no matter how small) are much appreciated. In fact, if you find these notes
useful I would appreciate a contribution in the form of a solution to a problem that I did
not work, a mathematical derivation of a statement or comment made in the book that was
unclear, a piece of code that implements one of the algorithms discussed, or a correction to
a typo (spelling, grammar, etc) about these notes. Sort of a “take a penny, leave a penny”
type of approach. Remember: pay it forward.
∗
wax@alum.mit.edu
1
Classifiers Based on Baye s Decision Theory
Notes on the text
Minimizing the average risk
The symbol r
k
is the expected risk associated with observing an object from class k. This
risk is divided up into parts tha t depend o n what we then do when an object from class k
with feature vector x is observed. Now we only observe the feature vector x and not the true
class label k. Since we must still perform an action when we observe x let λ
ki
represent the
loss associated with the event that the object is truly from class k and we decided that it is
from class i. Define r
k
as the expected loss when an object of type k is presented to us. Then
r
k
=
M
X
i=1
λ
ki
P (we classify this object as a member of class i)
=
M
X
i=1
λ
ki
Z
R
i
p(x|ω
k
)dx ,
which is the books equation 2.14. Thus the total risk r is the expected value of the class
dependent risks r
k
taking into account how likely each class is or
r =
M
X
k=1
r
k
P (ω
k
)
=
M
X
k=1
M
X
i=1
λ
ki
Z
R
i
p(x|ω
k
)P (ω
k
)dx
=
M
X
i=1
Z
R
i
M
X
k=1
λ
ki
p(x|ω
k
)P (ω
k
)
!
dx . (1)
The decision rule that leads to the smallest total risk is obtained by selecting R
i
to be the
region of feature space in which the integrand above is as small as possible. That is, R
i
should be defined as the values of x such that for that value of i we have
M
X
k=1
λ
ki
p(x|ω
k
)P (ω
k
) <
M
X
k=1
λ
kj
p(x|ω
k
)P (ω
k
) ∀j .
In wo r ds the index i, when put in the sum above gives the smallest value when compared to
all other possible choices. For these values of x we should select class ω
i
as our classification
decision.
2
Bayesian classification with normal distributions
When the covariance matrices for two classes are the same and diagonal i.e. Σ
i
= Σ
j
= σ
2
I
then the discrimination functions g
ij
(x) are given by
g
ij
(x) = w
T
(x −x
0
) = (µ
i
− µ
j
)
T
(x − x
0
) , (2)
since the vector w is w = µ
i
− µ
j
in this case. Note that the point x
0
is on t he decision
hyperplane i.e. satisfies g
ij
(x) = 0 since g
ij
(x
0
) = w
T
(x
0
− x
0
) = 0. Let x be another point
on the decision hyperplane, then x − x
0
is a vector in the decision hyperplane. Since x is a
point on the decision hyperplane it also must satisfy g
ij
(x) = 0 from the functional form for
g
ij
(·) and the definition of w is this means that
w
T
(x − x
0
) = (µ
i
− µ
j
)
T
(x − x
0
) = 0 .
This is the statement that the line connecting µ
i
and µ
j
is orthogonal to the decision hy-
perplane. In the same way, when the covariance matrices of each class are not diagona l but
are nevertheless t he Σ
i
= Σ
j
= Σ the same logic that we used a bove states that the decision
hyperplane is again orthogonal to the vector w which in this case is Σ
−1
(µ
i
−µ
j
).
The magnitude o f P (ω
i
) relative to P (ω
j
) influences how close the decision hyperplane is
to t he respective class means µ
i
or µ
j
, in the sense tha t the class with the larger a priori
probability will have a “larger” region of R
l
assigned to it for classification. For example, if
P (ω
i
) < P (ω
j
) then ln
P (ω
i
)
P (ω
j
)
< 0 so the point x
0
which in the case Σ
i
= Σ
j
= Σ is given
by
x
0
=
1
2
(µ
i
+ µ
j
) − ln
P (ω
i
)
P (ω
j
)
µ
i
− µ
j
||µ
i
− µ
j
||
2
Σ
−1
, (3)
we can write as
x
0
=
1
2
(µ
i
+ µ
j
) + α(µ
i
− µ
j
) ,
with the value of α > 0. Since µ
i
− µ
j
is a vector from µ
j
to µ
i
the expression for x
0
above
starts at the midpoint
1
2
(µ
i
+ µ
j
) and moves closer to µ
i
. Meaning that the amount of R
l
assigned to class ω
j
is “larger” than the amount assigned to class ω
i
. This is expected since
the prior probability of class ω
j
is larger than that of ω
i
.
Notes on Example 2.2
To see the final lengths of the principal axes we start with the tra nsfor med equation of
constant Mahalanobis distance of d
m
=
√
2.952 or
(x
′
1
− µ
′
11
)
2
λ
1
+
(x
′
2
−µ
′
12
)
2
λ
2
= (
√
2.952)
2
= 2.952 .
Since we want the principal axis about (0, 0) we have µ
′
11
= µ
′
12
= 0 and λ
1
and λ
2
are the
eigenvalues given by solving |Σ − λI| = 0. In this case, we get λ
1
= 1 (in direction v
1
) and
λ
2
= 2 (in direction v
2
). Then the above becomes in “standard form” for a conic section
(x
′
1
)
2
2.952λ
1
+
(x
′
2
)
2
2.952λ
2
= 1 .
3
From this expression we can read off the lengths of the principle axis
2
p
2.952λ
1
= 2
√
2.952 = 3.43627
2
p
2.952λ
2
= 2
p
2.952(2) = 4 .85962 .
Maximum A Posteriori (MAP) Estimation: Example 2.4
We will derive the MAP estimate of the population mean µ when given N samples x
k
distributed as p(x|µ) and a normal prior on µ i.e. N(µ
0
, σ
2
µ
I). Then the estimate of the
population mean µ given the sample X ≡ {x
k
}
N
k=1
is proportional to
p(µ|X) ∝ p(µ)p(X|µ) = p(µ )
N
Y
k=1
p(x
k
|µ) .
Note that we have written p(µ) on the outside of the product t erms since it should only
appear once and not N times as might be inferred by had we written the product as
Q
N
k=1
p(µ)p(x
k
|µ). To find the value of µ t hat maximized this we take begin by taking
the natural lo g of the expression above, taking the µ derivative and setting the resulting
expression equal to zero. We find the natural log of the above given by
ln(p(µ)) +
N
X
k=1
ln(p(x
k
|µ)) = −
1
2
||µ − µ
0
||
2
σ
2
µ
−
1
2
N
X
k=1
(x
k
− µ)
T
Σ
−1
(x
k
−µ) .
Then taking the derivative with respect to µ, setting the result equal to zero, and calling
that solution ˆµ gives
−
1
σ
2
µ
(ˆµ −µ
0
) +
1
σ
2
N
X
k=1
(x
k
− ˆµ) = 0 ,
were we have assumed that the density p(x|µ) is N(µ, Σ) with Σ = σ
2
I. When we solve for
ˆµ in the above we get
ˆµ =
1
σ
2
µ
µ
0
+
1
σ
2
P
N
k=1
x
k
N
σ
2
+
1
σ
2
µ
=
µ
0
+
σ
2
µ
σ
2
P
N
k=1
x
k
1 +
σ
2
µ
σ
2
N
(4)
Maximum Entropy Est imation
As another method to determine distribution parameters we seek to maximize the entropy
H or
H = −
Z
X
p(x) ln(p(x))dx . (5)
This is equivalent to minimizing its negat ive or
R
X
p(x) ln(p(x))dx. To incorporate the con-
straint that the density must integra t e to one, we form the entropy Lagrangian
H
L
=
Z
x
2
x
1
p(x) ln(p(x))dx − λ
Z
x
2
x
1
p(x)dx − 1
,
4
where we have a ssumed that our density is non-zero only over [x
1
, x
2
]. The negative of the
above is equivalent to
−H
L
= −
Z
x
2
x
1
p(x)(ln(p(x)) − λ)dx − λ .
Ta king the p(x) derivative and setting it equal to zero
∂(−H
L
)
∂p
= −
Z
x
2
x
1
[(ln(p) − λ) + p
1
p
]dx
= −
Z
x
2
x
1
[ln(p) − λ + 1]dx = 0 .
Solving for the integral of ln(p(x)) we get
Z
x
2
x
1
ln(p(x))dx = (λ − 1)(x
2
−x
1
) .
Ta ke the x
2
derivative of this expression and we find x
ln(p(x
2
)) = λ − 1 ⇒ p(x
2
) = e
λ−1
, .
To find the value of λ we put this expression into our constraint of
R
x
2
x
1
p(x)dx = 1 to get
e
λ−1
(x
2
−x
1
) = 1 ,
or λ − 1 = ln
1
x
2
−x
1
, thus
p(x) = exp
ln
1
x
2
−x
1
=
1
x
2
− x
1
,
a uniform distribution.
Problem Solutions
Problem 2.1 (the Bayes’ rule minimized the probability of error)
Following the hint in the book, the probability o f correct classification P
c
is given by
P
c
=
M
X
i=1
P (x ∈ R
i
, ω
i
) ,
since in order to be correct when x ∈ R
i
the sample that generated x must come from the
class ω
i
. Now this joint probability is given by
P (x ∈ R
i
, ω
i
) = P (x ∈ R
i
|ω
i
)P (ω
i
)
=
Z
R
i
p(x|ω
i
)dx
P (ω
i
) .
5
剩余208页未读,继续阅读
hjc2020
- 粉丝: 17
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论5