没有合适的资源?快使用搜索试试~ 我知道了~
old and new matrix algebra useful for statistics
需积分: 0 16 浏览量
2023-05-29
17:43:28
上传
评论
收藏 177KB PDF 举报
温馨提示
统计中常见的矩阵微分公式
资源推荐
资源详情
资源评论




















Old and New Matrix Algebra Useful for Statistics
Thomas P. Minka
Contents
1 Derivatives 1
2 Kronecker pro duct and vec 6
3 Vec-transp ose 7
4 Multilinear forms 8
5 Hadamard pro duct and diag 10
6 Inverting partitioned matrices 12
7 Polar decomp osition 14
8 Hessians 15
Warning: This pap er contains a large number of matrix identities which cannot be absorbed
by mere reading. The reader is encouraged to take time and check each equation by hand and
work out the examples. This is advanced material; see Searle (1982) for basic results.
1 Derivatives
Maximum-likeliho od problems almost always require derivatives. There are six kinds of deriva-
tives that can be expressed as matrices:
Scalar Vector Matrix
Scalar
d
y
d
x
d
y
d
x
=
h
@ y
i
@ x
i
d
Y
d
x
=
h
@ y
ij
@ x
i
Vector
d
y
d
x
=
h
@ y
@ x
j
i
d
y
d
x
=
h
@ y
i
@ x
j
i
Matrix
d
y
d
X
=
h
@ y
@ x
ji
i
The partials with resp ect to the numerator are laid out according to the shape of
Y
while the
partials with respect to the denominator are laid out according to the transpose of
X
. Each
of these derivatives can be tediously computed via partials, but this section shows how they
instead can b e computed with matrix manipulations. The material is based on Magnus and
Neudecker (1988).
Dene the dierential d
y
(
x
) to be that part of
y
(
x
+ d
x
)
?
y
(
x
) which is linear in d
x
. Unlike the
classical denition in terms of limits, this denition applies even when
x
or
y
are not scalars.
For example, this equation:
y
(
x
+ d
x
) =
y
(
x
) +
A
d
x
+ (higher order terms) (1)
1

is well-dened for any
y
satisfying certain continuity prop erties. The matrix
A
is the derivative,
as you can check by setting all but one component of d
x
to zero and making it small. The
matrix
A
is also called the Jacobian matrix
J
x
!
y
. Its transpose is the gradient of
y
, denoted
r
y
. The Jacobian is useful in calculus while the gradient is useful in optimization.
Therefore, the derivative of any expression involving matrices can b e computed in two steps:
1. compute the dierential
2. massage the result into canonical form
after which the derivative is immediately read o as the coecient of d
x
, d
x
, or d
X
.
The dierential of an expression can be computed by iteratively applying the following rules:
d
A
= 0 (for constant
A
) (2)
d(
X
) =
d
X
(3)
d(
X
+
Y
) = d
X
+ d
Y
(4)
d(tr(
X
)) = tr(d
X
) (5)
d(
XY
) = (d
X
)
Y
+
X
d
Y
(6)
d(
X
Y
) = (d
X
)
Y
+
X
d
Y
(see section 2) (7)
d(
X
Y
) = (d
X
)
Y
+
X
d
Y
(see section 5) (8)
d
X
?
1
=
?
X
?
1
(d
X
)
X
?
1
(9)
d
j
X
j
=
j
X
j
tr(
X
?
1
d
X
) (10)
d log
j
X
j
= tr(
X
?
1
d
X
) (11)
d
X
?
= (d
X
)
?
(12)
where
?
is any operator that rearranges elements, e.g. transp ose, vec, and vec-transpose (sec-
tion 3). The rules can b e iteratively applied b ecause of the chain rule, e.g. d(
AX
+
Y
) =
d(
AX
) + d
Y
=
A
d
X
+ (d
A
)
X
+ d
Y
=
A
d
X
+ d
Y
. Most of these rules can be derived by
subtracting
F
(
X
+ d
X
)
?
F
(
X
) and taking the linear part. For example,
(
X
+ d
X
)(
Y
+ d
Y
) =
XY
+ (d
X
)
Y
+
X
d
Y
+ (d
X
)(d
Y
)
from which (6) follows.
To derive d
X
?
1
, note that
0 = d
I
= d
X
?
1
X
= (d
X
?
1
)
X
+
X
?
1
d
X
from which (9) follows.
2

The next step is to massage the dierential into one of the six canonical forms:
d
y
=
a
d
x
d
y
=
a
d
x
d
Y
=
A
d
x
d
y
=
a
0
d
x
d
y
=
A
d
x
d
y
= tr(
A
d
X
)
This is where the op erators and identities developed in the following sections are useful. For
example, since the derivative of
Y
with respect to
X
cannot b e represented by a matrix, it
is customary to use dvec(
Y
)
=
dvec(
X
) instead (vec is dened in section 2). If the purpose of
dierentiation is to equate the derivative to zero, then this transformation do esn't aect the
result. So after expanding the dierential, just take vec of b oth sides and use the identities in
sections 2 and 3 to get it into canonical form.
One particularly helpful identity is:
tr(
AB
) = tr(
BA
) (13)
Examples:
d
d
X
tr(
AXB
) =
BA
(14)
because dtr(
AXB
) = tr(
A
(d
X
)
B
) = tr(
BA
d
X
)
d
d
X
tr(
AX
0
BXC
) =
CAX
0
B
+
A
0
C
0
X
0
B
0
(15)
because dtr(
AX
0
BXC
) = tr(
AX
0
B
(d
X
)
C
) + tr(
A
(d
X
)
0
BXC
)
= tr((
CAX
0
B
+
A
0
C
0
X
0
B
0
)d
X
)
d
d
X
tr(
AX
?
1
B
) =
?
X
?
1
BAX
?
1
(16)
because dtr(
AX
?
1
B
) =
?
tr(
AX
?
1
(d
X
)
X
?
1
B
)
=
?
tr(
X
?
1
BAX
?
1
d
X
)
d
d
X
tr(
A
(
X
X
0
)
?
1
B
) =
?
X
0
(
X
X
0
)
?
1
(
BA
+
A
0
B
0
)(
X
X
0
)
?
1
(17)
(for symmetric )
because dtr(
A
(
X
X
0
)
?
1
B
) =
?
tr(
A
(
X
X
0
)
?
1
((d
X
)
X
0
+
X
(d
X
)
0
)(
X
X
0
)
?
1
B
)
=
?
tr(
X
0
(
X
X
0
)
?
1
(
BA
+
A
0
B
0
)(
X
X
0
)
?
1
d
X
)
d
d
X
j
X
j
=
j
X
j
X
?
1
(18)
d
d
X
j
X
0
X
j
= 2
j
X
0
X
j
(
X
0
X
)
?
1
X
0
(19)
3

because d
j
X
0
X
j
=
j
X
0
X
j
tr((
X
0
X
)
?
1
d(
X
0
X
))
=
j
X
0
X
j
tr((
X
0
X
)
?
1
(
X
0
d
X
+ (d
X
)
0
X
))
= 2
j
X
0
X
j
tr((
X
0
X
)
?
1
X
0
d
X
)
d
d
X
f
(
Xz
) =
z
(
d
d
x
f
(
x
))
Xz
(20)
because d
f
(
x
) = (
d
d
x
f
(
x
))d
x
(by denition)
d
f
(
Xz
) = (
d
d
x
f
(
x
))
Xz
(d
X
)
z
= tr(
z
(
d
d
x
f
(
x
))
Xz
d
X
)
Constraints
Sometimes we want to take the derivative of a function whose argument must
be symmetric. In this case, d
X
must b e symmetric, so we get
d
y
(
X
) = tr(
A
d
X
)
)
d
y
(
X
)
d
X
= (
A
+
A
0
)
?
(
A
I
) (21)
where
A
I
is simply
A
with o-diagonal elements set to zero. The reader can check this by
expanding tr(
A
d
X
) and merging identical elements of d
X
. An example of this rule is:
d
d
log
j
j
= 2
?
1
?
(
?
1
I
) (22)
when must be symmetric. This is usually easier than taking an unconstrained derivative and
then using Lagrange multipliers to enforce symmetry.
Similarly, if
X
must b e diagonal, then so must d
X
, and we get
d
y
(
X
) = tr(
A
d
X
)
)
d
y
(
X
)
d
X
= (
A
I
) (23)
Example: Principal Comp onent Analysis
Suppose we want to represent the zero-mean
random vector
x
as one random variable
a
times a constant unit vector
v
. This is useful for
compression or noise removal. Once we choose
v
, the optimal choice for
a
is
v
0
x
, but what is
the best
v
? In other words, what
v
minimizes E[(
x
?
a
v
)
0
(
x
?
a
v
)], when
a
is chosen optimally
for each
x
?
Let = E[
xx
0
]. We want to maximize
f
(
v
) =
v
0
v
?
(
v
0
v
?
1)
where
is a Lagrange multiplier. Taking derivatives gives
r
f
(
v
) = 2
v
?
2
v
4
剩余18页未读,继续阅读
资源评论


大浪淘沙_scc
- 粉丝: 99
- 资源: 2
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


安全验证
文档复制为VIP权益,开通VIP直接复制
