没有合适的资源?快使用搜索试试~ 我知道了~
卷积神经网络CNN笔记(理解CNN数学原理的指南).pdf
需积分: 45 71 下载量 189 浏览量
2020-02-24
05:16:20
上传
评论 1
收藏 773KB PDF 举报
温馨提示
试读
35页
卷积神经网络是深度学习中的基础模型。南京大学吴建鑫教授的「卷积神经网络CNN」笔记,35页pdf初学者学习指南理解CNN数学原理。
资源推荐
资源详情
资源评论
Convolutional neural networks
Jianxin Wu
LAMDA Group
National Key Lab for Novel Software Technology
Nanjing University, China
wujx2001@gmail.com
February 11, 2020
Contents
1 Preliminaries 3
1.1 Tensor and vectorization . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Vector calculus and the chain rule . . . . . . . . . . . . . . . . . 4
2 CNN overview 5
2.1 The architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The forward run . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . 7
2.4 Error back propagation . . . . . . . . . . . . . . . . . . . . . . . 9
3 Layer input, output, and notations 10
4 The ReLU layer 11
5 The convolution layer 13
5.1 What is a convolution? . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Why to convolve? . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Convolution as matrix product . . . . . . . . . . . . . . . . . . . 18
5.4 The Kronecker product . . . . . . . . . . . . . . . . . . . . . . . 20
5.5 Backward propagation: updating the parameters . . . . . . . . . 21
5.6 Even higher dimensional indicator matrices . . . . . . . . . . . . 22
5.7 Backward propagation: preparing the supervision signal for the
previous layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.8 Fully connected layer as a convolution layer . . . . . . . . . . . . 26
6 The pooling layer 26
1
7 A case study: the VGG-16 net 29
7.1 VGG-Verydeep-16 . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2 Receptive field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 Hands-on CNN experiences 31
Exercises 32
This chapter describes how a Convolutional Neural Network (CNN) operates
from a mathematical perspective. This chapter is self-contained, and the focus
is to make it comprehensible for beginners to the CNN field.
The convolutional neural network (CNN) has shown excellent performance
in many computer vision, machine learning, and pattern recognition problems.
Many solid papers have been published on this topic, and quite a number of
high quality open source CNN software packages have been made available.
There are also well-written CNN tutorials or CNN software manuals. How-
ever, we believe that introductory CNN material specifically prepared for be-
ginners is still needed. Research papers are usually very terse and lack details.
It might be difficult for beginners to read such papers. A tutorial targeting
experienced researchers may not cover all the necessary details to understand
how a CNN runs.
This chapter tries to present a document that
• is self-contained. It is expected that all required mathematical background
knowledge is introduced in this chapter itself (or in other chapters in this
book);
• has details for all the derivations. This chapter aims to explain all the
necessary math in detail. We try not to ignore any important step in a
derivation. Thus, it should be possible for a beginner to follow (although
an expert may find this chapter a bit tautological);
• ignores implementation details. The purpose is for a reader to under-
stand how a CNN runs at the mathematical level. We will ignore those
implementation details. In CNN, making correct choices for various im-
plementation details is one of the keys to its high accuracy (that is, “the
devil is in the details”). However, we intentionally left this part out, in
order for the reader to focus on the mathematics. After understanding the
mathematical principles and details, it is more advantageous to learn these
implementation and design details through hands-on experience by exper-
imenting with CNN programming. The exercise problems in this chapter
provide opportunities for hands-on CNN programming experiences.
CNNs are useful in a lot of applications, especially in image related tasks.
Applications of CNNs include image classification, image semantic segmenta-
tion, object detection in images, etc. We will focus on image classification (or
2
categorization) in this chapter. In image categorization, every image has a ma-
jor object that occupies a large portion of the image. An image is classified into
one of the classes based on the identity of its main object—e.g., dog, airplane,
bird, etc.
1 Preliminaries
We start with a discussion of some background knowledge that is necessary in
order to understand how a CNN runs. The reader can ignore this section if
he/she is familiar with these basics.
1.1 Tensor and vectorization
Everybody is familiar with vectors and matrices. We use a symbol shown in
boldface to represent a vector—e.g., x ∈ R
D
is a column vector with D elements.
We use a capital letter to denote a matrix—e.g., X ∈ R
H×W
is a matrix with
H rows and W columns. The vector x can also be viewed as a matrix with 1
column and D rows.
These concepts can be generalized to higher-order matrices—i.e., tensors.
For example, x ∈ R
H×W ×D
is an order 3 (or third order) tensor. It contains
HWD elements, and each of them can be indexed by an index triplet (i, j, d),
with 0 ≤ i < H, 0 ≤ j < W , and 0 ≤ d < D. Another way to view an order
3 tensor is to treat it as containing D channels of matrices. Every channel is
a matrix with size H × W . The first channel contains all the numbers in the
tensor that are indexed by (i, j, 0). Note that in this chapter we assume the
index starts from 0 rather than 1. When D = 1, an order 3 tensor reduces to a
matrix.
We have interacted with tensors day-to-day. A scalar value is a zeroth-order
(order 0) tensor; a vector is an order 1 tensor; and a matrix is a second order
tensor. A color image is in fact an order 3 tensor. An image with H rows and
W columns is a tensor with size H × W × 3: if a color image is stored in the
RGB format, it has 3 channels (for R, G and B, respectively), and each channel
is a H ×W matrix (second order tensor) that contains the R (or G, or B) values
of all pixels.
It is beneficial to represent images (or other types of raw data) as a tensor.
In early computer vision and pattern recognition, a color image (which is an
order 3 tensor) was often converted to the grayscale version (which is a matrix)
because we know how to handle matrices much better than tensors. The color
information is lost during this conversion. But color is very important in various
image (or video) based learning and recognition problems, and we do want to
process color information in a principled way—e.g., using a CNN.
Tensors are essential in CNN. The input, intermediate representation, and
parameters in a CNN are all tensors. Tensors with order higher than 3 are also
widely used in CNNs. For example, we will soon see that the convolution kernels
in a convolution layer of a CNN form an order 4 tensor.
3
Given a tensor, we can arrange all the numbers inside it into a long vec-
tor, following a pre-specified order. For example, in Matlab/Octave, the (:)
operator converts a matrix into a column vector in the column-first order. An
example is:
A =
1 2
3 4
, A(:) = (1, 3, 2, 4)
T
=
1
3
2
4
. (1)
In mathematics, we use the notation “vec” to represent this vectorization
operator. That is, vec(A) = (1, 3, 2, 4)
T
in the example in Equation 1. In order
to vectorize an order 3 tensor, we could vectorize its first channel (which is a
matrix and we already know how to vectorize it), then the second channel, . . . ,
till all channels are vectorized. The vectorization of the order 3 tensor is then
the concatenation of the vectorization of all the channels in this order.
The vectorization of an order 3 tensor is a recursive process, which utilizes
the vectorization of order 2 tensors. This recursive process can be applied to
vectorize an order 4 (or even higher order) tensor in the same manner.
1.2 Vector calculus and the chain rule
The CNN learning process depends on vector calculus and the chain rule. Sup-
pose z is a scalar (i.e., z ∈ R) and y ∈ R
H
is a vector. If z is a function of y,
then the partial derivative of z with respect to y is a vector, defined as
∂z
∂y
i
=
∂z
∂y
i
. (2)
In other words,
∂z
∂y
is a vector having the same size as y, and its i-th element
is
∂z
∂y
i
. Also note that
∂z
∂y
T
=
∂z
∂y
T
.
Furthermore, suppose x ∈ R
W
is another vector, and y is a function of x.
Then, the partial derivative of y with respect to x is defined as
∂y
∂x
T
ij
=
∂y
i
∂x
j
. (3)
This partial derivative is a H × W matrix, whose entry at the intersection of
the i-th row and j-th column is
∂y
i
∂x
j
.
It is easy to see that z is a function of x in a chain-like argument: a function
maps x to y, and another function maps y to z. The chain rule can be used to
compute
∂z
∂x
T
, as
∂z
∂x
T
=
∂z
∂y
T
∂y
∂x
T
. (4)
4
A sanity check for Equation 4 is to check the matrix/vector dimensions. Note
that
∂z
∂y
T
is a row vector with H elements, or a 1 × H matrix. (Be reminded
that
∂z
∂y
is a column vector). Since
∂y
∂x
T
is an H × W matrix, the vector/matrix
multiplication between them is valid, and the result should be a row vector with
W elements, which matches the dimensionality of
∂z
∂x
T
.
For specific rules to calculate partial derivatives of vectors and matrices,
please refer to Chapter 2 and the Matrix Cookbook .
2 CNN overview
In this section, we will see how a CNN trains and predicts at the abstract level,
with the details left for later sections.
2.1 The architecture
A CNN usually takes an order 3 tensor as its input—e.g., an image with H
rows, W columns, and 3 channels (R, G, B color channels). Higher order ten-
sor inputs, however, can be handled by CNN in a similar fashion. The input
then sequentially goes through a number of processes. One processing step is
usually called a layer, which could be a convolution layer, a pooling layer, a
normalization layer, a fully connected layer, a loss layer, etc.
We will introduce the details of these layers later in this chapter. We will give
detailed introductions to three types of layers: convolution, pooling, and ReLU,
which are the key parts of almost all CNN models. Proper normalization—
e.g., batch normalization—is important in the optimization process for learning
good parameters in a CNN. Although it is not introduced in this chapter, we
will present some related resources in the exercise problems.
For now, let us give an abstract description of the CNN structure first.
x
1
−→ w
1
−→ x
2
−→ · · · −→ x
L−1
−→ w
L−1
−→ x
L
−→ w
L
−→ z (5)
The above Equation 5 illustrates how a CNN runs layer by layer in a forward
pass. The input is x
1
, usually an image (order 3 tensor). It goes through the
processing in the first layer, which is the first box. We denote the parameters
involved in the first layer’s processing collectively as a tensor w
1
. The output of
the first layer is x
2
, which also acts as the input to the second layer’s processing.
This processing proceeds till all layers in the CNN have been finished, which
outputs x
L
.
One additional layer, however, is added for backward error propagation,
a method that learns good parameter values in the CNN. Let’s suppose the
problem at hand is an image classification problem with C classes. A commonly
used strategy is to output x
L
as a C dimensional vector, whose i-th entry
encodes the prediction (posterior probability of x
1
coming from the i-th class).
To make x
L
a probability mass function, we can set the processing in the (L −
1)-th layer as a softmax transformation of x
L−1
(cf. Chapter 9). In other
applications, the output x
L
may have other forms and interpretations.
5
剩余34页未读,继续阅读
资源评论
syp_net
- 粉丝: 158
- 资源: 1196
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功