3 Convolutional Neural Networks
Typically convolutional layers are interspersed with sub-sampling layers to reduce computation time
and to gradually build up further spatial and configural invariance. A small sub-sampling factor is
desirable however in order to maintain specificity at the same time. Of course, this idea is not new,
but the concept is both simple and powerful. The mammalian visual cortex and models thereof [12,
8, 7] draw heavily on these themes, and auditory neuroscience has revealed in the past ten years
or so that these same design paradigms can be found in the primary and belt auditory areas of the
cortex in a number of different animals [6, 11, 9]. Hierarchical analysis and learning architectures
may yet be the key to success in the auditory domain.
3.1 Convolution Layers
Let’s move forward with deriving the backpropagation updates for convolutional layers in a network.
At a convolution layer, the previous layer’s feature maps are convolved with learnable kernels and
put through the activation function to form the output feature map. Each output map may combine
convolutions with multiple input maps. In general, we have that
x
`
j
= f
X
i∈M
j
x
`−1
i
∗ k
`
ij
+ b
`
j
!
,
where M
j
represents a selection of input maps, and the convolution is of the “valid” border handling
type when implemented in MATLAB. Some common choices of input maps include all-pairs or all-
triplets, but we will discuss how one might learn combinations below. Each output map is given an
additive bias b, however for a particular output map, the input maps will be convolved with distinct
kernels. That is to say, if output map j and map k both sum over input map i, then the kernels
applied to map i are different for output maps j and k.
3.1.1 Computing the Gradients
We assume that each convolution layer ` is followed by a downsampling layer `+1. The backpropa-
gation algorithm says that in order to compute the sensitivity for a unit at layer `, we should first sum
over the next layer’s sensitivies corresponding to units that are connected to the node of interest in
the current layer `, and multiply each of those connections by the associated weights defined at layer
` + 1. We then multiply this quantity by the derivative of the activation function evaluated at the
current layer’s pre-activation inputs, u . In the case of a convolutional layer followed by a downsam-
pling layer, one pixel in the next layer’s associated sensitivity map δ corresponds to a block of pixels
in the convolutional layer’s output map. Thus each unit in a map at layer ` connects to only one unit
in the corresponding map at layer ` + 1. To compute the sensitivities at layer ` efficiently, we can
upsample the downsampling layer’s sensitivity map to make it the same size as the convolutional
layer’s map and then just multiply the upsampled sensitivity map from layer `+1 with the activation
derivative map at layer ` element-wise. The “weights” defined at a downsampling layer map are all
equal to β (a constant, see section 3.2), so we just scale the previous step’s result by β to finish the
computation of δ
`
. We can repeat the same computation for each map j in the convolutional layer,
pairing it with the corresponding map in the subsampling layer:
δ
`
j
= β
`+1
j
f
0
(u
`
j
) ◦ up(δ
`+1
j
)
where up(·) denotes an upsampling operation that simply tiles each pixel in the input horizontally
and vertically n times in the output if the subsampling layer subsamples by a factor of n. As we
will discuss below, one possible way to implement this function efficiently is to use the Kronecker
product:
up(x) ≡ x ⊗ 1
n×n
.
Now that we have the sensitivities for a given map, we can immediately compute the bias gradient
by simply summing over all the entries in δ
`
j
:
∂E
∂b
j
=
X
u,v
(δ
`
j
)
uv
.
评论30