2013-VisualizingandUnderstandingConvolutionalNetworks资源-CSDN文库

DeepLearning

需积分: 16 142 浏览量 2015-05-21 10:14:00 上传评论收藏 34.56MB PDF 举报

资源推荐

资源详情

资源评论

Visualizing and Understanding Convolutional Networks

Matthew D. Zeiler zeiler@cs.nyu.edu

Dept. of Computer Science, Courant Institute, New York University

Rob Fergus fergus@cs.nyu.edu

Dept. of Computer Science, Courant Institute, New York University

Abstract

Large Convolutional Network models have

recently demonstrated impressive classiﬁca-

tion performance on the ImageNet bench-

mark (Krizhevsky et al., 2012). However

there is no clear understanding of why they

perform so well, or how they might be im-

proved. In this paper we address both issues.

We introduce a novel visualization technique

that gives insight into the function of inter-

mediate feature layers and the operation of

the classiﬁer. Used in a diagnostic role, these

visualizations allow us to ﬁnd model architec-

tures that outperform Krizhevsky et al. on

the ImageNet classiﬁcation benchmark. We

also perform an ablation study to discover

the performance contribution from diﬀerent

model layers. We show our ImageNet model

generalizes well to other datasets: when the

softmax classiﬁer is retrained, it convincingly

beats the current state-of-the-art results on

Caltech-101 and Caltech-256 datasets.

1. Introduction

Since their introduction by (LeCun et al., 1989) in

the early 1990’s, Convolutional Networks (convnets)

have demonstrated excellent performance at tasks such

as hand-written digit classiﬁcation and face detec-

tion. In the last year, several papers have shown

that they can also deliver outstanding performance on

more challenging visual classiﬁcation tasks. (Ciresan

et al., 2012) demonstrate state-of-the-art performance

on NORB and CIFAR-10 datasets. Most notably,

(Krizhevsky et al., 2012) show record beating perfor-

mance on the ImageNet 2012 classiﬁcation benchmark,

with their convnet model achieving an error rate of

16.4%, compared to the 2nd place result of 26.1%.

Several factors are responsible for this renewed inter-

est in convnet models: (i) the availability of much

larger training sets, with millions of labeled exam-

ples; (ii) powerful GPU implementations, making the

training of very large models practical and (iii) bet-

ter model regularization strategies, such as Dropout

(Hinton et al., 2012).

Despite this encouraging progress, there is still lit-

tle insight into the internal operation and behavior

of these complex models, or how they achieve such

good performance. From a scientiﬁc standpoint, this

is deeply unsatisfactory. Without clear understanding

of how and why they work, the development of better

models is reduced to trial-and-error. In this paper we

introduce a visualization technique that reveals the in-

put stimuli that excite individual feature maps at any

layer in the model. It also allows us to observe the

evolution of features during training and to diagnose

potential problems with the model. The visualization

technique we propose uses a multi-layered Deconvo-

lutional Network (deconvnet), as proposed by (Zeiler

et al., 2011), to project the feature activations back to

the input pixel space. We also perform a sensitivity

analysis of the classiﬁer output by occluding portions

of the input image, revealing which parts of the scene

are important for classiﬁcation.

Using these tools, we start with the architecture of

(Krizhevsky et al., 2012) and explore diﬀerent archi-

tectures, discovering ones that outperform their results

on ImageNet. We then explore the generalization abil-

ity of the model to other datasets, just retraining the

softmax classiﬁer on top. As such, this is a form of su-

pervised pre-training, which contrasts with the unsu-

pervised pre-training methods popularized by (Hinton

et al., 2006) and others (Bengio et al., 2007; Vincent

et al., 2008). The generalization ability of convnet fea-

tures is also explored in concurrent work by (Donahue

et al., 2013).

arXiv:1311.2901v3 [cs.CV] 28 Nov 2013

Visualizing and Understanding Convolutional Networks

1.1. Related Work

Visualizing features to gain intuition about the net-

work is common practice, but mostly limited to the 1st

layer where projections to pixel space are possible. In

higher layers this is not the case, and there are limited

methods for interpreting activity. (Erhan et al., 2009)

ﬁnd the optimal stimulus for each unit by perform-

ing gradient descent in image space to maximize the

unit’s activation. This requires a careful initialization

and does not give any information about the unit’s in-

variances. Motivated by the latter’s short-coming, (Le

et al., 2010) (extending an idea by (Berkes & Wiskott,

2006)) show how the Hessian of a given unit may be

computed numerically around the optimal response,

giving some insight into invariances. The problem is

that for higher layers, the invariances are extremely

complex so are poorly captured by a simple quadratic

approximation. Our approach, by contrast, provides a

non-parametric view of invariance, showing which pat-

terns from the training set activate the feature map.

(Donahue et al., 2013) show visualizations that iden-

tify patches within a dataset that are responsible for

strong activations at higher layers in the model. Our

visualizations diﬀer in that they are not just crops of

input images, but rather top-down projections that

reveal structures within each patch that stimulate a

particular feature map.

2. Approach

We use standard fully supervised convnet models

throughout the paper, as deﬁned by (LeCun et al.,

1989) and (Krizhevsky et al., 2012). These models

map a color 2D input image x

, via a series of lay-

ers, to a probability vector ˆy

over the C diﬀerent

classes. Each layer consists of (i) convolution of the

previous layer output (or, in the case of the 1st layer,

the input image) with a set of learned ﬁlters; (ii) pass-

ing the responses through a rectiﬁed linear function

(relu(x) = max(x, 0)); (iii) [optionally] max pooling

over local neighborhoods and (iv) [optionally] a lo-

cal contrast operation that normalizes the responses

across feature maps. For more details of these opera-

tions, see (Krizhevsky et al., 2012) and (Jarrett et al.,

2009). The top few layers of the network are conven-

tional fully-connected networks and the ﬁnal layer is

a softmax classiﬁer. Fig. 3 shows the model used in

many of our experiments.

We train these models using a large set of N labeled

images {x, y}, where label y

is a discrete variable

indicating the true class. A cross-entropy loss func-

tion, suitable for image classiﬁcation, is used to com-

pare ˆy

and y

. The parameters of the network (ﬁl-

ters in the convolutional layers, weight matrices in the

fully-connected layers and biases) are trained by back-

propagating the derivative of the loss with respect to

the parameters throughout the network, and updating

the parameters via stochastic gradient descent. Full

details of training are given in Section 3.

2.1. Visualization with a Deconvnet

Understanding the operation of a convnet requires in-

terpreting the feature activity in intermediate layers.

We present a novel way to map these activities back to

the input pixel space, showing what input pattern orig-

inally caused a given activation in the feature maps.

We perform this mapping with a Deconvolutional Net-

work (deconvnet) (Zeiler et al., 2011). A deconvnet

can be thought of as a convnet model that uses the

same components (ﬁltering, pooling) but in reverse, so

instead of mapping pixels to features does the oppo-

site. In (Zeiler et al., 2011), deconvnets were proposed

as a way of performing unsupervised learning. Here,

they are not used in any learning capacity, just as a

probe of an already trained convnet.

To examine a convnet, a deconvnet is attached to each

of its layers, as illustrated in Fig. 1(top), providing a

continuous path back to image pixels. To start, an

input image is presented to the convnet and features

computed throughout the layers. To examine a given

convnet activation, we set all other activations in the

layer to zero and pass the feature maps as input to

the attached deconvnet layer. Then we successively

(i) unpool, (ii) rectify and (iii) ﬁlter to reconstruct

the activity in the layer beneath that gave rise to the

chosen activation. This is then repeated until input

pixel space is reached.

Unpooling: In the convnet, the max pooling opera-

tion is non-invertible, however we can obtain an ap-

proximate inverse by recording the locations of the

maxima within each pooling region in a set of switch

variables. In the deconvnet, the unpooling operation

uses these switches to place the reconstructions from

the layer above into appropriate locations, preserving

the structure of the stimulus. See Fig. 1(bottom) for

an illustration of the procedure.

Rectiﬁcation: The convnet uses relu non-linearities,

which rectify the feature maps thus ensuring the fea-

ture maps are always positive. To obtain valid fea-

ture reconstructions at each layer (which also should

be positive), we pass the reconstructed signal through

a relu non-linearity.

Filtering: The convnet uses learned ﬁlters to con-

volve the feature maps from the previous layer. To

Visualizing and Understanding Convolutional Networks

invert this, the deconvnet uses transposed versions of

the same ﬁlters, but applied to the rectiﬁed maps, not

the output of the layer beneath. In practice this means

ﬂipping each ﬁlter vertically and horizontally.

Projecting down from higher layers uses the switch

settings generated by the max pooling in the convnet

on the way up. As these switch settings are peculiar

to a given input image, the reconstruction obtained

from a single activation thus resembles a small piece

of the original input image, with structures weighted

according to their contribution toward to the feature

activation. Since the model is trained discriminatively,

they implicitly show which parts of the input image

are discriminative. Note that these projections are not

samples from the model, since there is no generative

process involved.

Layer Below Pooled Maps

Feature Maps

Rectied Feature Maps

Convolu'onal)

Filtering){F})

Rec'ﬁed)Linear)

Func'on)

Pooled Maps

Max)Pooling)

Reconstruction

Rectied Unpooled Maps

Unpooled Maps

Convolu'onal)

Filtering){F

})

Rec'ﬁed)Linear)

Func'on)

Layer Above

Reconstruction

Max)Unpooling)

Switches)

Unpooling

Max Locations

“Switches”

Pooling

Pooled Maps

Feature Map

Layer Above

Reconstruction

Unpooled

Maps

Rectiﬁed

Feature Maps

Figure 1. Top: A deconvnet layer (left) attached to a con-

vnet layer (right). The deconvnet will reconstruct an ap-

proximate version of the convnet features from the layer

beneath. Bottom: An illustration of the unpooling oper-

ation in the deconvnet, using switches which record the

location of the local max in each pooling region (colored

zones) during pooling in the convnet.

3. Training Details

We now describe the large convnet model that will be

visualized in Section 4. The architecture, shown in

Fig. 3, is similar to that used by (Krizhevsky et al.,

2012) for ImageNet classiﬁcation. One diﬀerence is

that the sparse connections used in Krizhevsky’s lay-

ers 3,4,5 (due to the model being split across 2 GPUs)

are replaced with dense connections in our model.

Other important diﬀerences relating to layers 1 and

2 were made following inspection of the visualizations

in Fig. 6, as described in Section 4.1.

The model was trained on the ImageNet 2012 train-

ing set (1.3 million images, spread over 1000 diﬀerent

classes). Each RGB image was preprocessed by resiz-

ing the smallest dimension to 256, cropping the center

256x256 region, subtracting the per-pixel mean (across

all images) and then using 10 diﬀerent sub-crops of size

224x224 (corners + center with(out) horizontal ﬂips).

Stochastic gradient descent with a mini-batch size of

128 was used to update the parameters, starting with a

learning rate of 10

−2

, in conjunction with a momentum

term of 0.9. We anneal the learning rate throughout

training manually when the validation error plateaus.

Dropout (Hinton et al., 2012) is used in the fully con-

nected layers (6 and 7) with a rate of 0.5. All weights

are initialized to 10

−2

and biases are set to 0.

Visualization of the ﬁrst layer ﬁlters during training

reveals that a few of them dominate, as shown in

Fig. 6(a). To combat this, we renormalize each ﬁlter

in the convolutional layers whose RMS value exceeds

a ﬁxed radius of 10

−1

to this ﬁxed radius. This is cru-

cial, especially in the ﬁrst layer of the model, where the

input images are roughly in the [-128,128] range. As in

(Krizhevsky et al., 2012), we produce multiple diﬀer-

ent crops and ﬂips of each training example to boost

training set size. We stopped training after 70 epochs,

which took around 12 days on a single GTX580 GPU,

using an implementation based on (Krizhevsky et al.,

2012).

4. Convnet Visualization

Using the model described in Section 3, we now use

the deconvnet to visualize the feature activations on

the ImageNet validation set.

Feature Visualization: Fig. 2 shows feature visu-

alizations from our model once training is complete.

However, instead of showing the single strongest ac-

tivation for a given feature map, we show the top 9

activations. Projecting each separately down to pixel

space reveals the diﬀerent structures that excite a

given feature map, hence showing its invariance to in-

put deformations. Alongside these visualizations we

show the corresponding image patches. These have

greater variation than visualizations as the latter solely

focus on the discriminant structure within each patch.

For example, in layer 5, row 1, col 2, the patches ap-

pear to have little in common, but the visualizations

reveal that this particular feature map focuses on the

grass in the background, not the foreground objects.

剩余10页未读，继续阅读

评论收藏

内容反馈

lengwuqin

粉丝: 139
资源: 333

2013-Visualizing and Understanding Convolutional Networks

最新资源

2013-Visualizing and Understanding Convolutional Networks

Visualizing and Understanding Convolutional Networks.zip

Visualizing and Understanding Convolutional Networks 译文（“看懂”卷积神经网络）

Visualizing and Understanding CNNs 论文的ppt分享

Visualizing and Understanding Convolutional Networks (2).zip

Visualizing and Understanding Convolutional Networks笔记

百度地图毕业设计源码-Visualizing-and-Understanding-Convolutional-Networks:paddlep

Visualizing and Understanding Convolutional Networks.pdf

ECCV2014-Visualizing and Understandng Convolutioal Networks

09Visualizing and understanding.pdf

反卷积论文汇总

CNN_visualization:CNN可视化的实现

机器学习论文合集（pdf格式）.zip

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

仿真电路以及操作方法

【纯干货啊】华为IPD流程管理(完整版).pptx

可编程语言标准IEC61131-3中文版.pdf

OFDM完整仿真过程与教程.zip

信号与系统——保研复习资料.pdf

Landsat_WRS2.zip

最全的Visio形状/图形库

AxureRP9项目原型50套、案例20个、元件库1套.zip

北理工+成电+东南——通信/信号保研面试真题.pdf

数字信号处理——保研复习资料.pdf

风电和储能并网Simulink模型

使用STM32F103C8T6+L298N+MG513P30电机使用外部中断法和输入捕获法进行编码器测速

COMSOL各个模块中文使用手册及教程，入门必备

FMEA第五版（中文版）

离散数学及其应用第八版偶数题答案

最新资源

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar