ECCV2014-Visualizing and Understandng Convolutioal Networks

所需积分/C币:10 2015-05-21 10:47:50 2.25MB PDF
收藏 收藏

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-ofthe- art results on Caltech-101 and Caltech-256 datasets.
820 M.D. Zeiler and R. fergus 2D input image i, via a series of layers, to a probability vector y i over the C dif- ferent classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image)with a set of learned filters; (i) passing the responses through a rectified linear function (relu(a)= max(a,O)) (iii)optionally] max pooling over local neighborhoods and(iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see [18 and 16. The top few layers of the net work are conventional fully-connected networks and the final layer is a softmax classifier. Fig 3 shows the model used in many of our experiments We train these models using a large set of N labeled images a, y), where label yi is a discrete variable indicating the true class. A cross-entropy loss function suitable for image classification, is used to compare yi and yi. The parameters of the network(filters in the convolutional layers, weight matrices in the fully- connected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Details of training are given in Section 2.1 Visualization with a deconvnet Understanding the operation of a convnet requires interpreting the feature activ- ity in intermediate layers. We present a novel way to map these activities back to the input picel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet)Zeiler et al. 29. a deconvnet can be thought of as a convnet model that uses the same components(filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite In Zeiler et al.29, deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet To examine a convnet a deconvnet is attached to each of its layers, as illus rated in Fig. Itop), providing a continuous path back to image pixels. To start an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii)filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached Unpooling: In the convnet, the max pooling operation is non-invertible, how ever we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the decon vnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. Bottom) for an illustration of the procedure Rectification: The convnet uses relu non-linearities, which rectify the fea ture maps thus ensuring the feature maps are always positive. To obtain valid Visualizing and Understanding Convolutional Networks 821 ature reconstructions at each layer(which also should be positive ), we pass the reconstructed signal through a rela non-linearityI Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To approximately invert this, the deconvnet uses transposed versions of the same filters(as other autoencoder models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this means nipping each filter vertically and horizontally Note that we do not use any contrast normalization operations when in this reconstruction path. Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature act vation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. The whole procedure is similar to backpropping a single strong activation (rather than the usual gradients),i. e. computing ax, where h is the element of the feature map with the strong activation and Xn is the input image. However, it differs in that (i) the the relu is imposed independently and(ii)contrast normalization operations are not used. A general shortcoming of our approach is that it only visualizes a single activation, not the joint activity present in a layer Neverthe- less, as we show in Fig. 6 these visualizations are accurate representations of the input pattern that stimulates the given feature map in the model: when the parts of the original input image corresponding to the pattern are occluded, we see a distinct drop in activity within the feature map 3 Training Details We now describe the large convnet model that will be visualized in Section 4 The architecture, shown in Fig. B is similar to that used by Krizhevsky et al. 18 for ImageNet classification. One difference is that the sparse connections used in Krizhevsky's layers 3, 4, 5(due to the model being split across 2 GPUs)are replaced with dense connections in our model. Other important differences re lating to layers 1 and 2 were made following inspection of the visualizations in Fig. 5 as described in Section 4. 1 The model was trained on the Image Net 2012 training set(1.3 million images spread over 1000 different classes)6. Each RGB image was preprocessed by resiz ing the smallest dimension to 256, cropping the center 256x256 region, subtract ing the per-pixel mean(across all images)and then using 10 different sub-crops of size 224x 224(corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10-2, in conjunction with a momentum term of 0.9. We We also tried rectifying using the binary mask imposed by the feed-forward relu operation, but the resulting visualizations were significantly less clear 822 M.D. Zeiler and R. Fergi Layer Above reconstruction Pooled Maps Switch Max pooling X Unpooling Rectified li Functio Function Rectified Unpooled Maps Feature Maps Filtering(] Filtering FI Reconstruction Layer Below Pooled Maps Un np olins ng Unpooled Fig. 1. Top: A deconvnet layer(left) attached to a convnet layer(right). The deconvnet will reconstruct an approximate version of the convnet features from the layer beneath Bottom: An illustration of the unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region(colored zones)during pooling in the convnet. The black /white bars are negative/positive activations within the feature map anneal the learning rate throughout training manually when the validation error plateaus. Dropout 14 is used in the fully connected layers(6 and 7)with a rate of 0.5. All weights are initialized to 10 and biases are set to 0 Visualization of the first layer filters during training reveals that a few of them dominate. To combat this. we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10- to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128, 128] range. As in Krizhevsky et al. 18, we produce multiple different crops and fips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single gtX580 GPU, using an implementation based on 18 4 Convnet Visualization Using the model described in Section 3 we now use the deconvnet to visualize the feature activations on the ImageNet validation set Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. For a given feature map, we show the top 9 acti vations, each projected separately down to pixel space, revealing the different Visualizing and Understanding Convolutional Networks 823 structures that excite that map and showing its invariance to input deformations Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations which solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge color conjunctions Layer 3 has more complex invariances, capturing similar textures(e.g. mesh patterns(Row 1, Col 1); text(R2, C4)). Layer 4 shows significant variation, and is more class-specific: dog faces(R1, C1); birds legs(R4, C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards(R1, C11) and dogs(r4) Feature Evolution during Training: Fig 4]visualizes the progression during training of the strongest activation(across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However the upper layers only developdevelop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged 4.1 Architecture selection While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al's architecture(Fig 5a)&(c), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we(i) reduced the 1st layer filter size from 11x1l to 7x7 and (ii) made the stride of the convolution 2. rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. b)&(d). More importantly, it also improves the classification performance as shown in Section 5.1 4.2 Occlusion Sensitivity With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 6 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 6 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the 824 M D. Zeiler and R Fergus 圆 D因 Laver 2 Layer 3 9 Layer 4 Layer 5 Fig. 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i)the the strong grouping within each feature map,(ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs(layer 4, row 1, cols 1). Best viewed in electronic form. The compression artifacts are a consequence of the 30Mb submission limit, not the reconstruction algorithm itself Visualizing and Understanding Convolutional Networks 825 age size 224 110 filter size 7 256 409614096 stride Input Image 256 256 Layer 1 Layer 3 Layer 4 5 Layer 6 Layer 7 Output Fig 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image(with 3 color planes) is presented as the input. This is convolved with 96 different lst layer filters(red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then:(i) passed through a rectified linear function(not shown),(ii) pooled(max within 3x3 regions, using stride 2) and(iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2, 3, 4, 5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form(6. 6. 256=9216 dimensions The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape 圆国国□ I Layer1 Layer 5 Fig 4. Evolution of a randomly chosen subset of model features through training Each layer's features are displayed in a different block. Within each block, we show a randomly chosen subset of features at epochs1, 2, 5, 10, 20, 30, 40, 64. The visualiza tion shows the strongest activation(across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization hence validating the other visualizations shown in Fig. and Fig. ature map genuinely corresponds to the image structure that stimulates that fe 5 Experiments 5.1 ImageNet 2012 This dataset consists of 1. 3M/50k/100k training/validation/test examples spread over 1000 categories. Table Ishows our results on this dataset Using the exact architecture specified in Krizhevsky et al. 18, we attempt to replicate their result on the validation set. We achieve an error rate within 0. 1% of their reported value on the ImageNet 2012 validation set Next we analyze the performance of our model with the architectural changes outlined in Section 4. 1(77 filters in layer 1 and stride 2 convolutions in layers 826 M D. Zeiler and R Fergus Fig. 5.(a): Ist layer features without feature scale clipping. Note that one feature dom- inates.(b): Ist layer features from Krizhevsky et aL. 18.(c): Our lst layer features. The smaller stride(2 vs 4) and filter size(7x7 vs 11x11) results in more distinctive features and fewer features. (d: Visualizations of 2nd layer features from Krizhevsky et al.18.(e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in(d 1 2). This model, shown in Fig 3 significantly outperforms the architecture of Krizhevsky et al. 18, beating their single model result by 1.7%(test top-5) When we combine multiple models, we obtain a test error of 14.87, an improve ment of 1.6%. This result is close to that produced by the data-augmentation approaches of Howard 15, which could easily be combined with our architec ture. However, our model is some way short of the winner of the 2013 Imagenet classification competition 28 Table 1. ImageNet 2012 /2013 classification error rates. The indicates models that were trained on both ImageNet 2011 and 2012 training sets Val Val Test Error Top-1Top-5 Top-5 unJl et a 26.2 DeCAF 7 19.2 Krizhevsky et al. 18, 1 convnet 40.7182 Krizhevsky et al.18,5 38.1|16.416.4 Krichevsky et al."18,1 s39.016.6 Krizhevsky et al. 18, 7 convnets 36.715 415.3 Our replication of Krizhevsky et al., 1 convnet 40.5|18.1 1 convnet as per Fig.图 38.4165 5 convnets as per Fig 3-(a) B6715315.3 1 convnet as per Fig 3 but with ayers3,4,5:512,1024512maps-(b)37.516.016.1 6 convnets, (a)&(b) combined 36014.7148 Howard 15 13.5 Clarifai 28 117 Varying ImageNet Model Sizes: In Table 2 we first explore the architecture of Krizhevsky et al. 18 by adjusting the size of layers, or removing them entirely In each case. the model is trained from scratch with the revised architecture Removing the fully connected layers(6, 7)only gives a slight increase in error(in Visualizing and Understanding Convolutional Networks 827 (c)Layer 5, strongest (d)Classifier, probability (e)Classifier, most (a)Input Image b)Layer 5, strongest feature map True Label: Pomeranian DEKRAY LTrue Label: Car Wheel Afghan Hound Fig. 6. Three test examples where we systematically cover up different portions of the scene with a gray square(lst column) and see how the top(layer 5) feature maps ((b)&(c) and classifier output(()&(e)) changes.(b): for each position of the gray scale, we record the total activation in one layer 5 feature map(the one with the strongest response in the unoccluded image).(c): a visualization of this feature map projected down into the input image(black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog's face. When this is covered-up the activity in the feature map decreases(blue area in(b)).(d): a map of correct class probability, as a function of the position of the gray square. E. g. when the dog's face is obscured, the probability for "pomeranian"drops significantly. (e): the most probable label as a function of occluder position. E.g. in the lst row, for most locations it is "pomeranian", but if the dog's face is obscured but not the ball, then it predicts"tennis ball". In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog(blue region in(d)), since it uses multiple feature maps the following, we refer to top-5 validation error ). This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small difference to the error rate However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse This would suggest that the overall depth of the model is important for obtaining good performance. We then modify our model, shown in Fig 3 Changing the size of the fully connected layers makes little difference to performance(same for model of Krizhevsky et al.18). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these hile also enlarging the fully connected layers results in over-fitting

试读 16P ECCV2014-Visualizing and Understandng Convolutioal Networks
立即下载 身份认证后 购VIP低至7折
  • 分享王者

关注 私信
ECCV2014-Visualizing and Understandng Convolutioal Networks 10积分/C币 立即下载
ECCV2014-Visualizing and Understandng Convolutioal Networks第1页
ECCV2014-Visualizing and Understandng Convolutioal Networks第2页
ECCV2014-Visualizing and Understandng Convolutioal Networks第3页
ECCV2014-Visualizing and Understandng Convolutioal Networks第4页

试读结束, 可继续读1页

10积分/C币 立即下载