2013-Visualizing and Understanding Convolutional Networks

所需积分/C币:10 2015-05-21 10:14:00 34.56MB PDF
收藏 收藏

Large Convolutional Network models have recently demonstrated impressive classication performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classier. Used in a diagnostic role, these visualizations allow us to nd model architectures that outperform Krizhevsky et al. on the ImageNet classication benchmark. We also perform an ablation study to discover the performance contribution from dierent model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Visualizing and Undcrstanding Convolutional Networks invert this, the deconvnet uses transposed versions of Other important differences relating to layers 1 and the same filters, but, applied to the rectified maps, not. 2 were made following inspection of the visua lizations the output of the layer beneath. In practice this means in Fig. 6, as described in Section 4.1 Hipping each filter vertically and horizontally The Inodel was trained on the IimlageNet 2012 traill Projecting down from higher layers uses the switch ing set (1.3 million images, spread over 1000 different settings generated by the max pooling in the convnet classes ). Each RGB image was preprocessed by resiz- on the way up. As these switch settings are peculiar ing the smallest dimension to 256, cropping the center to a given input image, the reconstruction obtained 256x256 region, subtracting the per-pixel mean(across from a single activation thus resembles a small piece all images) and then using 10 different sub-crops of size of the original input image, with structures weighted 224x 224(corners center with(out) horizontal flips) according to their contribution toward to the feature Stochastic gradient descent with a mini-batch size of activation. Since the model is trained discriminatively, 128 was used to update the parameters, starting with a they implicitly show which parts of the input imagc Learning ratc of 10-2, in conjunction with a momcntum ire discriminative. Note that these projections are not term of 0. 9. We anneal the learning rate throughout samples from the model, since there is no generative training manually when the validation error plateau process involved Dropout(Hinton et al., 2012) is used in the fully con- nected layers(6 and 7)with a rate of 0.5. All weights are initialized to 10 and biases are set to o Layer Above Reconstruction Pooled Maps Switches Visualization of the first layer filters during training Max Unpooling Max Pooling reveals that a few of them dominate. as shown in Fig. 6(a). To combat this, we renormalize each filter whose Rms value exceed Rectified linear Rectified Linear a fixed radius of 10-l to this fixed radius. This is cru Function cial, especially in the first layer of the Inodel, where the Rectified Unpoolec Maps Feature map input images are roughly in the[ -128, 128 range. As in Convolutional Filtering F) Filtering F1 (Krizhevsky et al., 2012), we produce multiple differ- ent crops and Hips of each training example to boost Reconstruction Layer Below Pooled Maps training set size. We stopped training after 70 epochs which took around 12 days on a single gtX580 GPU Laver Abowe using an implementation based on(Krizhevsky et al Pooled Maps 2012) pooling Pooling Max locations 4. Convnet visualization Rectified Using the model described in Section 3, we now use 上 eature Maps the deconvnet to visualize the feature activations on the lmage Net validation set Figure 1. Top: A deconvnet layer(left) attached to a con nct layer(right). The dcconvnet will reconstruct an ap- Feature Visualization: Fig. 2 shows feature visu proximate version of the convnet features from the layer alizations from our modcl oncc training is complctc beneath. Bottom: An illustration of the unpooling oper- However, instead of showing the single strongest ac ation in the deconvnet, using switches which record the tivation for a given feature map, we show the top 9 location of the local max in each pooling region (colored activations. Projecting each separately down to pixel zones) during pooling in the convnet space reveals the different structures that excite a given feature map, hence showing its invariance to in- 3. Training De etaIls put deformations. Alongside these visualizations we show the corrcsponding imagc patches. Thcsc have We now describe the large convict modcl that will bc greater variation than visualizations as the latter solely visualized in Section 4. The architecture, shown in focus on the discriminant structure within each patch Fig 3, is similar to that used hy(Krizhevsky et al., For example. in layer 5, row 1, col 2, the patch nes a 2012)for ImageNet classification. One difference is pear to have little in common, but the visualizations that the sparse connections used in Krizhevsky's lay- reveal that this particular feature map focuses on the ers 3, 4, 5(due to the model being split across 2 GPUs) grass in the background, not the foreground objects are replaced with dense connections in our model Visualizing and Undcrstanding Convolutional Networks Layer 1 12 ayer meaTs trowe 用品 miming slr anseri Layer3 Layer 4 Layer 5 Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset of feature maps across t he valida tion data, projected down to pixel space using our deconvolution I net work approach Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note li)the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii)exaggeration of discriminative parts of the image, e. g. eyes and noses of dogs(layer 4, row 1, cols 1). Best viewed in electronic form Visualizing and Undcrstanding Convolutional Networks The projections from each layer show the hierarchi- 4.2. Occlusion Sensitivity caI nature of the features in the network. Layer 2 re- sponds to corners and other edge/color conjunctions With image classification approaches, a natural ques Layer 3 has more complex invariances, capturing sim- tion is if the model is truly identifying the location of ilar textures (e. g. mesh patterns(Row 1. Col 1);text the object in the image, or just using the surround (R2, C1). Layer 4 shows significant variation, but ing context. Fig. 7 attempts to answer this question is more class-specific: dog faces(R1,C1): bird's legs by systematically occluding different portions of the R4.C2). Layer 5 shows entire objects with significant Input illage with a grey square, anld Illonitoring' the pose variation, e. g. key boards(Rl C11)and dogs(r4) output of the classifier. The examples clearly show the model is localizing the objects within the scene Feature Evolution during Training Fig. 4 visu- as the probability of the correct class drops signif alizes the progression during training of the strongest cantly when the object is occluded. Fig. 7 also shows activation(across all training examples) within a given visualizations from the strongest feature map of the fcaturc map projected back to pixel spacc. Sudden top convolution layer, in addition to activity in this jumps in appearance result from a change in the image map(summed over spatial locations)as a function of from which the strongest activation originates. The occluder position. When the occluder covers the im- lower layers of the model can be seen to converge agc rcgion that appears in the visualization, we scc within a few epochs. However, the upper layers only strong drop in activity in the feature map. This shows develop develop after a considerable number of epochs that, the visualizat, ion genuinely corresponds to the im (40-50); demonstrating the need to let the models train age structure that stimulates that feature map, hence until fully converged validating the other visualizations shown in Fig. 4 and Feature Invariance: Fig. 5 shows 5 sample images F1g. 2 being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from 4.3. Correspondence Analysis the top and bottoIn layers of the model, relative to the Deep models differ from many existing recognition ap untransformed feature. Small transformations have a proaches in that there is no explicit mechanism for dramatic effect in the first layer of the model, but a establishing correspondence between specific object lesser impact at the top feature layer, being quasi- parts in different images (e. g. faces have a particular near for translation scaling. The network output spatial configuration of the cycs and nosc).However is stable to translations and scalings. In general, the an intriguing possibility is that deep models might be output is not invariant to rotation, except for object implicitly computing them. To explore this, we take 5 with rotational symmetry (e.g. entertainment center). randomly drawn dog images with frontal pose and sys tematically mask out the same part of the face in each 41. Architecture selection image(e.g. all left eyes, see Fig. 8). For each image i, we then compute: E4=;-af, where i and ai are the While visualization of a trained model gives insight fcaturc vectors at laycr l for thc original and occluded into its operation, it can also assist with selecting good inages respectively. We then Measure the consis architectures in the first place. By visualizing the first tency of this difference vector e between all related im- and second layers of Krichevsky et al.'s architecture age pairs(l,3 ∑ H(sign(ei), sign(e)) (Fig. 6(b)&(d), various probloms arc apparent. The where H is lamming distance. A lower value indi- first layer filters are a Inix of extremely high anld low cates greater consistency in the change resulting from frequency information, with little coverage of the mid the masking operation, hence tighter correspondence dditionally, the 2nd layer visualization between the same object parts in different images shows aliasing artifacts caused by the large stride 4 (i.e. blocking the left eye changes the feature repre- used in the lst layer convolutions. To remedy these sentation in a consistent way). In Table 1 we compare problems, we (i)reduced the lst layer filter size from the A score for three parts of the face(left eye, right 11x1l to 7x7 and (ii) made the stride of the convolu- eye and nose) to random parts of the object, using fea- tion 2, rather thian 4. This new architecture retains tures from layer 1=5 and 1=7. The lower score for much more information in the lst and 2nd layer fea- these parts, relative to random object regions, for the tures, as shown in Fig. 6(c)&(e). More importantly, it layer 5 features show the model does establish some also improves the classification performance as shown degree of correspondence in section 5.1 Visualizing and Undcrstanding Convolutional Networks image size 224 26 13 filter size 7 384 1384 256 256 stride 2 3x3 max 3x3 max 3x 3 max pool contras pool contrast 4096 class stride 2 units units v2 Input Im 96 256 3 ave Output Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image(with 3 color planes) is presented as the input. This is convolved with 96 different lst layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function(not shown),(ii) pooled(max within 3x3 regions, using stride 2)and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2, 3, 4, 5. The last two layers are fully connected, taking features from . he top convolutional layer as input in vector form(6. 6. 256=9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape 国圆 耳、 ■国■■ 回回回回回 Layer4 Figure 4. Evolution of a randoMly chosen subset of Inodel features through training. Each layer's feat ures are displayed in a different block. Within each block, we show a randomly chosen subset of features at epochs [1, 2,5, 10, 20, 30, 40, 64 The visualization shows the strongest activation(across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form ent Cente vertcal Tarts htion 60"40 Verizal Tran gation (Pixels) b2 。 -Entertainment Center Entertrainment Center g图r Attainment frican Croordde 一粪 rican Crocodile △ Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively ). Col 1:5 cxamplc images undergoing the transformations. Col 2 3: Euclidcan distance betwcen fcature vectors from the original and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the image is transformed Visualizing and Undcrstanding Convolutional Networks (b) I(c (d) Figure 6.(a): lst layer features without feature scale clipping. Note that one feature dominates. (b): Ist layer fea atures froIn(Krizhevsky et al., 2012).(c): Our lst layer features. The snaller stride(2 vs 4) and filter size(7x7 vs 11x11 results in more distinctive features and fewer dead"features. (d): Visualizations of 2nd layer features from(Krizhevsky et al., 2012).(e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in (d) (C)Layer 5, strongest (d)Classifier, probability (e)classifer, most (a) Input Image (b)Layer 5, strongest feature map feature map projections of correct class probable class True Label: Pomeranian DEKRA ADAE/aF True Label: Car Wheel 凵 Mortarboard True Label: Afghan Hound Figure 7. Three test exaMples where we systematically cover up different portions of the scene with a gray square(Ist column) and see how the top(layer 5) feature maps((b)&(c)) and classifier output((d)&(e) changes.(b): for each position of the gray scale, we record the total activation in one layer 5 feature map(the one with the strongest response in t he unoccluded image).(c: a visu..tion of this feature map projected down into the input, image(black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dogs face. When this is covered-up the activity in the feature map decreases(blue area in(b)).(d): a map of correct class probability, as a function of the position of the gray squarc. E. g. when the dogs facc is obscured, the probability for "pomeranian"drops significantly.(e): the most probable label as a function of occluder position. E. g. in the 1st row, for most locations it is"pomeranian,", but if the dogs face is obscured but not the ball, then it predicts " tennis ball".In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog(blue region in(d)), since it uses multiple feature maps Visualizing and Undcrstanding Convolutional Networks ing set). We note that this error is almost half that of the top non-convnet entry in the imagenet 2012 classi fication challenge, which obtained 26 2y error (Gunji etal,2012). Error go Top-1Top-5Top-5 (Gunji et al., 2012) 26.2 (Krizhevsky et al., 2012), 1 convnet.7182 (Krizhevsky et al., 2012), 5 convnets381 16.4 16.4 (Krizhevsky et al., 2012)",1 conv 39.0 (Krizhevsky et al. 2012)",7 conor 15.4 15.3 (Krichevsky et al., 2012), 1 convnet 40.5 18.1 384 Figure 8. Images used for correspondence cxpcriments I convict as per Fig 3 5 convnets as per Fig. 3 36.7 15.3 15.3 Col 1: Original image. Col 2, 3, 4: Occlusion of the right 1 convnet as per Fig3 but with eye, left eye, and nose respectively. Other columns show layers345:512,1024512maps-(b)37516.016.1 6convnets,(a)&(b)combined 36.014.7148 examples of random occlusions Table 2. ImageNet 2012 classification error rates. The sr Mean Feature Mean Feature indicates modcls that were trained on both ImageNct 2011 Sign changc and 2012 training sets Occlusion location Layer 5 Right eye 0.067士0.0070.069±0.015 Varying ImageNet Model Sizes: In Table 3, we Left Eye 0.069士0.0070.068±0.013 0.079±0.0170.069±0.011 first explore the architecture of(Krizhevsky et al. 2012) by adjusting the size of layers, or removing Random 0.107±0.0170.073±0.014 them entirely. In each case, the model is trained from Table 1. Measure of correspondence for different object scratch with the revised architecture. Removing the parts in 5 different dog images. The lower scores for the fully connected layers(6,7) only gives a slight increase eyes and nose(compared to random object parts )show the in crror. This is surprising, givcn that thcy contain model implicitly establishing some form of correspondence the ma jority of model parameters. Removing two of of parts a.t. layer 5 in the model. At layer 7, the scores the middle convolutional layers a lso makes a relatively are more similar, perhaps due to upper layers trying to small different to the error rate. However, removing discriminate between the different breeds of dog both the middle convolution layers and the fully con nected layers yields a model with only 4 layers whose 5. Experiments performance is dramatically worse. This would sug gest that thc ovcrall depth of the modcl is important 5. 1. ImageNet 2012 for obtaining good perfornance. In Table 3, we nodif This dataset consists of 1.3M/50k 100k train- our model, shown in Fig. 3. Changing the size of the ing/validation/test examples, spread over 1000 cate- fully connected layers makes little difference to perfor- gories. Table 2 shows our results on this dataset mance(same for model of(Krizhevsky et al., 2012) However, increasing the size of the middle convolution Using the exact architecture specified in(Krizhevsky layers goes give a useful gain in performance. But in et 2012), we attempt to replicate their result on the creasing these, while also enlarging the fully connected validation set. We achieve an error rate within 0. 1% of layers results in over-fitting cheir reported value on the ImageNet 2012 validation et 5.2. Feature Generalization Next we analyze the performance of our model with The experiments above show the importance of the the architectural changes outlined in Section 4.1(7x7 convolutional part of our ImagcNct model in obtain- filters in layer 1 and stride 2 convolutions in layers 1 ing state-of-the-art performance. This is supported by &z 2). This model, shown in Fig. 3, significantly out- the visualizations of Fig. 2 which show the complex in performs the architecture of(Krizhevsky ct al., 2012), variances learned in the convolutional layers. We now beating their single model result by 1.7%(test top-5). explore the ability of these feature extraction layers to When we combine multiple models, we obtain a test. generalize to other datasets, namely Caltech-101(Fei- error of 14.8, the best published performance fei et al., 2006), Caltech-256 (Griffin et al., 2006)and on this dataset ( despite only using the 2012 train- PASCAL VOC 2012. To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new I This performance has been surpassed in the recent Imagenet2013competition(http://www.image-net.orgchallenges/lsvrc/2013/results.php) Visualizing and Undcrstanding Convolutional Networks Train Val Error go Top-1 Top-1Top-5 Our replication of (Krizhevsky et al., 2012), 1 convnet 35.1 A0.5 Removed layers 3, 4 Removed layer 7 27.4 40.0 Removed layers 6, 7 27.4 44.8 22.4 Aym arace moved layer 3, 4,6,7 3 Adjust layers 6, 7: 2018 units 40.3 18.8 Adjust layers 6,7: 8192 units 26.8 40.0 18.1 Our model (as per Fig 3 33.1 16.5 Adjust layers 6, 7: 2048 units 38.2 17.6 Adjust layers 6, 7: 8192 units 22,0 Adjust: layers 3, 4, 5: 512, 1024, 512 maps188 Bo etal 75 16.0 30 Sohn etal Adjust layers 6, 7: 8192 units and Layers3,4,5:512,1024,512maps 10.0 3 16.9 Training Images per-class Table 3. ImageNet 2012 classification error rates with var ious architectural changes to the model of(Krizhevsky Figure 9. Caltech-256 classification performance as the et a.I. 2012) and our model (see Fig 3) number of training images per class is varied. Using only 6 training examples per class with our pre-trained feature extractor, we surpass best reported result by (Bo et softmax classifier on top(for the appropriate number 2013) of classes)using the training images of the new dataset Since the softmax contains relatively few parameters, ble 4, using 5 train/test folds. Training took 17 min it can be trained quickly from a relatively small num- utes for 30 images class. The pre-trained model beats ber of examples, as is the case for certain datasets the best reported rosult for 30 images/class from(Bo The classifiers used by our model(a softmax) and et al., 2013)by 2. 2%. The convnet model trained from other approaches(typically a linear SVM) are of simi- scratch however does terribly, only achieving 46.5% lar complexity, thus the experiments compare our fea Acc %o ture representation, learned from ImageNet, with the Train 15/class 30/class 2013 hand-crafted features used by other methods. It is im 81.4±0.33 (Jianchao et al., 2009) 73.2 843 portant to note that both our feature representation Non-pretrained convnet 228±1.546.5±17 and the hand-crafted features are designed using im- ImageNet-pretrained convnet838±0.5865±0.5 ages beyond the Caltech and PASCAL training sets For example, the hyper-parameters in HOG descrip- Table 4. Caltech-101 classification accuracy y for our con- tors were determined through systematic experiments vnet models, against two leading alternate approaches on a pedestrian dataset(Dalal Triggs, 2005). We Caltech-256: We follow the procedure of (Griffin also try a second strategy of training a model from scratch. i.e. resetting layers 1-7 to random values and et al. 2006), selecting 15, 30, 45, or 60 training im ages per ng of the per-class train them, as well as t he softmax, on the training accuracies in Table 5. Our Image Net-pretrained model images of the dataset beats the current state-of-the-art results obtained by One complication is that some of the Caltech datasets Bo et al. (Bo et al. 2013) by a significant margin have some images that are also in the Image Net train- 74.2% vB 55.2% for 60 training images/ class. However, ing data. Using normalized correlation, we identified as with Caltech-101, the model trained from scratch these few"overlap?"images and removed them from does poorly. In Fig 9, we explore the "one-shot, learn our Imagenct training sct and then rctraincd our Ima- ing"'(Fei-fei et al., 2006)regime. With our pre-trained genet models, so avoiding the possibility of train/test model, just 6 Caltech-256 training images are needed contamination to beat the leading method using 10 times as many im ages. This shows the power of the ImageNet feature Caltech-101: We follow the procedure of et al, 2006)and randomly select 15 or 30 images per Acc o Acc o class for training and test on up to 50 images per class Train 15/ class 30/class 45/ class 60/ class reporting the average of the per-class accuracies in Ta- (Sohn et al. 2011)135.1 42.1 45.7 47.9 ( Bo et al.,2013)|40.5±0.448.0±0.251.9±0.2552±0.3 For Caltech-101, we found 44 images in common(out of 9, 144 total images). with a maximum overlap of 10 for Non-pretr 9.0±1,4225±0.7312±0.538.8士1.4 any given class. For Caltech- 256, we found 243 imagos in [ImageNet-pretr 657±0270.602727±0.4742±0.3 common (out of 30, 607 total images), with a maximum Table 5. caltech 256 classification accuracies overlap of 18 for any given class Visualizing and Undcrstanding Convolutional Networks PASCAL 2012: We used the standard training and 6. Discussion va. lidation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as We explored largc convolutional neural nctwork mod PASCAL images can contain multiple objects and our els, trained for image classification, in a number ways model just provides a single exclusive prediction for First, we presented a novel way to visualize the ac each image. Table 6 shows the results on the test set tivity within the model. This reveals the features to The PASCAL and ImageNet images are quite differ be far from random, uninterpretable patterns. Rather, ent in nlature, thie formner being full scenes unlike the they show many intuitively desirable properties such as is may explain our mean performance being compositionality, increasing invariance and class dis- 3. 2%o lower than the leading(Yan et al., 2012) resul crimination as we ascend the layers. We also showed how these visualizatioN call be used to debug prob however we do beat them on 5 classes, sometimes by lems with the model to obtain better results, for ex- large margins ample improving on Krizhevsky et al.'s(Krizhevsky et aL., 2012) impressive Image Net 2012 result. We Irs Acc yoAB( Airplane92.0. 396.0 Dining tab.277.867.7 then demonstrated through a series of occlusion exper- 74.284.277 68.983.087.8 iments that the model, while trained for classification 73080.888.4lor 78.2 86.0 77.585385.5 Motorbike81.090.185.1 is highly sensitive to local structure in the image and is Bottle 85.8 Potted pl 55.9 57. 8 52.2 not just using broad scene context. An ablation study 54360.8558 Pcrson 91.605.90.9 81.9 69479. on the model revealed that having a minimum depth 91.2 Ch 652754650 T 6& 1 74.4 61. 1 to the network, rather than any individual section, is 6.1 vital to the Ilodel's performance 82.2 Finally, we showed how the lmage.et trained model Table 6. PASCAL 2012 classification results, comparing our Imagenet-pretrained convnet against the leading two can generalize well to other datasets. For Caltech-101 methods (A=(Sande et al., 2012)and B=(Yan et al and Caltech-256, the datasets are siInilar enough that 2012) we can beat the best report ed results, in the latter case by a significant margin. This result brings into ques tion to utility of benchmarks with small (i. e.< 10 5.3. Feature analvsis training sets. Our convnet model generalized less well We explore how discriminative the features in each to the PASCAL data, perhaps suffering from dataset layer of our Imagenet-pretrained model are. We do this bias (Torralba efros, 2011), although ill was stl y varying the number of layers retained from the Ima within 3. 2% of the best reported result, despite no tun geNet model and place either a linear SVM or softmax ing for the task. For example, our performance might classifier on top Table 7 shows results on Caltech improvc if a diffcrcnt loss function was uscd that pcr 101 and Caltech-256. For both datasets, a steady im- mitted multiple objects per image. This would natu provement can be seen as we ascend the model, with rally enable the networks to tackle the object, detection best results being obtained by using all layers. This as well supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful fea- Acknowledgments tures The authors are very grateful for support by Nsf grant Cal-101 Cal-256 IIS-1116923. Microsoft Research and d Sloan Fellow (30/ class)(60/class SVM(1448±0.7246=04 ship sVM(②)66.2±0.539.6=0.3 72.3±0.446.0=0.3 References SVM(4)766±0.4513=0.1 VM(5 862±0.8656=0.3 Bengio, Y, Lamblin, P, Popovici, D. and Larochelle, VM(7)855±0.4717士0.2 H. Greedy layer-wise training of deep networks. In Softmax(5)82.9±0.465.7=0.5 NIPS,pp.153-160,2007 Softmax(⑦)854±0.4726±0.1 Berkes, P. and Wiskott, L. On the analysis and in- Table 7. Analysis of the discriminative information con terpretation of inlhoinogeneous quadratic forMs as tained in each layer of feature maps within our ImageNet- receptive fields. Nearal Computation, 2006 pretrained convnet. We train either a linear SVM or soft max on features from different layers(as indicated in brack Bo, L, Ren, X, and Fox, D. multipath sparse coding ets) from the convnet. Higher layers generally produce using hierarchical matching pursuit. In CVPR, 2013 more discriminative features

试读 11P 2013-Visualizing and Understanding Convolutional Networks
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
  • 分享王者

关注 私信
2013-Visualizing and Understanding Convolutional Networks 10积分/C币 立即下载
2013-Visualizing and Understanding Convolutional Networks第1页
2013-Visualizing and Understanding Convolutional Networks第2页
2013-Visualizing and Understanding Convolutional Networks第3页

试读结束, 可继续读1页

10积分/C币 立即下载