SSD Single Shot MultiBox Detector.pdf

所需积分/C币:21 2019-08-14 15:01:44 9.8MB PDF
收藏 收藏

SSD: Single Shot MultiBox Detector 23 oc;△(cx,cy,w,h con C1,C2,···,C (a)Image with GT boxes (b)8x8 feature map (c)4 x 4 feature map Fig 1. SSD framework(a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set(e. g 4)of default boxes of different aspect ratios at each location in several feature maps with different scales(e. g 8x8 and 4 x 4 in(b) and(c)). For each default box, we predict both the shape offsets and the confidences for all object categories((c1, C2 At training time, we first match these default boxes to the ground truth boxes. For example, we have matched two default boxes with the cat and one with the dog, which are treated as positives and the rest as negatives. The model loss is a weighted sum between localization loss(e. g. Smooth Ll 6 and confidence loss(e. g Softmax) These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, coco, and IlsvrC and are com pared to a range of recent state-of-the-art approaches 2 The Single Shot Detector(SSD) This section describes our proposed SSD framework for detection(Sect. 2.1 and the associated training methodology (Sect. 2.2). Afterwards, Sect. 3 presents dataset-specific model details and experimental results 2.1 Model The SSD approach is based on a feed-forward convolutional network that pro- duces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a stan dard architecture used for high quality image classification(truncated before any classification layers), which we will call the base network. We then add auxiliary structure to the network to produce detections with the following key features We use the VGG-16 network as a base, but other networks should also produce good results 24 W. Liu et al Extra Feature Layers VGG-16 through Poo5 layer Classifier: Conv: 3x3x(4x (Classes+ 4)) Classifier: Conv: 3x3x(6x(Classes+4 374.3mAP Conv: 3x3x(4x(Classes+4))2 Conv Conv: 3x3x v1x1×1024 ony:1x1x128 Conv: 3x3x512-s2 Conv: 3x3x256-32 Conv: 3x3x256-91 Cony: 3x3x256-81 YOLO Customized Architecture Fully Connected Fully Connected Fig. 2. A comparison between two single shot detection models: SSD and YOLO 5 Our SSd model adds several feature layers to the end of a base network, which predict the offsets to default boxes of different scales and aspect ratios and their associated confidences. SSD with a 300 x 300 input size significantly outperforms its 448X 448 YOLO counterpart in accuracy on voC2007 test while also improving the speed Multi-scale feature maps for detection. We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progres- sively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer(cf Overfeat [4 and YOLo 5 that operate on a single scale feature map) Convolutional predictors for detection. Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. These are indi- cated on top of the SSd network architecture in Fig. 2. For a feature layer of size m x n with p channels, the basic element for predicting parameters of a potential detection is a3x3 X p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m x n locations where the kernel is applied, it produces an output value The bounding box offset output values are measured relative to a default box position relative to each feature map location(cf the architecture of YOLo [5 that uses an intermediate fully connected layer instead of a convolutional filter for this step Default boxes and aspect ratios. We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we SSD: Single Shot MultiBox Detector 25 compute c class scores and the 4 offsets relative to the original default box shape This results in a total of (c+4 k filters that are applied around each location in the feature map, yielding(c+4)kmn outputs for a m x n feature map For an illustration of default boxes, please refer to Fig. 1 Our default boxes are similar to the anchor bores used in Faster R-Cnn 2, however we apply them to several feature maps of different resolutions. Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box h 2.2 Training The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO 5 and for the region proposal stage of Faster R CNN 2 and MultiBox [7. Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies Matching Strategy. During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box we are selecting from default boxes that vary over location aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best Jaccard overlap(as in MultiBox 7). Unlike MultiBox we then match default boxes to any ground truth with Jaccard overlap higher than a threshold(0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap Training Objective. The Ssd training objective is derived from the MultiBox objective [7, 8 but is extended to handle multiple object categories. Let (1,0 be an indicator for matching the i-th default box to the j-th ground truth box of category p. In the matching strategy above, we can have 2ii>1. The overall objective loss function is a weighted sum of the localization loss (loc) and the confidence loss(conf) L(, C, L, g)=x(Lconf(a, c)+aLloc(a, l, g)) wherein is the number of matched default boxes and the localization loss is the Smooth L1 loss [6 between the predicted box(l)and the ground truth box (9) parameters. Similar to Faster R-CNn 2, we regress to offsets for the center of the bounding box and for its width and height. Our confidence loss is th softmax loss over multiple classes confidences(c) and the weight term a is set to l by cross validation 26 W. Liu et al Choosing Scales and Aspect Ratios for Default Boxes. To handle differ ent object scales, some methods 4, 9 suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [ 10, 1l have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, [12] showed that adding global context pooled from a feature map can help smooth the segmentation results Motivated by these methods, we use both the lower and upper feature maps for detection Figure 1 shows two exemplar feature maps(88 and 4 x 4) which are used in the framework. In practice, we can use many more with small computational overhead We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects. Suppose we want to use m feature maps for prediction. The scale of the default boxes for each feature map is computed as Sk= Smin +omax Smin (k-1),k∈[1,m] 2) nm where Smin is 0.2 and Smax is 0.9, meaning the lowest layer has a scale of 0.2 and the highest layer has a scale of 0.9, and all layers in between are regularly spaced. We impose different aspect ratios for the default boxes, and denote them as ar E(1, 2,3,2,33. We can compute the width(w=Kvar)and height (hg:= sk/var) for each default box. For the aspect ratio of 1, we also add a default box whose scale is sk=vsk Sk+1, resulting in 6 default boxes per feature map location. We set the center of each default box to(+0., *.), where fk I is the size of the k-th square feature map, i,] 0, fkD. In practice, one can also design a distribution of default boxes to best fit a specific dataset By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predic tions, covering various input object sizes and shapes. For example, in Fig. 1, the dog is matched to a default box in the 4 x 4 feature map, but not to any default boxes in the 8x& feature map. This is because those boxes have different scales and do not match the dog box, and therefore are considered as negatives during trainin Hard Negative Mining. After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3: 1. We found that this leads to faster optimization and a more stable training SSD: Single Shot MultiBox Detector 27 Data Augmentation. To make the model more robust to various input object sizes and shapes, each training image is randomly sampled by one of the following options Use the entire original input image ample a patch so that the minimum jaccard overlap with the objects is 0.1 Randomly sample a patch The size of each sampled patch is 0.1, 1] of the original image size, and the aspect ratio is between and 2. We keep the overlapped part of the ground truth box if the center of it is in the sampled patch. After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally fipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [13 3 Experimental Results Base Network. Our experiments are all based on VGG16 [14, which is pre-trained on the ILSVRC CLS-LoC dataset [15]. Similar to DeepLab LargeFOV [16, we convert fc6 and fc7 to convolutional layers, subsample para- meters from fc6 and fc7, change pools from 2 x 2-s2 to 3 x3-sl, and use the atrous algorithm to fill the holes". We remove all the dropout layers and the fc8 layer. We fine-tune the resulting model using SGD with initial learning rate 100, 0.9 momentum, 0.0005 weight decay, and batch size 32. The learning rate decay policy is slightly different for each dataset, and we will describe details later. The full training and testing code is built on Caffe [17 and is open source at 3.1 PASCAL VOC2007 On this dataset, we compare against Fast R-cnn 6 and Faster R-CNn 2 on VOC2007 test(4952 images). All methods use the same pre-trained VGG16 network Figure 2 shows the architecture details of the SSD300 model. We use conv4-3 conv7(fc7), conv8_2, conv9-2, conv 10_2, and conv11-2 to predict both location and confidences We initialize the parameters for all the newly added convolu tional layers with the "xavier"method 18. For conv4-3, conv10-2 and conv11-2 we only associate 4 default boxes at each feature map location-omitting aspect ratios of, and 3. For all other layers, we put 6 default boxes as described in Sect. 2.2. Since, as pointed out in[12, conv4-3 has a different feature scale com pared to the other layers, we use the L2 normalization technique introduced in [12 to scale the feature norm at each location in the feature map to 20 and learn the scale during back propagation. We use the 10 learning rate for 40k 2 For SSD512 model, we add extra conv 12_2 for prediction W. Liu et al Table 1. PASCAL VOC2007 test detection results. Both Fast and Faster R- CNN use input images whose minimum dimension is 600. The two SSD models have exactly the same settings except that they have different input sizes (300 X 300 vs 512 X 512). It is obvious that larger input size leads to better results, and more data always helps.Data:“07”:VOC2007 traina1,“07+12”: union of voc2007and VoC2012 trainval.07+12+Coco: first train on Coco trainval35k then fine- tune on 07-+12 Method datamAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa traintv 669745783692532367737828204077767979679273069030654702758658 Fast607+1270.07078.169.359438381.678.686742.878.868984.782.076.669931.870.174.880.470.4 69970080.670.157349.978280482052275367280379.875076339168367381.1676 Faster|07+1273.27657907096555283.184786452.081965784.88467576738873.67383.0726 Faster I1|7+12+coco78.884.3820776896578818488963.686.370.885.987680.182.353.680475.886678.9 SSD30 68073477.564.159.038975.280.878.546067.869.276682.17.072.541.264.269.178068.5 sD30007+1274174680272.266.247.182983486.154478.573.984484.582.476.148674375.084.3740 SS0o7+2coco79.680.986379.076.257687.38288660.5854476787.589284581455081.981.5859789 SSD512 0771675.181469.860.846.382.684.784.148575067482.383979476.644969969.178.171.8 sS207+1276.8824847784473853286287.586057.883.170.284985283979.750.37973982.5753 SSD51207+12+coCo81.586987582.075566.488288.789365.288.374.487.188985.984.557.684.680.787.181.7 iterations, then we continue training for 10k iterations with 10-4 and 10-5 When training on voc2007 trainval, Table 1 shows that our low resolution SSD300 model is already more accurate than Fast R-cnn. When we train SSD on a larger 512 x 512 input image it is even more accurate, surpassing Faster R-CNN by 1.7% mAP If we train SSD with morei.e. 07+12) data, we observe that SSD300 is already better than Faster R-cnn by 0.9 and that Ssd512 is 3.6 better. If we take models trained on Coco trainval35k as described in Sect. 3.4 and fine-tuning them on the 07+12 dataset with SSD512. we achieve the best results: 81.5 MAP To understand the performance of our two SSD models in more details, we used the detection analysis tool from 19. Figure 3 shows that SSD can detect various object categories with high quality (large white area). The majority of its confident detections are correct. The recall is around 85-90%, and is much higher with"weak"(0.1 Jaccard overlap) criteria. Compared to R-CNn 201, SSD has less localization error, indicating that SSd can localize objects better because it directly learns to regress the object shape and classify object categories instead of using two decoupled steps. However, SSd has more confusions with similar object categories(especially for animals), partly because we share locations for multiple categories. Figure 4 shows that SSD is very sensitive to the bounding box size. In other words, it has much worse performance on smaller objects than bigger objects. This is not surprising because those small objects may not even have any information at the very top layers. Increasing the input size(e. g. from 300 x 300 to 512 x 512)can help improve detecting small objects, but there is still a lot of room to improve. On the positive side, we can clearly see that SsD performs really well on large objects. And it is very robust to different object aspect ratios because we use default boxes of various aspect ratios per feature map location. SSD: Single Shot MultiBox Detector 29 animals vehicles 100 0600 20 total detections(x 357) total detections(x 415) total detections(x 400 56品四 a>5品g 吕40 255010020040080016003200 255010020040080016003200 255010020040080016003200 total false positives Fig 3. Visualization of performance for SSD 512 on animals, vehicles, and furniture from VOC2007 test using [19. The top row shows the cumulative frac- tion of detections that are correct(Cor)or false positive due to poor localization(Loc) confusion with similar categories(Sim), with others(Oth), or with background(BG) The bottom row shows the distribution of top-ranked false positive types SSD300: Aspect Ratio XSS M LXL XSS M LXL XSS M LXL XSS MLXL XSS MLXL XSS M LXL XSS M LXL XTT MWW XTT MWW XTT MWW XTT M WX XTT MWW XTT MWW D512: BBOx A chair XSS MLXL XSS MLXL XSSMLXL XSSMLXL XSS MLXL XSS MLXL XSSMLXI XTT MMAW XTT MWW XTT MWW XTT MWW XTT MWW XTT MWW XTT MWW Fig 4 Sensitivity and impact of different object characteristics on VOC2007 test set using 19. The plot on the left shows the effects of B Box Area per category, and the right plot shows the effect of Aspect Ratio 3.2 Model Analysis To understand SSd better, we carried out controlled experiments to examine now each component affects performance. For all the experiments, we use the same settings and input size (300 X 300), except for specified changes to the settings or component(s) W. Liu et al Table 2. Effects of various design choices and components on Ssd performance SSD300 more data augmentation Include {2,2}box include (3, 3 box? use atrous? VoC2007 test mAP 65.571.673.7744743 Table 3. Effects of multiple layers Source layers from mAP use boundary boxes?# Boxes conv4-3 conv conv8-2 conv9-2 conv10-2 conv YesNo 74.363.4 8732 74663.1 8764 73.868 8942 70.769.2 9864 64.264.4 9025 62.464.0 8664 Data Augmentation is Crucial. Fast and Faster R-CNn use the original image and the horizontal fip to train. We use a more extensive sampling strategy, similar to YOLO 5. Table 2 shows that we can improve 8.8% mAP with this sampling strategy. We do not know how much our sampling strategy will benefit Fast and Faster R-CNn, but they are likely to benefit less because they use a feature pooling step during classification that is relatively robust to object translation by design More Default Box Shapes is Better. As described in Sect. 2. 2, by default we use 6 default boxes per location. If we remove the boxes with and 3 aspect ratios, the performance drops by 0.6%. By further removing the boxes with and 2 aspect ratios, the performance drops another 2. 1%. Using a variety of default box shapes seems to make the task of predicting boxes easier for the network Atrous is Faster. as described in Sect. 3. we used the atrous version of a sub sampled VGG16, following DeepLab-LargeFOV [16. If we use the full VGG16 keeping pools with 2X2-52 and not subsampling parameters from fc6 and fc7 and add conv5-3 for prediction, the result is about the same while the speed is about 20 slower Multiple output Layers at Different Resolutions is Better. A major contribution of SsD is using default boxes of different scales on different output layers. To measure the advantage gained, we progressively remove layers and

试读 17P SSD Single Shot MultiBox Detector.pdf
立即下载 身份认证后 购VIP低至7折
  • 分享精英

关注 私信
SSD Single Shot MultiBox Detector.pdf 21积分/C币 立即下载
SSD Single Shot MultiBox Detector.pdf第1页
SSD Single Shot MultiBox Detector.pdf第2页
SSD Single Shot MultiBox Detector.pdf第3页
SSD Single Shot MultiBox Detector.pdf第4页

试读结束, 可继续读2页

21积分/C币 立即下载