ThunderNet Towards Real-time Generic Object Detection

所需积分/C币:39 2019-03-31 13:17:01 4.26MB PDF
收藏 收藏

Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. However, previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios. In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design. To generate more discriminative feature representation, we design two efficient architecture blocks, Context EnhancementModule and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Compared with lightweight one-stage detectors, ThunderNet achieves superior performance with only 40% of the computational cost on PASCALVOC and COCO benchmarks. Without bells and whistles, our model runs at 24.1 fps on an ARM-based device. To the best of our knowledge, this is the first real-time detector reported on ARM platforms. Code will be released for paper reproduction.
Region Proposal Network(RPn to generate regions pro ayer SNct49 SNct146 SNct535 posals instead of pre-handled proposals. R-FCN 4] designs 224×224 a fully convolutional architecture which shares computation Con]112×1123×3,24,23×3,24,.23×3,48,s2 6×56 3×3 maxpool,s2 on the entire image On the other hand, one-stage detectors 28×28 60.s2] [32.s2] 248,s2 such as SSD 19 and YOLO 24 25, 26 achieve real-time 60,s1]×3[132,s1l 14×14 inference on GPU with very competitive accuracy. Reti Stage3 [20,s2] [264,s2] 496,s2] 14×14[120.1×7264,S1)×7496,1]×7 naNet [17 proposes focal loss to address the foreground Staged [240,52][528,s2 background class imbalance and achieves significant accu 7×7 [240.,s11×3528,s1×399,.s1×3 1×1,512 racy improvements. In this work, we present a two-stage global avg pool detector which focuses on efficiency FC 1000-d FLOPs 49M Real-time generic object detection, Real-time object de- 146M1535M Table 1. architecture of the snet backbone networks sNet uses tection is another important problem for CNN-based detec ShufleNetv2 basic blocks but replaces all 3 x3 depthwise convo tors. Commonly, one-stage detectors are regarded as the lutions with 5x 5 depthwise convolutions key to real-time detection. For instance, YOLO [24,25 26 and SSD (19 run in real time on GPU. When coupled with small backbone networks, lightweight one-stage detectors, the inference speed, ThunderNet utilizes the input resolu such as Mobile Net-SSD [111, Mobile Netv2-SSDLite 28, tion of 320x 320 pixels. Moreover, in practice, we observe Pelee [31 and Tiny-DSOD13 achieve inference on mo- that the input resolution should match the capability of the bile devices at low frame rates. For two-stage detectors, backbone. A small backbone with large inputs and a large Light-HeadR-CNn [14 utilizes a light detection head and backbone with small inputs are both not optimal. details runs at over 100 fps on GPU. This raises a question: are are discussed in Sec. 4.4.1 two-stage detectors better than one-stage detectors in real- time detection? In this paper, we present the effectiveness of Backbone Networks. Backbone networks provide basic two-stage detectors in real-time detection. Compared with feature representation of the input image and have great in- prior lightweight one-stage detectors, ThunderNet achieves fuence on both accuracy and efficiency. CNN-based detec tors usually use classification networks transferred from Im a better balance hetween accuracy and efficiency ageNet classification as the backbone. However, as image Backbone networks for detection Modern cnn-based classification and object detection require different proper detectors typically adopt image classification networks [30 10 12 as the backbones. FPN [16] exploits the inher ties from the backbone, simply transferring classification networks to object detection is not optimal ent multi-scale, pyramidal hierarchy of CNNs to construct feature pyramids. Lightweight detectors also benefit from Receptive field: The receptive field size plays an impor the recent progress in Si networks. such as mobileNet tant role in CNN models. CNNs can only capture informa- 11 28 and ShuffleNet 20 However, image classi- tion inside the receptive field. Thus, a large receptive field fication and object detection require different properties of can leverage more context information and encode long networks. Therefore, simply transferring classification net- range relationship between pixels more effectively. This is crucial for the localization subtask especially for the local orks to object detection is not optimal. For this reason DetNet [15] designs a backbone specifically for object de ization of large objects. Previous works [2314) have also tection.Recent lightweight detectors 131131 demonstrated the effectiveness of the large receptive field in also design specialized backbones. However, this area is still not well semantic segmentation and object detection studied. In this work, we investigate the drawbacks of prior Early-stage and late-stage features: In the backbone, lightweight backbones and present a lightweight backbone early-stage feature maps are larger with low-level features for real-Lime detection task which describe spatial details, while late-stage feature maps are smaller with high-level features which are more dis- 3. Thundernet criminative. Generally, localization is sensitive to low-level features while high-level features are crucial for classifica In this section, we present the details of ThunderNet tion. In practice, we observe that localization is more dif- Our design mainly focuses on efficiency, but our model still ficult than classification for larger backbones, which indi achieves superior accuracy cates that early-stage features are more important. And the 3.1. Backbone part weak representation power restricts the accuracy in both subtasks for extremely tiny backbones, suggesting that both Input Resolution. The input resolution of two-stage d early-stage and late-stage features are crucial at this level lectors is usually very large, e. g, FPN [16] uses input im The designs of prior lightweight backbones violate the ages of 800x pixels. It brings several advantages but in aforementioned factors: ShuffeNetvl/v2 [33 20 have re volves enormous computational cost as well. To improve stricted receptive field(121 pixels vS. 320 pixels of input Shuffle Net V2 [20 and MobileNet V2 [28 lack early-stage 1x1 Conv. 24 features, and Xception [3 suffer from the insufficient high level features under small computational budgets Based on these insights, we start from ShuffleNet V2, and 1x1 Conv, 245 build a lightweight back bone named swet for real-time de Cs,10×10 tection. We present three SNet backbones: SNet49 for faster g obal avg pool 1x1 ConV, 245 inference, SNet535 for better accuracy, and SNet146 for a h,1x1 Broadcast better speed/accuracy trade-off. First, we replace all 3 X3 depthwise convolutions in ShuffleNetV2 with 5x 5 depth- Figure 3. Structure of Context Enhancement Module (CEM) wise convolutions. In practice, 5 x5 depthwise convolutions CEM combines feature maps from three scales and encodes more provide similar runtime speed to 3 x3 counterparts while ef fectively enlarging the receptive field (from 121 to 193 pix- more discriminative features els). In SNet146 and SNet535, we remove Conv5 and add more channels in early stages. This design generates more the number of Rols for testing as discussed in Sec. 4.1 low-level features without additional computational cost. In SNet49, we compress Conv5 to 5 12 channels instead of re Context Enhancement Module Light-HeadR-CNn ap moving it and increase the channels in the early stages for plies Global Convolutional Network(GCN)(23]to generate a better balance between low -level and high-level features. the thin feature map. It significantly increases the receptive If we remove Conv 5, the backbone cannot encode adequate field but involves enormous computational cost. Coupled information. But if the 1024-d Conv 5 layer is preserved, the with SNet 146, GiCn requires 2 x the FLOPs needed by the backbone suffers from limited low-level features. Table backbone (596M VS. 298M. For this reason, we decide to shows the overall architecture of the backbones besides abandon this design in thunder Net the last output feature maps of Stage3 and Stage4( Conv5 However, the network suffers from the small receptive for SNet49)are denoted as C4 and C: field and fails to encode sufficient context information with out gCN. a common technique to address this issue is Fea 3. 2. Detection Part ture Pyramid Network(FPN)[16 However, prior FPN Compressing rPn and Detection Head. Two-stage de structures[16 61326| involve many extra convolutions tectors usually adopt large rPn and a heavy detection head and multiple detection branches, which increases the com Although Light-HeadR-CNN [14 uses a lightweight detec putational cost and induces enormous runtime latency tion head, it is still too heavy when coupled with small back For this reason, we design an efficient Context enhance bones and induces imbalance between the back bone and the ment Module( Cem)to enlarge the receptive field. The key detection part. This imbalance not only leads to redundant idea of cem is to ag gregate multi-scale local context in for- computation but increases the risk of overfitting mation and global context information to generate more dis To address this issue, we compress rpn by replacing the criminative features. In CEM, the feature maps from three original 256-channel 3 x3 convolution with a 5 x5 depth scales are merged: C4, Cs and Cgb. Colh is the global con wise convolution and a 256-channel lxI convolution. We text feature vector by applying a global average pooling on increase the kernel size to enlarge the receptive field and en 5. We then apply a l x l convolution on each feature map code more context information. Five scales 32, 64, 128 to squeeze the number of channels to a×p×p-245 2562,512} and five aspect ratios{1:2,3:4,1:1,4:3,2:1} Afterwards, Cs is upsampled by 2x and Calb is broadcast are used to generate anchor boxes. Other hyperparameters so that the spatial dimensions of the three feature maps are remain the same as in [14 equal. At last, the three generated feature maps are aggre In the detection head, Light-HeadR-cnn generates a gated. By leveraging both local and global contexL, CEM effectively enlarges the receptive field and refines the rep thin feature map with a X p x p channels before Rol warp- ing, where p-7 is the pooling size and a-10. As the resentation ability of the thin feature map. Compared with backbones and the input images are smaller in Thundernet prior FPN structures, CEM involves only two lx I convo lutions and a fc layer, which is more computation-friendly we further narrow the feature map by halving a to 5 to elim inate redundant computation. For Rol warping, we opt for Fig. illustrates the structure of this module PSRoI align as it squeezes the number of channels to a. Spatial Attention Module. During roI warping, we ex As the rol feature from PSrol align is merely 245-d, we pect the features in the background regions to be small and apply a 1024-d fully-connected fc) layer in R-CNN subnet e foregroun d counterparts to be high. However, compared As demonstrated in Sec. 4.4.3 this design further reduces with large models, as ThunderNet utilizes lightweight back the computational cost of R-CNN subnet without sacrificing bones and small input images, it is more difficult for the accuracy. Besides, due to the small feature maps we reduce network itself to learn a proper feature distribution SAM Each image has 2000/200 rols for training/testing. For ef- ficiency, the input resolution of 320X 320 pixels is used in- stead of 600x or 800x pixels in common large two-Stage RPM H detectors. Multi-scale training with 1240, 320, 480| pixels is adopted. As the input resolution is small, we use heavy data augmentation[19 The networks are trained for 625K Figure 4. Structure of Spatial Attention Module (SAM). SAM iterations on voc dataset and 375K iterations on COCo leverages the information learned in rpn to refine the feature dis dataset. The learning rate starts from 0.01 and decays by a tribution of the feature map from Context Enhancement Module. factor of 0. 1 at 50 %o and 75% of the total iterations. Online The feature map is then used for rol warping hard example mining [29 is adopted and Soft-NMS [l is used for post-processing. Cross-GPU Batch Normalization (CGBN)[22 is used to learn batch normalization statistics For this reason, we design a computation-friendly spa tial Attention Module(sam) to explicitly re-weight the fea- 4. 2. Results on PASCAL voc ture map before Rol warping over the spatial dimensions The key idea of saM is to use the knowledge from RPN pascaL VOC dataset consists of natural images drawn to refine the feature distribution of the feature map RPn is from 20 classes. The networks are trained on the union set trained to recognize foreground regions under the supervi- of voc 2007 trainval and voc 2012 trainval. and we re sion of ground truths. Therefore, the intermediate features port single-model results on VOC 2007 test. The results are in rpn can be used to distinguish foreground features from exhibited in Table 2 background features. SAM accepts two inputs: the interme ThunderNet surpasses prior state-of-the-art lightweight diate feature map from rPn krv and the thin feature map one-stage detectors. ThunderNet with SNet49 outperforms from CEM FCEM. The output of SAM FSAM is defined as Mobilenet-SSd with merely 21% of the FloPs, while the SNet146-based model surpasses Tiny-DSOd by 2.9 mAP CEM sigmoid(e(Fm)) with about 439 of the floPs. moreover. ThunderNet with SNet146 performs better than Tiny- DSOD by 6.5 mAP un- Here ( )is a dimension transformation to match the num- der similar computational cost ber of channels in both feature maps. The sigmoid function Furthermore, ThunderNet achieves superior results to is used to constrain the values within [O, 1]. At last, FCEm state-of-the-art large object detectors such as YOLOv2(25 is re-weighted by the generated feature map for better fea- SSD300*[19, SSD321(19 and R-FCN (4), and is on a par ture distribution. For computational efficiency, we simply with DSSD321 161, but reduces the computational cost by apply a 1x1 convolution as 6(), so the computational cost orders of magnitude. We note that the backbone of Thun of CEM is negligible. Fig. 4 shows the structure of SAM derNet is significantly weaker and smaller than the large SAM has two functions. The first one is to refine the fea- detectors it demonstrates that Thunder Net achieves a much ture distribution by strengthening foreground features and better trade-off between accuracy and efficiency. suppressing background features. The second one is to sta bilize the training of rpn as SAM enables extra gradient 4.3. Results on Ms COCO flow from R-cnn subnet to rpn Ms COCO dataset consists of natural images from 80 al aCRM acR-CNN aFSAM object categories. Following common practice [16141, we dFi OFRPN 4 决5 OFRPN·(2) use trainval35k for training, minivan for validation and re port single-model results on test-dev As a result, rPn receives additional supervision from R As shown in Table 3 ThunderNet with SNet49 achieves CNN subnet, which helps the training ofRPN MobileNet-SSD level accuracy with 22% of the FLOPs ThunderNet with SNet146 surpasses MobileNet-SSD [1l 4. Experiments MobileNet-SSDLite [28 and Pelee 31] with less than 40% of the computational cost. It is noteworthy that our ap In this section we evaluate the effectiveness of thun proach achieves considerably better AP75, which suggests derNet on PASCaL VOC [5] and COCo [18] benchmarks. our model performs better in localization. This is consis Then we conduct ablation studies to evaluate our design tent with our initial motivation to design two-stage real-time 4.1. Implementation Details detectors. Compared with Tiny-DSOD [131, ThunderNet achieves better AP but worse APso with 42 C of the FLOPs Our detectors are trained end-LO-end on 4 GPUs using We conjecture that deep supervision and feature pyramid synchronized SGD with a weight decay of 0.0001 and a mo in Tiny-DSOD contribute to better classification accurac mentum of 0.9. The batch size is set to 16 images per GPU. However, ThunderNet is still better in localization Model Backbone MFLOPS YOLOV2 25 Darknel-1g 416×416 17400 768 SSD300*19 VGG-16 300×300 31750 77.5 DSSD3216 Net-101 + FPN 321×321 21200 78.6 600×1000 Tiny YOLO 25 Tiny Darknet 416×416 57.1 D-YOLO 21 Tiny darknet 416×416 2090 67.6 Mobilc Net-SSD[31 00×300 680 Pelee[31 Pelee net 304×304 1210 70.9 Tiny-DSOD(13 DDB-Nct+D-FPN 300×300 1060 Thunders 320×320 70.1 ThunderNet (ours) SNeL146 320×320 751 SNet535 320×320 78.6 Table 2. Evaluation results on vOC 2007 test. ThunderNet surpasses competing models with significantly less computational cost Model Backbone Input MFLOPs AP APso APT YOLOV2 25 Sark SSD300°19 ⅤGG-16 300×300 35200 43.1 25.8 SSD321|6 ResNet-l01 321×321 454 DSSD321G ResNet-101+FPN 28.0 I. ight-Head R-CNN 20 Shuffenetv2age 800×1200 MobileNet-SSD I1 Mobilener 300×300 19.3 MobileNet-SSDLite28 Mobileme 320×320 1300 22.2 MobileNetV2-SSDLite 28 Mobilenetv2 320×320 800 PeleeNe 30 229 Tiny-DSOD[13 DDB-Net+D-FPN 300×300 1L20 23.2 404 228 19.1 33.7 19.6 ThunderNet(ours SNet1 46 320×320 473 23.6 24.5 Thunder Net(ours) 320×320 28.0 Table 3. Evaluation results on CoCO test-dev. ThunderNet with SNet49 achieves MobileNet-SSD level accuracy with 22%of the FloPs ThunderNet with SNet 146 achieves superior accuracy to prior lightweight one-stage detectors with merely 40% of the FLOPs. ThunderNet with SNet535 rivals large detectors with significantly less computational cost significantly outperforms YOLOv2 and SSD300[19,and rivals SSD321 16 and DSSD321 61. It suggests that Thun derNet is not only efficient but highly accurate (a) ThunderNet with SNet4 4.4. ablation Experiments 4.4.1 Input resolution We first explore the relationship between the input reso- (b)Thunde Net with SNet146 lution and the backbone. Table 4 reveals that large back bones with small images and small backbones with large images are both not optimal. There is a trade-off between the two factors On the one hand. small images lead to low (c) ThunderNet with SNet535 resolution feature maps and induce severe loss of detail fea- Figure 5 Examples visualization on COCO test-dev tures. It is hard to be remedied by simply increasing the capacity of the backbones. On the other hand small back bones are too weak to encode sufficient information from ThunderNet with snet535 achieves significantly better large images. The backbone and the input images should detection accuracy under comparable computational cost match for a better balance between the representation abil- As shown in Table 3 ThunderNet surpasses other one-stage ity and the resolution of the feature maps counterparts by at least 4. 8 AP, 5. 8 AP5o and 6.7 AP75. The 4.4.2 Backbone networks gap in AP75 is larger than the gap in A P50, which means our model provides more accurate bounding boxes than other We then evaluate the design of the backbones. SNet146 detectors This further demonstrates that two-stage detec- and sNet49 are used as the baselines sNet146 achieves tors are prior to one-stage detectors in real-time detection 32.5% top-1 error on ImageNet classification and 23.6 AP task. Fig. 5]visualizes several examples on COCO test-dev. on COCO test-dev(Table 5]a)), while SNet49 achieves We also compare ThunderNet with large one-stage 39. 7% top-1 error and 19.1 AP (Table5 e) tectors. ThunderNet with SNet 146 surpasses YOLOv2 5X5 Depthwise Convolutions. We evaluate the effective with 37x fewer FLOPs. And ThunderNet with SNet535 ness of 5x5 depthwise convolutions on SNet146. We first Backbone Input MFLOPS AP Backbone MFLOPS Top-1 Err AP (a) SNetl46 146 23.6 SNet146 24×224 267 18.7 (b)SNet146+3 x3 DwConv 145 32.7 22.7 128×128 (c)SNetl46+ double 3 X3 DwConv SNet49 180×480 506 220 (d)SNet146+1024-d Conv5 32.3 23.2 320×32 23.6 (e) sNet SNet535 192×192 f)SNet49+No Conv5 40.8 18.2 Table 4. Evaluation of different input resolutions on Coco test- (g) SNet49+1024-d Convs 589 l8.8 dev. Large backbones with small images and small backbones with Table 5. Evaluation of different backbones on ImageNet classifi large images are both not optimal cation and CoCo test-dev DW Conv: depthwise convolution MFLOPS Top-1 Er. ShuffleNet l阝 137 208 replace all 5x5 depthwise convolutions with 3x3 depth ShuffleNet V2 (20) 14 31.4 22.7 wise convolutions. For fair comparison, the channels from ShutHleNet V2:*[20 145 B 145 34.1 23.0 Stage2 to Stage4 are slightly increased to maintain the com Mobile Net v2 28 l45 32.9 22.7 putational cost unchanged. This model performs worse on 146 23.6 both image classification(by 0.2%0)and object detection Table 6. evaluation of lightweight backbones on coco test-dev (by 0.9 AP)(Table 5(b). Compared with 3X3 depthwise SNet146 achieves better detection results though the classification convolutions, 5x5 depthwise convolutions considerably in- accuracy is lower crease the receptive fields, which helps in both tasks We then add another 3 x3 depthwise convolution before Comparison with Lightweight Backbones. At last, we the first 1 x 1 convolution in all building blocks as in Shuf further compare SNet with other lightweight backbones fleNetV2*[20]. The number of channels is kept unchanged Table(6 shows that SNet146 outperforms Xception (31,Mo- as the baseline This model is comparable on image class bileNetv2 [28, and ShuffleNetV 1/V2/V2*[3.3 20 on ob fication, but slightly worse on object detection(by 0.3 AP) ject detection under similar computational cost. These re (Table 5 c). As this model and SNet146 have the same sults further demonstrate the effectiveness of our design receptive fields theoretically, we conjecture that 5 x 5 depth wise convolutions can provide larger valid receptive fields 44.3 Detection Part which is especially crucial in object detection We also investigate the effectiveness of the design of the Early-Stage and Late-stage Features. To investigate the detection part in Thunder Net. Table 7 describes the com trade-off between early-stage and late-stage features, we parison of the model variants in the experiments first add a 1024-channel conv in sNet146. The chan Baseline. We choose a compressed light-Head R-CNN nels in the early stages are reduced accordingly. This mode [14] with SNet146 as the baseline. Cs is upsampled by slightly improves the top-l error, but reduces aP by 0.4(Ta 2x to obtain the same downsampling rate. C4 and Cs are ble5(d). A wide Conv5 generates more discriminative fea then squeezed to 245 channels and sent to RPN and rol tures, which improves the classification accuracy However warping respectively. We use a 256-channel 3 x3 convolu object detection focuses on both classification and localiza- tion in RPN and a 2048-d fc layer in R-CNN subnet. This tion. Increasing the channels in early stages encodes more model requires 703 MFLOPs and achieves 21.9 AP (Ta- detail information which is beneficial for localization ble7 a). Besides, we would mention that multi-scale train For SNet49. we first remove Conv 5 in SNet49 and in ing, CGBN [22, and Soft-NMS I1 gradually improve the crease the channels from Stage2 to Stage4. Table5()shows baseline by 1. 4 AP(from 20.5 to 21.9 AP). that both the classification and the detection performance RPN and R-CNn subnet. We first replace the 3 x3 con- suffer from severe degradation. Removing Conv5 cuts the volution in RPn with a 5x5 depthwise convolution and a output channels of the backbone by half. which hinders the 1 xI convolution. The number of output channels remains model from learning adequate information unchanged. This design reduces the computational cost by We then extend Conv5 to 1024 channels as in the original 28% without harming the accuracy Table 7 b). We then ShuffleNet v2. the early-stage channels are compressed to halve the number of outputs of the fc layer in- cnn sub maintain the same overall computational cost. This model net to 1024, which achieves a further 13% compression on surpasses SNet49 on image classification by 0.8%, but per- the FLOPs with a marginal decrease of 0.2 AP (Table7 c) orms worse on object detection (Table 51g). By leverag- These results demonstrate that heavy rPn andr-cnn sub ing a wide Conv5, this model benefits from more high-level net introduce great redundancy for lightweight detectors features in image classification. However, it suffers from More details will be discussed in Sec. 4.4.4 the lack of low-level features in object detection. It further Context Enhancement Module. We then insert Context demonstrates the differences between image classification Enhancement Module(CEm) after the backbone. The out- and object detection put feature map of CEM is used for both RPN and Rol BL SRPN SRCN CEM SAM AP APs0 AP75 MFLOPS Model Backbone rPn Head Tota 219 714 large-backbone-sImall-head 338 47323.6 8375224 516 all-backbone-large-head 154 286 51020.2 3.339.924.0 449 Table 8. MFLOPS and AP of different detection head designs √22.939.0238 on COCO test-dev. The large-backbone-small-head model outper- (f 23.640.224.5 473 forms che simlall-backbone -larye-head model with less flops Table 7. Ablation studies on the detection part on COco test- dev. We use a compressed Lighl-Headr-cnn with SNetl46 as Model ARM CPU GPU the baseline (bl), and gradually add small RPN (srPn), small r Thunder w/ SNct49 47.3 267 CNN (SRCN), Context Enhancement Module (cem) and spatial Thunder w/ SNet 146 13.8 32.3 248 Thunder w/ SNet535 214 Attention Module (saM) for ablation studies. Table 9. Inference speed in fps on Snapdragon 845(ARM), Xeon e5-2682v4(CPU)and Ge Force 1080Ti(GPU) large-backbone-small-head model outperforms the small- backbone-large-head one by 3.4 AP even with less com putational cost. It suggests that the large -backbone-small head design is better than the small-backbone-large-head design for lightweight two-stage detectors. We conjecture that the capability of the backbone and the detection head should match in the small-backbone-large-head design Inou a sam / SAM Ground truth the features from the backbone are relatively weak, which makes the powerful detection head redundant Figure 6. Visualization of the feature map before Rol warping Spatial Attention Module (sAM)enhances the features in the fore- 4.5. Inference Speed ground regions and weakens those in the background regions At last, we evaluate the inference speed of ThunderNet on Snapdragon 845(ARM), Xeon E5-2682v4(CPU) and arpingCEM achieves thorough improvements of 1. 7 AP, Ge Force 1080Ti(GPU). On ARM and CPU, the inference 2.5 AP50 and 1.8 AP75 with negligible increase on FLOPs is executed with a single thread. The batch normalization Table[z(). The combination of the multi-scale feature layers are merged with the preceding convolutions for faster maps introduces semantic and context information of dif- inference speed. The results are shown in Table 回 Thun ferent levels, which improves the representation ability derNet with SNet49 achieves real-time detection on both Spatial Attention Module. Adopting Spatial Attention ARM and CPU at 24.I and 47. 3 fps, respectively. To the Module(sam)without CEM(Table7]e)) improves AP by best of our know ledge, this is the first real-time detector and 1.3 with merely 5% extra computational cost compared with the fastest single-thread speed on ARM platforms ever re Table 7 c). Fig. 6 visualizes the feature maps before Rol ported. ThunderNet with SNet146 runs at 13. 8 fps on ARM ping in Table ]c) and Table Z]e). It is clear that SAM and runs in real-time on CPU at 32.3 fps. All three mod effectively refines the feature distribution with foreground els run at over 200 fps on GPU. These results suggest that feature enhanced and background features weakened ThunderNet is highly efficient in real-world applications. At last, we adopt both CEM and sam to compose the complete ThunderNet(Table 7 0). This setting improves 5. Conclusion aP by 1.7, APso by 2.6, and aP75 by 2.0 over the base- line while reducing the computational cost by 34%.These We investigate the effectiveness of two-stage detectors in results have demonstrated the effectiveness of our design real-time generic object detection and propose a lightweight two-stage detector named ThunderNet. In the backbone 4.4.4 Balance between backbone and detection head part, we analyze the drawbacks in prior lightweight back- bones and present a lightweight backbone designed for ob We further explore the relationship between the backbone ject detection. In the detection part, we adopt an extremely and the detection head. Two models are used in the efficient design in the detection head and rpn. context En experiments: a large-backbone-small-head model and a hancement Module and spatial attention module are de small-backbone-large-head model. The large-backbone- signed to improve the feature representation at last, we in small-head model is ThunderNet with SNet146. While the vestigate the balance between the input resolution, the back- SImall-backbone-large-head model uses SNet49 and a heav- bone, and the detection head. ThunderNet achieves superior ier head: c in the thin feature map is 10, and a 2048-d fc detection accuracy to prior one-stage detectors with signif- layer is used in R-CNN subnet. As shown in Table 8 the icantly less computational cost. To the best of our knowl 8 edge, ThunderNet achieves the first real-time detector and [15] Z Li, C Peng, G. Yu, X Zhang, Y Deng, and J. Sun. Det- the fastest single-thread speed reported on aRM platforms net: Design backbone for object detection. In The european Conference on Computer Vision(eCCv), 2018 References [.YLin, P Dollar, R Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection [1 N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft- In Proceedings of the IEEE Conference on Computer vision nInS-iimproving object detection with one line of code. In and pattern recognition, pages 2117-2125, 2017 Proceedings of the IEEE International Conference on Com- [17] T.Y.Lin, P. Goyal,R. Girshick, K. He, and P. Dollar. Focal puter Vision pages 5561-5569, 2017 loss for dense object detection. In Proceedings of the IEEE [2] Z Cai and n. vasconcelos. Cascade r-cnn: Delving into high international conference on computer vision, pages 2980- quality object detection. arXiv preprint arXiv: 1712.00726 2988,2017 2017 [18 T.-Y. Lin, M. maire s. Belongie, J. Hays, P Perona, D. Ra- [3 F Chollet. Xception: Deep learning with depthwise separa- manan. P Dollar and C.L. Zitnick. Microsoft coco: Com- ble convolutions. In Proceedings of the IEEE conference on mon objects in context. In European conference on computer computer vision and pattern recognition, pages 125 1-1258 vision, pages 740-755. Springer, 2014 2017. [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C Y Fu, and A. C. Berg. Ssd: Single shot multibox detector [4]J. Dai. Y. Li, K. He, and J. Sun. R-fcn: Object detection In European conference on computer vision, pages 21-37 via region-based fully convolutional networks. In Advances in neural information proces sing sy.s, pages 379-387 pringer, 2016 2016. [20]N. Ma, X. Zhang, H.-T. Zheng, and Sun. Shufflenet v2 Practical guidelines for efficient cnn architecture design. In [5 M. Everingham, L. Van Gool, C.K. williams, J. Winn, and Proceedings of the European Conference on Computer vi A. Zisserman. The pascal visual object classes(voc) chal sion(ECCv), pages 116-131, 2018 lenge. International journal of computer vision, 88(2): 303 [21]R Mehta and C Ozturk. Object detection at 200 frames per 338,2010 econd.arXiv preprint ar Xiv: 1805.06361, 2018 [6] C.Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. [22] C Peng, T. Xiao, Z Li, Y Jiang, X Zhang, K Jia, G. Yu, and Dssd: Deconvolutional single shot detector. arXiv preprint J. Sun. Megdel: A large mini-batch object deteclor. In Pro- arXiv:l701.06659,2017 ceedings of the IEEE Conference on Computer Vision and [7R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- Pattern Recognition, pages 6181-6189, 2018 national conference on computer vision, pages 1440-1448 [23] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel 2015 matters-improve semantic segmentation by global convolu [8 R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich led- tional network. In Proceedings of the IeEe conference on ture hierarchies for accurate object detection and semantic computer vision and pattern recognition, pages 4353-4361 segmentation. In Proceedings of the Ieee conference on 2017 computer vision and pattern recognition, pages 580-587 [24 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You 2014. only look once: Unified real-time object detection. In Pro [9]K.He, X Zhang, S Ren, and J. Sun. Spatial pyramid pooling ceedings of the Ieee conference on computer vision and pat- in deep convolutional networks for visual recognilion. In tern recognition, pages 779-788, 2016. European conference on computer vision, pages 346-361 [25] J. Redmon and A Farhadi. Yolo9000: better, faster, stronger Springer, 2014 In Proceedings of the Ieee conference on computer vision [10 K. He, X. Zhang, S.Ren, and J. Sun. Deep residual learn- and pattern recognition pages 7263 7271, 2017 ing for image recognition. In Proceedings of the Ieee con [26 J Redmon and A Farhadi. Yolov3 An incremental improve- ference on computer vision and pattern recognition, pages ment. arXiv preprint ar Xiv: 1804.02767, 2018 770-778,2016 [27S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards [I1] A G. Howard, M. Zhu, B Chen, D. Kalenichenko, w. Wang real-time object detection with region proposal networks. In T Weyand. m. andreetto. and h. adam Mobilenets: Effi Advances in neural information processing system, pages 91-99.2015 cient convolutional neural networks for mobile vision appli cations. arXiv preprint ar Xiv: 1704.04861, 2017 [28 M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L -C. Chen. Mobilenetv 2: Inverted residuals and linear bottle [12]J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net- necks. In Proceedings of the IEEE Conference on Computer works. In Proceedings of the IeeE conference on computer Vision and Pattern Recognition, pages 4510-4520, 2018 vision and pattern recognition, pages 7132-7141, 2018 [291 A. Shrivastava, A. Gupta, and R. Girshick. Training region [13 Y Li, J. Li, W. Lin, and J. Li. Tiny-dsod: Lightweight ob based object detectors with online hard example mining. In ject detection for resource-restricted usages. ar Xiv preprint Proceedings of the Ieee Conference on Computer Vision arIl:l807.11013.2018. and Pattern Recognition, pages 761-769, 2016 [14] Z Li, C. Peng, G. Yu, X Zhang, Y Deng, and J. Sun. Light- [30] K. Simonyan and A. Zisserman. Very deep convolutional head r-cnn: In defense of two-stage object detector. arXiv networks for large-scale image recognition. arXiv preprint preprint arXiv: 1711.07264, 2017 arXiv:l409.1556.2014 「31]R.J.Wang,Ⅹ.Li,andC.Ⅹ.Ling. Pelee: a real- time object detection system on mobile devices. In Advances in Neural Information Processing Systems, pages 1963-1972, 2018 [32] S. Xie, R Girshick, P Dollar, Z Tu, and K. He. Aggregated residual transformations for deep neural networks. In Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492-1500, 2017 [33] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An ex tremely efficient convolutional neural network for mobile de- vices. In Proceedings of the IEFF Conference on Computer Vision and Pattern Recognition, pages 6848-6856, 201

试读 10P ThunderNet Towards Real-time Generic Object Detection
立即下载 身份认证VIP会员低至7折
  • GitHub

  • 签到王者

  • 分享王者

关注 私信
ThunderNet Towards Real-time Generic Object Detection 39积分/C币 立即下载
ThunderNet Towards Real-time Generic Object Detection第1页
ThunderNet Towards Real-time Generic Object Detection第2页

试读结束, 可继续读1页

39积分/C币 立即下载