下载  >  人工智能  >  机器学习  > Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for.pdf

Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for.pdf 评分

Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for.pdf Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for.pdf
algorithm I Weakly-Supervised EM(fixed bias version) Input: Initial CNn parameters 0, potential parameters bu Deep Convolutional CLoSS Neural nerwork LE0,., L, image a image-level label set z E-Step: For each image position m fm(l)=fm(c;)+b,if≈ Image Score maps fn(1)=fm(x;0′),i argmaxl fm(l) 1. Cat 2. Person FG/BG M-Step Bias argmax 4. Sofa 4:Q(0: 0)=log P(yla, 0)=2m_log P(3m a, 0 5: Compute VeQ(8: 8) and use SGD to update 8 Image annotations Weakly-Supervised E-step igure 2. Deep Lab model training using image-level labels have the following probabilistic graphical model: We assume that v,: 0)=P((II P(m lz; 0 ))P(2lv).(3) (1 We set the parameters bi bfg, if l>0 and bo= bb We pursue an EM-approach in order to learn the model with bfg> bbg >0. Intuitively, this potential encourages a parameters 8 from training data. If we ignore terms that do pixel to be assigned to one of the image-level labels z. we not depend on 0, the expected complete-data log-likelihood choose bfg bbg, boosting present foreground classes more given the previous parameter estimate 6 is than the background to encourage full object coverage and Q(0:0,)-2P(yla, 2: 0)log P(yla; O)a log P(yla 0),background. The procedure is summarized in algorithm I and illustrated in fig. 2. where we adopt a hard-EM approximation, estimating in the EM-Adapt In this method, we assume that log P(zy) E-step of the algorithm the latent segmentation by o(y: z)+(const), where o(y z)takes the form of a cardi nality potential [23, 33, 36]. In particular, we encourage ar argmax P(yla; 0)P(zly) (5 least a pu portion of the image area to be assigned to class l, if at= l, and enforce that no pixel is assigned to class argmax log P(ya: 0)+log P(zly) l, if z1 =0. We set the parameters PI= Pfg, if l >0 and Po=Pbg. Similar constraints appear in [10, 20 argmax 2m(ymlac: 0)+ log P(ly) In practice, we employ a variant of Algorithm 1. W daptively set the image- and class-dependent biases br so m=1 as the prescribed proportion of the image area is assigned to n the M-step of the algorithm, we optimize Q(0: 0 )a the background or foreground object classes. This acts as a g P(ya: 0)by mini-batch SGD similarly to (1), treating powerful constraint that explicitly prevents the background y as ground truth segmentation score from prevailing in the whole image, also promoting To completely identify the E-step(7), we need to specify higher foreground object coverage. The detailed algorithm the observation model P(zly). We have experimented with is described in the supplementary materi two variants, EM- Fixed and em-Adapt EM VS. MIL It is instructive to compare our EM-based EM-Fixed In this variant, we assume that log P(zly)fac approach with two recent Multiple Instance Learning(MIL) tories over pixel positions as methods for learning semantic image segmentation models 331, 32]. The method in bl defines an mil classification objective based on the per-class spatial maximum of the lo og P(zly)=>p(3m, 2)+(const),( 8) cal label distributions of (2), P( maxm P n m=1 la: 8), and [32] adopts a softmax function. While thi allowing us to estimate the e-step segmentation at each approach has worked well for image classification tasks [29, 30], it is less suited for segmentation as it does not pro pixel separately mote full object coverage: The dcnn becomes tuned to 9m=argmax fm(3m)=fm(yma; 0)+o(ym, 2).(9) focus on the most distinctive object parts(e. g, human face instead of capturing the whole object (e. g, human body) Deep Convolutional Neural Network Imagc with bbox Ground-Truth Bboⅹ-Rcct CRE argmax Figure 4. estimated segmentation from bounding box annotation Segmentation Estimation Bbox annotations Figure 3. Deepl ab model training trom bounding boxes 3.3. Bounding box annotations Pixel Annotations Deep c We explore three alternative methods for training our Neural Network segmentation model from labeled bounding boxes The first bbox-Rect method amounts to simply consider- ing each pixel within the bounding box as positive example for the respective object class. Ambiguities are resolved by Deep Convolutional assigning pixels that belong to multiple bounding boxes to Neural Network LOss the one that has the smallest area Score maps The bounding boxes fully surround objects but also 1. Car contain background pixels that contaminate the training FG/BG 2. Person- m Bias argmax set with false positive examples for the respective object Image Annotations Weakly-Supervised E-step classes. To filter out these background pixels, we have also explored a second bbox - Seg method in which we per Figure 5. DeepLab model training on a union of full(strong labels) and image-level(weak labels) annotations form automatic foreground/background segmentation. To perform this segmentation, we use the same Crf as in DeepLab. More specifically, we constrain the center area of combining the methods presented in the previous sections the bounding box (a%c of pixels within the box)to be fore- as illustrated in Figure 5. In SGD training of our deep cnn ground, while we constrain pixels outside the bounding box models, we bundle to each mini-batch a fixed proportion to be back ground. We implement this by appropriately set- of strongly/weakly annotated images, and employ our EM ting the unary terms of the Cre. we then infer the labels for algorithm in estimating at each iteration the latent semantic pixels in between. We cross-validate the CrF parameters segmentations for the weakly annotated images to maximize segmentation accuracy in a small held-out set of fully-annotated images. This approach is similar to the 4. Experimental evaluation grabcut method of [34]. Examples of estimated segmenta tions with the two methods are shown in fig 4 4.1. Experimental Protocol The two methods above, illustrated in Fig 3, estimate Datasets The proposed training methods are evaluated segmentation maps from the bounding box annotation as a on the PASCAL VOC 2012 segmentation benchmark [13 pre-processing step, then employ the training procedure of consisting of 20 foreground object classes and one back Sec. 3.1, treating these estimated labels as ground-truth ground class. The segmentation part of the original PAs- Our third bbox-EM-Fixed method is an EM algorithm CAL VOC 2012 dataset contains 1464(train), 1449(val) that allows us to refine the estimated segmentation maps and 1456(test)images for training, validation, and test, re throughout training. The method is a variant of the em spectively. We also use the extra annotations provided b Fixed algorithm in Sec. 3.2, in which we boost the present [16], resulting in augmented sets of 10, 582(train-aug )and foreground object scores only within the bounding box area. 12, 031( trainval_aug)images. We have also experimented 3. 4. Mixed strong and weak annotations with the large Ms-coco 2014 dataset [24, which con- tains 123, 287 images in its trainval set. The Ms-coco In practice, we often have access to a large number of 2014 dataset has 80 foreground object classes and one back- weakly image-level annotated inages and can only afford ground class and is also annotated at the pixel level procure detailed pixel-level annotations for a small fraction The performance is measured in terms of pixel of these images. We handle this hybrid training scenario by intersection-over-union (IoU) averaged across the 21 classes. We first evaluate our proposed methods on the Pas Method #Strong#Weak val IOU EM-Fixed (Weak)I. CAL VOC 2012 val set. We then report our results on the EM-Adapt (weak) 10,582382 official PasCal voc 2012 benchmark test set(whose an 20010.382476 notations are not released). We also compare our test set results with other competing methods EM-Fixed(Semi)7509,832598 1,0009,58262.0 4645,00063.2 Reproducibility We have implemented the proposed 146491864 1464 62.5 methods by extending the excellent Caffe framework [18 Strong 67.6 We share our source code, configuration files, and trained Table 1. VOC: 2012 val performance for varying number of pixel- models that allow reproducing the results in this paper level(strong)and image-level(weak) annotations(Sec. 4.3 atacompanionwebsitehttps://bitbucket.org/ deeplab/deeplab-public. Method #Strong#Weak test IOU MIL-FCN 31 10k MIL-sPpxl [32] 760k358 Weak annotations In order to simulate the situations BING|760k37.0 where only weak annotations are available and to have fair MIL-seg [321 MCG 760k 40.6 comparisons(e. g, use the same images for all settings), we EM-Adapt (wcak) 2k 396 1.4k 66.2 generate the weak annotations from the pixel-level an EM-Fixed (Semi tions. The image-level labels are easily generated by sum marizing the pixel-level annotations, while the bounding Table 2. voc 2012 test performance for varying number of pixel- box annotations are produced by drawing rectangles tightly level(strong) and image-level(weak) annotations(Sec. 4.3) containing each object instance(PASCaL Voc 2012 also provides instance-level annotations) in the dataset. of 67.6%on val and 70.3% on test; see method Deeplab Network architectures We have experimented with the CRF-LargeFOV in [5, Table 1] two dCnn architectures of [5], with parameters initialized 4.3. Image-level annotations from the vgg-16 ImageNet [I 1] pretrained model of [35]] They differ in the receptive field of view(fov) size. We validation results We evaluate our proposed methods in have found that large FOV(224 x 224)performs best when training the DeepLab-CRF model using image-level weak at least some training images are annotated at the pixel level, annotations from the 10, 582 PASCAL VOC 2012 train_au whereas small FOV (128x 128)performs better when only set, generated as described in Sec. 4. I above. We report image-level annotations are available. In the main paper the val performance of our two weakly-supervised EM vari we report the results of the best architecture for each setup ants described in Sec. 3. 2. In the Em-Fixed variant we use and defer the full comparison between the two FOVs to the 3 as fixed foreground and background supplementary material biases. We found the results to be quite sensitive to the dif- Training We employ our proposed training methods to ference bfg - bbg but not very sensitive to their absolute val- ues. In the adaptive EM-Adapt variant we constrain at least learn the dcnn component of the deeplab-CRF model of [5]. For SGD, we use a mini-batch of 20-30 images and ini Pbg 40% of the image area to be assigned to background and at least 20% of the image area to be assigned to tial learning rate of 0.001(0.01 for the final classifier layer) foreground(as specified by the weak label set multiplying the learning rate by 0.1 after a fixed number of We also examine using weak image-level annotations iterations. We use momentum of 0.9 and a weight decay of 0.0005. Fine-tuning our network on PASCAL VOC 2012 in addition to a varying number of pixel-level annotations within the semi-supervised learning scheme of Sec. 3.4 takes about 12 hours on a nvidia tesla K40 gPu In this Semi setting we employ strong annotations of a Similarly to 5, we decouple the dcnn and dense cre subset of pascal voc 2012 train set and use the weak training stages and learn the crf parameters by cross val- image-level labels from another non-overlapping subset of idation to maximize IoU segmentation accuracy in a held- the train_aug set. We perforin segmentation inference for out set of 100 Pascal val fully-annotated images. We use 10 the images that only have image-level labels by means of mean-field iterations for Dense CRF inference [19]. Note EM-Fixed, which we have found to perform better than em that the iou scores are typically 3-5% worse if we don'l Adapt in the semi-supervised training setting use the CrF for post-processing of the results The results are summarized in Table l. we see that the 4.2. Pixel-level annotations EM-Adapt algorithm works much better than the EM-Fixed algorithm when we only have access to image level an We have first reproduced the results of [5]. Training notations, 20.8% vS. 38.2 o validation loU. Using 1, 464 the DeepLab-CRF model with strong pixel-level annota- pixel-level and 9, 1 18 image-level annotations in the EM- tions on pascal voc 2012. we achieve a mean iou score Fixed semi-supervised setting significantly improves per formance, yielding 64.6%0. Note that image-level annot- Method #Strong# Box val IOU Bbox-Rect(Weak) tions are helpful, as training only with the 1, 464 pixel-level Bbox-EM-Fixed (Weak) 10.58254.1 annotations only yields 62.5 %o Test results In Table 2 we report our test results. We com Bbox-EM-Fixed(Semi)I1,4649, 11864.8 pare the proposed methods with the recent MiL-based ap- Bbox-Seg(semi 1,4649,11865.1 proaches of [31, 32], which also report results obtained with Strong 1,464 62.5 image-level annotations on the voc benchmark. Our em Table 3. vOC 2012 val performance for varying number of pixel- Adapt method yields 39.6%, which improves over MIL level(strong) and bounding box (weak) annotations (Sec. 4.4) FCN Bl] by a large 13.9% margin. As [32] shows, MIL can become more competitive if additional segmentation in Mcthod #Strong #Box tcst lou Box Sup[9] MCG 64.6 formation is introduced: Using low-level superpixels, MIL 14k(+MCG)9k662 sppxl [321 yields 35. 8 %o and is still inferior to our eM algo- Bbox-Rec(Weak rithm. Only if augmented with BiNG [7 or MCG [l]can Bbox-Seg(Weak) 12k 62.2 Bbox-Seg( Semi) 14k 10k 66.6 MIL obtain results comparable to ours (MIL-obj: 37.0%0 Bbox-EM-Fixed (Semi) 14k 10k 666 MIL-seg: 40.6%)[32]. Note, however, that both bING Bbox-Seg(Semi) 29k 9k 68.0 Bbox-EM-Fixed (Semi) 2.9k 9k 690 and mcg have been trained with bounding box or pixel Strong annotated data on the PASCAL train set, and thus both Table 4. vOC 2012 test performance for varying number of pixel- MIL-obj and MIL-seg indirectly rely on bounding box or level(strong) and bounding box(weak) annotations(Sec. 4. 4) pixel-level PASCAL annotations The more interesting finding of this experiment is that including very few strongly annotated images in the semi- learning settings and 1, 464 strong annotations, Semi-Bbox supervised setting significantly improves the performance EM-Fixed and Semi-Bbox-Seg perform similarly compared to the pure weakly-supervised baseline For example, using 2.9k pixel-level annotations along with Test results In Table 4 we report our test results. We com 9k image-level annotations in the semi-supervised setting pare the proposed methods with the very recent Box Sup ap yields 68.5%. We would like to highlight that this re- proach of [9], which also uses bounding box annotations on suIt surpasses all techniques which are not based on the the voc 2012 segmentation benchmark. Comparing our al DCNN+CRF pipeline of [5](see Table 6), even if trained ternative Bbox-Rect(54. 2%)and Bbox-Seg(62.2%)meth with all available pixel-level annotations ods, we see that simple foreground-background segmenta tion provides much better segmentation masks for DCNN 4. 4. Bounding box annotations training than using the raw bounding boxes. BoxSup does Validation results In this experiment, we train th 2. 4% better, however it employs the MCG segmentation DeepLab-CRF model using bounding box annotations from proposal mechanism [I], which has been trained with pixel- the train_aug set. We estimate the training set segmentations annotated data on the PASCAL train set; it thus indirectly in a pre-processing step using the Bbox-Rect and Bbox-Seg relies on pixel-level annotations methods described in Sec. 3.3. We assume that we alse When we also have access to pixel-level annotated im have access to 100 fully-annotated PASCAL VOC 2012 val ages, our performance improves to 66.6%(1. 4k strong images which we have used to cross-validate the value of annotations)or 69.0%c(2. 9k strong annotations ). In this the single Bbox-Seg parameter a(percentage of the cen- semi-supervised setting we outperform BoXSup(66.6% vs ter bounding box area constrained to be foreground). We 66.2% with 1. 4k strong annotations), although we do not varied c from 20% to 80%, finding that (r= 20% maxi- use MCG. Interestingly, Bbox-EM-Fixed improves over mizes accuracy in terms of IOU in recovering the ground Bbox-Seg as we add more strong annotations, and it per- truth foreground from the bounding box. We also examine forms 1.0% better(69.0%vS 680%)with 2.9k strong an- the effect of combining these weak bounding box annot notations. This shows that the E-step of our EM algorithm tions with strong pixel-level annotations using the semi can estimate the object masks better than the foreground supervised learning methods of Sec. 3.4 background segmentation pre-processing step when enough The results are summarized in table 3. when using onl pixel-level annotated images are available bounding box annotations, we see that bbox-Seg improves Comparing with Sec. 4.3, note that 2.9k strong 9k over Bbox-Rect by 8.1%, and gets within 7.0% of the strong image-level annotations yield 68.5%(Table 2), while 2.9k pixel-level annotation result. We observe that combining strong ok bounding box annotations yield 69.0%(Ta- 1, 464 Strong pixel-level annotations with weak bounding ble 3). This finding suggests that bounding box annotations box annotations yields 65.1%C, only 2.5%o worse than the add little value over image-level annotations when a suffi strong pixel-level annotation result. In the semi-supervised cient number of pixel-level annotations is also available Method #Strong Coco#Weak CoCo val IOU boost over the baseline. This Cross-Joint ( Semi) result is PASCAL-only EM- Fixed(Semi) 23,28 67.7 also 1.3% better than the 68.7%0 performance obtained us- Cross-Joint(Semi) 5.000 l18.287 70.0 ing only the 5, 000 strong and no weak annotations from Cross-Joint(strong) 5.000 68.7 Cross-Pretrain(Strong 123,287 71.0 MS-COCO. As expected, our best results are obtained by Cross-Joint (strong 123,287 71.7 using all 123, 287 strong Ms-coco annotations, 71.0% for Table 5. VOC 2012 val performance using strong annotations for Cross-Pretrain(Strong) and 71. 7%o for Cross-Joint (Strong) all 10, 582 train_aug PASCAL images and a varying number of We observe that cross-dataset augmentation improves by strong and weak Ms-coco annotations (Sec. 4.5) 4.1%0 over the best PASCAL-only result USing only a small I test Iou portion of pixel-level annotations and a large portion of MSRA-CFM [8I 618 image-level annotations in the semi-supervised setting reaps FCN-8s [25] 62.2 about half of this benefit Hypercolumn [17 62.6 TTI-Zoomout-16 [28] 644 Test results We report our PASCAL VOC 2012 test re DeepLab-CRF-Large FOV [5] 703 sults in Table 6. we include results of other leading models BoxSup(Semi, with weak COCo)[9 10 DccpLab-CRF-LargcFOV(Multi-scalc nct)[5] 71.6 from the pascal leaderboard all our models have been Oxford TVG CRF RNN VOC [421 0 trained with pixel-level annotated images on the PASCAL Oxford_TVG_CRF_RNN_COCo[42] 74.7 Cross-Pretrain(Strong) 72.7 trainval_aug and the Ms-coco 2014 trainval datasets Cross-Joint(Strong) 730 Methods based on the DCNn+Crf pipeline of Cross-Pretrain (Strong, Multi-scale net) 73.6 DeepLab-CRF [5 are the most competitive, with perfor- Cross-Joint (Strong, Multi-scale net) 73.9 Table 6. voC 2012 test performance using PASCAL and Ms mance surpassing 70%, even when only trained on PAs COCO annotations(Sec. 4.5) CAL data. Leveraging the Ms-coco annotations brings about 2%o improvement. Our top model yields 73.9%o, using the multi-scale network architecture of [5]. Also see [42] 4.5. Exploiting Annotations Across Datasets which also uses joint PASCAL and MS-CoCO training, and Validation results We present ex periments leveraging the further improves performance(74.7% by end-to-end learn 81-label Ms-coco dataset as an additional source of data ing of the dcnn and crf parameters in learning the DeepLab model for the 21-label PASCAL 4.6. Qualitative Segmentation Results VOC 2012 segmentation task. We consider three scenarios Cross-Pretrain(Strong): Pre-train DeepLab on Ms In Fig. 6 we provide visual comparisons of the results obtained by the deeplab-CRF model learned with some of COCO, then replace the top-level network weights and fine-tune on Pascal VOC 2012, using pixel-level anno the proposed training methods tation in both datasets 5 Conclusions Cross-Joint (Strong): Jointly train Deeplab on Pas cal voc 2012 and MS-COCO, sharing the top-level network weights for the common classes, using pixel- The paper has explored the use of weak or partial anno- lation in training a state of art semantic inage segmenta level annotation in both datasets tion model. Extensive experiments on the challenging Pas CAL VOC 2012 dataset have shown that: (1)Using weak Cross-Joint (Semi): Jointly train DeepLab on Pascal annotation solely at the image-level seems insufficient to VOc 2012 and MS-COCO, sharing the top-level net- train a high-quality segmentation model. (2)Using weak work weights for the common classes, using the pixel- bounding-box annotation in conjunction with careful seg level labels trom PASCAL and varying the number of mentation inference for images in the training set suffices pixel-and image-level labels from MS-COcO to train a competitive model (3) Excellent performance is In all cases we use strong pixel-level annotations for all obtained when combining a small number of pixel-level an- 10, 582 train_aug PAscal images notated images with a large number of weakly annotated We report our results on the Pascal voc 2012 val in images in a semi-supervised setting, nearly matching the Table 5, also including for comparison our best PASCAL results achieved when all training images have pixel-level only 67.6% result exploiting all 10, 582 strong annotations annotations. (4) Exploiting extra weak or strong annot as a baseline. When we employ the weak Ms-COCo an tions from other datasets can lead to large improvements notations(EM-Fixed (Semi))we obtain 67.7%iOu, which does not improve over the Pascal-only baseline. How Acknowledgments ever,using strong labels from 5,000 Ms-COCO images This work was partly supported by ARO 62250-CS, and (4.0% of the ms-coco dataset) and weak labels froIm NIH SROIEY022247-03. We also gratefully acknowledge the remaining MS-COCO images in the Cross-Joint (Semi) the support of NVIDIA Corporation with the donation of semi-supervised scenario yields 70.0%, a significant 2.4% GPUs used for this research rr Imagc EM-Adapt(Wcak) EM-Fixcd(Scmi Bbox-EM-Fixcd (Scmi) Cross-Joint(Strong) Figure 6. Qualitative DeepLab-CRF segmentation results on the PasCal VOC 2012 val set. The last two rows show failure modes 8 Supplementary Material B. Effect of Field-of- View We include as appendix: (1) Details of the proposed EM In this section, we explore the effect of Field-Of-View Adapt algorithm. (2) More experimental evaluations about fov)when training the Deeplab-CRF model with the pro the effect of the model's Field-Of-view. (3)More detailed posed methods in the main paper. Similar to [5l, we also results of the proposed training methods on PASCAL vOc employ the algorithm [27 in the DeepLab model 2012 test set The atrous' algorithm enables us to arbitrary control the model's fov by adjusting the input stride (which is equiv A. E-Step with Cardinality Constraints: De alent to injecting zeros between filter weights) at the first tails of our EM- Adapt algorithm fully connected layer of VGG-16 net [35]. Applying a large value of input stride increases the effective kernel size, and Herein we provide more background and a detailed thus enlarges the model s FOV (see [5] for details) description of our EM-Adapt algorithm for weakly supervised training with image-level annotations Experimental protocol We employ the same experimen- As a reminder, y is the latent segmentation map, with tal protocol as the main paper. Models trained with the Um E 10, .. Li denoting the label at position m e proposed training methods and different values of FOV are [,..., M. The image-level annotation is encoded in a, evaluated on the pascal voc 2012 val set with zi=l, if the l-th label is present any where in the im EM-Adapt Assuming only image-level labels are avail We assume that log P(zy)=p(y, z)+(const). We able, we first experiment with the EM-Adapt (Weak) employ a cardinality potential o(y, 2) which encourages at method when the value of FOV varies. Specifically, we ex least a Pu portion of the image area to be assigned to class plore the setting where the kernel size is 3 x3 with various L, if zt= 1, and enforce that no pixel is assigned to class FOV values. The selection of kernel size 3 x 3 is based on L, if zl =0. We set the parameters P= Pfg, if l>0 and the discovery by [5]: employing a kernel size of 3 x3 at the first fully connected layer can attain the same performance While dedicated algorithms exist for optimizing energy as using the kernel size of 7x7, while being 3.1 times faster functions under such cardinality potentials [36, 33, 23], we during training. As shown in Table 7, we find that our pro- opt for a simpler alternative that approximately enforces posed model can yield the performance of 392y with FOV these area constraints and works well in practice. We 96 X 96, but the performance degrades by 9%o when large use a variant of the EM-Fixed algorithm described in the FoV 224 x 224 is employed. The original DeepLab model main paper, updating the segmentations in the E-Step by enployed by [Sl has a kernel size of 4 x 4 and input stride Um= argmax fm(U)=fm(a; 0 )+b. The key differ- of 4. Its performance, shown in the first row of Table 7, is ence in the EM-Adapt variant is that the biases b are adap- similar to the performance obtained by using a kernel size tively set so as the prescribed proportion of the image area of 3x3 and input stride of 6. Both cases have the same FOV is assigned to the background or foreground object classes value of 128X 128 that are present in the image When only one label l is present (i.e. 2!=1,2l=0 21 Network architectures In the following experiments, we 1), one can easily enforce the constraint that at least Pu of compare two network architectures trained with the meth the image area is assigned to label l as follows: (1) Set bz ods proposed in the main paper. The first network is the 0, /=l(2)Compute the maximum score at each position, same as the one originally employed by [5](kernel size 4x4 fmax=maxl=0 m(lx: 0).(3)Set b equal to the ei-th and input stride 4, resulting in a FOV size 128 x 128). The percentile of the score difference dm=fmax-fm(la: 07). second network we employ has FoV 224 x 224(with ker- The cost of this algorithm is O(M)(linear w.r. t the number nel size of 3 x 3 and an input stride of 12). We refer to the of pixels) first network as Deeplab-CRF with small fov, and the When more than one labels are present in the image (i. e. second network as'DeepLab-CrF with large FOV Liso 2y> 1), we employ the procedure above sequen tially for each label that at 1(we first visit the back- Image-level annotations In Table 8, we experiment with ground label, then in random order each of the foreground the cases where weak image-level annotations as well as labels which are present in the image). We set a varying number of pixel-level annotations are available if 2=0, to suppress the labels that are not present in the Similar to the results in Table 7, the DeepLab-CRF with Image OV performs better than that with large FOV when An implementation of this algorithm will become pub a amount of supervision is leveraged. Interestingly licly available after this paper gets published when there are more than 750 pixel-level annotations are kernel size input stride receptive field val IOU (%) #strong #Box w Small FoV w Large FOV 128×128 Bbox-Rect (Weak) 10,582 64×64 Bbox-EM-Fixed(Weak) 10582 54.1 50.2 10,58 60.6 128×128 Bbox-Rect (Semi G1.1 160×160 38.1 Bbox-EM-Fixed(Semi) 1, 4649, 1 18 64.8 192×19 Bbox- Seg(Semi) 1,4649,118 618 65.1 224×224 Stron 57.6 Table 7. Erect of Field-or-View. The validation performance Table 9. Effect of Field-Of-View. VOC 2012 val performance for obtained by deepLab-CRF trained with the method EM-Adapt (Weak) as the value of Fov varies varying number of pixel-level (strong) and bounding box(weak) annotations(Sec. 4.4 of main paper) #Strong #Weak w Small FoV w Large FOV EM-Fixed (Weak #Strong #Weak w Small fov w Large FOV EM-Adapt(Weak) 10,58 38.2 PASCAL-only 6 LM-Fixed(Semi) 123.287 644 67.7 10,382 10.082 56 54.2 700 9,832 58.8 59.8 Cross-Joint (strong) 5,000 EM-Fixed (Semi) 1009 Cross-Prctrain(Strong)[ 123, 287 710 1,4645,000 Coss-Joint(Strong) 123, 287 680 717 1,4649.8 61.9 64.6 Table 10. Effect of Field-Of-View. vOC 2012 val performance Strong 1,464 57.6 using strong annotations for all 10, 582 train aug PASCAL, images and a varying number of strong and weak Ms-cocO annotations Table 8. Effect of Field-Of-View. VOC 2012 val performance for (Sec. 4.5 of main paper varying number of pixel-level(strong)and image-level(weak) an notations(Sec. 4.3 of main paper) References FoV yields better performance than using small FOr arge available in the semi-supervised setting, employing la [I P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.1.2,6 [2S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material Bounding box annotations In Table 9, we report the re recognition in the wild with the materials in context database sults when weak bounding box annotations in addition to a arXiv∵:l42.0623,2014.1 varying number of pixel-level annotations are exploited. we 3] J. Carreira and C Sminchisescu. CPMC: Automatic object found that DeepLab-CRF with small FoV attains better per segmentation using constrained parametric min-cuts. PAMI, formance when trained with the three methods bbox -Rect 34(7):1312-1328,2012.2 Weak), Bbox-EM-Fixed(Weak), and Bbox-Rect (Semi 44 L -C. Chen, s. Fidler, A. L. Yuille, and R. Urtasun. Beat the 1464 strong), whereas the model DeepLab-CRF with large mturkers: Automatic image laheling from weak 3d supervi FOV is better in all the other cases sion. In CVPR. 2014. 2 5 L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep con- Annotations across datasets In Table 10. we show the volutional nels and fully connected crls. In ICLR 2015. 1 2,5,6.7,9,1 results when training the models with the strong pixel-level annotations from pascal voc 2012 train _aug set in con [6] L.C. Chen, A Schwing, A. Yuille, andR. Urtasun Learning junction with the extra annotations from MS-COCO [24 deep structured models. In ICML, 2015. 2 dataset (in the form of either weak image-level annotations [7 M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. h. s. torr BING: Binarized normed gradients for objectness estimation or strong pixel-level annotations ) Interestingly, employing at 300fps. In CVPR, 2014.1, 6 large FOv consistently improves over using small FOV in all cases by at least 3 %0 8] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. arXiv: 1412.1283, 2014 Main paper Note that in the main paper we report the 9]J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic seg- results of the best architecture for each setup mentation. auXin:1503.01640,2015.2,6,7,11 [10] A. Delong, A. Osokin, H. N. Isack, and Y. Boykov. Fast C. Detailed test results approximate energy minimization with label costs. 1JCV, 2012.3 In Table Il, Table 12, and Table 13, we show more de- [11]J. Deng, W. Dong, R. Socher, L.J. Li, K Li, and L. Fei- tailed results on pascal voc2012 test set for all the re Fei. ImageNet: A large-scale hierarchical image database orted methods in the main paper In CvPR. 2009. 5

...展开详情
所需积分/C币:15 上传时间:2019-05-22 资源大小:2.49MB
举报 举报 收藏 收藏
分享 分享