Object detection via a multi-region & semantic segmentation-aware CNN model

所需积分/C币:50 2015-05-14 23:04:09 8.24MB PDF
收藏 收藏
举报

卷积神经网络进行图像识别的一种方法,这种模型有多个输入部分
Activation maps module activatio Convo utional layers of input Im omponents of the m Entire Image as input Region Adaptation Module for rl adap. max pooling fully connected fally conme Region adaptation Module for r2 adap. max pooling fully connected fully connected Region Adaptation Module for R3 adap. max pooling fully connected filly connected detction b Region Adaptat on Module for R4 fully connected fally conme Figure 2: Multi Region CNN architecture. For clarity we present only four of the regions that participate on the initial Multi-Region CNN architecture. An"adaptive max pooling" layer uses spatially adaptive pooling as in [12](but with a one-level pyramid) of object segmentation works trained on each of those regions are guided to learn To conclude, the architecture of the Multi-Region cnn the appearance characteristics present only on each half part model that we just described, is composed of of an object or on each side of the objects borders, aiming The activation maps module. Extracts convolutional also to make the representation more robust with respect to feature maps from the entire image. Its architecture is occlusions consisted of the convolutional part of the 16-layers VGG Central Regions: there are two type of central regions Net [24] that outputs feature maps of 512 channels. The included in our model. The first is the box obtained b max-pooling layer right after the last convolutional layer is shrinking the candidate box by a factor of 0.5(figure 3f) omitted on this module The second one is a rectangular ring where the inner box is The region adaptation modules (Rili_I. Given a can- obtained by shrinking the candidate box by a factor of 0.3 didate detection box and the activation maps of the image, and the outer box by shrinking it by 0.8(figure 3g). The each of them extracts a high level feature from its assigned networks trained on them are guided to capture the pure region. They are consisted of a spatially adaptive max- appearance characteristics of the central part of an object pooling layer [12] that outputs fixed size features of 512 which is probably less interfered from other objects next to channels on a x 7 grid, and the two fully connected layers it or its background of VGG-Net [24] that have 4096 output channels each Border Regions: in our model we include two regions Despite the specific architecture choices that we made dedicated to focus their attention on the borders of an ob till now, the activation maps module followed by region ct Those regions have the form of rectangular rings for daptation modules, is a general architecture abstraction the first region, we obtain the inner box by shrinking the that can be easily adjusted as we will see, in other real candidate box by a factor of 0. 5 while the outer box has the Izations same size as the candidate box(figure 3h). For the sec 2. 1. Region components and their role on detection ond region, its inner box is obtained by shrinking the can didate box by a factor of 0. 8 and the outer box is obtained Here we describe the regions included on the Multi- by enlarging the candidate box by a factor of 1. 5(figure Region cnn model and discuss their role on object detec 3i). With those regions, we expect to guide the dedicated on tion them networks to focus on the joint appearance characteris Original candidate box: this is the candidate detection tics on both sides of the object borders, also aiming to make box itself as being used on R-Cnn [10](figure 3a). a the representation more sensitive to inaccurate localization network trained on this type of region is guided to capture Contextual region: this is again a rectangular ring where the appearance information of the entire object. When this its inner box is the candidate box itself and the outer box region is used alone consists the baseline of our work. is obtained by enlarging the candidate box by a factor of Half boxes: those are the left/right/up/bottom half parts 1. 8(figure 3j). The network dedicated on this region is of a candidate detection box(figures 3b, 3c, 3d, 3e ). Net- driven to focus on the contextual appearance that surrounds an object such as the appearance of its background or of other objects next to it Concerning the general role of the regions on object de- tection, we briefly focus below on two of the reasons why they are helpful on this task. Discriminative feature diversification. Our hypothesis is that having regions that render visible to their network components only a limited part of the object or only its (a original box (b)Half left immediate surrounding forces each network-component to discriminate image boxes solely based on the visual in formation that is apparent on them thus diversifying the discriminative factors captured by our overall recognition model. For example, if the border region depicted on fig ure 3i is replaced with one that includes its whole inner content, then we would expect that the network-component dedicated to it will not pay the desired attention on the vi (c) Half righ (d) Half up sual content that is concentrated around the borders of an object. We tested such a hypothesis by conducting an exper iment where we trained and tested two multi-Region cnn models consisted of two regions each. Model A included the original box region and the border region depicted on figure 3i that does not contain the central part of the ob- ject. On model B, we replaced the rectangular ring with a normal box of the same size as the outer box on the rect angular ring on figure 31. Both of them were trained on (e) Half bollon (f Central region PASCAL VOC2007 [7 trainval set and tested on the test set of the same challenge. Model A achieved 64. 1% mAP hile Model B achieved 62.9% mAP which is 1.2 points lower and validates our assumption Localization-aware representation. Moreover, we argue that the multi-region architecture of our model as well as the type of regions included address to a certain extent one of the major problems on the detection task, which is the inac- (g) Central Region (h) border region curate localization of object instances. We expect that hav ing multiple regions with network-components dedicated on each of them, imposes soft constraints regarding the vi- sual content allowed on each type of region for a given can- didate detection box. We provide experimental support for this argument by referring to sections 6.2 and 6.3 3. Semantic Segmentation-Aware cnn Fea- (i)Border Region Gi) Context. Region tures Figure 3: Illustration of the regions components used To further diversify the features encoded by our repre sentation, we extend the Multi-Region CNn model so that modules(see architecture in figure 4). We hereafter refer to it can also incorporate semantic segmentation-aware CNn features(see figure 4). We were motivated for this also the resulting modules as from the close connection between segmentation and detec Activation maps module for semantic segmentation- tion, and the fact that segmentation related cues have been aware features empirically known to help object detection [6, Il, 22]. In Region adaptation module for semantic segmentation the context of our multi-region CNn network, the incorpo aware features. ration of the semantic segmentation-aware features is done by properly adapting the two main modules of the network, It is important to note that the modules for the semantic i. e, the activation-maps module and the region-adaptation segmentation-aware features are trained without the use of any additional annotation. Instead, they are trained in a weakly supervised manner using only the provided bound- ing box annotations for detection We combine the initial multi-Region cnn model with the semantic segmentation aware Cnn model by simply oncatenating the outputs of their last hidden layers(see fi 3. 1. Activation Maps Module for Semantic Segmentation-Aware features Fully Convolutional Nets. Our semantic segmentation aware activation maps module has the architecture of a Fully Convolutional Net [201, abbreviated hereafter as FCN (we refer the interested reader to [20] for more details about fcn where it is being used for the task of semantic seg- mentation). In our work, we use as FCn the 16-layers Figure 5: Left column: images with the ground truth bounding VGG-Net [24 by reshaping its last three fully connected boxes drawn on them. The classes depicted from top to down or- ( c6, Sc, and /Cg)to convolutional ones of kernel der are horse, human, and dog. Middle column: the segmentation Size7×7,1×1,and1× l correspondingly. The last masks artificially generated roin the ground truth bounding box classification-layer of our FCN outputs as many channels on the left column We use blue color for the back ground and red as our classes color for the foreground Right column: the foreground probabil ities estimated from our trained fCn model. These clearly verify Feature dimensionality reduction of the activation maps. Furthermore, in order to be able to efficiently store that, despite the weakly supervised training, our extracted features carry significant semantic segmentation information the activation maps of an image on the hard disk, we re- duce the dimensionality of the last hidden layer, fcr, from 4096 channels to 512 channels. To accomplish that, af- ter the fine-tuning of the fCn with 4096 channels on the separate the object from its background f c layer has converged, we replace the fc, layer with an- other one that has 512 output channels and is initialized from a Gaussian distribution. Then, the training of the FCN 3.2. Region Adaptation Module for Semantic starts from the beginning and is continued untill conver Segmentation-Aware Features gence again. Finally, for the activation maps module of the After the fcn has been trained on the auxiliary task semantic segmentation-aware cNn. the new fcn with the of foreground segmentation, we drop the last classification 512 output channels is used layer and we use the rest of the fcn network in order to ex Weakly Supervised Training. To train the activation tract from images semantic segmentation aware activation maps module for the class-specific foreground segmenta maps. We exploit those activation maps by treating them as tion task, we only use the annotations provided on object mid-level features and adding on top of them a single region detection challenges(so as to make the training of our over adaptation module trained for our primary task of object de all system independent of the availability of segmentation tection annotations ). To that end, we follow a weakly supervised training strategy and we create artificial foreground claSs- Architecture. The architecture of our adaptation module specific segmentation masks using bounding box annot consists of a) a spatially adaptive max-pooling laye tions. More specifically, the ground truth bounding boxes and b)a fully connected layer with 2096 channel y. er/oy that outputs feature m s of 512 channels on a x9 grid of an inage are projected on the spatial domain of the last hidden layer, fC, and the cells that lay inside the projected Single region Choice. For the semantic segmentation- boxes are labelled as foreground while the rest are labelled aware cnn extension, we chose to use only one region ob as background (see left and middle column in figure 5). tained by enlarging the candidate detection box by a fac The aforementioned process is performed independently for tor of 1.5 (such a region contains semantic information also each class and yields as many segmentation masks as the from the surrounding of a candidate detection box). The number of our classes. As can be seen in figure 5, despite reason that we did not repeat the same regions that were the weakly supervised way of training, the resulting activa used on the initial multi-Region cnn architecture is for ef- tions carry significant semantic segmentation information ficiency as these are already used for capturing the appear- enough even to delineate the boundaries of the object and ance cues of an object Activation maps module Convolutional layers of input Image Entire Region Adaptation Module for rI adap max pooling ully connected cully connected Region Adaptation Module for R2 adap. max pooling fully connected connecT fa Region Adaptation Module for R3 adap. max pooling full Cully connected 3 Region Adaptation Module for r4 adap. max pooling lly connected ully connected Regions visble by the model Region Adaptation Module for Semantic Segmentation 品 adap. max pooling fully connected lage as input Semantic segmentation Activation maps module Semantic segmentation Aware cnn extension ware activation Fully Convolutional Net maps of input ima Figure 4: Multi Region Cnn architecture extended with the semantic segmentation-aware CNN features 4. Object Localization 4096 channels each(as the fully connected layers of VGG-Net)and a regression layer that predicts 4 C As we already explained, our Multi-Region Cnn based values where C is the number of categories. In order recognition model exhibits the localization awareness prop to allow it to predict the location of object instances erty that is necessary for accurate object localization. Hov that are not in the close proximity of any of the initial ever, by itself it is not enough. In order to make full use of candidate boxes, we use as region a box obtained by it, our recognition model needs to be presented with well lo enlarging the initial candidate box by a factor of 1.3 calized candidate boxes that in turn will be scored with high confidence from it. However, the selective search algorithm Iterative Localization: we use an iterative scheme that that we use, proposes category independent box proposals alternates between scoring a set of proposals with our that cover with high recall the objects of an image but with recognition model and refining their locations with out having them localized accurately enough a different the CNN-based bounding box regression. With this way to proceed would be to use our recognition model on scheme. we obtain candidate boxes that both exhibit a sliding window fashion and on multiple scales and aspect high recall of the objects on an image and are well lo ratios in an image but that would be a computationally ex calized on them. We found out that two iterations of pensive solution. Instead the solution that we adopt is our scheme are enough for convergence. For the first iteration, the box proposals are coming from selective CNN-based regression. We introduce an extra multi- search algorithm [26]. After being scored, the layer region sub-network that, instead of being used very low confidence are rejected in order to reduce the for object recognition, is trained to predict the actual computational burden of the subsequent iteration(s) object location. This bounding box regression region module, is applied on top of the activation maps pro Bounding box voting. After the last iteration of our duced from the initial Multi-Region cnn model and iterative scheme, the scored boxes produced on each is consisted of two hidden fully connected layers with step are merged together. Because of the multiple re gression steps, the generated boxes will be highly con centrated around the actual objects of interest. We exploit this by-product"of the iterative localization scheme by modifying the non-maximum-suppression step that is performed at post-process time. Specifi cally, after peaking the box with the highest score on its neighbourhood, we predict the final object location by having each of the boxes that overlap with the peaked (a)Step 1 b)Step 2 one by more than 0.5(on loU)to vote for the bounding box location using its score as weight In figure 6 we provide a visual illustration of the object lo calization 5Implementation details For all the cnn models involved in our proposed (e) step 3 (d) Step 4 system, we used the publicly available 16-layers VGG model [24] pre-trained on ImageNet [5] for the task of im age classification For simplicity, we line-luned only the fully connected layers(fc6 and fc7)of each model while we preserved the pre-trained weights for the convolutional lay ers, which are shared among all the models of our system Multi-Region CNN nodel. Each of its region com- ponents inherits the fully connected layers of the VGG- Net and is finetuned separately from the others. To train them we follow the guidelines of R-CNN [10]. The 1000 channels classification layer of the ImageNet classification challenge[5] is replaced with a 21 channels classification layer for the 20 classes of PAsCAL voc detection chal lenge plus one for background as an optimization objec tive we use the softmax-loss and the minimization is per (e)Step 5 formed with stochastic gradient descent (SGD). The mo Figure 6: Illustration of the object localization scheme for in mentum is set to 0. 9 and the learning rate is initially set stances of the class car. We describe the images from left to right to 0.001 and then reduced by a factor of 10 every 30k it- and top to down order. Step 1: the initial box proposal of the im crations. Our minibatch has 128 samples of which 25% age. For clarity we visualize only the box proposals that are not are foreground samples and 75% are back ground samples. rejected after the first scoring step Step 2: the new box locations The positive samples are defined as the selective search pro obtained after performing Cnn based bounding box regression on posals [26] that overlap with a ground-truth bounding box the boxes of step l. Step 3: the boxes obtained after a second step by at least 0. 5. As negative samples we use the proposals of box scoring and regressing on the boxes of Step 2 Step 4: the that overlap with a ground-truth bounding box in the range boxes of Step 2 and Step 3 merged together. Step 5: the detected 0. 1,0.5). The labelling of the training samples is relative boxes after applying non-maximum-suppression and box voting to the original candidate boxes and is the same across all the on the boxes of Step 4. On the final detections we use blue color different regions for the true positives and red color for the false positives. Also the ground truth bounding boxes are drawn with green color. The Semantic segmentation-aware CNN. The training is false positive that we see after the last step is a duplicate detection performed on two phases that survived from non-maximum-suppression 1. Training of activation maps module: We use as op timization objective a binary (foreground vs back ground) logistic loss applied on each spatial cell and the bounding box annotations of different classes for for each class independently The choice of having a given image are possible to overlap with each other on each spatial cell a binary logistic loss per class in- Thus, in contrast to the segmentation annotations typ stead of a softmax loss, was dictated by the fact that cally provided on the semantic segmentation chal lenges the labels of each cell on our artificially gen https:/gist.github.com/ksimonyan/ erated segmentation masks are not mutually exclusive (see figure 5 middle column). For loss minimization tivation maps module is applied on 3 scales of we use sgd with minibatch of size 10. The momen an image with their shorter dimension being in tum is set to 0. 9 and the learning rate is initialized to 1576, 874, 1200. For training, the region adaptation 0.01 and decreased by a factor of 10 every 20 epochs module is applied on a random scale and for testing, a This procedure is followed for fine-tunning the FCN single scale is selected such that the area of the scaled with the 4096 channels on fcz and then is restarted region is closest to 288 X 288 pixels for fine-tunning the fcn with the 512 channels on /Cr. For faster convergence during the second time, Bounding Box Regression CNN model: The ac the learning rate of the randomly initialized fc, layer tivation maps module is applied on 7 scales of (with the 512 channels) is multiplied by a factor of 10 an image with their shorter dimension being in 480,576,688,874,1200,1600,2100}. Both during 2. Training of region adaptation module: Here we follow training and testing, a single scale is used such that the the same procedure as for the region adaptation mod area of the scaled region is closest to 224 X 224 pixels ules of the initial Multi-Region CNN Inodel. During this phase, only the layers of the adaptation module 6. Experimental Evaluation are trained. The weights of the hidden fully connected layer are initialized randomly from a gaussian distri we evaluate our detection system on PASCAL b union VOC2007[7 and on PASCAL VOC2012 [8. During the presentation of the results, we will use as baseline cither Classification SVMs. In order to train the svms we fol the Original candidate box region alone (fig. 3a) and/or low the same principles as in [10]. As positive samples are the R-CNn framework with VGG-Net [24 We note that considered the ground truth bounding boxes and as negative when the Original candidate box region alone is used then samples are considered the selective search proposals 26 the resulted model is a realization of the SPP-Net [121 ob- that overlap with the ground truth boxes by less than 0.3. ject detection framework with the 16-layers VGG- Net[24 We use hard negative mining the same way as in [10, 91 For all the pascal voc2007 results we trained our mod CNN-based bounding box regression. To create this els on the trainval set and tested them on the test set of the Imodel we replaced the last classification layer of VGG-Net same year with a regression layer that outputs 4 channels per class As a loss function we use the euclidean distance between 6. 1. Results on pasCal voc2007 the target values and the network predictions. For training First, we asses the significance of each of the region samples we use the box proposals [26] that overlap by at adaptation modules alone on the object detection task. Re- least 0. 4 with the ground truth bounding boxes The target sults are reported in table 1. As we expected, the best per values of a training sample are defined the same way as in forming component is the Original candidate box. What R-CNn [10] framework. The learning rate is initially set is surprising is the high detection performance of individ to 0.01 and reduced by a factor of 10 every 40k iterations ual regions like the Border Region on figure 3i 54.8% or The momentum is set to 0.9 and the minibatch size is 128 the Contextual Region on figure 3j 47. 2%. Despite the fact Multi-Scale Implementation. In our system we adopt that the area visible by them includes limited or not at all a similar multi-scale implementation as in SPP-Net.[12] portion of the object, they outperform previous detection More specifically, we apply the activation maps modules systems that were based on hand crafted features Also in of our models on multiple scales of an image and then a teresting, is the high detection performance of the semantic single scale is selected for each region adaptation module segmentation aware region, 56.6% independently. In table 2, we report the detection performance of our Multi-Region CNN model: The activation proposed modules. The Multi-Region CNN model with- maps module out the semantic segmentation aware cnn features (MR applle scales of an image with their shorter dimension being In CNN), achieves 66. 2% mAP, which is 4.2 points higher than {480,576,688,874,1200,1600,2100}. For training R-CNN with VGG-Net(62.0%)and 4.5 points higher than the region adaptation modules are applied on a random the Original candidate box region alone(61.7)(which as scale and for testing, a single scale is used such that we said is a realization of the spp-Net framework with the the area of the scaled region is closest to 224 x 224 16-layers VGG-Net). Moreover, its detection performance pixels. In the case of rectangular ring regions, the slightly exceeds that of R-CNN with vGG-Net and bound scale is selected based on the area of the scaled outer ing box regression(66.0%). Extending the Multi-Region box of the rectangular ring cnn Inodel with the semantic segmentation aware CNN features (MR-CNN S-CNN), boosts the performance of Semantic Segmentation-Aware CNN model: The ac- our recognition model another 1.3 points and reaches the 8 Adaptation Modules areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP Original Box fig.3|0.7290.7150.593047804050.7130.725074104180.69405910713066207250.5600.3120.601056506690.7310617 Left half Box fig.3b06350.590.4550.3640.32206210.6400.58903140.620046305730.5450.64104770.000.5320.440.5460.621058 Right Haif Box fig.306260.5050470033103140.6070.616064102780.4870.51305480.5640.5850.45902620469046505730.5200502 Up Ilalf Box fig.3d|0.5910.65104700.266036106290.6560.610.3050.6040.51106040.6430.5880.4660.2200.54505280.5900.5700522 Bottom Half Box fig.3e06070.5310.40603970230.5940.62605590.28504170404052004900.6490.387023304570.344056606170471 Central region fig.3f0.5530.52204130.24402830.020.594060302820.5230.424051604950.5840.3860.2320.5270.3580.5330.5870463 entral region fig.3|0640.70503470367037078069806703810600538065906706790507030905570506110694|0573 Border region1g3n|06940.6960.553047003890.68707060.70303980.6310.5150606430660.590.3070.5820.53706180.7170586 Border Region fig.306510.6490.5040.40703330.6700.704062403230.6250.5330594065606270.51702230.5330.51506040.630548 Contextual Region fig.306240.5680.425038002550.6000.650054502220.5000.52204270563054104310.16304820.39205970.5320472 Semantic- aware region.06520.5840.54040702250.680.676073803160.90.63507050.6700.6890.54502300.5220.5980.6800.5480566 Table 1: Detection performance of individual region adaptation modules. Results on vOC 2007 test set areo bike bird boat bottle 0.707 car cat chair cow table do R- CNN with vGg.Net 0.7160.7350.5810.4220.394 0.7600.7450.3870.7100.5690.745 R-CMN with VGG-Net& bbox reg.07340.7006340.454044601510.78107980.40507370.6220.7940.7810731062035606806720.7040710650 Best approach of /28/ 0750:880.670452051007380737078304670.7380615071076407390650392069703940680729065 Best approach of28& bbox reg. o,10.83206700.080.51607620.8140.7020.1810.78906560.730.78075107010.1060606080.70207370685 Origina! Ror fig. 3 072907150593047804050713072507410418069405910713066207250560031206010565066907310617 MR-CNV 0./490.57065054904470.7410.7550.7600.4810,2406/407650.7240749061703480.61706400.7350.7600652 MR-Cn& s-CNN 07680.75706760.55104560.77607650.7840.4670.747068807930.7420.77006250.3740.4306380.7400.7470.675 MR-CNV&SCNN&Loc. scheme0.7870.81807670.6660.6180.8170.8530.8270.5700.819073208460.860080507490490.7170.6970.7870790.749 Table 2: Detection performance of our modules. Results on voc 2007 test set areobike bird boat botle bus car cat chair cow lable dog horse bike person plant sheep sofa train tvmAP R-CNN with VGG-Net/mn2)|0420.430240.4101304820450364017103402903830268029202010303703603160490308 04630.5810310.2160.2580.5710.58204350.2300.4640209004070.40604630.3340.1060.41304090.4580.5630.398 Best approach of28& bbox reg.0410.6180.3520.180.20970.60.6470.4800.2530.5040.4904370.5080.4940.3680.1370.4470.45604980.6050437 Original Candidate Box 04190.42602370.1750.1570.4410410370.1820.2950.30303120.24903320.1870.0990.3020286033704990.305 MR-CNN 04950.50502920.2350.1790.5130.5040.481020603810.3750.38702960.40302390.1510.341038904220.5210.356 MR-Cnn S-CNN 0.5070.52303160.2660.1770.5470.5130.4920.21004500.3610.4330.3090.40802460.1510.359042704380.5340.383 MR-CNN&SCN&Loc, scheme0.5490.6130.4300.3150.3830.6460.6500.51202530.440.5050.5210.5910.5400.3930.1590.4850.4680.5530.5730.484 Table 3: Detection performance of our modules for loU>0.7. Results on VOC 2007 test set. total of 67.5% mAP. Comparing to the recently published with the percentage of detections that are false positive due method of Yuting et al. [28 our MR-CNN&S-CNN model to bad localization, confusion with similar category, confu- scores 1 point higher than their best performing method that sion with other category, and triggered on the background or includes generation of extra box proposals via Bayesian op- an unlabelled object. In the first row is our baseline which timization and structured loss during the fine-tuning of the is the Original candidate box only model. This model is VGG-Net. Significant is also the improvement that we get actually a realization of the SPP-Net framework with the Then we couple our recognition model with the cnn model 16-layers VGG-Net. In the middle row is the multi-Region for bounding box regression under the iterative scheme pro Cnn model without the semantic segmentation aware cnn posed (Mr-Cnn S-CNN Loc. scheme). Specifically, features and in the bottom row is our overall system that the detection performance is raised from 67.5%to 74.9% includes the Multi-Region CNN model with the semantic setting the new state-of-the-art on this dataset segmentation aware cnn features coupled with the cnn In table 3, we report the detection performance of our based bounding box regression under the iterative localiza system when the threshold for considering a detection pos- tion scheme. We plot only the pie charts for the classes itive is set to 0.7. This metric was proposed from [28] in boat, bottle, chair, and pottedplant because of space limita- order to reveal the localization capability of their method tions and the fact that they are the most difficult categories From the table we observe that each of our modules exhibit of the PaSCal voc challenge very good localization capability, which was our goal when We see from the pie charts that, by using the multi designing them, and our overall system exceeds in that met Region cnn model. a considerable reduction in the ric the approach of [2 percentage of false positives due to bad localization is 6.2. Detection error analysis achieved. This validates our argument that focusing on mul tiple regions of an object increases the localization sensitiv We use the tool of Hoiem et al. [15 to analyse the ity of our model. Furthermore when our recognition model tection errors of our system. In figure 8, we plot pie charts is integrated on the localization module developed for it, the Figure 7: Top ranked false positive types. Top row: our baseline which is the original candidate box only model. This model is actually a realization of the SPP-Net framework with the 16-layers VGG-Net. Bottom row: our overall system that includes the Multi-Region CNN model with the semantic segmentation aware CNn features coupled with the CNN-based bounding box regression under the iterative localization scheme. We plot only the graphs for the classes boal, botle, chair, and polledplant because of space linitations and the fact that they are the most difficult categories of PAsCal voc challenge I aren hike bird boat bottlebus t chair cow table dog horse mike person plant sheep sofa train ty Original candidate bax-Baseline|0.75430.732506040.51605750.710907300727705180.711200.700.70390.10.66070.390.685064610690307359 MR-CNN 0.79380.78640.71800.64240.62220.76090.791807753061860.74830.68020.74480.75620.75690.71660.57530.72680.71480.73910.7555 Table 4: Correlation between the lou overlap of selective search box proposals [26](with the closest ground truth bounding box) and the scored assigned to them Approach bikebird boat botle bus car cat chair cow table dog horse mbike person plant sheep sofa train ty Original candida'e bax- aseline o927093240.9090.859408570093890945509250086030923708060.090926309317091510.84150.8920906009241092 MR-CNN 0.9462094790.92820.88430.87400.94980.9593093550879009338091270.93580.93930.94400.93410.86070.9120093140941309210 Table 5: The Area-Under-Curve(AUC) measure for the well-localized box proposals against the mis- localized box proposals reduction of false positives due to bad localization is huge tend to be scored higher than mis-localized ones. We report A similar observation can be deducted from figure 7 where the correlation coefficients of the aforementioned quantities we plot the top-ranked false positive types of the baseline both for the Baseline and MR-CNN models in table 4. be and of our overall proposed system cause with this experiment we want to emphasize on the lo calization aspect of the multi-Region cnn model we use 6.3. Localization awareness of Multi-Region CNN model proposals that overlap with the ground truth bounding boxes bv at least 0.1 lou Two extra experiments are presented here that indicate Area-Under-the-Curve of well-localized proposals the localization awareness of our Multi-Region CNN model against mis-localized proposals. The Roc curves are without the semantic segmentation aware Cnn features typically used to illustrate the capability of a classifier (MR-CNM) against the model that uses only the original to distinguish between two classes. This discrimination candidate box Baseline) capability can be measured by computing the Area-Under- Correlation between the scores and the lou overlap of the-Curve (AUC) metric. The higher the AUC measure is bor proposals. In this experiment, we estimate the correla- the more discriminative is the classifier between the two tion between the loU overlap of box proposals [26](with the lasses. In our case, the set of well-localized box proposals closest ground truth bounding box)and the score assigned the positive class and the set of miss-localized box to them from the two examined models. High correlation proposals is the negative class. As well-localized are con- coefficient means that better localized box proposals will sidered the box proposals that overlap with a ground-truth

...展开详情
试读 17P Object detection via a multi-region & semantic segmentation-aware CNN model
立即下载 低至0.43元/次 身份认证VIP会员低至7折
抢沙发
一个资源只可评论一次,评论内容不能少于5个字
关注 私信 TA的资源
上传资源赚积分,得勋章
最新推荐
Object detection via a multi-region & semantic segmentation-aware CNN model 50积分/C币 立即下载
1/17
Object detection via a multi-region & semantic segmentation-aware CNN model第1页
Object detection via a multi-region & semantic segmentation-aware CNN model第2页
Object detection via a multi-region & semantic segmentation-aware CNN model第3页
Object detection via a multi-region & semantic segmentation-aware CNN model第4页

试读结束, 可继续读2页

50积分/C币 立即下载 >