fasterrcnn

Stateoftheart object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast RCNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a R
classifier single, unified network for object detection(Figure 2) Using the recently popular terminology of neural Rol pooling networks with ' 31 mechanisms, the rPN module tells the Fast Rcnn module where to look In Section 3.1 we introduce the designs and properties proposals of the network for region proposal. In Section 3. 2 we evelop algorithms for training both modules with features shared Region Proposal Network 3.1 Region Proposal Networks feature maps A Region Proposal Network(RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.We model this process with a fully convolutional network [7l which we describe in this section Because our ulti mate goal is to share computation with a Fast RCNN object detection network [2], we assume that both nets share a common set of convolutional layers. In our ex periments, we investigate the Zeiler and Fergus model igure 2: Faster RCNN is a single, unified network [32(ZF), which has 5 shareable convolutional layers Fi for object detection. The RPN module serves as the and the Simonyan and Zisserman model [3](VGG16 attention of this unified network which has 13 shareable convolutional layers To generate region proposals, we slide a small network over the convolutional feature map output into a convolutional layer for detecting multiple class by the last shared convolutional layer. This small specific objects. The MultiBox methods [26], [27] gen network takes as input an m x n spatial window of erate region proposals from a network whose last the input convolutional feature map. Each sliding fullyconnected layer simultaneously predicts mul window is mapped to a lowerdimensional feature tiple classagnostic boxes, generalizing the single.(256d for ZF and 512d for VGG, with ReLU [331 box"fashion of OverFeat. These classagnostic boxes following). This feature is fed into two sibling fully are used as proposals for RCNN 15]. The MultiBox connected layersa boxregression layer(reg) and a proposal network is applied on a single image crop or boxclassification layer(cls). We use n=3 in this multiple large image crops(e.g. 224x224), in contrast Paper, noting that the effective receptive field on the to our fully convolutional scheme. MultiBox does not put image is large(171 and 228 pixels for ZF and share features between the proposal and detection VGG, respectively). This mininetwork is illustrated networks. We discuss overEat and multiBox in more at a single position in Figure 3 (left). Note that be depth later in context with our method. Concurrent cause the mininetwork operates in a sliding window with our work, the DeepMask method [28] is devel fashion, the fullyconnected layers are shared across oped for learning segmentation proposals ll spatial locations. This architecture is naturally im Shared computation of convolutions [91, [11, [291 Plemented with an n x n convolutional layer followed 71,[21 has been attracting increasing attention for ef. by two sibling I x I convolutional layers(for reg and ficient, yet accurate, visual recognition. The OverHeat cls, respectively) [9] computes convolutional features from an 3.1.1 Anchors image pyramid for classification, localization, and de At each slidingwindow location we simultaneously tection. Adaptivelysized pooling(SPP)[1] on shared convolutional feature maps is developed for efficient predict multiple region proposals, where the number of maximum possible proposals for each location is [301 and semantic denoted as k. So the reglayer has 4k outputs encoding segmentation [29]. Fast RCnn [2] enables endtoend g the coordinates of k boxes and the cls layer outputs detector training on shared convolutional features and shows compelling accuracy and speed 2k scores that estimate probability of object or not object for each proposal*. The k proposals are param eterized relative to k reference boxes which we call 3 FASTER RCNN 3. Region" is a generic term and in this paper we only consider Our object detection system, called Faster RCNN, 1S rectangular regions, as is common for many methods(e.8,[27, [4], composed of two modules. The first module is a deep [6)).Objectness"measures membership to a set of object classes fully convolutional network that proposes regions, US. background 4. For simplicity we implement the cls laver as a twoclass and the second module is the Fast rcnn detector [21 softmax layer. Alternatively, one may use logistic regression to that uses the proposed regions. The entire system is a produce k scores 2k scores 4k coordinates k anchor boxes cls layer reg layer :098 0.979 intermediate layer sliding window conv feature map Figure 3: Left: Region Proposal Network(RPN). Right: Example detections using rpn proposals on PASCAL VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios anchors. An anchor is centered at the sliding window MultiScale Anchors as regression References in question, and is associated with a scale and aspect Our design of anchors presents a novel scheme ratio(Figure 3, left). By default we use 3 scales and for addressing multiple scales(and aspect ratios). As 3 aspect ratios, yielding k=9 anchors at each sliding shown in Figure 1, there have been two popular ways position. For a convolutional feature map of a size for multiscale predictions. The first way is based on w x H(typically 2, 400), there are W Hk anchors in image /feature pyramids, e.g. in DPM [8 and CNN total based methods [9],[1, [2]. The images are resized at TranslationInvariant Anchors multiple scales, and feature maps(HOG [8] or deer convolutional features [9 ,[1,[2])are computed for An important property of our approach is that it each scale(Figure 1(a)). This way is often useful but is translation invariant, both in terms of the anchors is timeconsuming. The second way is to use sliding and the functions that compute proposals relative to windows of multiple scales(and /or aspect ratios )on the anchors. It one translates an object in an image, the feature maps. For example, in DPM [81, models the proposal should translate and the same function of different aspect ratios are trained separately using should be able to predict the proposal in either lo different filter sizes(such as 5x7 and 7x5). If this way cation. This translationinvariant property is guaran is is used to address multiple scales, it can be thought teed b y our method. As a comparison, the MultiBox of as a"pyramid of filters"(Figure 1(b). The second method [27] uses kmeans to generate 800 anchors, way is usually adopted jointly with the first way [8 which are not translation invariant. So MultiBox does As a comparison our anchorbased method is built not guarantee that the same proposal is generated if on a pyramid of anchors, which is more costefficient an object is translated Our method classifies and regresses bounding boxes The translationinvariant property also reduces the with reference to anchor boxes of multiple scales and model size. Multi Box has a(4+1)X800dimensional aspect ratios. It only relies on images and feature fullyconnected output layer, whereas our method has maps of a single scale, and uses filters( sliding win a(4+2)x 9dimensional convolutional output layer dows on the feature map)of a single size. We show by in the case of k=9 anchors. As a result, our output experiments the effects of this scheme for addressing ayer has28×10 parameters(512X(4+ 2)x9 multiple scales and sizes(Table 8) for VGG16), two orders of magnitude fewer than Because of this multiscale design based on anchors MultiBox's output layer that has 6. 1 x 10 parameters we can simply use the convolutional features com (1536x(4+1)x 800 for Google Net [34] in MultiBox puted on a singlescale image, as is also done by 271). If considering the feature projection layers, our the Fast Rcnn detector [2]. The design of multi proposal layers still have an order of magnitude fewer scale anchors is a key component for sharing parameters than MultiBox. We expect our method without extra cost for addressing scales to have less risk of overfitting on small datasets, like PASCAL VOC 3.1.2 Loss Function For training RPNs, we assign a binary class label 5. As is the case of FCNs [7, our network is translation invariant (of being an object or not) to each anchor. We as up to the network's total stride gn a positive label to two kinds of anchors: (i) the 6. Considering the feature projection layers, our proposal layers arameter count is3×3×512×512+512×6×9=2.4x106: anchor/ anchors with the highest Intersectionover MultiBox's proposal layers Parameter count is 7 x 7 x(64+96+ Union(IoU)overlap with a groundtruth box, or (ii)an 64+64)×1536+1536×5×800=27×106 anchor that has an Iou overlap higher than 0.7 with any groundtruth box. Note that a single groundtruth be thought of as boundingbox regression from an box may assign positive labels to multiple anchors anchor box to a nearby groundtruth box Usually the second condition is sufficient to determine Nevertheless, our method achieves boundingbox the positive samples, but we still adopt the first regression by a different manner from previous rol condition for the reason that in some rare cases the based(Region of Interest) methods [1],[2]. In [1] second condition may find no positive sample. We [2], bounding box regression is performed on features assign a negative label to a nonpositive anchor if its pooled from arbitrarily sized rols, and the regression loU ratio is lower than 0. 3 for all groundtruth boxes. weights are shared by all region sizes. In our formula Anchors that are neither positive nor negative do not tion the features used for regression are of the same contribute to the training objective spatial size (3 X 3)on the feature maps. To account With these definitions, we minimize an objective for varying sizes, a set of h boundingbox regressors function following the multitask loss in Fast Rcnn are learned. Each regressor is responsible for one scale 2 Our loss function for an image is defined as and one aspect ratio and the k regressors do not share weights. As such, it is still possible to predict boxes of 1 L({p},{t})= ∑L(p,D;) various sizes even though the features are of a fixed size/ scale, thanks to the design of anchors reg t 3. 1. 3 Training RPNs The RPn can be trained endtoend by back Here, i is the index of an anchor in a minibatch and propagation and stochastic gradient descent (SGD) pi is the predicted probability of anchor i being an object. The groundtruth label Pi is 1 if the anchor from [21 [35 We follow the"imagecentric"sampling strategy to train this network. Fach minibatch arises is positive, and is 0 if the anchor is negative. ti is a from a single image that contains many positive and vector representing the 4 parameterized coordinates of the predicted bounding box, and ti is that of the negative example anchors. It is possible to optimize for the loss functions of all anchors but this will groundtruth box associated with a positive anchor. bias towards negative samples as they are dominate The classification loss Lcls is log loss over two classes (object us. not object). For the regression loss, we use Instead, we randomly sample 256 anchors in an image Lreg(ti, t i )=R(titi)where R is the robust loss to compute the loss function of a minibatch, where the sampled positive and negative anchors have a function (smooth I 1)defined in [2]. The term pi reg ratio of up to 1: 1. If there are fewer than 128 positive means the regression loss is activated only for positive samples in an image, we pad the minibatch with anchors(p*= 1) and is disabled otherwise(p The outputs of the cls and reg layers consist of (pil negative and ti respectively. We randomly initialize all new layers by drawing weights from a zeromean gaussian distribution with The two terms are normalized by Ncls and Nr reg standard deviation 0.01. All other layers (i. e, the parameter A. In our shared convolutional layers)are initialized by pre current implementation(as in the released code), the training a model for ImageNet classification [361, as cls term in Eqn. (1) is normalized by the minibatch is standard practice [5]. We tune all layers of the size (i.e., Ncls= 256)and the reg term is normalized ZF net, and conv 1 and up for the VGG net to by the number of anchor locations (i.e, Nreg N 2, 400) conserve memory [2]. We use a learning rate of 0.001 By default we set A=10, and thus both cls and for 60k minibatches, and 0.0001 for the next 20k B reg terms are roughly equally weighted. We show minibatches on the pascal voc dataset. We use a y experiments that the results are insensitive to the values of A in a wide range Table 9). We also momentum of 0.9 and a weight decay of 0.0005 [37] o note Our implementation uses Caffe [381 that the normalization as above is not required and could be simplified For bounding box regression, we adopt the param 3.2 Sharing Features for RPN and Fast RCNN eterizations of the 4 coordinates following 15 Thus far we have described how to train a network for region proposal generation, without considering a the region based object detection Cnn that will utilize log(w/wa), th= log(h/h (2)these proposals. For the detection netw ork, we adopt =(m)/l,t=(y9)/ha log(w*/wa), th= log(h*/ha), learn a unified network composed of rpn and Fast RCNN with shared There r, g, w, and h denote the box's center coordi Both RPN and Fast RCNN, trained independently, nates and its width and height. Variables a, aa, and will modify their convolutional layers in different .* are for the predicted box, anchor box, and ground ways. We therefore need to develop a technique that truth box respectively(likewise for y, w, h). This can allows for sharing convolutional layers between the Table 1: the learned average proposal size for each anchor using the ZF net (numbers for s=600) anchor‖1282,2:11282,111282,12262,212562,112562,12512,2112,1512,1:2 proposal188×111113×11470×92416×229261×284174×332768×437499×501355×715 two networks, rather than learning two separate net fix the shared convolutional layers and only finetune works. We discuss three ways for training networks the layers unique to RPN. Now the two networks with features shared share convolutional layers. Finally, keeping the shared (i)Alternating training. In this solution, we first train convolutional layers fixed, we finetune the unique RPN, and use the proposals to train Fast RCNN. layers of Fast RCNN. As such, both networks share The network tuned by Fast RCNN is then used to the same convolutional layers and form a unified initialize RPN, and this process is iterated. This is the network. a similar alternating training can be run solution that is used in all experiments in this paper. for more iterations, but we have observed negligible (ii)Approximale joint training. In this solution, the mprovements RPN and Fast RCNN networks are merged into one network during training as in Figure 2. In each SGD 3.3 Implementation Details Iteration nerates region propos We train and test both region proposal and object ls which are treated just like fixed, precomputed detection networks on images of a single scale [1], [2] proposals when training a FastRcnn detector. The We rescale le the images such that their shorter side backward propagation takes place as usual, where for Is s 600 pixels [2]. Multiscale feature extraction the shared lavers th the backward propagated signals (using an image pyramid)may improve accuracy but from both the rpn loss and the fast rcnn loss does not exhibit a good speedaccuracy tradeoff [2] are combined. This solution is easy to implement. But On the rescaled images, the total stride for both zF this solution ignores the derivative w.r.t. the proposal and vgg nets on the last convolutional layer is 16 boxes'coordinates that are also network responses, pixels, and thus is x10 pixels on a typical PASCal so is approximate. In our experiments, we have em image before resizing( 500 x 375). Even such a large pirically found this solven uces close results, vet stride provides good results, though accuracy may be educes the training time by about 2550% comparing further improved with a smaller stride with alternating training. This solver is included in For anchors, we use 3 scales with box areas of 1282 our released python code 2562, and 512 pixels, and 3 aspect ratios of 1: 1, 1: 2, (ii) Nonapproximale joint training. As discussed and 2: 1.'These hyperparameters are not carefully cho above, the bounding boxes predicted by rPn are Sen for a particular dataset, and we provide ablation also functions of the input. The rol pooling layer experiments on their effects in the next section. As di [2]in Fast RCNN accepts the convolutional features cussed, our solution does not need an image pyramid and also the predicted bounding boxes as input, so or filter pyramid to predict regions of multiple sca a theoretically valid backpropagation solver should saving considerable running time. Figure 3 (right) also involve gradients w.r. t the box coordinates. These shows the capability of our method for a wide range gradients are ignored in the above ap proximate joint of scales and aspect ratios. Table 1 shows the learned training. In a nonapproximate joint training solution, average proposal size for each anchor using the ZF we need an roi pooling layer that is differentiable net. We note that our algorithm allows predictions w.r.t. the box coordinates. This is a nontrivial problem that are larger than the underlying receptive field and a solution can be given by an"Rol warping" layer Such predictions are not impossibleone may still as developed in [15], which is beyond the scope of this le Scope of this roughly infer the extent of an object if or e middle of the object is visible paper. 4Step Alternating Training. In this paper, we ado o he anchor boxes that cross image boundaries need o be handled with care During trainin g, we igr a pragmatic 4step training algorithm to learn shared all crossboundary anchors so they do not contribute features via alternating optimization In the first step, to the loss. For a typical 1000 X 600 image,there we train the rpn as described in Section 3. 1.3. This will be roughly 20000( 60 X40 x 9 )anchors in network is initialized with an ImageNetpretrained total. With the crossboundary anchors ignored, there model and finetuned endtoend for the region pro are about 6000 anchors per image for training. If the posal task. In the second step, we train a separate boundarycrossing outliers are not ignored in training detection network by Fast RCNN using the proposals they introduce large, difficult to correct error terms in generated by the step1 RPN. This detection net the objective, and training does not converge. During work is also initialized by the ImageNetpretrained testing, however, we still apply the fully convolutional model. At this point the two networks do not share rpn to the entire image. This may generate cross convolutional layers. In the third step, we use the boundary proposal boxes, which we clip to the image detector network to initialize RPN training but we boundary Table 2: Detection results on PAsCaL voC 2007 test set(trained on vOC 2007 trainval). The detectors are Fast RCNN with ZE, but using various proposal methods for training and testing traintime region proposals testtime region proposal method boxes method proposals mAP (%) 2000 587 EB 2000 EB 2000 58.6 RPN+ZF, shared 2000 RPN+ZF, shared 300 59.9 ablation experiments follow belou RPN+ZE, unshared 2000 RPN+ZE, unshared 300 58.7 2000 RPN+ZE 100 55.1 2000 RPN+ZF 568 RPN+ZE 1000 563 2000 RPN+ZF (no NMs) 6000 55.2 2000 RPN+ZF (no cls 100 44.6 2000 RPN+ZF (no cls) 514 2000 RPN+ZF (no cls) 1000 55.8 2000 RPN+ZF (no res 300 52.1 2000 RPN+ZF(no reg 1000 513 2000 RPN+VGG 300 592 Some rpn proposals highly overlap with each ToU. SS has an mAP of 58.7% and eB has an mAP other To reduce redundancy, we adopt nonmaximum of 58.6% under the Fast Rcnn framework. RPN suppression(NMS)on the proposal regions based on with Fast RCNN achieves competitive results, with their cls scores. We fix the IoU threshold for NMs an mAP of 59.9% while using up to 300 proposals at 0.7, which leaves us about 2000 proposal regions Using RPN yields a much faster detection system than per image. As we will show, NMS does not harm the using either SS or eB because of shared convolutional ultimate detection accuracy, but substantially reduces computations; the fewer proposals also reduce the the number of proposals. After NMS, we use the regionwise fullyconnected layers cost (Table 5) topN ranked proposal regions for detection. In the ablation experiments on rpn. to investigate the be following, we train Fast RCNN using 2000 RPN pro havior of RPNs as a proposal method, we conducted posals, but evaluate different numbers of proposals at several ablation studies. First we show the effect of testtime sharing convolutional layers between the rPn and Fast RCNN detection network. To do this, we stop 4 EXPERIMENTS fter the second step in the 4step training process 4.1 Experiments on PASCAL VOC Using separate networks reduces the result slightly to We comprehensively evaluate our method on the 58.7%(RPN+ZE, unshared, Table 2). We observe that PASCAL VOC 2007 detection benchmark [11]. This this is because in the third step when the detector dataset consists of about 5k trainval images and 5k tuned features are used to finetune the rpn, th test images over 20 object categories. We also provide proposal quality Is improve results on the pascal voc 2012 benchmark for a Next, we disentangle the RPN's influence on train few models. For the ImageNet pretrained network ing the Fast rcnn detection network.For th we use the"fast"version of ZF net [32] that has pose, we train a Fast rCNn model by using the 5 convolutional layers and 3 fullyconnected layers 2000 SS proposals and ZF net. We fix this detector ne public G16 model" 3 th and evaluate the detection maP by changing the volutional lavers and 3 fullyconnected lavers. We proposal regions used at testtime. In these ablation Primarily evaluate detection mean Average Precision experiments, the rpn does not share features with (mAP), because this is the actual metric for object the detector detection (rather than focusing on object proposal Replacing SS with 300 RPN proposals at testtime proxy metrics) leads to an maP of 56.8%. The loss in map is because Table 2 (top) Shows Fast RcNN results when of the inconsistency between the training/testing pro trained and tested using various region proposal posals. This result serves as the baseline for the fol methods. These results use the zf net. For selective lowing comparisons Search(ss)[4 we generate about 2000 proposals b Somewhat surprisingly, the rPn still leads to a y the"fast"mode For Edge Boxes(EB)[6], we generate competitive result(55. 1%)when using the topranked the proposals by the default EB setting tuned for 0.7 8. For RPN, the number of proposals(e.g, 300)is the maximum number for an image. RPN may produce fewer proposals after 7.www.robots.ox.ac.uk/vgg/research/very_deep/ NMS, and thus the average number of proposals is smaller Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast RCNN and VGG16. Training data: 07: VOC 2007 trainval, 07+12" union set of voc 2007 trainval and voc 2012 trainval For rPn the traintime proposals for Fast RCNN are 2000. t: this number was reported in [2]; using the repository provided by this paper, this result is higher (68.1) method proposals data AP(%) 2000 66.9 2000 07+12 70.0 RPN+VGG, unshared 300 RPN+ⅤGG, Shared 300 RPN+vGG, shared 07+12 73.2 RPN+vgg, shared 300COCO+07+12788 Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast RCNN and VGG16. Training data: 07: VOC 2007 trainval, 07++12. union set of voc 2007 trainval+test and voc 2012 trainval. for Rpn,thetraintimeproposalsforFastRcnnare2000.t:http://host.robots.ox.ac.uk:8080/anonymous/hzjtqa.html.t http://host.robots.ox.ac.uk:8080/anonymous/ynplxb.html3:http://host.robots.ox.ac.uk:8080/anonymous/xedh10.html metho proposals data mAP (%o 2000 2 657 2000 07++12 RPN+ⅤGG, shared 300 12 67.0 RPN+VGG. shared 07++12 70.4 RPN+ⅤGG, shared 300 COCO+07++12 75.9 Table 5: Timing(ms)on a K40 GPU, except SS proposal is evaluated in a CPU. Regionwise" includes NMS, pooling fullyconnected, and softmax layers. See our released code for the profiling of running time model system conv proposal regionwise total VGG SS+ Fast RCNN 146 1510 174 1830 0.5 rPn Fast RCN 10 ZF RPN Fast RCNN 31 3 25 59 17 fps 100 proposals at testtime, indicating that the top(using RPN+ZF) to 59. 2%(using RPN+VGG). This is a ranked rpn proposals are accurate. On the other promising result, because it suggests that the proposal extreme, using the topranked 6000 RPn proposals quality of RPN+VGG is better than that of RPN+ZH (without NMS) has a comparable mAp(55.2%), sug Because proposals of RPn+ ZF are competitive with gesting NMS does not harm the detection mAP and Ss(both are 58.7% when consistently used for training may reduce false alarms and testing), we may expect RPN+VGG to be better Next, we separately investigate the roles of RPNs than SS. The following experiments justify this cls and reg outputs by turning off either of them pothesIs at testtime. When the cls layer is removed at test time(thus no NMS ranking is used), we randomly Performance of VGG16. Table 3 shows the results ample N proposals from the unscored regions. The of VGG16 for both proposal and detection. Using mAP is nearly unchanged with N= 1000(55.8%), but RPN+VGG, the result is 68.5% for unshared features degrades considerably to 44.6% when /=100. This slightly higher than the Ss baseline. As shown above, shows that the cls scores account for the accuracy of this is because the proposals generated by RPN+VGG the highest ranked proposals are more accurate than SS. Unlike ss that is pre defined the rpn is actively trained and benefits from On the other hand, when the reg layer is removed better networks. for the featureshared variant, the at testtime(so the proposals become anchor boxes), result is 69.9%better than the strong SS baseline, yet the mAP drops to 52.1%. This suggests that the high with nearly costfree proposals. We further train the quality proposals are mainly due to the regressed box rpn and detection network on the union set of pas bounds. The anchor boxes, though having multiple cal voc 2007 trainval and 2012 trainval. The mAP scales and aspect ratios, are not sufficient for accurate is 73.2%. Figure 5 shows some results on the PASCAL detection VOC 2007 test set. On the pascal voc 2012 test set We also evaluate the effects of more powerful net(Table 4), our method has an mAP of 70.4% trained works on the proposal quality of rpn alone. We use on the union set of voc 2007 trainval+test and VOC VGG16 to train the rpn, and still use the above 20 12 trainval. Table 6 and table 7 show the detailed detector of SS+ZE. The mAP improves from 56.8% numbers Table 6: Results on pascal voc 2007 test set with Fast rcnn detectors and VGG16 For rpn. the train time proposals for Fast RCnn are 2000. RPN* denotes the unsharing feature version method# hox data mAP ao hike bird boat bottle bus car cat chair cow table dog horse mike person plant sheep sofa train SS2000 07 669745783625323667378.282040772767.979679,273.069030:16547.2758658 2000 700770781693594383816 86742878868984782076.669931.870174.8 RPN" 30 07 68.574.177267753951.075.1 78.950.778061.179.181.972.275.937.271.462.5 23m424 RPn 300 6997080.670.157.349978280482.052275.367280379.875.076.339.168.367.381167.6 07+1273.276.57.0709655218318478652081965784884677576738873673.9830726 RN300c00407+128843207768957881888963686370885987680.8235368047589667389 Table 7: Results on pascal voc 2012 test set with fast rcnn detectors and vgg16 for rpn the train time proposals for Fast RCNN are 2000 method# box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person 2000 6578037766946937773968687741771.151.18607.879.8698 SS2000 07++12 82.378470.852.3 RPn300 376.471.048.4 82 77.871.689.344.273.055.087580.5 2.172.387342.273750.0868787 则Z35051m 58.9 RPN 07++12 70.484.979.8743 77.57.988.545.677.155.386981780.979640.172.660.981.261.5 RPN COCO+x+12259874836768 81982091354982659.09.085584.784152278.965.585470.2 Table 8: Detection results of faster rcnn on pas 3 scales and 3 aspect ratios(69.9% mAP in Table 8) CAL VOC 2007 test set using different settings of If using just one anchor at each position, the mAP anchors. The network is VGG16. The training data drops by a considerable margin of 34%. The mAP is voc 2007 trainval. The default setting of using 3 is higher if using 3 scales (with 1 aspect ratio) or 3 scales and 3 aspect ratios (69.9%)is the same as that aspect ratios( with 1 scale), demonstrating that using n table 3 anchors of multiple sizes as the regression references settings anchor scales aspect ratios mAP( is an effective solution. Using just 3 scales with 1 1282 1:1 1 scale, i ratio 65.8 aspect ratio(69.8%)is as good as using 3 scales with 1:1 667 3 aspect ratios on this dataset, suggesting that scales 1 scale, 3 ratios {21,111:2H688 and aspect ratios are not disentangled dimensions for 256 21,1:1,1:2}67.9 the detection accuracy. But we still adopt these two 3 scales,, 1 ratio{1282,2562,512)1:1 698 3saes,3 ratios2,20.512){21,11,12}699 dimensions in our designs to keep our system flexible In Table 9 we compare different values of A in Equa tion (1). By default we use A= 10 which makes the Table 9. Detection results of Faster rcnn on pas two terms in Equation (1) roughly equally weighted CAL VOC 2007 test set using different values of after normalization. Table 9 shows that our result is in equation(1). The network is VGG16 The training impacted just marginally (by 1%)when d is within data is vOC 2007 trainval. The default setting of using a scale of about two orders of magnitude(1 to 100) =10(69.9%)is the same as that in Table 3 This demonstrates that the result is insensitive to a in 入 0.1 100 a wide range mAP(%)672689699691 Analysis of RecalltoloU. Next we compute the recall of proposals at different lou ratios with ground truth boxes. It is noteworthy that the recalltoloU In Table 5 we summarize the running time of the of the metric is just loosely 119l,[201 [21] related to the entire object detection system. Ss takes 12 seconds ultimate detection accuracy. It is more appropriate to depending on content (on average about 1.5s), and use this metric to diagnose the proposal method than to evaluate it Fast Rcnn with vGG16 takes 320ms on 2000 SS proposals(or 223ms if using SVD on fullyconnected In Figure 4, we show the results of using 300, 1000, layers [2). Our system with VGG16 takes in total and 2000 proposals We compare with SS and EB, and 198ms for both proposal and detection. With the con the n proposals are the top ranked ones based on volutional features shared, the rpn alone only takes the confidence generated by these methods. The plots 10ms computing the additional layers Our region show that the rpn method behaves gracefully when ise computation is also lower, thanks to fewer pro the number of proposals drops from 2000 to 300. This posals(300 per image). Our system has a framerate explains why the rpn has a good ultimate detection of 17 fps with the Zf net maP when using as few as 300 proposals. As we analyzed before, this property is mainly attributed to Sensitivities to Hyperparameters. In Table 8 we the cls term of the rPn. The recall of SS and eB drops investigate the settings of anchors. By default we use more quickly than rpn when the proposals are fewer 300 proposals 1000 proposals 2000 proposals 0.6 "+nss 02— RPN ZF RP 02— RPN ZF RPN VGG RPN VGG RPN VGG 0.6 8 lOU loU Figure 4: Recall us. loU overlap ratio on the PASCAL VOC 2007 test set Table 10: OneStage Detection uS. TwoStage Proposal Detection. Detection results are on the PASCaL vOC 2007 test set using the ZF model and Fast RCNN. RPN uses unshared features proposals detector mAP(%) TwoStage RPN+ ZF, unshared 300 Fast RCNN ZF, 1 scale587 OneStage dense, 3 scales, 3 aspect ratios 20000 Fast RCNN+ZF, 1 38 OneStage dense, 3 scales, 3 aspect ratios 20000 Fast RCNN+ ZE, 5 scales 539 OneStage Detection vs. TwoStage Proposal + De region proposals with sliding windows leads to 6% tection. The Over Feat paper [9] proposes a detection degradation in both papers. We also note that the one method that uses regressors and classifiers on sliding stage system is slower as it has considerably more windows over convolutional feature maps. OverFeat proposals to process is a onestage, classspecific detection pipeline, and ours is a twostage cascade consisting of classagnostic pro posals and classspecific detections In OverFeat, the 4.2 Experiments on MS COcO regionwise features come from a sliding window of We present more results on the Microsoft COcO one aspect ratio over a scale pyramid. These features object detection dataset [12]. This dataset involves 80 are used to simultaneously determine the location and object categories. We experiment with the 80k images category of objects. In RPN, the features are from on the training set, 40k images on the validation set, square(3x3)sliding windows and predict proposals and 20k images on the testdev set. We evaluate the relative to anchors with different scales and aspect mAP averaged for IOU E 0.5: 0.05: 0.95](COCO's ratios. Though both methods use sliding windowS, the standard metric, simply denoted as mAP@1.5, 95] region proposal task s only the first stage of Faster r and mAP@0. 5(PASCAL VOC'S metric CNNthe downstream Fast rcnn detector attends There are a few minor changes of our system made to the proposals to refine them In the second stage ot for this dataset. We train our models on an 8GPU our cascade, the regionwise features are adaptively implementation and the effective minibatch size be pooled [1], [2] from proposal boxes hat more faith comes 8 for RPN (1 per GPU) and 16 for Fast RCNN fully cover the features of the regions. We believe (2 per GPU). The rPn step and FastRCNN step are these features lead to more accurate detections both trained for 240k iterations with a learning rate To compare the onestage and twostage systems, of 0.003 and then for 80k iterations with 0.0003.We we emulate the Over Feat system (and thus also circum modify the learning rates(starting with 0.003 instead vent other differences of implementation details)by of 0.001)because the minibatch size is changed. For onestage Fast RCNN. In this system, the proposals" the anchors we use 3 aspect ratios and scales are dense sliding windows of 3 scales(128, 256,512)(adding 642) rated by handling small and 3 aspect ratios (1: 1, 1: 2, 2: 1 ). Fast RCnn is objects on this dataset In addition, in our Fast RCNN trained to predict classspecific scores and regress box step the negative samples are defined as those with locations from these sliding windows. Because the a maximum loU with ground truth in the interval of OverFeat system adopts an image pyramid, we also [0, 0.5), instead of [0. 1, 0.5 )used in [1], [2]. We note evaluate using convolutional features extracted from that in the SPPnet system [11, the negative samples 5 scales. We use those 5 scales as in [1, [21 in 0. 1, 0.5) are used for network finetuning but the Table 10 compares the twostage system and two negative samples in[0, 0.5)are still visited in the SVm variants of the onestage system. USing the ZF model, step with hardnegative mining But the Fast RCNN the onestage system has an mAP of 53.g%. This is system [2] abandons the SVM step, so the negative lower than the twostage system(58.7%)by 4.8%0. samples in 0, 0. 1)are never visited Including these This experiment justifies the effectiveness of cascaded 0, 0. 1)samples improves mAP@0. 5 on the COCO region proposals and object detection. Similar obser dataset for both Fast RCnn and Faster RCNN sys vations are reported in [2] ,[39], where replacing ss tems(but the impact is negligible on PASCAL VOC)
 1.64MB
Faster RCNN
20180827基础知识，理论及实践
Faster RCNN下载_course
20181219Stateoftheart object detection networks depend on region proposal algorithms to hypothesize objec
 论文阅读笔记（十二）：Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks 25720180412Stateoftheart object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast RCNN [2] have reduced the running time of these dete...
 140B
faster_rcnn_models的下载链接
20181207由于文件比较大，ZF_faster_rcnn_final.caffemodel，VGG16_faster_rcnn_final.caffemodel这两个文件都超过220M，所以我把这两个文件压缩放在
 【双语论文】Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks 64720190314Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks 更快的区域卷积神经网络：用区域建议网络朝着实时目标检测迈进 Abstract Stateoftheart object detection networks depend on region proposal algorithms to...

下载
ehcachecore2.6.5.jar和mybatisehcache1.0.2.jar
ehcachecore2.6.5.jar和mybatisehcache1.0.2.jar

博客
我们日常用到的电脑20个快捷键
我们日常用到的电脑20个快捷键

下载
PCB单点与多点接地有什么区别
PCB单点与多点接地有什么区别

博客
关于HTML
关于HTML

博客
数据结构（50）有向无环图描述表达式
数据结构（50）有向无环图描述表达式

学院
Java全栈工程师UEditor编辑器
Java全栈工程师UEditor编辑器

下载
ssm兰州梦居宾馆网上预订综合业务服务系统毕业设计程序
ssm兰州梦居宾馆网上预订综合业务服务系统毕业设计程序

学院
第二阶段4.10：Spring Cloud Config配置中心的应用与原理
第二阶段4.10：Spring Cloud Config配置中心的应用与原理

学院
js高级dom实战开发教程
js高级dom实战开发教程

下载
vsphere_doc_pdf_7.0.zip
vsphere_doc_pdf_7.0.zip

博客
误操作git add 之后如何恢复
误操作git add 之后如何恢复

下载
全国省市区县街道居委会级联关系
全国省市区县街道居委会级联关系

博客
人为什么活着？这个观点绝对让你耳目一新
人为什么活着？这个观点绝对让你耳目一新

博客
PMBOK第六版中文版十五至尊图
PMBOK第六版中文版十五至尊图

博客
各种注释,缩进快捷键
各种注释,缩进快捷键

学院
laravel5.6 初级入门
laravel5.6 初级入门

学院
Vue全家桶+Node.js全栈开发Xmall商城
Vue全家桶+Node.js全栈开发Xmall商城

博客
Android studio 中 Gradle插件版本和Gradle版本关系
Android studio 中 Gradle插件版本和Gradle版本关系

下载
抖音超火罗盘时钟源码
抖音超火罗盘时钟源码

博客
带字圆环的vue组件
带字圆环的vue组件

博客
shiro
shiro

学院
Python基础入门（纯面授课程）
Python基础入门（纯面授课程）

博客
正则表达式
正则表达式

博客
任务62：PROMISE的基础知识
任务62：PROMISE的基础知识

学院
计算机网络中的抽象
计算机网络中的抽象

学院
【谢昆明】音视频开发基础课：系统学习音视频基础知识
【谢昆明】音视频开发基础课：系统学习音视频基础知识

博客
diff.js使用指南
diff.js使用指南

学院
MySql核心原理训练营
MySql核心原理训练营

下载
基于AT86RF230 ZigBee的WPAN网络设备设计
基于AT86RF230 ZigBee的WPAN网络设备设计

下载
基于DSP的视频监控系统的硬件设计
基于DSP的视频监控系统的硬件设计