FasterR-CNNTowardsReal-TimeObjectDetectionwithRegionProposal资源-CSDN文库

共1个文件

pdf：1个

Faster

R-CNN

需积分: 31 115 浏览量 2016-10-14 16:57:42 上传评论收藏 6.04MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

Faster R-CNN .rar （1个子文件）

Faster R-CNN .pdf 6.59MB

Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.

Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region

proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image

convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional

network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to

generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN

into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with

“attention” mechanisms, the RPN component tells the uniﬁed network where to look. For the very deep VGG-16 model [3],

our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection

accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO

2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been

made publicly available.

Index Terms—Object Detection, Region Proposal, Convolutional Neural Network.

1 INTRODUCTION

Recent advances in object detection are driven by

the success of region proposal methods (e.g., [4])

and region-based convolutional neural networks (R-

CNNs) [5]. Although region-based CNNs were com-

putationally expensive as originally developed in [5],

their cost has been drastically reduced thanks to shar-

ing convolutions across proposals [1], [2]. The latest

incarnation, Fast R-CNN [2], achieves near real-time

rates using very deep networks [3], when ignoring the

time spent on region proposals. Now, proposals are the

test-time computational bottleneck in state-of-the-art

detection systems.

Region proposal methods typically rely on inex-

pensive features and economical inference schemes.

Selective Search [4], one of the most popular meth-

ods, greedily merges superpixels based on engineered

low-level features. Yet when compared to efﬁcient

detection networks [2], Selective Search is an order of

magnitude slower, at 2 seconds per image in a CPU

implementation. EdgeBoxes [6] currently provides the

best tradeoff between proposal quality and speed,

at 0.2 seconds per image. Nevertheless, the region

proposal step still consumes as much running time

as the detection network.

• S. Ren is with University of Science and Technology of China, Hefei,

China. This work was done when S. Ren was an intern at Microsoft

Research. Email: sqren@mail.ustc.edu.cn

• K. He and J. Sun are with Visual Computing Group, Microsoft

Research. E-mail: {kahe,jiansun}@microsoft.com

• R. Girshick is with Facebook AI Research. The majority of this work

was done when R. Girshick was with Microsoft Research. E-mail:

rbg@fb.com

One may note that fast region-based CNNs take

advantage of GPUs, while the region proposal meth-

ods used in research are implemented on the CPU,

making such runtime comparisons inequitable. An ob-

vious way to accelerate proposal computation is to re-

implement it for the GPU. This may be an effective en-

gineering solution, but re-implementation ignores the

down-stream detection network and therefore misses

important opportunities for sharing computation.

In this paper, we show that an algorithmic change—

computing proposals with a deep convolutional neu-

ral network—leads to an elegant and effective solution

where proposal computation is nearly cost-free given

the detection network’s computation. To this end, we

introduce novel Region Proposal Networks (RPNs) that

share convolutional layers with state-of-the-art object

detection networks [1], [2]. By sharing convolutions at

test-time, the marginal cost for computing proposals

is small (e.g., 10ms per image).

Our observation is that the convolutional feature

maps used by region-based detectors, like Fast R-

CNN, can also be used for generating region pro-

posals. On top of these convolutional features, we

construct an RPN by adding a few additional con-

volutional layers that simultaneously regress region

bounds and objectness scores at each location on a

regular grid. The RPN is thus a kind of fully convo-

lutional network (FCN) [7] and can be trained end-to-

end speciﬁcally for the task for generating detection

proposals.

RPNs are designed to efﬁciently predict region pro-

posals with a wide range of scales and aspect ratios. In

contrast to prevalent methods [8], [9], [1], [2] that use

arXiv:1506.01497v3 [cs.CV] 6 Jan 2016

multiple scaled images

multiple filter sizes

multiple references

(a) (b) (c)

image

feature map

image

feature map

image

feature map

Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps

are built, and the classiﬁer is run at all scales. (b) Pyramids of ﬁlters with multiple scales/sizes are run on

the feature map. (c) We use pyramids of reference boxes in the regression functions.

pyramids of images (Figure 1, a) or pyramids of ﬁlters

(Figure 1, b), we introduce novel “anchor” boxes

that serve as references at multiple scales and aspect

ratios. Our scheme can be thought of as a pyramid

of regression references (Figure 1, c), which avoids

enumerating images or ﬁlters of multiple scales or

aspect ratios. This model performs well when trained

and tested using single-scale images and thus beneﬁts

running speed.

To unify RPNs with Fast R-CNN [2] object detec-

tion networks, we propose a training scheme that

alternates between ﬁne-tuning for the region proposal

task and then ﬁne-tuning for object detection, while

keeping the proposals ﬁxed. This scheme converges

quickly and produces a uniﬁed network with convo-

lutional features that are shared between both tasks.

We comprehensively evaluate our method on the

PASCAL VOC detection benchmarks [11] where RPNs

with Fast R-CNNs produce detection accuracy bet-

ter than the strong baseline of Selective Search with

Fast R-CNNs. Meanwhile, our method waives nearly

all computational burdens of Selective Search at

test-time—the effective running time for proposals

is just 10 milliseconds. Using the expensive very

deep models of [3], our detection method still has

a frame rate of 5fps (including all steps) on a GPU,

and thus is a practical object detection system in

terms of both speed and accuracy. We also report

results on the MS COCO dataset [12] and investi-

gate the improvements on PASCAL VOC using the

COCO data. Code has been made publicly available

at https://github.com/shaoqingren/faster_

rcnn (in MATLAB) and https://github.com/

rbgirshick/py-faster-rcnn (in Python).

A preliminary version of this manuscript was pub-

lished previously [10]. Since then, the frameworks of

RPN and Faster R-CNN have been adopted and gen-

eralized to other methods, such as 3D object detection

[13], part-based detection [14], instance segmentation

[15], and image captioning [16]. Our fast and effective

object detection system has also been built in com-

1. Since the publication of the conference version of this paper

[10], we have also found that RPNs can be trained jointly with Fast

R-CNN networks leading to less training time.

mercial systems such as at Pinterests [17], with user

engagement improvements reported.

In ILSVRC and COCO 2015 competitions, Faster

R-CNN and RPN are the basis of several 1st-place

entries [18] in the tracks of ImageNet detection, Ima-

geNet localization, COCO detection, and COCO seg-

mentation. RPNs completely learn to propose regions

from data, and thus can easily beneﬁt from deeper

and more expressive features (such as the 101-layer

residual nets adopted in [18]). Faster R-CNN and RPN

are also used by several other leading entries in these

competitions

. These results suggest that our method

is not only a cost-efﬁcient solution for practical usage,

but also an effective way of improving object detec-

tion accuracy.

2 RELATED WORK

Object Proposals. There is a large literature on object

proposal methods. Comprehensive surveys and com-

parisons of object proposal methods can be found in

[19], [20], [21]. Widely used object proposal methods

include those based on grouping super-pixels (e.g.,

Selective Search [4], CPMC [22], MCG [23]) and those

based on sliding windows (e.g., objectness in windows

[24], EdgeBoxes [6]). Object proposal methods were

adopted as external modules independent of the de-

tectors (e.g., Selective Search [4] object detectors, R-

CNN [5], and Fast R-CNN [2]).

Deep Networks for Object Detection. The R-CNN

method [5] trains CNNs end-to-end to classify the

proposal regions into object categories or background.

R-CNN mainly plays as a classiﬁer, and it does not

predict object bounds (except for reﬁning by bounding

box regression). Its accuracy depends on the perfor-

mance of the region proposal module (see compar-

isons in [20]). Several papers have proposed ways of

using deep networks for predicting object bounding

boxes [25], [9], [26], [27]. In the OverFeat method [9],

a fully-connected layer is trained to predict the box

coordinates for the localization task that assumes a

single object. The fully-connected layer is then turned

2. http://image-net.org/challenges/LSVRC/2015/results

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

Figure 2: Faster R-CNN is a single, uniﬁed network

for object detection. The RPN module serves as the

‘attention’ of this uniﬁed network.

into a convolutional layer for detecting multiple class-

speciﬁc objects. The MultiBox methods [26], [27] gen-

erate region proposals from a network whose last

fully-connected layer simultaneously predicts mul-

tiple class-agnostic boxes, generalizing the “single-

box” fashion of OverFeat. These class-agnostic boxes

are used as proposals for R-CNN [5]. The MultiBox

proposal network is applied on a single image crop or

multiple large image crops (e.g., 224×224), in contrast

to our fully convolutional scheme. MultiBox does not

share features between the proposal and detection

networks. We discuss OverFeat and MultiBox in more

depth later in context with our method. Concurrent

with our work, the DeepMask method [28] is devel-

oped for learning segmentation proposals.

Shared computation of convolutions [9], [1], [29],

[7], [2] has been attracting increasing attention for ef-

ﬁcient, yet accurate, visual recognition. The OverFeat

paper [9] computes convolutional features from an

image pyramid for classiﬁcation, localization, and de-

tection. Adaptively-sized pooling (SPP) [1] on shared

convolutional feature maps is developed for efﬁcient

region-based object detection [1], [30] and semantic

segmentation [29]. Fast R-CNN [2] enables end-to-end

detector training on shared convolutional features and

shows compelling accuracy and speed.

3 FASTER R-CNN

Our object detection system, called Faster R-CNN, is

composed of two modules. The ﬁrst module is a deep

fully convolutional network that proposes regions,

and the second module is the Fast R-CNN detector [2]

that uses the proposed regions. The entire system is a

single, uniﬁed network for object detection (Figure 2).

Using the recently popular terminology of neural

networks with ‘attention’ [31] mechanisms, the RPN

module tells the Fast R-CNN module where to look.

In Section 3.1 we introduce the designs and properties

of the network for region proposal. In Section 3.2 we

develop algorithms for training both modules with

features shared.

3.1 Region Proposal Networks

A Region Proposal Network (RPN) takes an image

(of any size) as input and outputs a set of rectangular

object proposals, each with an objectness score.

model this process with a fully convolutional network

[7], which we describe in this section. Because our ulti-

mate goal is to share computation with a Fast R-CNN

object detection network [2], we assume that both nets

share a common set of convolutional layers. In our ex-

periments, we investigate the Zeiler and Fergus model

[32] (ZF), which has 5 shareable convolutional layers

and the Simonyan and Zisserman model [3] (VGG-16),

which has 13 shareable convolutional layers.

To generate region proposals, we slide a small

network over the convolutional feature map output

by the last shared convolutional layer. This small

network takes as input an n × n spatial window of

the input convolutional feature map. Each sliding

window is mapped to a lower-dimensional feature

(256-d for ZF and 512-d for VGG, with ReLU [33]

following). This feature is fed into two sibling fully-

connected layers—a box-regression layer (reg) and a

box-classiﬁcation layer (cls). We use n = 3 in this

paper, noting that the effective receptive ﬁeld on the

input image is large (171 and 228 pixels for ZF and

VGG, respectively). This mini-network is illustrated

at a single position in Figure 3 (left). Note that be-

cause the mini-network operates in a sliding-window

fashion, the fully-connected layers are shared across

all spatial locations. This architecture is naturally im-

plemented with an n×n convolutional layer followed

by two sibling 1 × 1 convolutional layers (for reg and

cls, respectively).

3.1.1 Anchors

At each sliding-window location, we simultaneously

predict multiple region proposals, where the number

of maximum possible proposals for each location is

denoted as k. So the reg layer has 4k outputs encoding

the coordinates of k boxes, and the cls layer outputs

2k scores that estimate probability of object or not

object for each proposal

. The k proposals are param-

eterized relative to k reference boxes, which we call

3. “Region” is a generic term and in this paper we only consider

rectangular regions, as is common for many methods (e.g., [27], [4],

[6]). “Objectness” measures membership to a set of object classes

vs. background.

4. For simplicity we implement the cls layer as a two-class

softmax layer. Alternatively, one may use logistic regression to

produce k scores.

car : 1.000

dog : 0.997

person : 0.992

person : 0.979

horse : 0.993

conv feature map

intermediate layer

256-d

2k scores 4k coordinates

sliding window

reg la yercls layer

bus : 0.996

person : 0.736

boat : 0.970

person : 0.989

person : 0.983

person : 0.925

cat : 0.982

dog : 0.994

Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL

VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.

anchors. An anchor is centered at the sliding window

in question, and is associated with a scale and aspect

ratio (Figure 3, left). By default we use 3 scales and

3 aspect ratios, yielding k = 9 anchors at each sliding

position. For a convolutional feature map of a size

W × H (typically ∼2,400), there are W Hk anchors in

total.

Translation-Invariant Anchors

An important property of our approach is that it

is translation invariant, both in terms of the anchors

and the functions that compute proposals relative to

the anchors. If one translates an object in an image,

the proposal should translate and the same function

should be able to predict the proposal in either lo-

cation. This translation-invariant property is guaran-

teed by our method

. As a comparison, the MultiBox

method [27] uses k-means to generate 800 anchors,

which are not translation invariant. So MultiBox does

not guarantee that the same proposal is generated if

an object is translated.

The translation-invariant property also reduces the

model size. MultiBox has a (4 + 1) × 800-dimensional

fully-connected output layer, whereas our method has

a (4 + 2) × 9-dimensional convolutional output layer

in the case of k = 9 anchors. As a result, our output

layer has 2.8 × 10

parameters (512 × (4 + 2) × 9

for VGG-16), two orders of magnitude fewer than

MultiBox’s output layer that has 6.1 × 10

parameters

(1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox

[27]). If considering the feature projection layers, our

proposal layers still have an order of magnitude fewer

parameters than MultiBox

. We expect our method

to have less risk of overﬁtting on small datasets, like

PASCAL VOC.

5. As is the case of FCNs [7], our network is translation invariant

up to the network’s total stride.

6. Considering the feature projection layers, our proposal layers’

parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 10

;

MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 +

64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 10

Multi-Scale Anchors as Regression References

Our design of anchors presents a novel scheme

for addressing multiple scales (and aspect ratios). As

shown in Figure 1, there have been two popular ways

for multi-scale predictions. The ﬁrst way is based on

image/feature pyramids, e.g., in DPM [8] and CNN-

based methods [9], [1], [2]. The images are resized at

multiple scales, and feature maps (HOG [8] or deep

convolutional features [9], [1], [2]) are computed for

each scale (Figure 1(a)). This way is often useful but

is time-consuming. The second way is to use sliding

windows of multiple scales (and/or aspect ratios) on

the feature maps. For example, in DPM [8], models

of different aspect ratios are trained separately using

different ﬁlter sizes (such as 5×7 and 7×5). If this way

is used to address multiple scales, it can be thought

of as a “pyramid of ﬁlters” (Figure 1(b)). The second

way is usually adopted jointly with the ﬁrst way [8].

As a comparison, our anchor-based method is built

on a pyramid of anchors, which is more cost-efﬁcient.

Our method classiﬁes and regresses bounding boxes

with reference to anchor boxes of multiple scales and

aspect ratios. It only relies on images and feature

maps of a single scale, and uses ﬁlters (sliding win-

dows on the feature map) of a single size. We show by

experiments the effects of this scheme for addressing

multiple scales and sizes (Table 8).

Because of this multi-scale design based on anchors,

we can simply use the convolutional features com-

puted on a single-scale image, as is also done by

the Fast R-CNN detector [2]. The design of multi-

scale anchors is a key component for sharing features

without extra cost for addressing scales.

3.1.2 Loss Function

For training RPNs, we assign a binary class label

(of being an object or not) to each anchor. We as-

sign a positive label to two kinds of anchors: (i) the

anchor/anchors with the highest Intersection-over-

Union (IoU) overlap with a ground-truth box, or (ii) an

anchor that has an IoU overlap higher than 0.7 with

any ground-truth box. Note that a single ground-truth

box may assign positive labels to multiple anchors.

Usually the second condition is sufﬁcient to determine

the positive samples; but we still adopt the ﬁrst

condition for the reason that in some rare cases the

second condition may ﬁnd no positive sample. We

assign a negative label to a non-positive anchor if its

IoU ratio is lower than 0.3 for all ground-truth boxes.

Anchors that are neither positive nor negative do not

contribute to the training objective.

With these deﬁnitions, we minimize an objective

function following the multi-task loss in Fast R-CNN

[2]. Our loss function for an image is deﬁned as:

L({p

}, {t

}) =

cls

, p

∗

)

+λ

reg

∗

reg

, t

∗

(1)

Here, i is the index of an anchor in a mini-batch and

is the predicted probability of anchor i being an

object. The ground-truth label p

∗

is 1 if the anchor

is positive, and is 0 if the anchor is negative. t

is a

vector representing the 4 parameterized coordinates

of the predicted bounding box, and t

∗

is that of the

ground-truth box associated with a positive anchor.

The classiﬁcation loss L

cls

is log loss over two classes

(object vs. not object). For the regression loss, we use

reg

, t

∗

) = R(t

− t

∗

) where R is the robust loss

function (smooth L

) deﬁned in [2]. The term p

∗

reg

means the regression loss is activated only for positive

anchors (p

∗

= 1) and is disabled otherwise (p

∗

= 0).

The outputs of the cls and reg layers consist of {p

}

and {t

} respectively.

The two terms are normalized by N

cls

and N

reg

and weighted by a balancing parameter λ. In our

current implementation (as in the released code), the

cls term in Eqn.(1) is normalized by the mini-batch

size (i.e., N

cls

= 256) and the reg term is normalized

by the number of anchor locations (i.e., N

reg

∼ 2, 400).

By default we set λ = 10, and thus both cls and

reg terms are roughly equally weighted. We show

by experiments that the results are insensitive to the

values of λ in a wide range (Table 9). We also note

that the normalization as above is not required and

could be simpliﬁed.

For bounding box regression, we adopt the param-

eterizations of the 4 coordinates following [5]:

= (x − x

)/w

, t

= (y − y

)/h

= log(w/w

), t

= log(h/h

∗

= (x

∗

− x

)/w

, t

∗

= (y

∗

− y

)/h

∗

= log(w

∗

), t

∗

= log(h

∗

(2)

where x, y, w, and h denote the box’s center coordi-

nates and its width and height. Variables x, x

, and

∗

are for the predicted box, anchor box, and ground-

truth box respectively (likewise for y, w, h). This can

be thought of as bounding-box regression from an

anchor box to a nearby ground-truth box.

Nevertheless, our method achieves bounding-box

regression by a different manner from previous RoI-

based (Region of Interest) methods [1], [2]. In [1],

[2], bounding-box regression is performed on features

pooled from arbitrarily sized RoIs, and the regression

weights are shared by all region sizes. In our formula-

tion, the features used for regression are of the same

spatial size (3 × 3) on the feature maps. To account

for varying sizes, a set of k bounding-box regressors

are learned. Each regressor is responsible for one scale

and one aspect ratio, and the k regressors do not share

weights. As such, it is still possible to predict boxes of

various sizes even though the features are of a ﬁxed

size/scale, thanks to the design of anchors.

3.1.3 Training RPNs

The RPN can be trained end-to-end by back-

propagation and stochastic gradient descent (SGD)

[35]. We follow the “image-centric” sampling strategy

from [2] to train this network. Each mini-batch arises

from a single image that contains many positive and

negative example anchors. It is possible to optimize

for the loss functions of all anchors, but this will

bias towards negative samples as they are dominate.

Instead, we randomly sample 256 anchors in an image

to compute the loss function of a mini-batch, where

the sampled positive and negative anchors have a

ratio of up to 1:1. If there are fewer than 128 positive

samples in an image, we pad the mini-batch with

negative ones.

We randomly initialize all new layers by drawing

weights from a zero-mean Gaussian distribution with

standard deviation 0.01. All other layers (i.e., the

shared convolutional layers) are initialized by pre-

training a model for ImageNet classiﬁcation [36], as

is standard practice [5]. We tune all layers of the

ZF net, and conv3 1 and up for the VGG net to

conserve memory [2]. We use a learning rate of 0.001

for 60k mini-batches, and 0.0001 for the next 20k

mini-batches on the PASCAL VOC dataset. We use a

momentum of 0.9 and a weight decay of 0.0005 [37].

Our implementation uses Caffe [38].

3.2 Sharing Features for RPN and Fast R-CNN

Thus far we have described how to train a network

for region proposal generation, without considering

the region-based object detection CNN that will utilize

these proposals. For the detection network, we adopt

Fast R-CNN [2]. Next we describe algorithms that

learn a uniﬁed network composed of RPN and Fast

R-CNN with shared convolutional layers (Figure 2).

Both RPN and Fast R-CNN, trained independently,

will modify their convolutional layers in different

ways. We therefore need to develop a technique that

allows for sharing convolutional layers between the

评论收藏

内容反馈

ture_dream

粉丝: 277
资源: 63

Faster R-CNN Towards Real-Time Object Detection with Region Prop...

最新资源

Faster R-CNN Towards Real-Time Object Detection with Region Prop...

Faster R-CNN: Towards Real-Time Object Detection with Region

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN Towards Real-Time Object Detection With Region Proposal Networks

Robust Real-time Object Detection

real-time concepts for operating system

NFC Software in a Mobile System

DIV教程

最新微信公众号二维码获取方式(2017.11)

image object detection

Faster R-CNN Towards Real-Time Object

【走读】Faster R-CNN Towards Real-Time Object Detection with Region

近邻传播聚类（affinity propagation clustering）MATLAB程序

网络编程实用教程(第二版)-源代码-叶树华

PCI EXPRESS体系结构导读 PDF

DADICC: Intelligent system for anomaly detection in a combined

Cadence17.2 HotfixS029补丁包

DHT22(AM2302)STM32f103程序

仿招行信用卡已出账单环形动画效果

Real-time multiple vehicle detection and tracking from a moving vehicle.pdf

Tikhonov正则化工具箱.zip

Development of a Fall Detection System with Microsoft Kinect

winqsb运筹学软件

object detection and tracking

fiddler中文免安装版

KepOPC工业互联网数据交换平台V2.4.9

Zabbix 模板 H3C交换机通用模板zbx_export_templates

Fiddler官方版下载

最新资源