FasterR-CNN资源-CSDN文库

需积分: 9 35 浏览量 2018-01-31 14:23:42 上传评论收藏 1.15MB PDF 举报

Faster R-CNN是一篇发表在IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE上的重要论文，提出了一个面向实时目标检测的卷积神经网络框架。该论文由Shaoqing Ren、Kaiming He、Ross Girshick和Jian Sun联合撰写。Faster R-CNN的核心贡献在于它引入了一种区域提议网络（Region Proposal Network，简称RPN），这种网络能够与检测网络共享图像级的卷积特征，从而使得区域提议（region proposals）的生成几乎不需要额外的计算代价。在目标检测领域，区域提议方法是关键的一步，这一步骤旨在假设目标物体的位置。SPPnet和Fast R-CNN等先前技术降低了检测网络的运行时间，但区域提议的计算成为了瓶颈。Faster R-CNN通过引入RPN来解决这个问题，RPN是一种全卷积网络，能够同时预测每个位置的对象边界和对象性得分。RPN通过端到端的训练，生成高质量的区域提议，这些提议被用于Fast R-CNN进行目标检测。通过将RPN和Fast R-CNN的功能合并为一个单一网络并共享它们的卷积特征，Faster R-CNN实现了端到端的目标检测。更具体地说，Faster R-CNN将RPN作为一个子组件整合进统一的网络中，使用了神经网络中所谓的“注意力”机制的术语，RPN告诉统一网络应该关注哪里。对于非常深的VGG-16模型，Faster R-CNN的检测系统在包括所有步骤的情况下，能在GPU上达到每秒5帧的速度，并且在PASCAL VOC 2007、2012和MSCOCO数据集上取得了最先进的目标检测精度，每幅图像只需要300个提议。在ILSVRC和COCO 2015竞赛中，基于Faster R-CNN和RPN的系统在多个项目中获得了第一名。相关的代码已经公开发布。 Faster R-CNN的实现和相关背景知识涉及多个层面，下面将对其中的一些关键点进行详细解析： 1. 区域提议网络（Region Proposal Network，RPN）：RPN是Faster R-CNN框架中的一个创新点，其目的是为了高效地生成候选目标区域。RPN通过一个全卷积的网络来预测每个位置的可能对象边界框，并且为这些边界框提供了一个对象性得分，即该框内包含对象的可能性有多大。 2. 生成高质量的区域提议：高质量的区域提议对于目标检测的性能至关重要。Faster R-CNN通过RPN网络能够自动学习到如何生成这样的区域提议，而不需要人工设计复杂的启发式算法。 3. 共享卷积特征：RPN与检测网络共享全图的卷积特征，这是通过设计一个统一的网络结构来实现的。这种共享机制大大减少了重复计算，因为不需要对每个提议单独计算卷积特征。 4. 端到端训练：Faster R-CNN的另一个特点是端到端的训练方式，即整个检测系统作为一个整体进行优化，从区域提议生成到目标分类与定位，整个流程协同训练，提高了系统的整体性能。 5. 实时性：实时目标检测对于许多应用，如视频监控、自动驾驶等都至关重要。Faster R-CNN通过高效的网络设计和优化，实现了接近实时的检测速度，是该系统的一个显著优势。 6. 对比SPPnet和Fast R-CNN：Faster R-CNN与以往的方法相比，改进了计算效率。例如，SPPnet需要预先生成区域提议，而Fast R-CNN则是通过共享卷积层减少了计算时间。Faster R-CNN进一步通过RPN的提出，解决了Fast R-CNN中的测试时计算瓶颈问题。 7. 对象检测和区域提议：Faster R-CNN将目标检测和区域提议结合起来，探索了区域提议方法和基于区域的卷积神经网络（R-CNNs）的进步如何推动了目标检测技术的发展。 Faster R-CNN的提出标志着目标检测技术向前迈出了一大步，其创新之处在于通过整合RPN来实现接近实时的目标检测，并且在多个权威数据集上取得了领先地位。该论文的研究成果不仅在学术界产生了深远的影响，也为工业界的目标检测应用提供了重要的技术支持。

资源推荐

资源详情

资源评论

Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances

like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation

as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the

detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts

object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which

are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional

features—using the recently popular terminology of neural networks with ’attention’ mechanisms, the RPN component tells the uniﬁed

network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5 fps (including all steps)ona

GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300

proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning

entries in several tracks. Code has been made publicly available.

Index Terms—Object detection, region proposal, convolutional neural network

1INTRODUCTION

ECENT advances in object detection are driven by the

success of region proposal methods (e.g., [4]) and

region-based convolutional neural networks (R-CNNs) [5].

Although region-based CNNs were computationally expen-

sive as originally developed in [5], their cost has been drasti-

cally reduced thanks to sharing convolutions across

proposals [1], [2]. The latest incarnation, Fast R-CNN [2],

achieves near real-time rates using very deep networks [3],

when ignoring the time spent on region proposals. Now, pro-

posals are the test-time computational bottleneck in state-

of-the-art detection systems.

Region proposal methods typically rely on inexpensive

features and economical inference schemes. Selective Search

[4], one of the most popular methods, greedily merges super-

pixels based on engineered low-level features. Yet when com-

pared to efﬁcient detection networks [2], Selective Search is an

order of magnitude slower, at 2 seconds per image in a CPU

implementation. EdgeBoxes [6] currently provides the best

tradeoff between proposal quality and speed, at 0.2 seconds

per image. Nevertheless, the region proposal step still con-

sumes as much running time as the detection network.

One may note that fast region-based CNNs take advantage

of GPUs, while the region proposal methods used in research

are implemented on the CPU, making such runtime

comparisons inequitable. An obvious way to accelerate pro-

posal computation is to re-implement it for the GPU. This

may be an effective engineering solution, but re-implementa-

tion ignores the down-stream detection network and there-

fore misses important opportunities for sharing computation.

In this paper, we show that an algorithmic change—com-

puting proposals with a deep convolutional neural net-

work—leads to an elegant and effective solution where

proposal computation is nearly cost-free given the detection

network’s computation. To this end, we introduce novel

Region Proposal Networks (RPNs) that share convolutional

layers with state-of-the-art object detection networks [1], [2].

By sharing convolutions at test-time, the marginal cost for

computing proposals is small (e.g., 10 ms per image).

Our observation is that the convolutional feature maps

used by region-based detectors, like Fast R-CNN, can also

be used for generating region proposals. On top of these

convolutional features, we construct an RPN by adding a

few additional convolutional layers that simultaneously

regress region bounds and objectness scores at each location

on a regular grid. The RPN is thus a kind of fully convolu-

tional network (FCN) [7] and can be trained end-to-end spe-

ciﬁcally for the task for generating detection proposals.

RPNs are designed to efﬁciently predict region proposals

with a wide range of scales and aspect ratios. In contrast to

prevalent methods [1], [2], [8], [9] that use pyramids of

images (Fig. 1a) or pyramids of ﬁlters (Fig. 1b), we introduce

novel “anchor” boxes that serve as references at multiple

scales and aspect ratios. Our scheme can be thought of as a

pyramid of regression references (Fig. 1c), which avoids

enumerating images or ﬁlters of multiple scales or aspect

ratios. This model performs well when trained and tested

using single-scale images and thus beneﬁts running speed.

To unify RPNs with Fast R-CNN [2] object detection

networks, we propose a training scheme that alternates

 S. Ren is with University of Science and Technology of China, Hefei,

Anhui 230026, China. E-mail: sqren@mail.ustc.edu.cn.

 K. He and J. Sun are with Visual Computing Group, Microsoft Research,

Beijing 100080, China. E-mail: {kahe, jiansun}@microsoft.com.

 R. Girshick is with Facebook AI Research, Seattle, WA 98109.

E-mail: rbg@fb.com.

Manuscript received 29 Dec. 2015; revised 21 Apr. 2016; accepted 28 May

2016. Date of publication 5 June 2016; date of current version 12 May 2017.

Recommended for acceptance by D. Hoiem.

For information on obtaining reprints of this article, please send e-mail to:

reprints@ieee.org, and reference the Digital Object Identiﬁer below.

Digital Object Identiﬁer no. 10.1109/TPAMI.2016.2577031

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 6, JUNE 2017 1137

0162-8828 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

between ﬁne-tuning for the region proposal task and then

ﬁne-tuning for object detection, while keeping the proposals

ﬁxed. This scheme converges quickly and produces a uni-

ﬁed network with convolutional features that are shared

between both tasks.

We comprehensively evaluate our method on the PAS-

CAL VOC detection benchmarks [11] where RPNs with Fast

R-CNNs produce detection accuracy better than the strong

baseline of Selective Search with Fast R-CNNs. Meanwhile,

our method waives nearly all computational burdens of

Selective Search at test-time—the effective running time for

proposals is just 10 milliseconds. Using the expensive very

deep models of [3], our detection method still has a frame

rate of 5 fps (including all steps) on a GPU, and thus is a practi-

cal object detection system in terms of both speed and accu-

racy. We also report results on the MS COCO dataset [12]

and investigate the improvements on PASCAL VOC using

the COCO data. Code has been made publicly available at

https://github.com/shaoqingren/faster_rcnn

(in MATLAB) and https://github.com/rbgirshick/

py-faster-rcnn (in Python).

A preliminary version of this manuscript was published

previously [10]. Since then, the frameworks of RPN and

Faster R-CNN have been adopted and generalized to other

methods, such as 3D object detection [13], part-based detec-

tion [14], instance segmentation [15], and image captioning

[16]. Our fast and effective object detection system has also

been built in commercial systems such as at Pinterests [17],

with user engagement improvements reported.

In ILSVRC and COCO 2015 competitions, Faster R-CNN

and RPN are the basis of several 1st-place entries [18] in the

tracks of ImageNet detection, ImageNet localization, COCO

detection, and COCO segmentation. RPNs completely learn

to propose regions from data, and thus can easily beneﬁt

from deeper and more expressive features (such as the 101-

layer residual nets adopted in [18]). Faster R-CNN and RPN

are also used by several other leading entries in these com-

petitions.

These results suggest that our method is not only

a cost-efﬁcient solution for practical usage, but also an effec-

tive way of improving object detection accuracy.

2RELATED WORK

Object Proposals. There is a large literature on object proposal

methods. Comprehensive surveys and comparisons of

object proposal methods can be found in [19], [20], [21].

Widely used object proposal methods include those based

on grouping super-pixels (e.g., Selective Search [4], CPMC

[22], MCG [23]) and those based on sliding windows (e.g.,

objectness in windows [24], EdgeBoxes [6]). Object proposal

methods were adopted as external modules independent of

the detectors (e.g., Selective Search [4] object detectors, R-

CNN [5], and Fast R-CNN [2]).

Deep Networks for Object Detection. The R-CNN method [5]

trains CNNs end-to-end to classify the proposal regions into

object categories or background. R-CNN mainly plays as a

classiﬁer, and it does not predict object bounds (except for

reﬁning by bounding box regression). Its accuracy depends

on the performance of the region proposal module (see com-

parisons in [20]). Several papers have proposed ways of using

deep networks for predicting object bounding boxes [9], [25],

[26], [27]. In the OverFeat method [9], a fully-connected layer

is trained to predict the box coordinates for the localization

task that assumes a single object. The fully-connected layer is

then turned into a convolutional layer for detecting multiple

class-speciﬁc objects. The MultiBox methods [26], [27] gener-

ate region proposals from a network whose last fully-con-

nected layer simultaneously predicts multiple class-agnostic

boxes, generalizing the “single-box” fashion of OverFeat.

These class-agnostic boxes are used as proposals for R-CNN

[5]. The MultiBox proposal network is applied on a single

image crop or multiple large image crops (e.g., 224  224), in

contrast to our fully convolutional scheme. MultiBox does

not share features between the proposal and detection net-

works. We discuss OverFeat and MultiBox in more depth

later in context with our method. Concurrent with our work,

the DeepMask method [28] is developed for learning segmen-

tation proposals.

Shared computation of convolutions [1], [2], [7], [9], [29]

has been attracting increasing attention for efﬁcient, yet

accurate, visual recognition. The OverFeat paper [9] com-

putes convolutional features from an image pyramid for

classiﬁcation, localization, and detection. Adaptively-sized

pooling (SPP) [1] on shared convolutional feature maps is

developed for efﬁcient region-based object detection [1],

[30] and semantic segmentation [29]. Fast R-CNN [2] ena-

bles end-to-end detector training on shared convolutional

features and shows compelling accuracy and speed.

3FASTER R-CNN

Our object detection system, called Faster R-CNN, is

composed of two modules. The ﬁrst module is a deep fully

convolutional network that proposes regions, and the

Fig. 1. Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classiﬁer is run at all scales.