Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances
like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation
as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the
detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts
object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which
are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional
features—using the recently popular terminology of neural networks with ’attention’ mechanisms, the RPN component tells the unified
network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5 fps (including all steps)ona
GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300
proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning
entries in several tracks. Code has been made publicly available.
Index Terms—Object detection, region proposal, convolutional neural network
Ç
1INTRODUCTION
R
ECENT advances in object detection are driven by the
success of region proposal methods (e.g., [4]) and
region-based convolutional neural networks (R-CNNs) [5].
Although region-based CNNs were computationally expen-
sive as originally developed in [5], their cost has been drasti-
cally reduced thanks to sharing convolutions across
proposals [1], [2]. The latest incarnation, Fast R-CNN [2],
achieves near real-time rates using very deep networks [3],
when ignoring the time spent on region proposals. Now, pro-
posals are the test-time computational bottleneck in state-
of-the-art detection systems.
Region proposal methods typically rely on inexpensive
features and economical inference schemes. Selective Search
[4], one of the most popular methods, greedily merges super-
pixels based on engineered low-level features. Yet when com-
pared to efficient detection networks [2], Selective Search is an
order of magnitude slower, at 2 seconds per image in a CPU
implementation. EdgeBoxes [6] currently provides the best
tradeoff between proposal quality and speed, at 0.2 seconds
per image. Nevertheless, the region proposal step still con-
sumes as much running time as the detection network.
One may note that fast region-based CNNs take advantage
of GPUs, while the region proposal methods used in research
are implemented on the CPU, making such runtime
comparisons inequitable. An obvious way to accelerate pro-
posal computation is to re-implement it for the GPU. This
may be an effective engineering solution, but re-implementa-
tion ignores the down-stream detection network and there-
fore misses important opportunities for sharing computation.
In this paper, we show that an algorithmic change—com-
puting proposals with a deep convolutional neural net-
work—leads to an elegant and effective solution where
proposal computation is nearly cost-free given the detection
network’s computation. To this end, we introduce novel
Region Proposal Networks (RPNs) that share convolutional
layers with state-of-the-art object detection networks [1], [2].
By sharing convolutions at test-time, the marginal cost for
computing proposals is small (e.g., 10 ms per image).
Our observation is that the convolutional feature maps
used by region-based detectors, like Fast R-CNN, can also
be used for generating region proposals. On top of these
convolutional features, we construct an RPN by adding a
few additional convolutional layers that simultaneously
regress region bounds and objectness scores at each location
on a regular grid. The RPN is thus a kind of fully convolu-
tional network (FCN) [7] and can be trained end-to-end spe-
cifically for the task for generating detection proposals.
RPNs are designed to efficiently predict region proposals
with a wide range of scales and aspect ratios. In contrast to
prevalent methods [1], [2], [8], [9] that use pyramids of
images (Fig. 1a) or pyramids of filters (Fig. 1b), we introduce
novel “anchor” boxes that serve as references at multiple
scales and aspect ratios. Our scheme can be thought of as a
pyramid of regression references (Fig. 1c), which avoids
enumerating images or filters of multiple scales or aspect
ratios. This model performs well when trained and tested
using single-scale images and thus benefits running speed.
To unify RPNs with Fast R-CNN [2] object detection
networks, we propose a training scheme that alternates
S. Ren is with University of Science and Technology of China, Hefei,
Anhui 230026, China. E-mail: sqren@mail.ustc.edu.cn.
K. He and J. Sun are with Visual Computing Group, Microsoft Research,
Beijing 100080, China. E-mail: {kahe, jiansun}@microsoft.com.
R. Girshick is with Facebook AI Research, Seattle, WA 98109.
E-mail: rbg@fb.com.
Manuscript received 29 Dec. 2015; revised 21 Apr. 2016; accepted 28 May
2016. Date of publication 5 June 2016; date of current version 12 May 2017.
Recommended for acceptance by D. Hoiem.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPAMI.2016.2577031
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 6, JUNE 2017 1137
0162-8828 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.