没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Rich feature hierarchies for accurate object detection and semantic segmentation
Ross Girshick
1
Jeff Donahue
1,2
Trevor Darrell
1,2
Jitendra Malik
1
1
UC Berkeley and
2
ICSI
{rbg,jdonahue,trevor,malik}@eecs.berkeley.edu
Abstract
Object detection performance, as measured on the
canonical PASCAL VOC dataset, has plateaued in the last
few years. The best-performing methods are complex en-
semble systems that typically combine multiple low-level
image features with high-level context. In this paper, we
propose a simple and scalable detection algorithm that im-
proves mean average precision (mAP) by more than 30%
relative to the previous best result on VOC 2012—achieving
a mAP of 53.3%. Our approach combines two key insights:
(1) one can apply high-capacity convolutional neural net-
works (CNNs) to bottom-up region proposals in order to
localize and segment objects and (2) when labeled training
data is scarce, supervised pre-training for an auxiliary task,
followed by domain-specific fine-tuning, yields a signifi-
cant performance boost. Since we combine region propos-
als with CNNs, we call our method R-CNN: Regions with
CNN features. We also present experiments that provide
insight into what the network learns, revealing a rich hier-
archy of image features. Source code for the complete sys-
tem is available at http://www.cs.berkeley.edu/
˜
rbg/rcnn.
1. Introduction
Features matter. The last decade of progress on various
visual recognition tasks has been based considerably on the
use of SIFT [26] and HOG [7]. But if we look at perfor-
mance on the canonical visual recognition task, PASCAL
VOC object detection [12], it is generally acknowledged
that progress has been slow during 2010-2012, with small
gains obtained by building ensemble systems and employ-
ing minor variants of successful methods.
SIFT and HOG are blockwise orientation histograms,
a representation we could associate roughly with complex
cells in V1, the first cortical area in the primate visual path-
way. But we also know that recognition occurs several
stages downstream, which suggests that there might be hier-
archical, multi-stage processes for computing features that
are even more informative for visual recognition.
Fukushima’s “neocognitron” [16], a biologically-
1. Input
image
2. Extract region
proposals (~2k)
3. Compute
CNN features
aeroplane? no.
.
.
.
person? yes.
tvmonitor? no.
4. Classify
regions
warped region
.
.
.
CNN
R-CNN: Regions with CNN features
Figure 1: Object detection system overview. Our system (1)
takes an input image, (2) extracts around 2000 bottom-up region
proposals, (3) computes features for each proposal using a large
convolutional neural network (CNN), and then (4) classifies each
region using class-specific linear SVMs. R-CNN achieves a mean
average precision (mAP) of 53.7% on PASCAL VOC 2010. For
comparison, [32] reports 35.1% mAP using the same region pro-
posals, but with a spatial pyramid and bag-of-visual-words ap-
proach. The popular deformable part models perform at 33.4%.
inspired hierarchical and shift-invariant model for pattern
recognition, was an early attempt at just such a process.
The neocognitron, however, lacked a supervised training al-
gorithm. LeCun et al. [23] provided the missing algorithm
by showing that stochastic gradient descent, via backprop-
agation, can train convolutional neural networks (CNNs), a
class of models that extend the neocognitron.
CNNs saw heavy use in the 1990s (e.g., [24]), but then
fell out of fashion, particularly in computer vision, with the
rise of support vector machines. In 2012, Krizhevsky et al.
[22] rekindled interest in CNNs by showing substantially
higher image classification accuracy on the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) [9, 10].
Their success resulted from training a large CNN on 1.2
million labeled images, together with a few twists on Le-
Cun’s CNN (e.g., max(x, 0) rectifying non-linearities and
“dropout” regularization).
The significance of the ImageNet result was vigorously
debated during the ILSVRC 2012 workshop. The central
issue can be distilled to the following: To what extent do
the CNN classification results on ImageNet generalize to
object detection results on the PASCAL VOC Challenge?
We answer this question decisively by bridging the
chasm between image classification and object detection.
This paper is the first to show that a CNN can lead to dra-
1
检测图像中的局部特征点,
对这些特征点进行描述,具
备尺度、旋转不变性。
关注图像中的梯度信息,通
过计算局部区域的梯度方向
直方图来描述图像特征
资源评论
gaoooytt
- 粉丝: 9
- 资源: 5
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功