深度学习目标检测经典论文RCNN～Mask-RCNN～YOLOv2

共7个文件

pdf：7个

需积分: 50 177 浏览量 2018-06-03 21:40:07 上传评论 1 收藏 32.75MB ZIP 举报

深度学习在计算机视觉领域取得了显著的进步，特别是在目标检测任务上。这一领域的经典论文包括了从早期的RCNN到更先进的Fast-RCNN、Faster-RCNN、Mask-RCNN，以及YOLO系列。这些方法的出现和发展，极大地提升了目标检测的效率和准确性。 RCNN（Region-based Convolutional Neural Network）是2014年由Girshick等人提出的，它首次将卷积神经网络（CNN）引入目标检测。RCNN首先通过Selective Search等算法生成大量可能包含目标的区域提案，然后对每个提案进行CNN特征提取，最后通过SVM分类器确定是否包含目标并进行边界框调整。RCNN虽然效果良好，但其计算量大，速度慢。 Fast-RCNN随后由Girshick在2015年提出，旨在解决RCNN的效率问题。Fast-RCNN改进了特征提取过程，通过共享全图像的CNN特征图，减少了大量的重复计算。同时，它将分类和边界框回归两个任务合并到一个损失函数中，使得模型可以端到端训练，大大提高了检测速度。 Faster-RCNN在2015年由Ren和He等人进一步优化，引入了区域提议网络（RPN），实现了目标检测的完全端到端训练。RPN可以在特征图上直接预测出目标区域，无需预处理步骤，极大地提升了检测速度，同时保持了高精度。 Mask-RCNN，由He等人在2017年提出，是Faster-RCNN的扩展，增加了对实例分割的支持。除了分类和边界框回归，Mask-RCNN还预测每个目标的像素级掩模，实现了目标检测与实例分割的联合处理。这种方法在COCO等数据集上表现出色，成为了实例分割的基准模型。 YOLO（You Only Look Once）系列则代表了另一种目标检测思路。YOLO于2016年由Redmon和Farhadi提出，它的核心思想是一次性预测图像中的所有目标，而不需要区域提案。YOLOv2在YOLO的基础上进行了优化，包括使用更深层次的网络结构、多尺度预测以及 anchor boxes，提高了检测精度和速度。 YOLOv3在2018年推出，进一步提升了性能，引入了空间金字塔池化和更多尺寸的anchor boxes，增强了对小目标的检测能力。此外，YOLOv3还引入了暗知识（Dark Knowledge）的概念，通过知识蒸馏技术将大模型的知识传递给小模型，使得小模型也能达到接近大模型的性能。这些经典论文不仅推动了目标检测技术的发展，也为后续的科研工作提供了重要的理论基础和实验框架。通过对这些模型的深入理解，我们可以更好地掌握深度学习在目标检测中的应用，并可能启发新的研究方向和改进方案。

资源推荐

资源详情

资源评论

收起资源包目录

object detection.zip （7个子文件）

object detection

02R-CNN_1311.2524v5.pdf 6.07MB

05Faster R-CNN.pdf 6.52MB

03SPP-net.pdf 3.97MB

04Fast R-CNN.pdf 691KB

Mask_RCNN.pdf 7.34MB

08YOLO2.pdf 5.01MB

06YOLO.pdf 5MB

Mask R-CNN

Kaiming He Georgia Gkioxari Piotr Doll

ar Ross Girshick

Facebook AI Research (FAIR)

Abstract

We present a conceptually simple, ﬂexible, and general

framework for object instance segmentation. Our approach

efﬁciently detects objects in an image while simultaneously

generating a high-quality segmentation mask for each in-

stance. The method, called Mask R-CNN, extends Faster

R-CNN by adding a branch for predicting an object mask in

parallel with the existing branch for bounding box recogni-

tion. Mask R-CNN is simple to train and adds only a small

overhead to Faster R-CNN, running at 5 fps. Moreover,

Mask R-CNN is easy to generalize to other tasks, e.g., al-

lowing us to estimate human poses in the same framework.

We show top results in all three tracks of the COCO suite of

challenges, including instance segmentation, bounding-box

object detection, and person keypoint detection. Without

tricks, Mask R-CNN outperforms all existing, single-model

entries on every task, including the COCO 2016 challenge

winners. We hope our simple and effective approach will

serve as a solid baseline and help ease future research in

instance-level recognition. Code will be made available.

1. Introduction

The vision community has rapidly improved object de-

tection and semantic segmentation results over a short pe-

riod of time. In large part, these advances have been driven

by powerful baseline systems, such as the Fast/Faster R-

CNN [12, 34] and Fully Convolutional Network (FCN) [29]

frameworks for object detection and semantic segmenta-

tion, respectively. These methods are conceptually intuitive

and offer ﬂexibility and robustness, together with fast train-

ing and inference time. Our goal in this work is to develop a

comparably enabling framework for instance segmentation.

Instance segmentation is challenging because it requires

the correct detection of all objects in an image while also

precisely segmenting each instance. It therefore combines

elements from the classical computer vision tasks of ob-

ject detection, where the goal is to classify individual ob-

jects and localize each using a bounding box, and semantic

segmentation, where the goal is to classify each pixel into

RoIAlignRoIAlign

class

box

convconv

Figure 1. The Mask R-CNN framework for instance segmentation.

a ﬁxed set of categories without differentiating object in-

stances.

Given this, one might expect a complex method

is required to achieve good results. However, we show that

a surprisingly simple, ﬂexible, and fast system can surpass

prior state-of-the-art instance segmentation results.

Our method, called Mask R-CNN, extends Faster R-CNN

[34] by adding a branch for predicting segmentation masks

on each Region of Interest (RoI), in parallel with the ex-

isting branch for classiﬁcation and bounding box regres-

sion (Figure 1). The mask branch is a small FCN applied

to each RoI, predicting a segmentation mask in a pixel-to-

pixel manner. Mask R-CNN is simple to implement and

train given the Faster R-CNN framework, which facilitates

a wide range of ﬂexible architecture designs. Additionally,

the mask branch only adds a small computational overhead,

enabling a fast system and rapid experimentation.

In principle Mask R-CNN is an intuitive extension of

Faster R-CNN, yet constructing the mask branch properly

is critical for good results. Most importantly, Faster R-

CNN was not designed for pixel-to-pixel alignment be-

tween network inputs and outputs. This is most evident in

how RoIPool [18, 12], the de facto core operation for at-

tending to instances, performs coarse spatial quantization

for feature extraction. To ﬁx the misalignment, we pro-

pose a simple, quantization-free layer, called RoIAlign, that

faithfully preserves exact spatial locations. Despite being

Following common terminology, we use object detection to denote

detection via bounding boxes, not masks, and semantic segmentation to

denote per-pixel classiﬁcation without differentiating instances. Yet we

note that instance segmentation is both semantic and a form of detection.

arXiv:1703.06870v2 [cs.CV] 5 Apr 2017

dining table.96

person1.00

person.94

bottle.99

motorcycle1.00

person1.00

person.96

person1.00

person.83

person.96

person.98

person.90

person.92

person.99

person.91

bus.99

person1.00

backpack.93

person1.00

person.99

person1.00

backpack.99

person.99

person.98

person.89

person.95

person1.00

car1.00

traffic light.96

person.96

truck1.00

person.99

car.99

person.85

motorcycle.95

car.99

car.92

person.99

person1.00

traffic light.92

traffic light.84

traffic light.95

car.93

person.87

person1.00

umbrella.98

backpack1.0 0

handbag.96

elephant1.00

person1.00

person.99

sheep1.00

person1.00

sheep.99

sheep.91

sheep1.00

sheep.99

sheep.95

person.99

sheep1.00

sheep.96

sheep.99

sheep.96

sheep.86

sheep.82

sheep.93

dining table.99

chair.99

chair.90

chair.99

chair.98

chair.96

chair.86

chair.99

bowl.81

chair.96

tv.99

bottle.99

wine glass.99

wine glass1.00

bowl.85

knife.83

wine glass1.00

wine glass.93

wine glass.97

fork.95

Figure 2. Mask R-CNN results on the COCO test set. These results are based on ResNet-101 [19], achieving a mask AP of 35.7 and

running at 5 fps. Masks are shown in color, and bounding box, category, and conﬁdences are also shown.

a seemingly minor change, RoIAlign has a large impact: it

improves mask accuracy by relative 10% to 50%, showing

bigger gains under stricter localization metrics. Second, we

found it essential to decouple mask and class prediction: we

predict a binary mask for each class independently, without

competition among classes, and rely on the network’s RoI

classiﬁcation branch to predict the category. In contrast,

FCNs usually perform per-pixel multi-class categorization,

which couples segmentation and classiﬁcation, and based

on our experiments works poorly for instance segmentation.

Without bells and whistles, Mask R-CNN surpasses all

previous state-of-the-art single-model results on the COCO

instance segmentation task [28], including the heavily-

engineered entries from the 2016 competition winner. As

a by-product, our method also excels on the COCO object

detection task. In ablation experiments, we evaluate multi-

ple basic instantiations, which allows us to demonstrate its

robustness and analyze the effects of core factors.

Our models can run at about 200ms per frame on a GPU,

and training on COCO takes one to two days on a single

8-GPU machine. We believe the fast train and test speeds,

together with the framework’s ﬂexibility and accuracy, will

beneﬁt and ease future research on instance segmentation.

Finally, we showcase the generality of our framework

via the task of human pose estimation on the COCO key-

point dataset [28]. By viewing each keypoint as a one-hot

binary mask, with minimal modiﬁcation Mask R-CNN can

be applied to detect instance-speciﬁc poses. Without tricks,

Mask R-CNN surpasses the winner of the 2016 COCO key-

point competition, and at the same time runs at 5 fps. Mask

R-CNN, therefore, can be seen more broadly as a ﬂexible

framework for instance-level recognition and can be readily

extended to more complex tasks.

We will release code to facilitate future research.

2. Related Work

R-CNN: The Region-based CNN (R-CNN) approach [13]

to bounding-box object detection is to attend to a manage-

able number of candidate object regions [38, 20] and evalu-

ate convolutional networks [25, 24] independently on each

RoI. R-CNN was extended [18, 12] to allow attending to

RoIs on feature maps using RoIPool, leading to fast speed

and better accuracy. Faster R-CNN [34] advanced this

stream by learning the attention mechanism with a Region

Proposal Network (RPN). Faster R-CNN is ﬂexible and ro-

bust to many follow-up improvements (e.g., [35, 27, 21]),

and is the current leading framework in several benchmarks.

Instance Segmentation: Driven by the effectiveness of R-

CNN, many approaches to instance segmentation are based

on segment proposals. Earlier methods [13, 15, 16, 9] re-

sorted to bottom-up segments [38, 2]. DeepMask [32] and

following works [33, 8] learn to propose segment candi-

dates, which are then classiﬁed by Fast R-CNN. In these

methods, segmentation precedes recognition, which is slow

and less accurate. Likewise, Dai et al. [10] proposed a com-

plex multiple-stage cascade that predicts segment proposals

from bounding-box proposals, followed by classiﬁcation.

Instead, our method is based on parallel prediction of masks

and class labels, which is simpler and more ﬂexible.

Most recently, Li et al. [26] combined the segment pro-

posal system in [8] and object detection system in [11] for

“fully convolutional instance segmentation” (FCIS). The

common idea in [8, 11, 26] is to predict a set of position-

sensitive output channels fully convolutionally. These

channels simultaneously address object classes, boxes, and

masks, making the system fast. But FCIS exhibits system-

atic errors on overlapping instances and creates spurious

edges (Figure 5), showing that it is challenged by the fun-

damental difﬁculties of segmenting instances.

3. Mask R-CNN

Mask R-CNN is conceptually simple: Faster R-CNN has

two outputs for each candidate object, a class label and a

bounding-box offset; to this we add a third branch that out-

puts the object mask. Mask R-CNN is thus a natural and in-

tuitive idea. But the additional mask output is distinct from

the class and box outputs, requiring extraction of much ﬁner

spatial layout of an object. Next, we introduce the key ele-

ments of Mask R-CNN, including pixel-to-pixel alignment,

which is the main missing piece of Fast/Faster R-CNN.

Faster R-CNN: We begin by brieﬂy reviewing the Faster

R-CNN detector [34]. Faster R-CNN consists of two stages.

The ﬁrst stage, called a Region Proposal Network (RPN),

proposes candidate object bounding boxes. The second

stage, which is in essence Fast R-CNN [12], extracts fea-

tures using RoIPool from each candidate box and performs

classiﬁcation and bounding-box regression. The features

used by both stages can be shared for faster inference. We

refer readers to [21] for latest, comprehensive comparisons

between Faster R-CNN and other frameworks.

Mask R-CNN: Mask R-CNN adopts the same two-stage

procedure, with an identical ﬁrst stage (which is RPN). In

the second stage, in parallel to predicting the class and box

offset, Mask R-CNN also outputs a binary mask for each

RoI. This is in contrast to most recent systems, where clas-

siﬁcation depends on mask predictions (e.g. [32, 10, 26]).

Our approach follows the spirit of Fast R-CNN [12] that

applies bounding-box classiﬁcation and regression in par-

allel (which turned out to largely simplify the multi-stage

pipeline of original R-CNN [13]).

Formally, during training, we deﬁne a multi-task loss on

each sampled RoI as L = L

cls

+ L

box

+ L

mask

. The clas-

siﬁcation loss L

cls

and bounding-box loss L

box

are identi-

cal as those deﬁned in [12]. The mask branch has a Km

dimensional output for each RoI, which encodes K binary

masks of resolution m × m, one for each of the K classes.

To this we apply a per-pixel sigmoid, and deﬁne L

mask

the average binary cross-entropy loss. For an RoI associated

with ground-truth class k, L

mask

is only deﬁned on the k-th

mask (other mask outputs do not contribute to the loss).

Our deﬁnition of L

mask

allows the network to generate

masks for every class without competition among classes;

we rely on the dedicated classiﬁcation branch to predict the

class label used to select the output mask. This decouples

mask and class prediction. This is different from common

practice when applying FCNs [29] to semantic segmenta-

tion, which typically uses a per-pixel softmax and a multino-

mial cross-entropy loss. In that case, masks across classes

compete; in our case, with a per-pixel sigmoid and a binary

loss, they do not. We show by experiments that this formu-

lation is key for good instance segmentation results.

Mask Representation: A mask encodes an input object’s

spatial layout. Thus, unlike class labels or box offsets

that are inevitably collapsed into short output vectors by

fully-connected (fc) layers, extracting the spatial structure

of masks can be addressed naturally by the pixel-to-pixel

correspondence provided by convolutions.

Speciﬁcally, we predict an m × m mask from each RoI

using an FCN [29]. This allows each layer in the mask

branch to maintain the explicit m × m object spatial lay-

out without collapsing it into a vector representation that

lacks spatial dimensions. Unlike previous methods that re-

sort to fc layers for mask prediction [32, 33, 10], our fully

convolutional representation requires fewer parameters, and

is more accurate as demonstrated by experiments.

This pixel-to-pixel behavior requires our RoI features,

which themselves are small feature maps, to be well aligned

to faithfully preserve the explicit per-pixel spatial corre-

spondence. This motivated us to develop the following

RoIAlign layer that plays a key role in mask prediction.

RoIAlign: RoIPool [12] is a standard operation for extract-

ing a small feature map (e.g., 7×7) from each RoI. RoIPool

ﬁrst quantizes a ﬂoating-number RoI to the discrete granu-

larity of the feature map, this quantized RoI is then subdi-

vided into spatial bins which are themselves quantized, and

ﬁnally feature values covered by each bin are aggregated

(usually by max pooling). Quantization is performed, e.g.,

on a continuous coordinate x by computing [x/16], where

16 is a feature map stride and [·] is rounding; likewise, quan-

tization is performed when dividing into bins (e.g., 7×7).

These quantizations introduce misalignments between the

RoI and the extracted features. While this may not impact

classiﬁcation, which is robust to small translations, it has a

large negative effect on predicting pixel-accurate masks.

To address this, we propose an RoIAlign layer that re-

moves the harsh quantization of RoIPool, properly aligning

the extracted features with the input. Our proposed change

is simple: we avoid any quantization of the RoI boundaries

or bins (i.e., we use x/16 instead of [x/16]). We use bilinear

interpolation [22] to compute the exact values of the input

features at four regularly sampled locations in each RoI bin,

and aggregate the result (using max or average).

RoIAlign leads to large improvements as we show in

§4.2. We also compare to the RoIWarp operation proposed

in [10]. Unlike RoIAlign, RoIWarp overlooked the align-

ment issue and was implemented in [10] as quantizing RoI

just like RoIPool. So even though RoIWarp also adopts

bilinear resampling motivated by [22], it performs on par

with RoIPool as shown by experiments (more details in Ta-

ble 2c), demonstrating the crucial role of alignment.

We sample four regular locations, so that we can evaluate either max

or average pooling. In fact, interpolating only a single value at each bin

center (without pooling) is nearly as effective. One could also sample more

than four locations per bin, which we found to give diminishing returns.

评论收藏

内容反馈

游离在代码上的灵魂

粉丝: 16
资源: 4

深度学习目标检测经典论文 RCNN～Mask-RCNN～YOLOv2

最新资源

深度学习目标检测 经典论文 RCNN～Mask-RCNN～YOLOv2

深度学习目标检测论文

目标检测经典论文

深度学习中目标检测的论文原文

深度学习在目标检测方面的应用论文

经典深度学习论文 rcnn fast-rcnn faster-rcnn mask -rcnn yolo系类 DOTA

mask R-CNN论文

基于卷积神经网络的目标检测（RCNN）介绍

目标检测.zip

目标检测.pdf

基于深度学习的目标检测算法总览pdf文件.pdf

目标检测经典论文：RCNN系列，YOLO系列,SSD，相关综述类论文

计算机视觉与深度学习 | 目标检测综述（RCNN、RPN、YOLOv1 v2 v3、FPN、Mask RCNN、SSD代码类）-附件资源

基于深度学习RCNN实现汽车目标检测附matlab代码+运行结果.zip

YOLOv1、YOLOv2、 YOLOv3、SSD DSSD 单阶段目标检测论文

yolo，yolov2,yolov3论文原文

yolo v1 v2 v3 论文及代码实现

YOLO系列论文翻译

mask_rcnn_inception_v2_coco_2018_01_28(附代码).zip

mask_rcnn_inception_v2_coco.rar

mask_rcnn_inception_v2_coco_2018_01_28.tar.gz

YOLOv2_wrapper.pdf

YOLO论文v1 v2 v3核心.zip

rcnn_car_object_detection.zip_matlab RCNN_深度学习 检测_深度学习MATLAB_目标检

YOLOv1v2v3论文.zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

最新资源

深度学习目标检测经典论文 RCNN～Mask-RCNN～YOLOv2

rcnn_car_object_detection.zip_matlab RCNN_深度学习检测_深度学习MATLAB_目标检