MaskR-cnn_v2.pdf资源-CSDN文库

需积分: 10 37 浏览量 2018-01-17 15:15:51 上传评论收藏 7.34MB PDF 举报

Mask R-CNN是一种用于实例分割的深度学习框架，它通过在Faster R-CNN的基础上增加了一个分支来并行预测目标掩码，从而实现了高效的实例检测和高质量分割掩码的生成。Mask R-CNN是由Facebook AI Research (FAIR)的Kaiming He、Georgia Gkioxari、Piotr Dollár和Ross Girshick提出，并在他们的论文《Mask R-CNN》中进行了详细的介绍。实例分割是一项挑战性的任务，因为它要求在检测图像中所有对象的同时，还需要精确地对每个实例进行分割。这意味着实例分割结合了目标检测和语义分割这两个经典计算机视觉任务的元素。目标检测任务的目标是分类图像中的个体对象并使用边界框进行定位，而语义分割则是将每个像素分类到特定的类别中。 Mask R-CNN的架构在训练上相对简单，并且与Faster R-CNN相比只增加了很小的计算开销，能够以每秒5帧的速度运行。此外，Mask R-CNN易于泛化到其他任务，例如，允许在同一框架中估计人体姿态。在COCO数据集的三项挑战中，Mask R-CNN的表现均优于所有现有的单一模型，包括COCO 2016挑战赛的获胜者。作者希望Mask R-CNN的简单有效方法能够为实例级别的识别研究提供一个坚实的基线，并帮助未来的研究工作。 Mask R-CNN是建立在Faster R-CNN基础上的，通过引入一个分支来预测目标掩码，同时保留了Faster R-CNN的边界框识别分支。这种设计让Mask R-CNN在保持原有检测能力的同时，增加了对实例进行像素级分割的能力。Mask R-CNN框架的另一个显著特点是它易于实现和训练，不需要复杂的技巧，就能够得到良好的结果。 Mask R-CNN的关键创新之一是它提出的并行架构，其中一个分支用于识别边界框，另一个分支用于生成对应实例的掩码。这使得Mask R-CNN不仅能够检测出图像中的对象，还能为每个对象生成一个高质量的实例掩码。在实际应用中，这种并行架构极大地提高了实例分割的效率和准确性。 Mask R-CNN的成功归功于其灵活而通用的设计，使得它可以轻松地适应于其他计算机视觉任务。例如，在相同的Mask R-CNN框架中，可以实现人体关键点检测，这展示了该模型在多任务学习中的潜力。Mask R-CNN在多个基准测试中的表现证明了其在实例分割领域的领先地位，并且其开源代码的可用性为计算机视觉研究和应用提供了强大的工具。 Mask R-CNN所采用的技术如深度学习和卷积神经网络（CNN）在最近几年已经在图像识别、检测和分割任务中取得了显著的进步。Mask R-CNN融合了这些先进技术，并通过添加实例分割能力扩展了Faster R-CNN模型，使其成为一个非常强大的工具，适用于各种计算机视觉问题。随着深度学习技术的不断进步和创新，Mask R-CNN这类模型将在未来的研究和工业应用中发挥越来越重要的作用。

资源推荐

资源详情

资源评论

Mask R-CNN

Kaiming He Georgia Gkioxari Piotr Doll

ar Ross Girshick

Facebook AI Research (FAIR)

Abstract

We present a conceptually simple, ﬂexible, and general

framework for object instance segmentation. Our approach

efﬁciently detects objects in an image while simultaneously

generating a high-quality segmentation mask for each in-

stance. The method, called Mask R-CNN, extends Faster

R-CNN by adding a branch for predicting an object mask in

parallel with the existing branch for bounding box recogni-

tion. Mask R-CNN is simple to train and adds only a small

overhead to Faster R-CNN, running at 5 fps. Moreover,

Mask R-CNN is easy to generalize to other tasks, e.g., al-

lowing us to estimate human poses in the same framework.

We show top results in all three tracks of the COCO suite of

challenges, including instance segmentation, bounding-box

object detection, and person keypoint detection. Without

tricks, Mask R-CNN outperforms all existing, single-model

entries on every task, including the COCO 2016 challenge

winners. We hope our simple and effective approach will

serve as a solid baseline and help ease future research in

instance-level recognition. Code will be made available.

1. Introduction

The vision community has rapidly improved object de-

tection and semantic segmentation results over a short pe-

riod of time. In large part, these advances have been driven

by powerful baseline systems, such as the Fast/Faster R-

CNN [12, 34] and Fully Convolutional Network (FCN) [29]

frameworks for object detection and semantic segmenta-

tion, respectively. These methods are conceptually intuitive

and offer ﬂexibility and robustness, together with fast train-

ing and inference time. Our goal in this work is to develop a

comparably enabling framework for instance segmentation.

Instance segmentation is challenging because it requires

the correct detection of all objects in an image while also

precisely segmenting each instance. It therefore combines

elements from the classical computer vision tasks of ob-

ject detection, where the goal is to classify individual ob-

jects and localize each using a bounding box, and semantic

segmentation, where the goal is to classify each pixel into

RoIAlignRoIAlign

class

box

convconv

Figure 1. The Mask R-CNN framework for instance segmentation.

a ﬁxed set of categories without differentiating object in-

stances.

Given this, one might expect a complex method

is required to achieve good results. However, we show that

a surprisingly simple, ﬂexible, and fast system can surpass

prior state-of-the-art instance segmentation results.

Our method, called Mask R-CNN, extends Faster R-CNN

[34] by adding a branch for predicting segmentation masks

on each Region of Interest (RoI), in parallel with the ex-

isting branch for classiﬁcation and bounding box regres-

sion (Figure 1). The mask branch is a small FCN applied

to each RoI, predicting a segmentation mask in a pixel-to-

pixel manner. Mask R-CNN is simple to implement and

train given the Faster R-CNN framework, which facilitates

a wide range of ﬂexible architecture designs. Additionally,

the mask branch only adds a small computational overhead,

enabling a fast system and rapid experimentation.

In principle Mask R-CNN is an intuitive extension of

Faster R-CNN, yet constructing the mask branch properly

is critical for good results. Most importantly, Faster R-

CNN was not designed for pixel-to-pixel alignment be-

tween network inputs and outputs. This is most evident in

how RoIPool [18, 12], the de facto core operation for at-

tending to instances, performs coarse spatial quantization

for feature extraction. To ﬁx the misalignment, we pro-

pose a simple, quantization-free layer, called RoIAlign, that

faithfully preserves exact spatial locations. Despite being

Following common terminology, we use object detection to denote

detection via bounding boxes, not masks, and semantic segmentation to

denote per-pixel classiﬁcation without differentiating instances. Yet we

note that instance segmentation is both semantic and a form of detection.

arXiv:1703.06870v2 [cs.CV] 5 Apr 2017

dining table.96

person1.00

person.94

bottle.99

motorcycle1.00

person1.00

person.96

person1.00

person.83

person.96

person.98

person.90

person.92

person.99

person.91

bus.99

person1.00

backpack.93

person1.00

person.99

person1.00

backpack.99

person.99

person.98

person.89

person.95

person1.00

car1.00

traffic light.96

person.96

truck1.00

person.99

car.99

person.85

motorcycle.95

car.99

car.92

person.99

person1.00

traffic light.92

traffic light.84

traffic light.95

car.93

person.87

person1.00

umbrella.98

backpack1.0 0

handbag.96

elephant1.00

person1.00

person.99

sheep1.00

person1.00

sheep.99

sheep.91

sheep1.00

sheep.99

sheep.95

person.99

sheep1.00

sheep.96

sheep.99

sheep.96

sheep.86

sheep.82

sheep.93

dining table.99

chair.99

chair.90

chair.99

chair.98

chair.96

chair.86

chair.99

bowl.81

chair.96

tv.99

bottle.99

wine glass.99

wine glass1.00

bowl.85

knife.83

wine glass1.00

wine glass.93

wine glass.97

fork.95

Figure 2. Mask R-CNN results on the COCO test set. These results are based on ResNet-101 [19], achieving a mask AP of 35.7 and

running at 5 fps. Masks are shown in color, and bounding box, category, and conﬁdences are also shown.

a seemingly minor change, RoIAlign has a large impact: it

improves mask accuracy by relative 10% to 50%, showing

bigger gains under stricter localization metrics. Second, we

found it essential to decouple mask and class prediction: we

predict a binary mask for each class independently, without

competition among classes, and rely on the network’s RoI

classiﬁcation branch to predict the category. In contrast,

FCNs usually perform per-pixel multi-class categorization,

which couples segmentation and classiﬁcation, and based

on our experiments works poorly for instance segmentation.

Without bells and whistles, Mask R-CNN surpasses all

previous state-of-the-art single-model results on the COCO

instance segmentation task [28], including the heavily-

engineered entries from the 2016 competition winner. As

a by-product, our method also excels on the COCO object

detection task. In ablation experiments, we evaluate multi-

ple basic instantiations, which allows us to demonstrate its

robustness and analyze the effects of core factors.

Our models can run at about 200ms per frame on a GPU,

and training on COCO takes one to two days on a single

8-GPU machine. We believe the fast train and test speeds,

together with the framework’s ﬂexibility and accuracy, will

beneﬁt and ease future research on instance segmentation.

Finally, we showcase the generality of our framework

via the task of human pose estimation on the COCO key-

point dataset [28]. By viewing each keypoint as a one-hot

binary mask, with minimal modiﬁcation Mask R-CNN can

be applied to detect instance-speciﬁc poses. Without tricks,

Mask R-CNN surpasses the winner of the 2016 COCO key-

point competition, and at the same time runs at 5 fps. Mask

R-CNN, therefore, can be seen more broadly as a ﬂexible

framework for instance-level recognition and can be readily

extended to more complex tasks.

We will release code to facilitate future research.

2. Related Work

R-CNN: The Region-based CNN (R-CNN) approach [13]

to bounding-box object detection is to attend to a manage-

able number of candidate object regions [38, 20] and evalu-

ate convolutional networks [25, 24] independently on each

RoI. R-CNN was extended [18, 12] to allow attending to

RoIs on feature maps using RoIPool, leading to fast speed

and better accuracy. Faster R-CNN [34] advanced this

stream by learning the attention mechanism with a Region

Proposal Network (RPN). Faster R-CNN is ﬂexible and ro-

bust to many follow-up improvements (e.g., [35, 27, 21]),

and is the current leading framework in several benchmarks.

Instance Segmentation: Driven by the effectiveness of R-

CNN, many approaches to instance segmentation are based

on segment proposals. Earlier methods [13, 15, 16, 9] re-

sorted to bottom-up segments [38, 2]. DeepMask [32] and

following works [33, 8] learn to propose segment candi-

dates, which are then classiﬁed by Fast R-CNN. In these

methods, segmentation precedes recognition, which is slow

and less accurate. Likewise, Dai et al. [10] proposed a com-

plex multiple-stage cascade that predicts segment proposals

from bounding-box proposals, followed by classiﬁcation.

Instead, our method is based on parallel prediction of masks

and class labels, which is simpler and more ﬂexible.

Most recently, Li et al. [26] combined the segment pro-

posal system in [8] and object detection system in [11] for

“fully convolutional instance segmentation” (FCIS). The

common idea in [8, 11, 26] is to predict a set of position-

sensitive output channels fully convolutionally. These

channels simultaneously address object classes, boxes, and

masks, making the system fast. But FCIS exhibits system-

atic errors on overlapping instances and creates spurious

edges (Figure 5), showing that it is challenged by the fun-

damental difﬁculties of segmenting instances.

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_40925399

粉丝: 0
资源: 1

Mask R-cnn_v2.pdf

最新资源

Mask R-cnn_v2.pdf

mask_rcnn.pdf

Mask-R-CNN.zip

Mask R-CNN v1

MaskR-CNN中文翻译.pdf

Mask R-CNN 原理文档

Mask R-CNN预训练权重.zip

mask_rcnn_inception_v2_coco.rar

Mask R-CNN源码(TensorFlow版本)

PyTorch版Mask R-CNN图像实例分割实战：训练自己的数据集【331018】网盘文件说明1

mask_rcnn_inception_resnet_v2_atrous_coco_2018_01_28.tar.gz

Mask_R-CNN_for_object_detection_and_instance_segme_Mask_RCNN.zip

Mask R-CNN

什么是Mask R-CNN？Mask R-CNN的工作原理.pdf

mask_rcnn_inception_v2_coco_2018_01_28(附代码).zip

mask_rcnn_inception_v2_coco_2018_01_28.tar.gz

mask_rcnn_ballon.zip

mask_rcnn_coco.h5

yolo-ibm_40000.weights

maskrcnn-benchmark-main.zip

什么是Mask R-CNN？Mask R-CNN的工作原理.docx

FlowersImage_Mask-R-CNN-DataSet.rar

Boundary-preserving_Mask_R-CNN_(ECCV_2020)_BMaskR-CNN.zip

_mask.cpython-37m-x86_64-linux-gnu.so

pycocotools-2.0.7-cp38-cp38-win_amd64.zip

3D-Position_Mask-Tool-for-Nuke.zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

最新资源