![](https://csdnimg.cn/release/download_crawler_static/10456306/bg3.jpg)
3. Mask R-CNN
Mask R-CNN is conceptually simple: Faster R-CNN has
two outputs for each candidate object, a class label and a
bounding-box offset; to this we add a third branch that out-
puts the object mask. Mask R-CNN is thus a natural and in-
tuitive idea. But the additional mask output is distinct from
the class and box outputs, requiring extraction of much finer
spatial layout of an object. Next, we introduce the key ele-
ments of Mask R-CNN, including pixel-to-pixel alignment,
which is the main missing piece of Fast/Faster R-CNN.
Faster R-CNN: We begin by briefly reviewing the Faster
R-CNN detector [34]. Faster R-CNN consists of two stages.
The first stage, called a Region Proposal Network (RPN),
proposes candidate object bounding boxes. The second
stage, which is in essence Fast R-CNN [12], extracts fea-
tures using RoIPool from each candidate box and performs
classification and bounding-box regression. The features
used by both stages can be shared for faster inference. We
refer readers to [21] for latest, comprehensive comparisons
between Faster R-CNN and other frameworks.
Mask R-CNN: Mask R-CNN adopts the same two-stage
procedure, with an identical first stage (which is RPN). In
the second stage, in parallel to predicting the class and box
offset, Mask R-CNN also outputs a binary mask for each
RoI. This is in contrast to most recent systems, where clas-
sification depends on mask predictions (e.g. [32, 10, 26]).
Our approach follows the spirit of Fast R-CNN [12] that
applies bounding-box classification and regression in par-
allel (which turned out to largely simplify the multi-stage
pipeline of original R-CNN [13]).
Formally, during training, we define a multi-task loss on
each sampled RoI as L = L
cls
+ L
box
+ L
mask
. The clas-
sification loss L
cls
and bounding-box loss L
box
are identi-
cal as those defined in [12]. The mask branch has a Km
2
-
dimensional output for each RoI, which encodes K binary
masks of resolution m × m, one for each of the K classes.
To this we apply a per-pixel sigmoid, and define L
mask
as
the average binary cross-entropy loss. For an RoI associated
with ground-truth class k, L
mask
is only defined on the k-th
mask (other mask outputs do not contribute to the loss).
Our definition of L
mask
allows the network to generate
masks for every class without competition among classes;
we rely on the dedicated classification branch to predict the
class label used to select the output mask. This decouples
mask and class prediction. This is different from common
practice when applying FCNs [29] to semantic segmenta-
tion, which typically uses a per-pixel softmax and a multino-
mial cross-entropy loss. In that case, masks across classes
compete; in our case, with a per-pixel sigmoid and a binary
loss, they do not. We show by experiments that this formu-
lation is key for good instance segmentation results.
Mask Representation: A mask encodes an input object’s
spatial layout. Thus, unlike class labels or box offsets
that are inevitably collapsed into short output vectors by
fully-connected (fc) layers, extracting the spatial structure
of masks can be addressed naturally by the pixel-to-pixel
correspondence provided by convolutions.
Specifically, we predict an m × m mask from each RoI
using an FCN [29]. This allows each layer in the mask
branch to maintain the explicit m × m object spatial lay-
out without collapsing it into a vector representation that
lacks spatial dimensions. Unlike previous methods that re-
sort to fc layers for mask prediction [32, 33, 10], our fully
convolutional representation requires fewer parameters, and
is more accurate as demonstrated by experiments.
This pixel-to-pixel behavior requires our RoI features,
which themselves are small feature maps, to be well aligned
to faithfully preserve the explicit per-pixel spatial corre-
spondence. This motivated us to develop the following
RoIAlign layer that plays a key role in mask prediction.
RoIAlign: RoIPool [12] is a standard operation for extract-
ing a small feature map (e.g., 7×7) from each RoI. RoIPool
first quantizes a floating-number RoI to the discrete granu-
larity of the feature map, this quantized RoI is then subdi-
vided into spatial bins which are themselves quantized, and
finally feature values covered by each bin are aggregated
(usually by max pooling). Quantization is performed, e.g.,
on a continuous coordinate x by computing [x/16], where
16 is a feature map stride and [·] is rounding; likewise, quan-
tization is performed when dividing into bins (e.g., 7×7).
These quantizations introduce misalignments between the
RoI and the extracted features. While this may not impact
classification, which is robust to small translations, it has a
large negative effect on predicting pixel-accurate masks.
To address this, we propose an RoIAlign layer that re-
moves the harsh quantization of RoIPool, properly aligning
the extracted features with the input. Our proposed change
is simple: we avoid any quantization of the RoI boundaries
or bins (i.e., we use x/16 instead of [x/16]). We use bilinear
interpolation [22] to compute the exact values of the input
features at four regularly sampled locations in each RoI bin,
and aggregate the result (using max or average).
2
RoIAlign leads to large improvements as we show in
§4.2. We also compare to the RoIWarp operation proposed
in [10]. Unlike RoIAlign, RoIWarp overlooked the align-
ment issue and was implemented in [10] as quantizing RoI
just like RoIPool. So even though RoIWarp also adopts
bilinear resampling motivated by [22], it performs on par
with RoIPool as shown by experiments (more details in Ta-
ble 2c), demonstrating the crucial role of alignment.
2
We sample four regular locations, so that we can evaluate either max
or average pooling. In fact, interpolating only a single value at each bin
center (without pooling) is nearly as effective. One could also sample more
than four locations per bin, which we found to give diminishing returns.
3