SPPnet also has notable drawbacks. Like R-CNN, train-
ing is a multi-stage pipeline that involves extracting fea-
tures, fine-tuning a network with log loss, training SVMs,
and finally fitting bounding-box regressors. Features are
also written to disk. But unlike R-CNN, the fine-tuning al-
gorithm proposed in [11] cannot update the convolutional
layers that precede the spatial pyramid pooling. Unsurpris-
ingly, this limitation (fixed convolutional layers) limits the
accuracy of very deep networks.
1.2. Contributions
We propose a new training algorithm that fixes the disad-
vantages of R-CNN and SPPnet, while improving on their
speed and accuracy. We call this method Fast R-CNN be-
cause it’s comparatively fast to train and test. The Fast R-
CNN method has several advantages:
1. Higher detection quality (mAP) than R-CNN, SPPnet
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching
Fast R-CNN is written in Python and C++ (Caffe
[13]) and is available under the open-source MIT Li-
cense at https://github.com/rbgirshick/
2. Fast R-CNN architecture and training
Fig. 1 illustrates the Fast R-CNN architecture. A Fast
R-CNN network takes as input an entire image and a set
of object proposals. The network first processes the whole
image with several convolutional (conv) and max pooling
layers to produce a conv feature map. Then, for each ob-
ject proposal a region of interest (RoI) pooling layer ex-
tracts a fixed-length feature vector from the feature map.
Each feature vector is fed into a sequence of fully connected
(fc) layers that finally branch into two sibling output lay-
ers: one that produces softmax probability estimates over
K object classes plus a catch-all “background” class and
another layer that outputs four real-valued numbers for each
of the K object classes. Each set of 4 values encodes refined
bounding-box positions for one of the K classes.
2.1. The RoI pooling layer
The RoI pooling layer uses max pooling to convert the
features inside any valid region of interest into a small fea-
ture map with a fixed spatial extent of H ⇥ W (e.g., 7 ⇥ 7),
where H and W are layer hyper-parameters that are inde-
pendent of any particular RoI. In this paper, an RoI is a
rectangular window into a conv feature map. Each RoI is
defined by a four-tuple ( r, c, h, w) that specifies its top-left
corner (r, c) and its height and width (h, w).
feature map
RoI feature
For each RoI
Figure 1. Fast R-CNN architecture. An input image and multi-
ple regions of interest (RoIs) are input into a fully convolutional
network. Each RoI is pooled into a fixed-size feature map and
then mapped to a feature vector by fully connected layers (FCs).
The network has two output vectors per RoI: softmax probabilities
and per-class bounding-box regression offsets. The architecture is
trained end-to-end with a multi-task loss.
RoI max pooling works by dividing the h ⇥ w RoI win-
dow into an H ⇥ W grid of sub-windows of approximate
size h/H ⇥ w/W and then max-pooling the values in each
sub-window into the corresponding output grid cell. Pool-
ing is applied independently to each feature map channel,
as in standard max pooling. The RoI layer is simply the
special-case of the spatial pyramid pooling layer used in
SPPnets [11] in which there is only one pyramid level. We
use the pooling sub-window calculation given in [11].
2.2. Initializing from pre-trained networks
We experiment with three pre-trained ImageNet [4] net-
works, each with five max pooling layers and between five
and thirteen conv layers (see Section 4.1 for network de-
tails). When a pre-trained network initializes a Fast R-CNN
network, it undergoes three transformations.
First, the last max pooling layer is replaced by a RoI
pooling layer that is configured by setting H and W to be
compatible with the net’s first fully connected layer (e.g.,
H = W =7for VGG16).
Second, the network’s last fully connected layer and soft-
max (which were trained for 1000-way ImageNet classifi-
cation) are replaced with the two sibling layers described
earlier (a fully connected layer and softmax over K +1cat-
egories and category-specific bounding-box regressors).
Third, the network is modified to take two data inputs: a
list of images and a list of RoIs in those images.
2.3. Fine-tuning for detection
Training all network weights with back-propagation is an
important capability of Fast R-CNN. First, let’s elucidate
why SPPnet is unable to update weights below the spatial
pyramid pooling layer.
The root cause is that back-propagation through the SPP
layer is highly inefficient when each training sample (i.e.
RoI) comes from a different image, which is exactly how
R-CNN and SPPnet networks are trained. The inefficiency