making predictions. Unlike sliding window and region
proposal-based techniques, YOLO sees the entire image
during training and test time so it implicitly encodes contex-
tual information about classes as well as their appearance.
Fast R-CNN, a top detection method [
14], mistakes back-
ground patches in an image for objects because it can’t see
the larger context. YOLO makes less than half the number
of background errors compared to Fast R-CNN.
Third, YOLO learns generalizable representations of ob-
jects. When trained on natural images and tested on art-
work, YOLO outperforms top detection methods like DPM
and R-CNN by a wide margin. Since YOLO is highly gen-
eralizable it is less likely to break down when applied to
new domains or unexpected inputs.
YOLO still lags behind state-of-the-art detection systems
in accuracy. While it can quickly identify objects in im-
ages it struggles to precisely localize some objects, espe-
cially small ones. We examine these tradeoffs further in our
experiments.
All of our training and testing code is open source. A
variety of pretrained models are also available to download.
2. Unified Detection
We unify the separate components of object detection
into a single neural network. Our network uses features
from the entire image to predict each bounding box. It also
predicts all bounding boxes across all classes for an im-
age simultaneously. This means our network reasons glob-
ally about the full image and all the objects in the image.
The YOLO design enables end-to-end training and real-
time speeds while maintaining high average precision.
Our system divides the input image into an S × S grid.
If the center of an object falls into a grid cell, that grid cell
is responsible for detecting that object.
Each grid cell predicts B bounding boxes and confidence
scores for those boxes. These confidence scores reflect how
confident the model is that the box contains an object and
also how accurate it thinks the box is that it predicts. For-
mally we define confidence as Pr(Object) ∗ IOU
truth
pred
. If no
object exists in that cell, the confidence scores should be
zero. Otherwise we want the confidence score to equal the
intersection over union (IOU) between the predicted box
and the ground truth.
Each bounding box consists of 5 predictions: x, y, w, h,
and confidence. The (x, y) coordinates represent the center
of the box relative to the bounds of the grid cell. The width
and height are predicted relative to the whole image. Finally
the confidence prediction represents the IOU between the
predicted box and any ground truth box.
Each grid cell also predicts C conditional class proba-
bilities, Pr(Class
i
|Object). These probabilities are condi-
tioned on the grid cell containing an object. We only predict
one set of class probabilities per grid cell, regardless of the
number of boxes B.
At test time we multiply the conditional class probabili-
ties and the individual box confidence predictions,
Pr(Class
i
|Object) ∗ Pr(Object) ∗ IOU
truth
pred
= Pr(Class
i
) ∗ IOU
truth
pred
(1)
which gives us class-specific confidence scores for each
box. These scores encode both the probability of that class
appearing in the box and how well the predicted box fits the
object.
S × S grid on input
Bounding boxes + conidence
Class probability map
Final detections
Figure 2: The Model. Our system models detection as a regres-
sion problem. It divides the image into an S × S grid and for each
grid cell predicts B bounding boxes, confidence for those boxes,
and C class probabilities. These predictions are encoded as an
S × S × (B ∗ 5 + C) tensor.
For evaluating YOLO on PASCAL VOC, we use S = 7,
B = 2. PASCAL VOC has 20 labelled classes so C = 20.
Our final prediction is a 7 × 7 × 30 tensor.
2.1. Network Design
We implement this model as a convolutional neural net-
work and evaluate it on the PASCAL VOC detection dataset
[9]. The initial convolutional layers of the network extract
features from the image while the fully connected layers
predict the output probabilities and coordinates.
Our network architecture is inspired by the GoogLeNet
model for image classification [
33]. Our network has 24
convolutional layers followed by 2 fully connected layers.
Instead of the inception modules used by GoogLeNet, we
simply use 1 × 1 reduction layers followed by 3 × 3 convo-
lutional layers, similar to Lin et al [
22]. The full network is
shown in Figure
3.
We also train a fast version of YOLO designed to push
the boundaries of fast object detection. Fast YOLO uses a
neural network with fewer convolutional layers (9 instead
of 24) and fewer filters in those layers. Other than the size
of the network, all training and testing parameters are the
same between YOLO and Fast YOLO.
780
YOLO在训练和测试期间看到整
个图像,因此它隐式地编码关
于类及其外观的上下文信息