没有合适的资源?快使用搜索试试~ 我知道了~
YOLO 已成为机器人、无人驾驶汽车和视频监控应用的核心实时物体检测系统。我们对 YOLO 的发展历程进行了全面分析,研究了从最初的 YOLO 到 YOLOv8 每次迭代中的创新和贡献。首先,我们介绍了标准指标和后处理;然后,我们讨论了每个模型在网络架构和训练技巧方面的主要变化。 然后,我们讨论每个模型在网络架构和训练技巧方面的主要变化。最后,我们总结了 YOLO 发展过程中的基本经验,并展望了 YOLO 的未来,强调了增强实时物体检测系统的潜在研究方向。
资源推荐
资源详情
资源评论
![docx](https://img-home.csdnimg.cn/images/20210720083331.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![md](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![docx](https://img-home.csdnimg.cn/images/20210720083331.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![7z](https://img-home.csdnimg.cn/images/20210720083312.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![crx](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![](https://csdnimg.cn/release/download_crawler_static/89423088/bg1.jpg)
A COMPREHENSIVE REVIEW OF YOLO: FROM YOLOV1 TO
YOLOV8 AND BEYOND
UNDER REVIEW IN ACM COMPUTING SURVEYS
Juan R. Terven
CICATA-Qro
Instituto Politecnico Nacional
Mexico
jrtervens@ipn.mx
Diana M. Cordova-Esparaza
Facultad de Informática
Universidad Autónoma de Querétaro
Mexico
diana.cordova@uaq.mx
April 4, 2023
ABSTRACT
YOLO has become a central real-time object detection system for robotics, driverless cars, and video
monitoring applications. We present a comprehensive analysis of YOLO’s evolution, examining the
innovations and contributions in each iteration from the original YOLO to YOLOv8. We start by
describing the standard metrics and postprocessing; then, we discuss the major changes in network
architecture and training tricks for each model. Finally, we summarize the essential lessons from
YOLO’s development and provide a perspective on its future, highlighting potential research directions
to enhance real-time object detection systems.
Keywords YOLO · Object detection · Deep Learning · Computer Vision
1 Introduction
Real-time object detection has emerged as a critical component in numerous applications, spanning various fields
such as autonomous vehicles, robotics, video surveillance, and augmented reality. Among the various object detection
algorithms, the YOLO (You Only Look Once) framework has stood out for its remarkable balance of speed and accuracy,
enabling the rapid and reliable identification of objects in images. Since its inception, the YOLO family has evolved
through multiple iterations, each building upon the previous versions to address limitations and enhance performance
(see Figure 1). This paper aims to provide a comprehensive review of the YOLO framework’s development, from the
original YOLOv1 to the latest YOLOv8, elucidating the key innovations, differences, and improvements across each
version.
The paper begins by exploring the foundational concepts and architecture of the original YOLO model, which set the
stage for the subsequent advances in the YOLO family. Following this, we delve into the refinements and enhancements
introduced in each version, ranging from YOLOv2 to YOLOv8. These improvements encompass various aspects such
as network design, loss function modifications, anchor box adaptations, and input resolution scaling. By examining
these developments, we aim to offer a holistic understanding of the YOLO framework’s evolution and its implications
for object detection.
In addition to discussing the specific advancements of each YOLO version, the paper highlights the trade-offs between
speed and accuracy that have emerged throughout the framework’s development. This underscores the importance of
considering the context and requirements of specific applications when selecting the most appropriate YOLO model.
Finally, we envision the future directions of the YOLO framework, touching upon potential avenues for further research
and development that will shape the ongoing progress of real-time object detection systems.
![](https://csdnimg.cn/release/download_crawler_static/89423088/bg2.jpg)
UNDER REVIEW IN ACM COMPUTING SURVEYS
2021
YOLOv1
YOLO9000
v2
YOLOv3
Scaled
YOLOv4
PP-YOLO
YOLOv5
YOLOv6
YOLOX
YOLOR
PP-YOLOv2
DAMO YOLO
PP-YOLOE
YOLOv7
YOLOv6
2015
2016
2018
2020
2022
2023
YOLOv8
Figure 1: A timeline of YOLO versions.
2 YOLO Applications Across Diverse Fields
YOLO’s real-time object detection capabilities have been invaluable in autonomous vehicle systems, enabling quick
identification and tracking of various objects such as vehicles, pedestrians [
1
,
2
], bicycles, and other obstacles [
3
,
4
,
5
,
6
].
These capabilities have been applied in numerous fields, including action recognition [
7
] in video sequences for
surveillance [8], sports analysis [9], and human-computer interaction [10].
YOLO models have been used in agriculture to detect and classify crops [
11
,
12
], pests, and diseases [
13
], assisting in
precision agriculture techniques and automating farming processes. They have also been adapted for face detection
tasks in biometrics, security, and facial recognition systems [14, 15].
In the medical field, YOLO has been employed for cancer detection [
16
,
17
], skin segmentation [
18
], and pill
identification [
19
], leading to improved diagnostic accuracy and more efficient treatment processes. In remote sensing,
it has been used for object detection and classification in satellite and aerial imagery, aiding in land use mapping, urban
planning, and environmental monitoring [20, 21, 22, 23].
Security systems have integrated YOLO models for real-time monitoring and analysis of video feeds, allowing rapid
detection of suspicious activities [
24
], social distancing, and face mask detection [
25
]. The models have also been
applied in surface inspection to detect defects and anomalies, enhancing quality control in manufacturing and production
processes [26, 27, 28].
In traffic applications, YOLO models have been utilized for tasks such as license plate detection [
29
] and traffic
sign recognition [
30
], contributing to the development of intelligent transportation systems and traffic management
solutions. They have been employed in wildlife detection and monitoring to identify endangered species for biodiversity
conservation and ecosystem management [
31
]. Lastly, YOLO has been widely used in robotic applications [
32
,
33
] and
object detection from drones [34, 35].
3 Object Detection Metrics and Non-Maximum Suppression (NMS)
The Average Precision (AP), traditionally called Mean Average Precision (mAP), is the commonly used metric for
evaluating the performance of object detection models. It measures the average precision across all categories, providing
a single value to compare different models. The COCO dataset makes no distinction between AP and AP. In the rest of
this paper, we will refer to this metric as AP.
In YOLOv1 and YOLOv2, the dataset utilized for training and benchmarking was PASCAL VOC 2007, and VOC 2012
[
36
]. However, from YOLOv3 onwards, the dataset used is Microsoft COCO (Common Objects in Context) [
37
]. The
AP is calculated differently for these datasets. The following sections will discuss the rationale behind AP and explain
how it is computed.
2
![](https://csdnimg.cn/release/download_crawler_static/89423088/bg3.jpg)
UNDER REVIEW IN ACM COMPUTING SURVEYS
3.1 How AP works?
The AP metric is based on precision-recall metrics, handling multiple object categories, and defining a positive
prediction using Intersection over Union (IoU).
Precision and Recall
: Precision measures the accuracy of the model’s positive predictions, while recall measures the
proportion of actual positive cases that the model correctly identifies. There is often a trade-off between precision and
recall; for example, increasing the number of detected objects (higher recall) can result in more false positives (lower
precision). To account for this trade-off, the AP metric incorporates the precision-recall curve that plots precision
against recall for different confidence thresholds. This metric provides a balanced assessment of precision and recall by
considering the area under the precision-recall curve.
Handling multiple object categories
: Object detection models must identify and localize multiple object categories
in an image. The AP metric addresses this by calculating each category’s average precision (AP) separately and then
taking the mean of these APs across all categories (that is why it is also called mean average precision). This approach
ensures that the model’s performance is evaluated for each category individually, providing a more comprehensive
assessment of the model’s overall performance.
Intersection over Union
: Object detection aims to accurately localize objects in images by predicting bounding
boxes. The AP metric incorporates the Intersection over Union (IoU) measure to assess the quality of the predicted
bounding boxes. IoU is the ratio of the intersection area to the union area of the predicted bounding box and the ground
truth bounding box (see Figure 2). It measures the overlap between the ground truth and predicted bounding boxes.
The COCO benchmark considers multiple IoU thresholds to evaluate the model’s performance at different levels of
localization accuracy.
Figure 2: Intersection over Union (IoU). a) The IoU is calculated by dividing the intersection of the two boxes by the
union of the boxes; b) examples of three different IoU values for different box locations.
3.2 Computing AP
The AP is computed differently in the VOC and in the COCO datasets. In this section, we describe how it is computed
on each dataset.
VOC Dataset
This dataset includes 20 object categories. To compute the AP in VOC, we follow the next steps:
1.
For each category, calculate the precision-recall curve by varying the confidence threshold of the model’s
predictions.
3
![](https://csdnimg.cn/release/download_crawler_static/89423088/bg4.jpg)
UNDER REVIEW IN ACM COMPUTING SURVEYS
2.
Calculate each category’s average precision (AP) using an interpolated 11-point sampling of the precision-recall
curve.
3. Compute the final average precision (AP) by taking the mean of the APs across all 20 categories.
Microsoft COCO Dataset
This dataset includes 80 object categories and uses a more complex method for calculating AP. Instead of using an
11-point interpolation, it uses a 101-point interpolation, i.e., it computes the precision for 101 recall thresholds from 0
to 1 in increments of 0.01. Also, the AP is obtained by averaging over multiple IoU values instead of just one, except
for a common AP metric called
AP
50
, which is the AP for a single IoU threshold of 0.5. The steps for computing AP in
COCO are the following:
1.
For each category, calculate the precision-recall curve by varying the confidence threshold of the model’s
predictions.
2. Compute each category’s average precision (AP) using 101-recall thresholds.
3.
Calculate AP at different Intersection over Union (IoU) thresholds, typically from 0.5 to 0.95 with a step size
of 0.05. A higher IoU threshold requires a more accurate prediction to be considered a true positive.
4. For each IoU threshold, take the mean of the APs across all 80 categories.
5. Finally, compute the overall AP by averaging the AP values calculated at each IoU threshold.
The differences in AP calculation make it hard to directly compare the performance of object detection models across
the two datasets. The current standard uses the COCO AP due to its more fine-grained evaluation of how well a model
performs at different IoU thresholds.
3.3 Non-Maximum Suppression (NMS)
Non-Maximum Suppression (NMS) is a post-processing technique used in object detection algorithms to reduce the
number of overlapping bounding boxes and improve the overall detection quality. Object detection algorithms typically
generate multiple bounding boxes around the same object with different confidence scores. NMS filters out redundant
and irrelevant bounding boxes, keeping only the most accurate ones. Algorithm 1 describes the procedure. Figure 3
shows the typical output of an object detection model containing multiple overlapping bounding boxes and the output
after NMS.
Algorithm 1 Non-Maximum Suppression Algorithm
Require: Set of predicted bounding boxes B, confidence scores S, IoU threshold τ, confidence threshold T
Ensure: Set of filtered bounding boxes F
1: F ← ∅
2: Filter the boxes: B ← {b ∈ B | S(b) ≥ T }
3: Sort the boxes B by their confidence scores in descending order
4: while B 6= ∅ do
5: Select the box b with the highest confidence score
6: Add b to the set of final boxes F : F ← F ∪ {b}
7: Remove b from the set of boxes B: B ← B − {b}
8: for all remaining boxes r in B do
9: Calculate the IoU between b and r: iou ← IoU(b, r)
10: if iou ≥ τ then
11: Remove r from the set of boxes B: B ← B − {r}
12: end if
13: end for
14: end while
We are ready to start describing the different YOLO models.
4 YOLO: You Only Look Once
YOLO by Joseph Redmon et al. was published in CVPR 2016 [
38
]. It presented for the first time a real-time end-to-end
approach for object detection. The name YOLO stands for "You Only Look Once," referring to the fact that it was
4
![](https://csdnimg.cn/release/download_crawler_static/89423088/bg5.jpg)
UNDER REVIEW IN ACM COMPUTING SURVEYS
Figure 3: Non-Maximum Suppression (NMS). a) Shows the typical output of an object detection model containing
multiple overlapping boxes. b) Shows the output after NMS.
able to accomplish the detection task with a single pass of the network, as opposed to previous approaches that either
used sliding windows followed by a classifier that needed to run hundreds or thousands of times per image or the more
advanced methods that divided the task into two-steps, where the first step detects possible regions with objects or
regions proposals and the second step run a classifier on the proposals. Also, YOLO used a more straightforward output
based only on regression to predict the detection outputs as opposed to Fast R-CNN [
39
] that used two separate outputs,
a classification for the probabilities and a regression for the boxes coordinates.
4.1 How YOLOv1 works?
YOLOv1 unified the object detection steps by detecting all the bounding boxes simultaneously. To accomplish
this, YOLO divides the input image into a
S × S
grid and predicts
B
bounding boxes of the same class, along
with its confidence for
C
different classes per grid element. Each bounding box prediction consists of five values:
P c, bx, by, bh, bw
where
P c
is the confidence score for the box that reflects how confident the model is that the box
contains an object and how accurate the box is. The
bx
and
by
coordinates are the centers of the box relative to the grid
cell, and
bh
and
bw
are the height and width of the box relative to the full image. The output of YOLO is a tensor of
S × S × (B × 5 + C) optionally followed by non-maximum suppression (NMS) to remove duplicate detections.
In the original YOLO paper, the authors used the PASCAL VOC dataset [
36
] that contains 20 classes (
C = 20
); a grid
of 7 × 7 (S = 7) and at most 2 classes per grid element (B = 2), giving a 7 × 7 × 30 output prediction.
Figure 4 shows a simplified output vector considering a three-by-three grid, three classes, and a single class per grid for
eight values. In this simplified case, the output of YOLO would be 3 × 3 × 8.
YOLOv1 achieved an average precision (AP) of 63.4 on the PASCAL VOC2007 dataset.
4.2 YOLOv1 Architecture
YOLOv1 architecture comprises 24 convolutional layers followed by two fully-connected layers that predict the
bounding box coordinates and probabilities. All layers used leaky rectified linear unit activations [
40
] except for the
last one that used a linear activation function. Inspired by GoogLeNet [
41
] and Network in Network [
42
], YOLO uses
1 × 1
convolutional layers to reduce the number of feature maps and keep the number of parameters relatively low. As
activation layers, Table 1 describes the YOLOv1 architecture. The authors also introduced a lighter model called Fast
YOLO, composed of nine convolutional layers.
4.3 YOLOv1 Training
The authors pre-trained the first 20 layers of YOLO at a resolution of
224 × 224
using the ImageNet dataset [
43
]. Then,
they added the last four layers with randomly initialized weights and fine-tuned the model with the PASCAL VOC 2007,
and VOC 2012 datasets [36] at a resolution of 448 × 448 to increase the details for more accurate object detection.
For augmentations, the authors used random scaling and translations of at most 20% of the input image size, as well as
random exposure and saturation with an upper-end factor of 1.5 in the HSV color space.
5
剩余26页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/0b4ca873124d4ac0a486db5ce7e2c781_jluliuchao.jpg!1)
jluliuchao
- 粉丝: 6
- 资源: 241
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)