YOLO全面回顾：从V1到V8_使用c++语言怎么配置yolo环境资源-CSDN文库

需积分: 1 13 浏览量 2024-06-12 07:42:01 上传评论收藏 2.93MB PDF 举报

YOLO 已成为机器人、无人驾驶汽车和视频监控应用的核心实时物体检测系统。我们对 YOLO 的发展历程进行了全面分析，研究了从最初的 YOLO 到 YOLOv8 每次迭代中的创新和贡献。首先，我们介绍了标准指标和后处理；然后，我们讨论了每个模型在网络架构和训练技巧方面的主要变化。然后，我们讨论每个模型在网络架构和训练技巧方面的主要变化。最后，我们总结了 YOLO 发展过程中的基本经验，并展望了 YOLO 的未来，强调了增强实时物体检测系统的潜在研究方向。 ### YOLO全面回顾：从V1到V8 #### 引言实时物体检测作为一项关键技术，在诸如自动驾驶车辆、机器人技术、视频监控以及增强现实等众多领域扮演着至关重要的角色。在这其中，YOLO（You Only Look Once）框架以其速度与准确性的出色平衡脱颖而出，实现了图像中对象的快速可靠识别。自诞生以来，YOLO系列经历了多次迭代更新，每一次都基于前一代版本的基础之上进行改进与优化，以克服限制并提升整体性能。本文旨在全面回顾YOLO的发展历程，从最初的YOLOv1一直到最新的YOLOv8，深入剖析各版本之间的关键创新、差异及改进措施。 #### YOLO基础概念与架构 **YOLOv1**是该系列的起点，它首次提出了“一次看完全图”的理念，即通过一个单一的神经网络预测整个图像中的所有边界框及其类别概率。这种端到端的学习方式极大地简化了物体检测的过程，提高了效率。然而，YOLOv1也存在一些局限性，比如对于小物体的检测效果不佳等问题。 #### YOLOv2：Darknet-19与Darknet-53 **YOLOv2**在YOLOv1的基础上引入了一系列改进措施，包括采用预训练的Darknet-19/53网络作为特征提取器、引入Batch Normalization以加速训练过程、使用多尺度训练提高检测精度以及引入锚点框机制来更好地适应不同大小的目标物体等。这些改进显著提升了YOLOv2的检测性能，尤其是在小物体上的表现有了显著改善。 #### YOLOv3：多尺度预测 **YOLOv3**进一步优化了网络结构，通过在不同层级上进行特征融合，实现了多尺度预测的能力，从而大大增强了对各种尺寸目标的检测能力。此外，YOLOv3还引入了更多的锚点框，进一步提升了检测精度。这些技术的综合运用使得YOLOv3成为了当时性能最优的实时物体检测系统之一。 #### YOLOv4：SOTA性能 **YOLOv4**是在YOLOv3的基础上进行的重大改进，不仅结合了多种先进的技术如Mish激活函数、SPP-Net、Mosaic数据增强等，还引入了高效的网络设计策略如CSPNet和PANet等，最终使得YOLOv4在保持高速的同时达到了当时最先进的性能水平。 #### YOLOv5：灵活与高效 **YOLOv5**采用了更简洁的网络架构，同时保留了YOLOv4中的许多优化措施，如CSP模块和PANet路径聚合网络。更重要的是，YOLOv5提供了一种更为灵活且易于调整的训练配置方法，使得用户可以根据不同的应用场景快速调整模型以达到最佳性能。 #### YOLOv6：轻量化设计 **YOLOv6**进一步推动了轻量化设计的方向，通过精简网络结构并在保持高性能的前提下减少了计算资源的需求。这一版本重点在于降低延迟时间，适用于边缘设备上的实时处理。 #### YOLOv7：性能与效率的完美平衡 **YOLOv7**继续沿用YOLOv6的设计思路，但更加注重性能与效率之间的平衡。通过细致的网络结构调整和优化，YOLOv7在保持高速的同时实现了更高精度的检测结果。 #### YOLOv8：面向未来的创新 **YOLOv8**代表着YOLO系列发展的最新阶段，它不仅仅是一次技术上的进步，更是对未来物体检测技术趋势的一种探索。YOLOv8可能引入了更多前沿的技术如Transformer结构、更高级的数据增强技术和模型压缩技术等，以应对日益复杂的应用场景需求。 #### 总结与展望从YOLOv1到YOLOv8的发展过程中，我们可以看到YOLO框架不断地吸收新思想和技术，不断改进自身以适应不断变化的需求。尽管目前YOLO已经达到了相当高的成熟度，但在实时性和准确性方面仍有很大的发展空间。未来的研究方向可能会集中在如何进一步提高检测速度、提升小目标检测能力以及降低模型的计算成本等方面。 YOLO系列的发展不仅是计算机视觉领域的一次重大突破，也为人工智能技术在实际生活中的广泛应用奠定了坚实的基础。随着技术的进步和社会需求的变化，相信YOLO在未来还将有更多令人期待的进展。

资源推荐

资源详情

资源评论

A COMPREHENSIVE REVIEW OF YOLO: FROM YOLOV1 TO

YOLOV8 AND BEYOND

UNDER REVIEW IN ACM COMPUTING SURVEYS

Juan R. Terven

CICATA-Qro

Instituto Politecnico Nacional

Mexico

jrtervens@ipn.mx

Diana M. Cordova-Esparaza

Facultad de Informática

Universidad Autónoma de Querétaro

Mexico

diana.cordova@uaq.mx

April 4, 2023

ABSTRACT

YOLO has become a central real-time object detection system for robotics, driverless cars, and video

monitoring applications. We present a comprehensive analysis of YOLO’s evolution, examining the

innovations and contributions in each iteration from the original YOLO to YOLOv8. We start by

describing the standard metrics and postprocessing; then, we discuss the major changes in network

architecture and training tricks for each model. Finally, we summarize the essential lessons from

YOLO’s development and provide a perspective on its future, highlighting potential research directions

to enhance real-time object detection systems.

Keywords YOLO · Object detection · Deep Learning · Computer Vision

1 Introduction

Real-time object detection has emerged as a critical component in numerous applications, spanning various ﬁelds

such as autonomous vehicles, robotics, video surveillance, and augmented reality. Among the various object detection

algorithms, the YOLO (You Only Look Once) framework has stood out for its remarkable balance of speed and accuracy,

enabling the rapid and reliable identiﬁcation of objects in images. Since its inception, the YOLO family has evolved

through multiple iterations, each building upon the previous versions to address limitations and enhance performance

(see Figure 1). This paper aims to provide a comprehensive review of the YOLO framework’s development, from the

original YOLOv1 to the latest YOLOv8, elucidating the key innovations, differences, and improvements across each

version.

The paper begins by exploring the foundational concepts and architecture of the original YOLO model, which set the

stage for the subsequent advances in the YOLO family. Following this, we delve into the reﬁnements and enhancements

introduced in each version, ranging from YOLOv2 to YOLOv8. These improvements encompass various aspects such

as network design, loss function modiﬁcations, anchor box adaptations, and input resolution scaling. By examining

these developments, we aim to offer a holistic understanding of the YOLO framework’s evolution and its implications

for object detection.

In addition to discussing the speciﬁc advancements of each YOLO version, the paper highlights the trade-offs between

speed and accuracy that have emerged throughout the framework’s development. This underscores the importance of

considering the context and requirements of speciﬁc applications when selecting the most appropriate YOLO model.

Finally, we envision the future directions of the YOLO framework, touching upon potential avenues for further research

and development that will shape the ongoing progress of real-time object detection systems.

UNDER REVIEW IN ACM COMPUTING SURVEYS

2021

YOLOv1

YOLO9000

YOLOv3

Scaled

YOLOv4

PP-YOLO

YOLOv5

YOLOv6

YOLOX

YOLOR

PP-YOLOv2

DAMO YOLO

PP-YOLOE

YOLOv7

YOLOv6

2015

2016

2018

2020

2022

2023

YOLOv8

Figure 1: A timeline of YOLO versions.

2 YOLO Applications Across Diverse Fields

YOLO’s real-time object detection capabilities have been invaluable in autonomous vehicle systems, enabling quick

identiﬁcation and tracking of various objects such as vehicles, pedestrians [

], bicycles, and other obstacles [

These capabilities have been applied in numerous ﬁelds, including action recognition [

] in video sequences for

surveillance [8], sports analysis [9], and human-computer interaction [10].

YOLO models have been used in agriculture to detect and classify crops [

], pests, and diseases [

], assisting in

precision agriculture techniques and automating farming processes. They have also been adapted for face detection

tasks in biometrics, security, and facial recognition systems [14, 15].

In the medical ﬁeld, YOLO has been employed for cancer detection [

], skin segmentation [

], and pill

identiﬁcation [

], leading to improved diagnostic accuracy and more efﬁcient treatment processes. In remote sensing,

it has been used for object detection and classiﬁcation in satellite and aerial imagery, aiding in land use mapping, urban

planning, and environmental monitoring [20, 21, 22, 23].

Security systems have integrated YOLO models for real-time monitoring and analysis of video feeds, allowing rapid

detection of suspicious activities [

], social distancing, and face mask detection [

]. The models have also been

applied in surface inspection to detect defects and anomalies, enhancing quality control in manufacturing and production

processes [26, 27, 28].

In trafﬁc applications, YOLO models have been utilized for tasks such as license plate detection [

] and trafﬁc

sign recognition [

], contributing to the development of intelligent transportation systems and trafﬁc management

solutions. They have been employed in wildlife detection and monitoring to identify endangered species for biodiversity

conservation and ecosystem management [

]. Lastly, YOLO has been widely used in robotic applications [

] and

object detection from drones [34, 35].

3 Object Detection Metrics and Non-Maximum Suppression (NMS)

The Average Precision (AP), traditionally called Mean Average Precision (mAP), is the commonly used metric for

evaluating the performance of object detection models. It measures the average precision across all categories, providing

a single value to compare different models. The COCO dataset makes no distinction between AP and AP. In the rest of

this paper, we will refer to this metric as AP.

In YOLOv1 and YOLOv2, the dataset utilized for training and benchmarking was PASCAL VOC 2007, and VOC 2012

[

]. However, from YOLOv3 onwards, the dataset used is Microsoft COCO (Common Objects in Context) [

]. The

AP is calculated differently for these datasets. The following sections will discuss the rationale behind AP and explain

how it is computed.

UNDER REVIEW IN ACM COMPUTING SURVEYS

3.1 How AP works?

The AP metric is based on precision-recall metrics, handling multiple object categories, and deﬁning a positive

prediction using Intersection over Union (IoU).

Precision and Recall

: Precision measures the accuracy of the model’s positive predictions, while recall measures the

proportion of actual positive cases that the model correctly identiﬁes. There is often a trade-off between precision and

recall; for example, increasing the number of detected objects (higher recall) can result in more false positives (lower

precision). To account for this trade-off, the AP metric incorporates the precision-recall curve that plots precision

against recall for different conﬁdence thresholds. This metric provides a balanced assessment of precision and recall by

considering the area under the precision-recall curve.

Handling multiple object categories

: Object detection models must identify and localize multiple object categories

in an image. The AP metric addresses this by calculating each category’s average precision (AP) separately and then

taking the mean of these APs across all categories (that is why it is also called mean average precision). This approach

ensures that the model’s performance is evaluated for each category individually, providing a more comprehensive

assessment of the model’s overall performance.

Intersection over Union

: Object detection aims to accurately localize objects in images by predicting bounding

boxes. The AP metric incorporates the Intersection over Union (IoU) measure to assess the quality of the predicted

bounding boxes. IoU is the ratio of the intersection area to the union area of the predicted bounding box and the ground

truth bounding box (see Figure 2). It measures the overlap between the ground truth and predicted bounding boxes.

The COCO benchmark considers multiple IoU thresholds to evaluate the model’s performance at different levels of

localization accuracy.

Figure 2: Intersection over Union (IoU). a) The IoU is calculated by dividing the intersection of the two boxes by the

union of the boxes; b) examples of three different IoU values for different box locations.

3.2 Computing AP

The AP is computed differently in the VOC and in the COCO datasets. In this section, we describe how it is computed

on each dataset.

VOC Dataset

This dataset includes 20 object categories. To compute the AP in VOC, we follow the next steps:

For each category, calculate the precision-recall curve by varying the conﬁdence threshold of the model’s

predictions.

UNDER REVIEW IN ACM COMPUTING SURVEYS

Calculate each category’s average precision (AP) using an interpolated 11-point sampling of the precision-recall

curve.

3. Compute the ﬁnal average precision (AP) by taking the mean of the APs across all 20 categories.

Microsoft COCO Dataset

This dataset includes 80 object categories and uses a more complex method for calculating AP. Instead of using an

11-point interpolation, it uses a 101-point interpolation, i.e., it computes the precision for 101 recall thresholds from 0

to 1 in increments of 0.01. Also, the AP is obtained by averaging over multiple IoU values instead of just one, except

for a common AP metric called

, which is the AP for a single IoU threshold of 0.5. The steps for computing AP in

COCO are the following:

For each category, calculate the precision-recall curve by varying the conﬁdence threshold of the model’s

predictions.

2. Compute each category’s average precision (AP) using 101-recall thresholds.

Calculate AP at different Intersection over Union (IoU) thresholds, typically from 0.5 to 0.95 with a step size

of 0.05. A higher IoU threshold requires a more accurate prediction to be considered a true positive.

4. For each IoU threshold, take the mean of the APs across all 80 categories.

5. Finally, compute the overall AP by averaging the AP values calculated at each IoU threshold.

The differences in AP calculation make it hard to directly compare the performance of object detection models across

the two datasets. The current standard uses the COCO AP due to its more ﬁne-grained evaluation of how well a model

performs at different IoU thresholds.

3.3 Non-Maximum Suppression (NMS)

Non-Maximum Suppression (NMS) is a post-processing technique used in object detection algorithms to reduce the

number of overlapping bounding boxes and improve the overall detection quality. Object detection algorithms typically

generate multiple bounding boxes around the same object with different conﬁdence scores. NMS ﬁlters out redundant

and irrelevant bounding boxes, keeping only the most accurate ones. Algorithm 1 describes the procedure. Figure 3

shows the typical output of an object detection model containing multiple overlapping bounding boxes and the output

after NMS.

Algorithm 1 Non-Maximum Suppression Algorithm

Require: Set of predicted bounding boxes B, conﬁdence scores S, IoU threshold τ, conﬁdence threshold T

Ensure: Set of ﬁltered bounding boxes F

1: F ← ∅

2: Filter the boxes: B ← {b ∈ B | S(b) ≥ T }

3: Sort the boxes B by their conﬁdence scores in descending order

4: while B 6= ∅ do

5: Select the box b with the highest conﬁdence score

6: Add b to the set of ﬁnal boxes F : F ← F ∪ {b}

7: Remove b from the set of boxes B: B ← B − {b}

8: for all remaining boxes r in B do

9: Calculate the IoU between b and r: iou ← IoU(b, r)

10: if iou ≥ τ then

11: Remove r from the set of boxes B: B ← B − {r}

12: end if

13: end for

14: end while

We are ready to start describing the different YOLO models.

4 YOLO: You Only Look Once

YOLO by Joseph Redmon et al. was published in CVPR 2016 [

]. It presented for the ﬁrst time a real-time end-to-end

approach for object detection. The name YOLO stands for "You Only Look Once," referring to the fact that it was

UNDER REVIEW IN ACM COMPUTING SURVEYS

Figure 3: Non-Maximum Suppression (NMS). a) Shows the typical output of an object detection model containing

multiple overlapping boxes. b) Shows the output after NMS.

able to accomplish the detection task with a single pass of the network, as opposed to previous approaches that either

used sliding windows followed by a classiﬁer that needed to run hundreds or thousands of times per image or the more

advanced methods that divided the task into two-steps, where the ﬁrst step detects possible regions with objects or

regions proposals and the second step run a classiﬁer on the proposals. Also, YOLO used a more straightforward output

based only on regression to predict the detection outputs as opposed to Fast R-CNN [

] that used two separate outputs,

a classiﬁcation for the probabilities and a regression for the boxes coordinates.

4.1 How YOLOv1 works?

YOLOv1 uniﬁed the object detection steps by detecting all the bounding boxes simultaneously. To accomplish

this, YOLO divides the input image into a

S × S

grid and predicts

bounding boxes of the same class, along

with its conﬁdence for

different classes per grid element. Each bounding box prediction consists of ﬁve values:

P c, bx, by, bh, bw

where

P c

is the conﬁdence score for the box that reﬂects how conﬁdent the model is that the box

contains an object and how accurate the box is. The

and

coordinates are the centers of the box relative to the grid

cell, and

and

are the height and width of the box relative to the full image. The output of YOLO is a tensor of

S × S × (B × 5 + C) optionally followed by non-maximum suppression (NMS) to remove duplicate detections.

In the original YOLO paper, the authors used the PASCAL VOC dataset [

] that contains 20 classes (

C = 20

); a grid

of 7 × 7 (S = 7) and at most 2 classes per grid element (B = 2), giving a 7 × 7 × 30 output prediction.

Figure 4 shows a simpliﬁed output vector considering a three-by-three grid, three classes, and a single class per grid for

eight values. In this simpliﬁed case, the output of YOLO would be 3 × 3 × 8.

YOLOv1 achieved an average precision (AP) of 63.4 on the PASCAL VOC2007 dataset.

4.2 YOLOv1 Architecture

YOLOv1 architecture comprises 24 convolutional layers followed by two fully-connected layers that predict the

bounding box coordinates and probabilities. All layers used leaky rectiﬁed linear unit activations [

] except for the

last one that used a linear activation function. Inspired by GoogLeNet [

] and Network in Network [

], YOLO uses

1 × 1

convolutional layers to reduce the number of feature maps and keep the number of parameters relatively low. As

activation layers, Table 1 describes the YOLOv1 architecture. The authors also introduced a lighter model called Fast

YOLO, composed of nine convolutional layers.

4.3 YOLOv1 Training

The authors pre-trained the ﬁrst 20 layers of YOLO at a resolution of

224 × 224

using the ImageNet dataset [

]. Then,

they added the last four layers with randomly initialized weights and ﬁne-tuned the model with the PASCAL VOC 2007,

and VOC 2012 datasets [36] at a resolution of 448 × 448 to increase the details for more accurate object detection.

For augmentations, the authors used random scaling and translations of at most 20% of the input image size, as well as

random exposure and saturation with an upper-end factor of 1.5 in the HSV color space.

剩余26页未读，继续阅读

评论收藏

内容反馈

jluliuchao

粉丝: 33
资源: 353

YOLO全面回顾：从V1到V8

YOLO-TLA：基于YOLOv5的高效轻量级小目标检测模型

YOLO核心思想：从R-CNN到Fast R-CNN一直采用的思路是proposal+分类 （proposal 提供位置信息，

中文翻译学习笔记-YOLO的全面评述：从YOLOv1到YOLOv8

为YOLO V5铺垫：一文看懂YOLO V1-V4的变化

从YOLOV1到V8的YOLO全面回顾

YOLO目标检测：数据集准备与标注

YOLO-World：实时开放词汇对象检测

【YOLO开发实战】：从入门到精通的深度学习之旅-Markdown文章材料.zip

YOLO 数据集：中草药图像目标检测【包含划分好的数据集、类别class文件、数据可视化脚本】

YOLO 数据集：鱼身上疾病图像目标检测【包含划分好的数据集、类别class文件、数据可视化脚本】

YOLO 数据集：布匹瑕疵检测数据【包含划分好的数据集、类别class文件】

YOLO 数据集：草莓成熟度检测【包含训练集、验证集、对应标签、class文件】

YOLO 数据集：水下管道数据集图像检测【包含训练集、验证集、对应标签、class文件】

YOLO 数据集：监控视角下得行人图像目标检测【包含划分好的数据集、类别class文件、数据可视化脚本】

YOLO 数据集：CT腹部肾脏结石图像检测【包含训练集、验证集、对应标签、class文件】

YOLO 数据集：盲道、人行道缺陷图像目标检测【包含划分好的数据集、类别class文件、数据可视化脚本】

YOLO 数据集：齿轮目标图像检测【包含训练集、验证集、对应标签、class文件】

YOLO 数据集：行人摔倒图像检测数据【包含训练集、验证集、对应标签、class文件】

全面回顾YOLO系列从YOLOv1到YOLOv10的发展路径与应用

Yolo-PyTorch:YOLO v1在PyTorch中的实现

YOLO基础入门：理解目标检测原理.md

yolo论文系列-v1-v8

YOLO开发教程：从零开始构建自己的目标检测系统.md

YOLO 数据集：柑橘新鲜、腐烂图像缺陷检测数据【包含训练集、验证集、对应标签、class文件】

yolotfjs-yolo、yolo v3和Tiny yolo v1、v2、v3与Tensorflow.js

目标检测经典论文-YOLO论文翻译：（YOLO：统一的实时目标检测）

YOLO-CIANNA：在无线电数据中进行深度学习的星系检测 I. 一种受YOLO启发的新型源检测方法应用于SKAO SDC1

pytorch实现的YOLO-v1源代码

yolo编程tfjs-yolo、yolo v3和Tiny yolo v1、v2、v3与Tensorflow.js

最新资源

YOLO核心思想：从R-CNN到Fast R-CNN一直采用的思路是proposal+分类（proposal 提供位置信息，