【免费】youonlylookonce原论文资源-CSDN文库

需积分: 0 57 浏览量 2023-09-27 09:34:20 上传评论收藏 1.01MB PDF 举报

yolo 系列论文，开创了目标检测的新的分支。他把目标检测任务，简单到回归任务。无论是目标的类别，位置信息，都直接了当的交给模型来回归出来。同时，与置信度分为前景置信度和条该概率的每一个类的置信度，类似于faster rcnn两阶段任务中的rpn阶段判断positive or negative，然后交给roi head判断具体的类别。，更详细的介绍参考博客：https://blog.csdn.net/wsLJQian YOLO（You Only Look Once）是由Joseph Redmon等人在2016年提出的一种目标检测算法，它将目标检测问题转换为回归问题，极大地提高了检测速度，并开创了实时目标检测的新纪元。YOLO的核心思想是通过一个统一的神经网络模型，直接从完整的图像中预测出目标的边界框和对应的类别概率，避免了传统方法中分类器和检测器分离的两阶段处理过程。在传统的目标检测框架中，如Faster R-CNN，首先由区域提议网络（RPN）生成可能包含目标的候选区域，然后这些区域被送入RoI（Region of Interest）头部进行分类和精确定位。然而，YOLO摒弃了这种分步方法，提出了一种端到端的学习模型，该模型可以一次性预测出图像中所有可能的目标及其位置、大小和类别概率。 YOLO模型的设计使得其具有极高的效率。基础版YOLO模型可以在45帧/秒的速度下处理图像，而Fast YOLO版本则进一步优化，速度高达155帧/秒，且在保持高效率的同时，其平均精度（mAP）优于其他实时检测器。尽管YOLO在定位准确性上可能会犯更多错误，但相比其他方法，它在背景误报方面表现更好。 YOLO的网络架构包括一个全卷积网络，对输入图像进行处理，每个网格单元负责预测固定的边界框数量，每个框包含一个类别概率和四个坐标值（x, y, w, h），用于表示边界框的中心位置和大小。通过这样的设计，YOLO可以并行处理图像的各个部分，大大减少了计算时间。此外，YOLO的一个显著优势是其泛化能力。在从自然图像到艺术作品等不同领域的泛化测试中，YOLO的表现优于DPM（Deformable Part Model）和R-CNN等其他检测方法，证明了它能学习到更为通用的对象表示。 YOLO的贡献在于引入了一种全新的目标检测范式，强调了实时性能和整体优化的重要性。后续的YOLOv2和YOLOv3等版本进一步改进了模型结构，提高了精度，同时保持了快速的检测速度，继续引领着目标检测领域的发展。总结起来，YOLO的主要知识点包括： 1. 目标检测的回归方法：将检测任务视为回归问题，直接预测边界框和类别概率。 2. 端到端学习：整个检测流程在一个神经网络中完成，可直接优化检测性能。 3. 高效性：基础YOLO和Fast YOLO实现了实时目标检测，速度快。 4. 错误分析：YOLO在定位准确性上有一定误差，但较少产生背景误报。 5. 泛化能力：在不同领域有较好的泛化性能，优于其他检测方法。 YOLO的成功不仅在于它的创新设计，还在于它启发了后续的许多实时目标检测算法，如SSD（Single Shot MultiBox Detector）等，推动了计算机视觉领域的发展。

资源推荐

资源详情

资源评论

You Only Look Once:

Uniﬁed, Real-Time Object Detection

Joseph Redmon

∗

, Santosh Divvala

∗†

, Ross Girshick

, Ali Farhadi

∗†

University of Washington

∗

, Allen Institute for AI

†

, Facebook AI Research

http://pjreddie.com/yolo/

Abstract

We present YOLO, a new approach to object detection.

Prior work on object detection repurposes classiﬁers to per-

form detection. Instead, we frame object detection as a re-

gression problem to spatially separated bounding boxes and

associated class probabilities. A single neural network pre-

dicts bounding boxes and class probabilities directly from

full images in one evaluation. Since the whole detection

pipeline is a single network, it can be optimized end-to-end

directly on detection performance.

Our uniﬁed architecture is extremely fast. Our base

YOLO model processes images in real-time at 45 frames

per second. A smaller version of the network, Fast YOLO,

processes an astounding 155 frames per second while

still achieving double the mAP of other real-time detec-

tors. Compared to state-of-the-art detection systems, YOLO

makes more localization errors but is less likely to predict

false positives on background. Finally, YOLO learns very

general representations of objects. It outperforms other de-

tection methods, including DPM and R-CNN, when gener-

alizing from natural images to other domains like artwork.

1. Introduction

Humans glance at an image and instantly know what ob-

jects are in the image, where they are, and how they inter-

act. The human visual system is fast and accurate, allow-

ing us to perform complex tasks like driving with little con-

scious thought. Fast, accurate algorithms for object detec-

tion would allow computers to drive cars without special-

ized sensors, enable assistive devices to convey real-time

scene information to human users, and unlock the potential

for general purpose, responsive robotic systems.

Current detection systems repurpose classiﬁers to per-

form detection. To detect an object, these systems take a

classiﬁer for that object and evaluate it at various locations

and scales in a test image. Systems like deformable parts

models (DPM) use a sliding window approach where the

classiﬁer is run at evenly spaced locations over the entire

image [

10].

More recent approaches like R-CNN use region proposal

1. Resize image.

2. Run convolutional network.

3. Non-max suppression.

Dog: 0.30

Person: 0.64

Horse: 0.28

Figure 1: The YOLO Detection System. Processing images

with YOLO is simple and straightforward. Our system (1) resizes

the input image to 448 × 448, (2) runs a single convolutional net-

work on the image, and (3) thresholds the resulting detections by

the model’s conﬁdence.

methods to ﬁrst generate potential bounding boxes in an im-

age and then run a classiﬁer on these proposed boxes. After

classiﬁcation, post-processing is used to reﬁne the bound-

ing boxes, eliminate duplicate detections, and rescore the

boxes based on other objects in the scene [

13]. These com-

plex pipelines are slow and hard to optimize because each

individual component must be trained separately.

We reframe object detection as a single regression prob-

lem, straight from image pixels to bounding box coordi-

nates and class probabilities. Using our system, you only

look once (YOLO) at an image to predict what objects are

present and where they are.

YOLO is refreshingly simple: see Figure

1. A sin-

gle convolutional network simultaneously predicts multi-

ple bounding boxes and class probabilities for those boxes.

YOLO trains on full images and directly optimizes detec-

tion performance. This uniﬁed model has several beneﬁts

over traditional methods of object detection.

First, YOLO is extremely fast. Since we frame detection

as a regression problem we don’t need a complex pipeline.

We simply run our neural network on a new image at test

time to predict detections. Our base network runs at 45

frames per second with no batch processing on a Titan X

GPU and a fast version runs at more than 150 fps. This

means we can process streaming video in real-time with

less than 25 milliseconds of latency. Furthermore, YOLO

achieves more than twice the mean average precision of

other real-time systems. For a demo of our system running

in real-time on a webcam please see our project webpage:

http://pjreddie.com/yolo/.

Second, YOLO reasons globally about the image when

779

making predictions. Unlike sliding window and region

proposal-based techniques, YOLO sees the entire image

during training and test time so it implicitly encodes contex-

tual information about classes as well as their appearance.

Fast R-CNN, a top detection method [

14], mistakes back-

ground patches in an image for objects because it can’t see

the larger context. YOLO makes less than half the number

of background errors compared to Fast R-CNN.

Third, YOLO learns generalizable representations of ob-

jects. When trained on natural images and tested on art-

work, YOLO outperforms top detection methods like DPM

and R-CNN by a wide margin. Since YOLO is highly gen-

eralizable it is less likely to break down when applied to

new domains or unexpected inputs.

YOLO still lags behind state-of-the-art detection systems

in accuracy. While it can quickly identify objects in im-

ages it struggles to precisely localize some objects, espe-

cially small ones. We examine these tradeoffs further in our

experiments.

All of our training and testing code is open source. A

variety of pretrained models are also available to download.

2. Uniﬁed Detection

We unify the separate components of object detection

into a single neural network. Our network uses features

from the entire image to predict each bounding box. It also

predicts all bounding boxes across all classes for an im-

age simultaneously. This means our network reasons glob-

ally about the full image and all the objects in the image.

The YOLO design enables end-to-end training and real-

time speeds while maintaining high average precision.

Our system divides the input image into an S × S grid.

If the center of an object falls into a grid cell, that grid cell

is responsible for detecting that object.

Each grid cell predicts B bounding boxes and conﬁdence

scores for those boxes. These conﬁdence scores reﬂect how

conﬁdent the model is that the box contains an object and

also how accurate it thinks the box is that it predicts. For-

mally we deﬁne conﬁdence as Pr(Object) ∗ IOU

truth

pred

. If no

object exists in that cell, the conﬁdence scores should be

zero. Otherwise we want the conﬁdence score to equal the

intersection over union (IOU) between the predicted box

and the ground truth.

Each bounding box consists of 5 predictions: x, y, w, h,

and conﬁdence. The (x, y) coordinates represent the center

of the box relative to the bounds of the grid cell. The width

and height are predicted relative to the whole image. Finally

the conﬁdence prediction represents the IOU between the

predicted box and any ground truth box.

Each grid cell also predicts C conditional class proba-

bilities, Pr(Class

|Object). These probabilities are condi-

tioned on the grid cell containing an object. We only predict

one set of class probabilities per grid cell, regardless of the

number of boxes B.

At test time we multiply the conditional class probabili-

ties and the individual box conﬁdence predictions,

Pr(Class

|Object) ∗ Pr(Object) ∗ IOU

truth

pred

= Pr(Class

) ∗ IOU

truth

pred

(1)

which gives us class-speciﬁc conﬁdence scores for each

box. These scores encode both the probability of that class

appearing in the box and how well the predicted box ﬁts the

object.

S × S grid on input

Bounding boxes + conidence

Class probability map

Final detections

Figure 2: The Model. Our system models detection as a regres-

sion problem. It divides the image into an S × S grid and for each

grid cell predicts B bounding boxes, conﬁdence for those boxes,

and C class probabilities. These predictions are encoded as an

S × S × (B ∗ 5 + C) tensor.

For evaluating YOLO on PASCAL VOC, we use S = 7,

B = 2. PASCAL VOC has 20 labelled classes so C = 20.

Our ﬁnal prediction is a 7 × 7 × 30 tensor.

2.1. Network Design

We implement this model as a convolutional neural net-

work and evaluate it on the PASCAL VOC detection dataset

[9]. The initial convolutional layers of the network extract

features from the image while the fully connected layers

predict the output probabilities and coordinates.

Our network architecture is inspired by the GoogLeNet

model for image classiﬁcation [

33]. Our network has 24

convolutional layers followed by 2 fully connected layers.

Instead of the inception modules used by GoogLeNet, we

simply use 1 × 1 reduction layers followed by 3 × 3 convo-

lutional layers, similar to Lin et al [

22]. The full network is

shown in Figure

We also train a fast version of YOLO designed to push

the boundaries of fast object detection. Fast YOLO uses a

neural network with fewer convolutional layers (9 instead

of 24) and fewer ﬁlters in those layers. Other than the size

of the network, all training and testing parameters are the

same between YOLO and Fast YOLO.

780

YOLO在训练和测试期间看到整

个图像，因此它隐式地编码关

于类及其外观的上下文信息

7*7*(2*5+20)=7*7*30

完成

剩余9页未读，继续阅读

评论收藏

内容反馈

钱多多先森

粉丝: 4w+
资源: 23

you only look once 原论文

最新资源

you only look once 原论文

yolov论文（You Only Look Once）

YOLOV1论文原文和论文对应的ppt文件You Only Look Once:Unified, Real-Time Objec

YOLOV3论文在2018年提出，全称是《You Only Look Once (v3)》

YOLOV1（You Only Look Once）

机器视觉 目标检测补习贴之YOLO实时检测, You only look once .html

Yolo，you only look once，yolo物体检测系列算法介绍，70页PPT资源

YOLO you only look once 算法介绍

"YOLO" 是一种在计算机视觉领域广泛使用的目标检测算法，全称为 "You Only Look Once" 这种算法由 Jos

以下是关于自然语言处理（NLP）、Transformer 模型、YOLO（You Only Look Once）等技术应用

计算机视觉论文 Redmon_You_Only_Look_CVPR_2016_paper

YOLO（You Only Look Once）是一个流行的实时目标检测算法，它通过单个神经网络模型同时预测图像中多个目标的位置

yolo入门自学笔记-包含代码.zip

yolov1、yolov2、yolov3、yolov4、yolov5、yolov6、yolov7等论文

yolov1、yolov2、yolov3、yolov4、yolov5、yolov6、yolov7等论文 .zip

yolov1、yolov2、yolov3、yolov4、yolov5、yolov6、yolov7等论文 1.zip

目标检测YOLOv1原版论文中英文翻译对照pdf

yolo系列论文原文，包含yolov1~yolov7

目标检测YOLOv1原版论文pdf

YOLOV1-V7英文论文，深度学习、目标检测领域必读经典论文

有关目标分类、目标检测的相关论文集合

yolo，yolov2,yolov3论文原文

YOLO论文v1 v2 v3核心.zip

yolo论文.rar

YOLOv论文解读简洁版

最新资源

机器视觉目标检测补习贴之YOLO实时检测, You only look once .html