大盘点｜YOLO系目标检测算法总览（共22篇）资源-CSDN文库

共22个文件

pdf：22个

深度学习

论文

YOLO

需积分: 14 64 浏览量 2023-01-18 15:20:26 上传评论收藏 48.88MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

【极市平台】YOLO总览.zip （22个子文件）

【极市平台】YOLO总览

Complexer-YOLO Real-Time 3D Object Detection and Tracking on Semantic Point Clouds.pdf 2.73MB

IFQ-Net Integrated Fixed-point Quantization Networks for Embedded Vision.pdf 2.15MB

YOLOv3 An Incremental Improvement.pdf 2.34MB

YOLOv4 Optimal Speed and Accuracy of Object Detection.pdf 3.76MB

YOLO9000 Better, Faster, Stronger.pdf 5.01MB

MV-YOLO Motion Vector-aided Tracking by Semantic Object Detection.pdf 1.59MB

WQT and DG-YOLO towards domain generalization in underwater object detection.pdf 2.48MB

Poly-YOLO higher speed, more precise detection and instance segmentation for YOLOv3.pdf 8.29MB

YOLO Nano a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection.pdf 261KB

xYOLO A Model For Real-Time Object Detection In Humanoid Soccer On Low-End Hardware.pdf 1.21MB

YOLO-LITE A Real-Time Object Detection Algorithm Optimized for Non-GPU Computers.pdf 1.35MB

Complex-YOLO Real-time 3D Object Detection on Point Clouds.pdf 1.9MB

Expandable YOLO 3D Object Detection from RGB-D Images.pdf 843KB

You Only Look Once Unified, Real-Time Object Detection.pdf 5.05MB

Spiking-YOLO Spiking Neural Network for Energy-Efficient Object Detection.pdf 5.98MB

PP-YOLO An Effective and Efficient Implementation of Object Detector.pdf 496KB

YOLO3D End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud.pdf 1.41MB

SlimYOLOv3 Narrower, Faster and Better for Real-Time UAV Applications.pdf 985KB

REQ-YOLO A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs.pdf 1.7MB

Fast YOLO A Fast You Only Look Once System for Real-time Embedded Object Detection in Video.pdf 230KB

SpeechYOLO Detection and Localization of Speech Objects.pdf 462KB

DC-SPP-YOLO Dense Connection and Spatial Pyramid Pooling Based YOLO for Object Detection.pdf 1.64MB

POLY-YOLO: HIGHER SPEED, MORE PRECISE DETECTION AND

INSTANCE SEGMENTATION FOR YOLOV3

A PREPRINT

Petr Hurtik*, Vojtech Molek*, Jan Hula*, Marek Vajgl*, Pavel Vlasanek

∗

, and Tomas Nejezchleba

†

June 1, 2020

ABSTRACT

We present a new version of YOLO with better performance and extended with instance segmentation

called Poly-YOLO. Poly-YOLO builds on the original ideas of YOLOv3 and removes two of its

weaknesses: a large amount of rewritten labels and inefﬁcient distribution of anchors. Poly-YOLO

reduces the issues by aggregating features from a light SE-Darknet-53 backbone with a hypercolumn

technique, using stairstep upsampling, and produces a single scale output with high resolution. In

comparison with YOLOv3, Poly-YOLO has only 60% of its trainable parameters but improves

mAP by a relative 40%. We also present Poly-YOLO lite with fewer parameters and a lower

output resolution. It has the same precision as YOLOv3, but it is three times smaller and twice

as fast, thus suitable for embedded devices. Finally, Poly-YOLO performs instance segmentation

using bounding polygons. The network is trained to detect size-independent polygons deﬁned on

a polar grid. Vertices of each polygon are being predicted with their conﬁdence, and therefore

Poly-YOLO produces polygons with a varying number of vertices. Source code is available at

https://gitlab.com/irafm-ai/poly-yolo.

Keywords

Object detection

Instance segmentation

YOLOv3

Bounding box

Bounding polygon

Realtime

detection

Figure 1: The ﬁgure shows instance segmentation performance of the proposed Poly-YOLO algorithm applied on

Cityscapes dataset and running 22FPS on a mid-tier graphic card. Image was cropped due to visibility.

∗

University of Ostrava, Centre of Excellence IT4Innovations, Institute for Research and Applications of Fuzzy Modeling, 30.

dubna 22, Ostrava, Czech Republic

†

Varroc Lighting Systems, Suvorovova 195, Šenov u Nového Ji

cína, Czech Republic.

arXiv:2005.13243v2 [cs.CV] 29 May 2020

A PREPRINT - JUNE 1, 2020

1 Problem statement

Object detection is a process where all important areas containing objects of interest are bounded while the background

is ignored. Usually, the object is bounded by a box that is expressed in terms of spatial coordinates of its top-left corner

and its width and height. The disadvantage of this approach is that for the objects of complex shapes, the bounding

box also includes background, which can occupy a signiﬁcant part of the area as the bounding box does not wrap the

object tightly. Such behavior can decrease the performance of a classiﬁer applied over the bounding box [

] or may

not fulﬁll requirements of precise detection [

]. To avoid the problem, classical detectors such as Faster R-CNN [

]

or RetinaNet [

] were modiﬁed into a version of Mask R-CNN [

] or RetinaMask [

]. These methods also infer the

instance segmentation, i.e., each pixel in the bounding box is classiﬁed into object/background classes. The limitation

of the methods is their computation speed, where they are unable to reach real-time performance on non-high-tier

hardware. The problem we focus on is to create a precise detector with instance segmentation and the ability of real-time

processing on mid-tier graphic cards.

In this study, we start with YOLOv3 [

], which excels in processing speed, and therefore it is a good candidate for

real-time applications running on computers [

] or mobile devices [

]. On the other hand, the precision of YOLOv3 lags

behind detectors such as RetinaNet [

], EfﬁcientDet [

], or CornerNet [

]. We analyze YOLO’s performance and

identify its two drawbacks. The ﬁrst drawback is low precision of the detection of big boxes [

] caused by inappropriate

handling of anchors in output layers. The second one is rewriting of labels by each-other due to the coarse resolution. To

solve these issues, we design a new approach, dubbed Poly-YOLO, that signiﬁcantly pushes forward original YOLOv3

abilities. To tackle the problem of instance segmentation, we propose a way to detect tight polygon-based contour. Our

contributions and beneﬁts of our approach are as follows:

•

we propose Poly-YOLO that increases the detection accuracy of the previous version, YOLOv3. Poly-YOLO

has a brand-new feature decoder with a single output tensor that goes to head with higher resolution that solves

two principal YOLO’s issues: rewriting of labels and incorrect distribution of anchors.

•

We produce a single output tensor by a hypercolumn composition of multi-resolution feature maps produced

by a feature extractor. To unify the resolutions of the feature maps, we utilize stairstep upscaling, which allows

us to obtain slightly lower loss with comparison to direct upscaling while the computation speed is preserved.

•

We design an extension that realizes instance segmentation using bounding polygon representation. The

number of maximal polygon vertices can be adjusted according to a requirement to a precision.

•

The bounding polygon is detected within a polar grid with relative coordinates that allow the network to learn

general, size-independent shapes. The network produces a dynamic number of vertices per bounding polygon.

Figure 2: Examples of Poly-YOLO inference on the Cityscapes testing dataset.

Figure 3: Examples of Poly-YOLO inference on the India driving testing dataset.

A PREPRINT - JUNE 1, 2020

2 Current state and related work

2.1 Object detection

Models for object detection can be divided into two groups, two-stage, and one-stage detectors. Two-stage detectors

split the process as follows. In the ﬁrst phase, regions of interest (RoI) are proposed, and in the subsequent stage,

bounding box regression and classiﬁcation is being done inside these proposed regions. One-stage detectors predict the

bounding boxes and their classes at once. Two-stage detectors are usually more precise in terms of localization and

classiﬁcation accuracy, but in terms of processing are slower then one-stage detectors. Both of these types contain a

backbone network for feature extraction and head networks for classiﬁcation and regression. Typically, the backbone is

some SOTA network such as ResNet [

] or ResNext [

], pre-trained on ImageNet or OpenImages. Even though, some

approaches [13], [14] also experiment with training from scratch.

2.1.1 Two-stage detectors

The prototypical example of two-stage architecture is Faster R-CNN [

], which is an improvement of its predecessor

Fast R-CNN [

]. The main improvement lies in the use of Region Proposal Network (RPN), which replaced a much

slower selective search of RoIs. It also introduced the usage of multi-scale anchors to detect objects of different sizes.

Faster R-CNN is, in a way, a meta-algorithm that can have many different incarnations depending on a type of the

backbone and its heads. One of the frequently used backbones, called Feature Pyramid Network (FPN) [

], allows

to predict RoIs from multiple feature maps, each with a different resolution. This is beneﬁcial for the recognition of

objects at different scales.

2.1.2 One-stage detectors

Two best-known examples of one-stage detectors are YOLO [

] and SSD [

]. The architecture of YOLO will be

thoroughly described in Section 3. Usually, one-stage detectors divide the image into a grid and predict bounding boxes

and their classes inside them, all at once. Most of them also use the concept of anchors, which are predeﬁned typical

dimensions of bounding boxes that serve as apriori knowledge. One of the major improvements in the area of one-stage

detectors was a novel loss function call Focal Loss [

]. Because of the fact that two-stage detectors produce a sparse

set of region proposals in the ﬁrst step, most of the negative locations are ﬁltered out for the second stage. One-stage

detectors, on the other hand, produce a dense set of region proposals which they need to classify as containing objects or

not. This creates a problem with the non-proportional frequency of negative examples. Focal Loss solves this problem

by adjusting the importance of negative and positive examples within the loss function. Another interesting idea was

proposed in an architecture called ReﬁneDet [

], which performs a two-step regression of the bounding boxes. The

second step reﬁnes the bounding boxes proposed in the ﬁrst step, which produces more accurate detection, especially

for small objects. Recently, there has been a surge of interest in approaches that do not use anchor boxes. The main

representative of this trend is the FCOS framework [

], which works by predicting four coordinates of a bounding

box for every foreground pixel. These four coordinates represent a distance to the four boundary edges of a bounding

box in which the pixel is enclosed in. The predicted bounding boxes of every pixel are subsequently ﬁltered by NMS.

Similar anchor-free approach was proposed in CornerNet [

], where the objects are detected as a pair of top-left and

bottom-right corners of a bounding box.

2.2 Instance Segmentation

In many applications, a boundary given by a rectangle may be too crude, and we may instead require a boundary

framing the object tightly. In the literature, this task is called Instance Segmentation, and the main approaches also

ﬁt into the one-stage/two-stage taxonomy. The prototypical example of a two-stage method is an architecture called

Mask R-CNN [

], which extended Faster R-CNN by adding a separate fully-convolutional head that predicts masks of

objects. Note, the same principle is also applied to RetinaNet, and the improved net is called RetinaMask [

]. One of

Mask R-CNN innovations is a novel way for extracting features from RoIs using the RoIAlign layer, which avoids

the problem of misalignments of the RoI due to its quantization to the grid of the feature map. One-stage methods

for instance segmentation can be further divided into top-down methods, bottom-up methods, and direct methods.

Top-down methods [

] work by ﬁrst detecting an object and then segmenting this object within a bounding box.

Prediction of bounding boxes either uses anchors or is anchor free following the FCOS framework [

]. Bottom-up

methods [

], on the other hand, work by ﬁrst embedding each pixel into a metric space in which are these pixels

subsequently clustered. As the name suggests, direct methods work by directly predicting the segmentation mask

without bounding boxes or pixel embedding [

]. We also mention that independently of our instance segmentation,

PolarMask [

] introduces instance segmentation using polygons, which are also predicted in polar coordinates. In

comparison with PolarMask, Poly-YOLO learns itself in general size-independent shapes due to the use of the relative

A PREPRINT - JUNE 1, 2020

size of a bounding polygon according to the particular bounding box. The second difference is that Poly-YOLO

produces a dynamic number of vertices per polygon, according to the shape-complexity of various objects.

3 Fast and precise object detection with Poly-YOLO

Here, we ﬁrstly recall YOLOv3 fundamental ideas, describe issues that block reaching higher performance, and propose

our solution that removes them.

3.1 YOLO history

First version of YOLO (You Only Look Once) was introduced in 2016 [

]. The motivation behind YOLO is to create a

fast object detector with an emphasis on speed. The detector is made of two essential parts: the convolutional neural

network (CNN) and specially designed loss function. The CNN backbone is inspired by GoogleNet [

] and has 24

convolutional layers followed by 2 fully connected layers. The network output is reshaped into two dimensional grid

with with the shape of

× G

, where

is number of cells in vertical side and

in horizontal side. Each grid

cell occupies a part of the image, as depicted in Fig. 4. Every object in the image has its center in one of the cells, and

Figure 4: The left image illustrates the YOLO grid over the input image, and yellow dots represent centers of detected

objects. The right image illustrates detections.

that particular cell is responsible for detecting and classifying said object. More precisely, the responsible cell outputs

bounding boxes. Each box is given as a tuple

(x, y, w, h)

and a conﬁdence measure. Here,

(x, y)

is the center of

the predicted box relative to the cell boundary and

(w, h)

is the width and height of the bounding box relative to the

image size. The conﬁdence measures how much is the cell conﬁdent that it contains an object. Finally, each cell outputs

conditional class probabilities, i.e. probabilities that detected object belongs to certain class(es). In other words,

cell conﬁdence tells us that there is object in the predicted box and conditional class probabilities tells us that the box

contains, e.g., vehicle – car. The ﬁnal output of the model is a tensor with dimensions

× G

× (5N

+ N

)

, where

constant ﬁve is used because of (x, y, w, h) and a conﬁdence.

YOLOv2 [

] brought a couple of improvements. Firstly, the architecture of the convolutional neural network was

updated to Darknet-19 – a fully convolutional network with 19 convolutional layers containing batch normalization and

ﬁve max-pooling layers. The cells are no longer predicting plain

(x, y, w, h)

directly, but instead scales and translates

anchor boxes. The parameters

, a

)

, i.e., width and height of an anchor box for all anchors boxes are extracted

from a training dataset with usage of

-means algorithm. The clustering criterion is IoU. Lastly, YOLOv2 uses skip

connections to concatenate features from different parts of the CNN to create ﬁnal tensor of feature maps, including

features across different scales and levels of abstraction.

The most recent version of YOLO [

] introduces mainly three output scales and a deeper architecture – Darknet-53.

Each of the scale/feature-map has its own set of anchors – three per output scale. Compared with v2, YOLOv3 reaches

higher accuracy, but due to the heavier backbone, its inference speed is decreased.

3.2 YOLOv3 issues blocking better performance

YOLOv3, as it is designed, suffers from two issues that we discovered and that are not described in the original papers:

rewriting of labels and imbalanced distribution of anchors across output scales. Solving these issues is crucial for

improvement of the YOLO performance.

A PREPRINT - JUNE 1, 2020

Figure 5: The image illustrates the label rewriting problem for the detection of cars. A label is rewritten by other if

centers of two boxes (with the same anchor box) belong to the same cell. In this illustrative example, blue denotes grid,

red rewritten label, and green preserved label. Here, 10 labels out of 27 are rewritten, and the detector is not trained to

detect them.

Table 1: Amount of rewritten labels for various datasets

Rewritten labels [%]

Poly Poly

Dataset Resolution YOLOv3 YOLO YOLO lite

Simulator 416×416 16.36 0.22 2.31

Simulator 608×800 12.55 0.00 0.61

Cityscapes 416×416 9.51 2.79 9.50

Cityscapes 608×832 3.92 0.97 2.75

Cityscapes 640×1280 2.56 0.59 1.44

India Driving 416×416 23.07 5.80 13.78

India Driving 448×800 13.54 1.92 4.96

India Driving 704×1280 9.16 1.12 2.44

3.2.1 Label rewriting problem

here, we discuss situation, when a bounding box given by its label from a ground truth dataset can be rewritten

by other box and therefore the network is not trained to detect it. For the sake of simplicity and explanation, we

avoid the usage of the anchors notation in the text bellow. Let us suppose an input image with a resolution of

r × r

pixels. Furthermore, let

be a scale ratio of an

-th output to the input, where YOLOv3 uses the following

ratios:

= 1/8, s

= 1/16, s

= 1/32

. These scales are given by the YOLOv3 architecture, namely by strided

convolutions. Finally, let

B = {b

, . . . , b

}

be a set of boxes presented in an image. Each box

is represented

as a tuple

, b

)

that deﬁnes its top-left and bottom-right corner. For simplicity, we also derive centers

C = {c

, . . . , c

}

where

= (c

, c

)

is deﬁned as

= 0.5(b

+ b

)

and the same for

. With this notation, label

is rewritten, if the following holds:

∃(c

, c

∈ C) : ξ(c

, c

, s

) + ξ(c

, c

, s

) = 2, (1)

where

ξ(x, y, z) =



1, bxzc = byzc

0, else

, (2)

and

b·c

denotes the lowest integer of the term. The purpose of function

is to check if both boxes are assigned to the

same cell of a grid on the scale

. In simple words, if two boxes on the same scale are assigned to the same cell, then

one of them will be rewritten. Introducing anchors, both must belong to the same anchor. As a consequence, the network

is trained to ignore some objects, which leads to a low number of positive detections. According to Equations

(1)

and

(2)

, there is a crucial role of

that directly affects the number and the resolution of cells. Considering standard

resolution of YOLO

r = 416

, then, for

(the coarsest scale) we obtain a grid of

13 × 13

cells with size of

32 × 32

pixels each. Also, the absolute size of boxes does not affect the label rewriting problem; the important indicator is the

box center. The practical illustration for such a setting and its consequence for the labels is shown in Figure 5. The ratio

of rewritten labels in the datasets used in the benchmark is shown in Table 1.

评论收藏

内容反馈

福尔摩星儿

粉丝: 0
资源: 229

大盘点｜YOLO 系目标检测算法总览（共22篇）

最新资源

大盘点｜YOLO 系目标检测算法总览（共22篇）

基于Yolo v5目标检测代码+数据集

YOLO二维码目标检测数据集

基于YOLO的3D目标检测（激光雷达点云）课程设计

水面漂浮物目标检测模型+已训练目标检测模型+yolo系列+.pt+水面漂浮物目标检测数据集

钓鱼目标检测模型+已训练目标检测项目模型+yolo系列+.pt+岸边钓鱼检测数据集

目标检测算法之YOLO

案例22目标检测（YOLO方法）.py

基于YOLO的目标检测优化算法研究.docx

基于YOLO系列的目标检测改进算法.docx

基于YOLO算法的口罩目标检测。

基于yolo v3算法的倾斜目标检测方法、装置及存储介质.docx

YOLO系列算法行人检测person-dataset-22.rar

基于YOLO的目标检测界面化部署实现（支持yolov1-yolov5、yolop、yolox）

目标检测之yolo算法.pptx

采用tensorflow（python）实现 YOLO v3目标检测算法模型

目标检测算法-YOLO系列-YOLOV1

基于YOLO深度学习模型的图像目标检测算法研究.pptx

YOLO目标检测算法讲义.md

YOLO_v1目标检测算法深入理解

基于YOLO-IDSTD算法的红外弱小目标检测.docx

YOLO v3目标检测算法的PyTorch实现（压缩包中包含240MB的预训练网络文件）

7-机器学习系列（7）：目标检测之--YOLO算法原理及python实现1

Flir红外数据集，已转为yolo格式，直接可用yolo系列目标检测

TIV红外人员数据集，已转为yolo格式，直接可用yolo系列目标检测

YOLO目标检测算法的黑科技全揭秘

YOLO算法：实时目标检测的革命性突破与广泛应用

YOLO系列算法旋翼无人机目标检测 YOLO无人机检测数据集-drone-part1.zip

基于YOLO的小目标检测改进算法研究与应用.pdf

改进的YOLO V3算法及其在小目标检测中的应用

Window环境下运行YOLO v4目标检测算法

最新资源