【免费】yolov9全英文论文资源-CSDN文库

深度学习

需积分: 0 141 浏览量 2024-05-10 00:18:33 上传评论收藏 4.74MB PDF 举报

资源推荐

资源详情

资源评论

YOLOv9: Learning What You Want to Learn

Using Programmable Gradient Information

Chien-Yao Wang

1,2

, I-Hau Yeh

, and Hong-Yuan Mark Liao

1,2,3

Institute of Information Science, Academia Sinica, Taiwan

National Taipei University of Technology, Taiwan

Department of Information and Computer Engineering, Chung Yuan Christian University, Taiwan

kinyiu@iis.sinica.edu.tw, ihyeh@emc.com.tw, and liao@iis.sinica.edu.tw

Abstract

Today’s deep learning methods focus on how to design

the most appropriate objective functions so that the pre-

diction results of the model can be closest to the ground

truth. Meanwhile, an appropriate architecture that can

facilitate acquisition of enough information for prediction

has to be designed. Existing methods ignore a fact that

when input data undergoes layer-by-layer feature extrac-

tion and spatial transformation, large amount of informa-

tion will be lost. This paper will delve into the important is-

sues of data loss when data is transmitted through deep net-

works, namely information bottleneck and reversible func-

tions. We proposed the concept of programmable gradi-

ent information (PGI) to cope with the various changes

required by deep networks to achieve multiple objectives.

PGI can provide complete input information for the tar-

get task to calculate objective function, so that reliable

gradient information can be obtained to update network

weights. In addition, a new lightweight network architec-

ture – Generalized Efﬁcient Layer Aggregation Network

(GELAN), based on gradient path planning is designed.

GELAN’s architecture conﬁrms that PGI has gained su-

perior results on lightweight models. We veriﬁed the pro-

posed GELAN and PGI on MS COCO dataset based ob-

ject detection. The results show that GELAN only uses

conventional convolution operators to achieve better pa-

rameter utilization than the state-of-the-art methods devel-

oped based on depth-wise convolution. PGI can be used

for variety of models from lightweight to large. It can be

used to obtain complete information, so that train-from-

scratch models can achieve better results than state-of-the-

art models pre-trained using large datasets, the compari-

son results are shown in Figure 1. The source codes are at:

https://github.com/WongKinYiu/yolov9.

1. Introduction

Deep learning-based models have demonstrated far bet-

ter performance than past artiﬁcial intelligence systems in

various ﬁelds, such as computer vision, language process-

ing, and speech recognition. In recent years, researchers

Figure 1. Comparisons of the real-time object detecors on MS

COCO dataset. The GELAN and PGI-based object detection

method surpassed all previous train-from-scratch methods in terms

of object detection performance. In terms of accuracy, the new

method outperforms RT DETR [43] pre-trained with a large

dataset, and it also outperforms depth-wise convolution-based de-

sign YOLO MS [7] in terms of parameters utilization.

in the ﬁeld of deep learning have mainly focused on how

to develop more powerful system architectures and learn-

ing methods, such as CNNs [21–23, 42, 55, 71, 72], Trans-

formers [8, 9, 40, 41, 60, 69, 70], Perceivers [26, 26, 32, 52,

56, 81, 81], and Mambas [17, 38, 80]. In addition, some

researchers have tried to develop more general objective

functions, such as loss function [5, 45, 46, 50, 77, 78], la-

bel assignment [10, 12, 33, 67, 79] and auxiliary supervi-

sion [18, 20, 24, 28, 29, 51, 54, 68, 76]. The above studies

all try to precisely ﬁnd the mapping between input and tar-

get tasks. However, most past approaches have ignored that

input data may have a non-negligible amount of informa-

tion loss during the feedforward process. This loss of in-

formation can lead to biased gradient ﬂows, which are sub-

sequently used to update the model. The above problems

can result in deep networks to establish incorrect associa-

tions between targets and inputs, causing the trained model

to produce incorrect predictions.

arXiv:2402.13616v2 [cs.CV] 29 Feb 2024

Figure 2. Visualization results of random initial weight output feature maps for different network architectures: (a) input image, (b)

PlainNet, (c) ResNet, (d) CSPNet, and (e) proposed GELAN. From the ﬁgure, we can see that in different architectures, the information

provided to the objective function to calculate the loss is lost to varying degrees, and our architecture can retain the most complete

information and provide the most reliable gradient information for calculating the objective function.

In deep networks, the phenomenon of input data losing

information during the feedforward process is commonly

known as information bottleneck [59], and its schematic di-

agram is as shown in Figure 2. At present, the main meth-

ods that can alleviate this phenomenon are as follows: (1)

The use of reversible architectures [3, 16, 19]: this method

mainly uses repeated input data and maintains the informa-

tion of the input data in an explicit way; (2) The use of

masked modeling [1, 6, 9, 27, 71, 73]: it mainly uses recon-

struction loss and adopts an implicit way to maximize the

extracted features and retain the input information; and (3)

Introduction of the deep supervision concept [28,51,54,68]:

it uses shallow features that have not lost too much impor-

tant information to pre-establish a mapping from features

to targets to ensure that important information can be trans-

ferred to deeper layers. However, the above methods have

different drawbacks in the training process and inference

process. For example, a reversible architecture requires ad-

ditional layers to combine repeatedly fed input data, which

will signiﬁcantly increase the inference cost. In addition,

since the input data layer to the output layer cannot have a

too deep path, this limitation will make it difﬁcult to model

high-order semantic information during the training pro-

cess. As for masked modeling, its reconstruction loss some-

times conﬂicts with the target loss. In addition, most mask

mechanisms also produce incorrect associations with data.

For the deep supervision mechanism, it will produce error

accumulation, and if the shallow supervision loses informa-

tion during the training process, the subsequent layers will

not be able to retrieve the required information. The above

phenomenon will be more signiﬁcant on difﬁcult tasks and

small models.

To address the above-mentioned issues, we propose a

new concept, which is programmable gradient information

(PGI). The concept is to generate reliable gradients through

auxiliary reversible branch, so that the deep features can

still maintain key characteristics for executing target task.

The design of auxiliary reversible branch can avoid the se-

mantic loss that may be caused by a traditional deep super-

vision process that integrates multi-path features. In other

words, we are programming gradient information propaga-

tion at different semantic levels, and thereby achieving the

best training results. The reversible architecture of PGI is

built on auxiliary branch, so there is no additional cost.

Since PGI can freely select loss function suitable for the

target task, it also overcomes the problems encountered by

mask modeling. The proposed PGI mechanism can be ap-

plied to deep neural networks of various sizes and is more

general than the deep supervision mechanism, which is only

suitable for very deep neural networks.

In this paper, we also designed generalized ELAN

(GELAN) based on ELAN [65], the design of GELAN si-

multaneously takes into account the number of parameters,

computational complexity, accuracy and inference speed.

This design allows users to arbitrarily choose appropriate

computational blocks for different inference devices. We

combined the proposed PGI and GELAN, and then de-

signed a new generation of YOLO series object detection

system, which we call YOLOv9. We used the MS COCO

dataset to conduct experiments, and the experimental results

veriﬁed that our proposed YOLOv9 achieved the top perfor-

mance in all comparisons.

We summarize the contributions of this paper as follows:

1. We theoretically analyzed the existing deep neural net-

work architecture from the perspective of reversible

function, and through this process we successfully ex-

plained many phenomena that were difﬁcult to explain

in the past. We also designed PGI and auxiliary re-

versible branch based on this analysis and achieved ex-

cellent results.

2. The PGI we designed solves the problem that deep su-

pervision can only be used for extremely deep neu-

ral network architectures, and therefore allows new

lightweight architectures to be truly applied in daily

life.

3. The GELAN we designed only uses conventional con-

volution to achieve a higher parameter usage than the

depth-wise convolution design that based on the most

advanced technology, while showing great advantages

of being light, fast, and accurate.

4. Combining the proposed PGI and GELAN, the object

detection performance of the YOLOv9 on MS COCO

dataset greatly surpasses the existing real-time object

detectors in all aspects.

2. Related work

2.1. Real-time Object Detectors

The current mainstream real-time object detectors are the

YOLO series [2, 7, 13–15, 25, 30, 31, 47–49, 61–63, 74, 75],

and most of these models use CSPNet [64] or ELAN [65]

and their variants as the main computing units. In terms of

feature integration, improved PAN [37] or FPN [35] is of-

ten used as a tool, and then improved YOLOv3 head [49] or

FCOS head [57, 58] is used as prediction head. Recently

some real-time object detectors, such as RT DETR [43],

which puts its fundation on DETR [4], have also been pro-

posed. However, since it is extremely difﬁcult for DETR

series object detector to be applied to new domains without

a corresponding domain pre-trained model, the most widely

used real-time object detector at present is still YOLO se-

ries. This paper chooses YOLOv7 [63], which has been

proven effective in a variety of computer vision tasks and

various scenarios, as a base to develop the proposed method.

We use GELAN to improve the architecture and the training

process with the proposed PGI. The above novel approach

makes the proposed YOLOv9 the top real-time object de-

tector of the new generation.

2.2. Reversible Architectures

The operation unit of reversible architectures [3, 16, 19]

must maintain the characteristics of reversible conversion,

so it can be ensured that the output feature map of each

layer of operation unit can retain complete original informa-

tion. Before, RevCol [3] generalizes traditional reversible

unit to multiple levels, and in doing so can expand the se-

mantic levels expressed by different layer units. Through

a literature review of various neural network architectures,

we found that there are many high-performing architectures

with varying degree of reversible properties. For exam-

ple, Res2Net module [11] combines different input parti-

tions with the next partition in a hierarchical manner, and

concatenates all converted partitions before passing them

backwards. CBNet [34, 39] re-introduces the original in-

put data through composite backbone to obtain complete

original information, and obtains different levels of multi-

level reversible information through various composition

methods. These network architectures generally have ex-

cellent parameter utilization, but the extra composite layers

cause slow inference speeds. DynamicDet [36] combines

CBNet [34] and the high-efﬁciency real-time object detec-

tor YOLOv7 [63] to achieve a very good trade-off among

speed, number of parameters, and accuracy. This paper in-

troduces the DynamicDet architecture as the basis for de-

signing reversible branches. In addition, reversible infor-

mation is further introduced into the proposed PGI. The

proposed new architecture does not require additional con-

nections during the inference process, so it can fully retain

the advantages of speed, parameter amount, and accuracy.

2.3. Auxiliary Supervision

Deep supervision [28,54,68] is the most common auxil-

iary supervision method, which performs training by insert-

ing additional prediction layers in the middle layers. Es-

pecially the application of multi-layer decoders introduced

in the transformer-based methods is the most common one.

Another common auxiliary supervision method is to utilize

the relevant meta information to guide the feature maps pro-

duced by the intermediate layers and make them have the

properties required by the target tasks [18, 20, 24, 29, 76].

Examples of this type include using segmentation loss or

depth loss to enhance the accuracy of object detectors. Re-

cently, there are many reports in the literature [53, 67, 82]

that use different label assignment methods to generate dif-

ferent auxiliary supervision mechanisms to speed up the

convergence speed of the model and improve the robustness

at the same time. However, the auxiliary supervision mech-

anism is usually only applicable to large models, so when

it is applied to lightweight models, it is easy to cause an

under parameterization phenomenon, which makes the per-

formance worse. The PGI we proposed designed a way to

reprogram multi-level semantic information, and this design

allows lightweight models to also beneﬁt from the auxiliary

supervision mechanism.

3. Problem Statement

Usually, people attribute the difﬁculty of deep neural net-

work convergence problem due to factors such as gradient

vanish or gradient saturation, and these phenomena do ex-

ist in traditional deep neural networks. However, modern

deep neural networks have already fundamentally solved

the above problem by designing various normalization and

activation functions. Nevertheless, deep neural networks

still have the problem of slow convergence or poor conver-

gence results.

In this paper, we explore the nature of the above issue

further. Through in-depth analysis of information bottle-

neck, we deduced that the root cause of this problem is that

the initial gradient originally coming from a very deep net-

work has lost a lot of information needed to achieve the

goal soon after it is transmitted. In order to conﬁrm this

inference, we feedforward deep networks of different archi-

tectures with initial weights, and then visualize and illus-

trate them in Figure 2. Obviously, PlainNet has lost a lot of

important information required for object detection in deep

layers. As for the proportion of important information that

ResNet, CSPNet, and GELAN can retain, it is indeed posi-

tively related to the accuracy that can be obtained after train-

ing. We further design reversible network-based methods to

solve the causes of the above problems. In this section we

shall elaborate our analysis of information bottleneck prin-

ciple and reversible functions.

3.1. Information Bottleneck Principle

According to information bottleneck principle, we know

that data X may cause information loss when going through

transformation, as shown in Eq. 1 below:

I(X, X) ≥ I(X, f

(X)) ≥ I(X, g

(X))), (1)

where I indicates mutual information, f and g are transfor-

mation functions, and θ and ϕ are parameters of f and g,

respectively.

In deep neural networks, f

(·) and g

(·) respectively

represent the operations of two consecutive layers in deep

neural network. From Eq. 1, we can predict that as the num-

ber of network layer becomes deeper, the original data will

be more likely to be lost. However, the parameters of the

deep neural network are based on the output of the network

as well as the given target, and then update the network after

generating new gradients by calculating the loss function.

As one can imagine, the output of a deeper neural network

is less able to retain complete information about the pre-

diction target. This will make it possible to use incomplete

information during network training, resulting in unreliable

gradients and poor convergence.

One way to solve the above problem is to directly in-

crease the size of the model. When we use a large number

of parameters to construct a model, it is more capable of

performing a more complete transformation of the data. The

above approach allows even if information is lost during the

data feedforward process, there is still a chance to retain

enough information to perform the mapping to the target.

The above phenomenon explains why the width is more im-

portant than the depth in most modern models. However,

the above conclusion cannot fundamentally solve the prob-

lem of unreliable gradients in very deep neural network.

Below, we will introduce how to use reversible functions

to solve problems and conduct relative analysis.

3.2. Reversible Functions

When a function r has an inverse transformation func-

tion v, we call this function reversible function, as shown in

Eq. 2.

X = v

(X)), (2)

where ψ and ζ are parameters of r and v, respectively. Data

X is converted by reversible function without losing infor-

mation, as shown in Eq. 3.

I(X, X) = I(X, r

(X)) = I(X, v

(X))). (3)

When the network’s transformation function is composed

of reversible functions, more reliable gradients can be ob-

tained to update the model. Almost all of today’s popular

deep learning methods are architectures that conform to the

reversible property, such as Eq. 4.

l+1

= X

+ f

l+1

), (4)

where l indicates the l-th layer of a PreAct ResNet and

f is the transformation function of the l-th layer. PreAct

ResNet [22] repeatedly passes the original data X to sub-

sequent layers in an explicit way. Although such a design

can make a deep neural network with more than a thousand

layers converge very well, it destroys an important reason

why we need deep neural networks. That is, for difﬁcult

problems, it is difﬁcult for us to directly ﬁnd simple map-

ping functions to map data to targets. This also explains

why PreAct ResNet performs worse than ResNet [21] when

the number of layers is small.

In addition, we tried to use masked modeling that al-

lowed the transformer model to achieve signiﬁcant break-

throughs. We use approximation methods, such as Eq. 5,

to try to ﬁnd the inverse transformation v of r, so that the

transformed features can retain enough information using

sparse features. The form of Eq. 5 is as follows:

X = v

(X) · M), (5)

where M is a dynamic binary mask. Other methods that

are commonly used to perform the above tasks are diffusion

model and variational autoencoder, and they both have the

function of ﬁnding the inverse function. However, when

we apply the above approach to a lightweight model, there

will be defects because the lightweight model will be under

parameterized to a large amount of raw data. Because of

the above reason, important information I(Y, X) that maps

data X to target Y will also face the same problem. For this

issue, we will explore it using the concept of information

bottleneck [59]. The formula for information bottleneck is

as follows:

I(X, X) ≥ I(Y, X) ≥ I(Y, f

(X)) ≥ ... ≥ I(Y,

Y ). (6)

Generally speaking, I(Y, X) will only occupy a very small

part of I(X, X). However, it is critical to the target mis-

sion. Therefore, even if the amount of information lost in

the feedforward stage is not signiﬁcant, as long as I(Y, X)

is covered, the training effect will be greatly affected. The

lightweight model itself is in an under parameterized state,

so it is easy to lose a lot of important information in the

feedforward stage. Therefore, our goal for the lightweight

model is how to accurately ﬁlter I(Y, X) from I(X, X). As

for fully preserving the information of X, that is difﬁcult to

achieve. Based on the above analysis, we hope to propose a

new deep neural network training method that can not only

generate reliable gradients to update the model, but also be

suitable for shallow and lightweight neural networks.

剩余17页未读，继续阅读

评论收藏

内容反馈

浪淘沙jkp

粉丝: 323
资源: 3

yolo v9 全英文论文

最新资源

yolo v9 全英文论文

深度学习 yolo v9源码学习需要的colo2017资源下载

yolov9完整源码+权重文件

yolo论文.rar

YOLO系列YOLOv1论文超详细解读（翻译 ＋学习笔记）.pdf

论文对YOLO的演进进行了全面的分析，考察了从原始的YOLO到YOLOv8和YOLO-NAS每个版本中的创新和贡献

yolo 系列论文和源码， 全部源码和论文 v1 - 7. yolo 系列论文和源码， 全部源码和论文 v1 - 7yolo 系

YOLO算法v6~v7论文英文原文

yolo论文理论梳理总结

YOLO系列论文.zip

yolo face v2 论文

yolo相关论文，中英文都含有

YOLO系列论文翻译

yolo系列论文.zip

yolov9-Pytorch源代码

Yolo相关论文.rar

YOLO v5 全套代码实现

yolo数据集的清洗工具.zip

YOLO v1 ~ YOLO v5 论文解读和实现细节

YOLOv9概述.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

YOLOV5 + 双目相机实现三维测距（新版本）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

全新的SOTA模型YOLOv9

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

最新资源

YOLO系列YOLOv1论文超详细解读（翻译＋学习笔记）.pdf

yolo 系列论文和源码，全部源码和论文 v1 - 7. yolo 系列论文和源码，全部源码和论文 v1 - 7yolo 系