【免费】全新的SOTA模型YOLOv9原文+论文阅读笔记

共2个文件

pdf：1个

docx：1个

网络

深度学习

数据集

目标检测

需积分: 0 63 浏览量 2024-02-22 15:52:54 上传评论 2 收藏 7.27MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

YOLOv9.zip （2个子文件）

YOLOv9

论文阅读笔记.docx 2.73MB

YOLOv9.pdf 4.89MB

YOLOv9: Learning What You Want to Learn

Using Programmable Gradient Information

Chien-Yao Wang

1,2

, I-Hau Yeh

, and Hong-Yuan Mark Liao

1,2,3

Institute of Information Science, Academia Sinica, Taiwan

National Taipei University of Technology, Taiwan

Department of Information and Computer Engineering, Chung Yuan Christian University, Taiwan

kinyiu@iis.sinica.edu.tw, ihyeh@emc.com.tw, and liao@iis.sinica.edu.tw

Abstract

Today’s deep learning methods focus on how to design

the most appropriate objective functions so that the pre-

diction results of the model can be closest to the ground

truth. Meanwhile, an appropriate architecture that can

facilitate acquisition of enough information for prediction

has to be designed. Existing methods ignore a fact that

when input data undergoes layer-by-layer feature extrac-

tion and spatial transformation, large amount of informa-

tion will be lost. This paper will delve into the important is-

sues of data loss when data is transmitted through deep net-

works, namely information bottleneck and reversible func-

tions. We proposed the concept of programmable gradi-

ent information (PGI) to cope with the various changes

required by deep networks to achieve multiple objectives.

PGI can provide complete input information for the tar-

get task to calculate objective function, so that reliable

gradient information can be obtained to update network

weights. In addition, a new lightweight network architec-

ture – Generalized Efﬁcient Layer Aggregation Network

(GELAN), based on gradient path planning is designed.

GELAN’s architecture conﬁrms that PGI has gained su-

perior results on lightweight models. We veriﬁed the pro-

posed GELAN and PGI on MS COCO dataset based ob-

ject detection. The results show that GELAN only uses

conventional convolution operators to achieve better pa-

rameter utilization than the state-of-the-art methods devel-

oped based on depth-wise convolution. PGI can be used

for variety of models from lightweight to large. It can be

used to obtain complete information, so that train-from-

scratch models can achieve better results than state-of-the-

art models pre-trained using large datasets, the compari-

son results are shown in Figure 1. The source codes are at:

https://github.com/WongKinYiu/yolov9.

1. Introduction

Deep learning-based models have demonstrated far bet-

ter performance than past artiﬁcial intelligence systems in

various ﬁelds, such as computer vision, language process-

ing, and speech recognition. In recent years, researchers

Figure 1. Comparisons of the real-time object detecors on MS

COCO dataset. The GELAN and PGI-based object detection

method surpassed all previous train-from-scratch methods in terms

of object detection performance. In terms of accuracy, the new

method outperforms RT DETR [43] pre-trained with a large

dataset, and it also outperforms depth-wise convolution-based de-

sign YOLO MS [7] in terms of parameters utilization.

in the ﬁeld of deep learning have mainly focused on how

to develop more powerful system architectures and learn-

ing methods, such as CNNs [21–23, 42, 55, 71, 72], Trans-

formers [8, 9, 40, 41, 60, 69, 70], Perceivers [26, 26, 32, 52,

56, 81, 81], and Mambas [17, 38, 80]. In addition, some

researchers have tried to develop more general objective

functions, such as loss function [5, 45, 46, 50, 77, 78], la-

bel assignment [10, 12, 33, 67, 79] and auxiliary supervi-

sion [18, 20, 24, 28, 29, 51, 54, 68, 76]. The above studies

all try to precisely ﬁnd the mapping between input and tar-

get tasks. However, most past approaches have ignored that

input data may have a non-negligible amount of informa-

tion loss during the feedforward process. This loss of in-

formation can lead to biased gradient ﬂows, which are sub-

sequently used to update the model. The above problems

can result in deep networks to establish incorrect associa-

tions between targets and inputs, causing the trained model

to produce incorrect predictions.

arXiv:2402.13616v1 [cs.CV] 21 Feb 2024

Figure 2. Visualization results of random initial weight output feature maps for different network architectures: (a) input image, (b)

PlainNet, (c) ResNet, (d) CSPNet, and (e) proposed GELAN. From the ﬁgure, we can see that in different architectures, the information

provided to the objective function to calculate the loss is lost to varying degrees, and our architecture can retain the most complete

information and provide the most reliable gradient information for calculating the objective function.

In deep networks, the phenomenon of input data losing

information during the feedforward process is commonly

known as information bottleneck [59], and its schematic di-

agram is as shown in Figure 2. At present, the main meth-

ods that can alleviate this phenomenon are as follows: (1)

The use of reversible architectures [3, 16, 19]: this method

mainly uses repeated input data and maintains the informa-

tion of the input data in an explicit way; (2) The use of

masked modeling [1, 6, 9, 27, 71, 73]: it mainly uses recon-

struction loss and adopts an implicit way to maximize the

extracted features and retain the input information; and (3)

Introduction of the deep supervision concept [28,51,54,68]:

it uses shallow features that have not lost too much impor-

tant information to pre-establish a mapping from features

to targets to ensure that important information can be trans-

ferred to deeper layers. However, the above methods have

different drawbacks in the training process and inference

process. For example, a reversible architecture requires ad-

ditional layers to combine repeatedly fed input data, which

will signiﬁcantly increase the inference cost. In addition,

since the input data layer to the output layer cannot have a

too deep path, this limitation will make it difﬁcult to model

high-order semantic information during the training pro-

cess. As for masked modeling, its reconstruction loss some-

times conﬂicts with the target loss. In addition, most mask

mechanisms also produce incorrect associations with data.

For the deep supervision mechanism, it will produce error

accumulation, and if the shallow supervision loses informa-

tion during the training process, the subsequent layers will

not be able to retrieve the required information. The above

phenomenon will be more signiﬁcant on difﬁcult tasks and

small models.

To address the above-mentioned issues, we propose a

new concept, which is programmable gradient information

(PGI). The concept is to generate reliable gradients through

auxiliary reversible branch, so that the deep features can

still maintain key characteristics for executing target task.

The design of auxiliary reversible branch can avoid the se-

mantic loss that may be caused by a traditional deep super-

vision process that integrates multi-path features. In other

words, we are programming gradient information propaga-

tion at different semantic levels, and thereby achieving the

best training results. The reversible architecture of PGI is

built on auxiliary branch, so there is no additional cost.

Since PGI can freely select loss function suitable for the

target task, it also overcomes the problems encountered by

mask modeling. The proposed PGI mechanism can be ap-

plied to deep neural networks of various sizes and is more

general than the deep supervision mechanism, which is only

suitable for very deep neural networks.

In this paper, we also designed generalized ELAN

(GELAN) based on ELAN [65], the design of GELAN si-

multaneously takes into account the number of parameters,

computational complexity, accuracy and inference speed.

This design allows users to arbitrarily choose appropriate

computational blocks for different inference devices. We

combined the proposed PGI and GELAN, and then de-

signed a new generation of YOLO series object detection

system, which we call YOLOv9. We used the MS COCO

dataset to conduct experiments, and the experimental results

veriﬁed that our proposed YOLOv9 achieved the top perfor-

mance in all comparisons.

We summarize the contributions of this paper as follows:

1. We theoretically analyzed the existing deep neural net-

work architecture from the perspective of reversible

function, and through this process we successfully ex-

plained many phenomena that were difﬁcult to explain

in the past. We also designed PGI and auxiliary re-

versible branch based on this analysis and achieved ex-

cellent results.

2. The PGI we designed solves the problem that deep su-

pervision can only be used for extremely deep neu-

ral network architectures, and therefore allows new

lightweight architectures to be truly applied in daily

life.

3. The GELAN we designed only uses conventional con-

volution to achieve a higher parameter usage than the

depth-wise convolution design that based on the most

advanced technology, while showing great advantages

of being light, fast, and accurate.

4. Combining the proposed PGI and GELAN, the object

detection performance of the YOLOv9 on MS COCO

dataset greatly surpasses the existing real-time object

detectors in all aspects.

2. Related work

2.1. Real-time Object Detectors

The current mainstream real-time object detectors are the

YOLO series [2, 7, 13–15, 25, 30, 31, 47–49, 61–63, 74, 75],

and most of these models use CSPNet [64] or ELAN [65]

and their variants as the main computing units. In terms of

feature integration, improved PAN [37] or FPN [35] is of-

ten used as a tool, and then improved YOLOv3 head [49] or

FCOS head [57, 58] is used as prediction head. Recently

some real-time object detectors, such as RT DETR [43],

which puts its fundation on DETR [4], have also been pro-

posed. However, since it is extremely difﬁcult for DETR

series object detector to be applied to new domains without

a corresponding domain pre-trained model, the most widely

used real-time object detector at present is still YOLO se-

ries. This paper chooses YOLOv7 [63], which has been

proven effective in a variety of computer vision tasks and

various scenarios, as a base to develop the proposed method.

We use GELAN to improve the architecture and the training

process with the proposed PGI. The above novel approach

makes the proposed YOLOv9 the top real-time object de-

tector of the new generation.

2.2. Reversible Architectures

The operation unit of reversible architectures [3, 16, 19]

must maintain the characteristics of reversible conversion,

so it can be ensured that the output feature map of each

layer of operation unit can retain complete original informa-

tion. Before, RevCol [3] generalizes traditional reversible

unit to multiple levels, and in doing so can expand the se-

mantic levels expressed by different layer units. Through

a literature review of various neural network architectures,

we found that there are many high-performing architectures

with varying degree of reversible properties. For exam-

ple, Res2Net module [11] combines different input parti-

tions with the next partition in a hierarchical manner, and

concatenates all converted partitions before passing them

backwards. CBNet [34, 39] re-introduces the original in-

put data through composite backbone to obtain complete

original information, and obtains different levels of multi-

level reversible information through various composition

methods. These network architectures generally have ex-

cellent parameter utilization, but the extra composite layers

cause slow inference speeds. DynamicDet [36] combines

CBNet [34] and the high-efﬁciency real-time object detec-

tor YOLOv7 [63] to achieve a very good trade-off among

speed, number of parameters, and accuracy. This paper in-

troduces the DynamicDet architecture as the basis for de-

signing reversible branches. In addition, reversible infor-

mation is further introduced into the proposed PGI. The

proposed new architecture does not require additional con-

nections during the inference process, so it can fully retain

the advantages of speed, parameter amount, and accuracy.

2.3. Auxiliary Supervision

Deep supervision [28,54,68] is the most common auxil-

iary supervision method, which performs training by insert-

ing additional prediction layers in the middle layers. Es-

pecially the application of multi-layer decoders introduced

in the transformer-based methods is the most common one.

Another common auxiliary supervision method is to utilize

the relevant meta information to guide the feature maps pro-

duced by the intermediate layers and make them have the

properties required by the target tasks [18, 20, 24, 29, 76].

Examples of this type include using segmentation loss or

depth loss to enhance the accuracy of object detectors. Re-

cently, there are many reports in the literature [53, 67, 82]

that use different label assignment methods to generate dif-

ferent auxiliary supervision mechanisms to speed up the

convergence speed of the model and improve the robustness

at the same time. However, the auxiliary supervision mech-

anism is usually only applicable to large models, so when

it is applied to lightweight models, it is easy to cause an

under parameterization phenomenon, which makes the per-

formance worse. The PGI we proposed designed a way to

reprogram multi-level semantic information, and this design

allows lightweight models to also beneﬁt from the auxiliary

supervision mechanism.

3. Problem Statement

Usually, people attribute the difﬁculty of deep neural net-

work convergence problem due to factors such as gradient

vanish or gradient saturation, and these phenomena do ex-

ist in traditional deep neural networks. However, modern

deep neural networks have already fundamentally solved

the above problem by designing various normalization and

activation functions. Nevertheless, deep neural networks

still have the problem of slow convergence or poor conver-

gence results.

In this paper, we explore the nature of the above issue

further. Through in-depth analysis of information bottle-

neck, we deduced that the root cause of this problem is that

the initial gradient originally coming from a very deep net-

work has lost a lot of information needed to achieve the

goal soon after it is transmitted. In order to conﬁrm this

inference, we feedforward deep networks of different archi-

tectures with initial weights, and then visualize and illus-

trate them in Figure 2. Obviously, PlainNet has lost a lot of

important information required for object detection in deep

layers. As for the proportion of important information that

ResNet, CSPNet, and GELAN can retain, it is indeed posi-

tively related to the accuracy that can be obtained after train-

ing. We further design reversible network-based methods to

solve the causes of the above problems. In this section we

shall elaborate our analysis of information bottleneck prin-

ciple and reversible functions.

3.1. Information Bottleneck Principle

According to information bottleneck principle, we know

that data X may cause information loss when going through

transformation, as shown in Eq. 1 below:

I(X, X) ≥ I(X, f

(X)) ≥ I(X, g

(X))), (1)

where I indicates mutual information, f and g are transfor-

mation functions, and θ and ϕ are parameters of f and g,

respectively.

In deep neural networks, f

(·) and g

(·) respectively

represent the operations of two consecutive layers in deep

neural network. From Eq. 1, we can predict that as the num-

ber of network layer becomes deeper, the original data will

be more likely to be lost. However, the parameters of the

deep neural network are based on the output of the network

as well as the given target, and then update the network after

generating new gradients by calculating the loss function.

As one can imagine, the output of a deeper neural network

is less able to retain complete information about the pre-

diction target. This will make it possible to use incomplete

information during network training, resulting in unreliable

gradients and poor convergence.

One way to solve the above problem is to directly in-

crease the size of the model. When we use a large number

of parameters to construct a model, it is more capable of

performing a more complete transformation of the data. The

above approach allows even if information is lost during the

data feedforward process, there is still a chance to retain

enough information to perform the mapping to the target.

The above phenomenon explains why the width is more im-

portant than the depth in most modern models. However,

the above conclusion cannot fundamentally solve the prob-

lem of unreliable gradients in very deep neural network.

Below, we will introduce how to use reversible functions

to solve problems and conduct relative analysis.

3.2. Reversible Functions

When a function r has an inverse transformation func-

tion v, we call this function reversible function, as shown in

Eq. 2.

X = v

(X)), (2)

where ψ and ζ are parameters of r and v, respectively. Data

X is converted by reversible function without losing infor-

mation, as shown in Eq. 3.

I(X, X) = I(X, r

(X)) = I(X, v

(X))). (3)

When the network’s transformation function is composed

of reversible functions, more reliable gradients can be ob-

tained to update the model. Almost all of today’s popular

deep learning methods are architectures that conform to the

reversible property, such as Eq. 4.

l+1

= X

+ f

l+1

), (4)

where l indicates the l-th layer of a PreAct ResNet and

f is the transformation function of the l-th layer. PreAct

ResNet [22] repeatedly passes the original data X to sub-

sequent layers in an explicit way. Although such a design

can make a deep neural network with more than a thousand

layers converge very well, it destroys an important reason

why we need deep neural networks. That is, for difﬁcult

problems, it is difﬁcult for us to directly ﬁnd simple map-

ping functions to map data to targets. This also explains

why PreAct ResNet performs worse than ResNet [21] when

the number of layers is small.

In addition, we tried to use masked modeling that al-

lowed the transformer model to achieve signiﬁcant break-

throughs. We use approximation methods, such as Eq. 5,

to try to ﬁnd the inverse transformation v of r, so that the

transformed features can retain enough information using

sparse features. The form of Eq. 5 is as follows:

X = v

(X) · M), (5)

where M is a dynamic binary mask. Other methods that

are commonly used to perform the above tasks are diffusion

model and variational autoencoder, and they both have the

function of ﬁnding the inverse function. However, when

we apply the above approach to a lightweight model, there

will be defects because the lightweight model will be under

parameterized to a large amount of raw data. Because of

the above reason, important information I(Y, X) that maps

data X to target Y will also face the same problem. For this

issue, we will explore it using the concept of information

bottleneck [59]. The formula for information bottleneck is

as follows:

I(X, X) ≥ I(Y, X) ≥ I(Y, f

(X)) ≥ ... ≥ I(Y,

Y ). (6)

Generally speaking, I(Y, X) will only occupy a very small

part of I(X, X). However, it is critical to the target mis-

sion. Therefore, even if the amount of information lost in

the feedforward stage is not signiﬁcant, as long as I(Y, X)

is covered, the training effect will be greatly affected. The

lightweight model itself is in an under parameterized state,

so it is easy to lose a lot of important information in the

feedforward stage. Therefore, our goal for the lightweight

model is how to accurately ﬁlter I(Y, X) from I(X, X). As

for fully preserving the information of X, that is difﬁcult to

achieve. Based on the above analysis, we hope to propose a

new deep neural network training method that can not only

generate reliable gradients to update the model, but also be

suitable for shallow and lightweight neural networks.

Figure 3. PGI and related network architectures and methods. (a) Path Aggregation Network (PAN)) [37], (b) Reversible Columns

(RevCol) [3], (c) conventional deep supervision, and (d) our proposed Programmable Gradient Information (PGI). PGI is mainly composed

of three components: (1) main branch: architecture used for inference, (2) auxiliary reversible branch: generate reliable gradients to supply

main branch for backward transmission, and (3) multi-level auxiliary information: control main branch learning plannable multi-level of

semantic information.

4. Methodology

4.1. Programmable Gradient Information

In order to solve the aforementioned problems, we pro-

pose a new auxiliary supervision framework called Pro-

grammable Gradient Information (PGI), as shown in Fig-

ure 3 (d). PGI mainly includes three components, namely

(1) main branch, (2) auxiliary reversible branch, and (3)

multi-level auxiliary information. From Figure 3 (d) we

see that the inference process of PGI only uses main branch

and therefore does not require any additional inference cost.

As for the other two components, they are used to solve or

slow down several important issues in deep learning meth-

ods. Among them, auxiliary reversible branch is designed

to deal with the problems caused by the deepening of neural

networks. Network deepening will cause information bot-

tleneck, which will make the loss function unable to gener-

ate reliable gradients. As for multi-level auxiliary informa-

tion, it is designed to handle the error accumulation problem

caused by deep supervision, especially for the architecture

and lightweight model of multiple prediction branch. Next,

we will introduce these two components step by step.

4.1.1 Auxiliary Reversible Branch

In PGI, we propose auxiliary reversible branch to gener-

ate reliable gradients and update network parameters. By

providing information that maps from data to targets, the

loss function can provide guidance and avoid the possibil-

ity of ﬁnding false correlations from incomplete feedfor-

ward features that are less relevant to the target. We pro-

pose the maintenance of complete information by introduc-

ing reversible architecture, but adding main branch to re-

versible architecture will consume a lot of inference costs.

We analyzed the architecture of Figure 3 (b) and found that

when additional connections from deep to shallow layers

are added, the inference time will increase by 20%. When

we repeatedly add the input data to the high-resolution com-

puting layer of the network (yellow box), the inference time

even exceeds twice the time.

Since our goal is to use reversible architecture to ob-

tain reliable gradients, “reversible” is not the only neces-

sary condition in the inference stage. In view of this, we

regard reversible branch as an expansion of deep supervi-

sion branch, and then design auxiliary reversible branch, as

shown in Figure 3 (d). As for the main branch deep fea-

tures that would have lost important information due to in-

formation bottleneck, they will be able to receive reliable

gradient information from the auxiliary reversible branch.

These gradient information will drive parameter learning to

assist in extracting correct and important information, and

the above actions can enable the main branch to obtain fea-

tures that are more effective for the target task. Moreover,

the reversible architecture performs worse on shallow net-

works than on general networks because complex tasks re-

quire conversion in deeper networks. Our proposed method

does not force the main branch to retain complete origi-

nal information but updates it by generating useful gradient

through the auxiliary supervision mechanism. The advan-

tage of this design is that the proposed method can also be

applied to shallower networks.

评论收藏

内容反馈

Snu77

粉丝: 5w+
资源: 18

全新的SOTA模型YOLOv9原文 + 论文阅读笔记

最新资源

全新的SOTA模型YOLOv9原文 + 论文阅读笔记

基于Yolov5-DeepSort-Pytorch SOTA实时道路多目标跟踪和分割（源码+说明文档）.rar

基于YOLOv8的牛羊识别检测系统源码(部署教程+训练好的模型+各项评估指标曲线).zip

SOTA-models:最先进的深度神经网络模型

nlp_tutorial:NLP超强入门指南，包括各任务sota模型汇总（文本分类，文本匹配，序列标注，文本生成，语言模型），以及代码，技巧

底层视觉工具箱，集成了超分辨率、视频插帧、补图、抠图等方向大量 SOTA 模型，架构灵活易于二次开发

YOLOv8源码ultralytics-main

基于 SOTA AI模型 的 图像修复工具__代码_下载

ReadingList:阅读SOTA NLP论文清单

m-DAN:为图像文本匹配任务实现当前的SOTA模型

LancoSum:用于抽象总结的工具包，易于实现基线和我们提出的模型，可以实现SOTA性能

论文投稿新规则，不用跑出SOTA，还能“内定”发论文？！.pdf

yolov论文资源-一种新的目标检测方法

HAKE-Action:作为HAKE项目的一部分，包括复制的SOTA模型和相应的HAKE增强版本（CVPR2020）

SOTA 跟踪论文：BoTSORT-OCSORT-StrongSORT 等

论文投稿新规则，不用跑出SOTA，还能“内定”发论文？！.rar

一个大模型训练、微调、评估、推理、部署的全流程开发套件： 提供业内主流的Transformer类预训练模型和SOTA下游任务应用

完爆GPT3、谷歌PaLM！检索增强模型Atlas刷新知识类小样本任务SOTA.

基于BERT预训练模型的SOTA标点修复（例如自动语音识别）深度学习模型_Jupyter Notebook_Python.zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

最新资源

基于 SOTA AI模型的图像修复工具__代码_下载

一个大模型训练、微调、评估、推理、部署的全流程开发套件：提供业内主流的Transformer类预训练模型和SOTA下游任务应用