没有合适的资源?快使用搜索试试~ 我知道了~
yolo v9的论文
资源推荐
资源详情
资源评论
YOLOv9: Learning What You Want to Learn
Using Programmable Gradient Information
Chien-Yao Wang
1,2
, I-Hau Yeh
2
, and Hong-Yuan Mark Liao
1,2,3
1
Institute of Information Science, Academia Sinica, Taiwan
2
National Taipei University of Technology, Taiwan
3
Department of Information and Computer Engineering, Chung Yuan Christian University, Taiwan
kinyiu@iis.sinica.edu.tw, ihyeh@emc.com.tw, and liao@iis.sinica.edu.tw
Abstract
Today’s deep learning methods focus on how to design
the most appropriate objective functions so that the pre-
diction results of the model can be closest to the ground
truth. Meanwhile, an appropriate architecture that can
facilitate acquisition of enough information for prediction
has to be designed. Existing methods ignore a fact that
when input data undergoes layer-by-layer feature extrac-
tion and spatial transformation, large amount of informa-
tion will be lost. This paper will delve into the important is-
sues of data loss when data is transmitted through deep net-
works, namely information bottleneck and reversible func-
tions. We proposed the concept of programmable gradi-
ent information (PGI) to cope with the various changes
required by deep networks to achieve multiple objectives.
PGI can provide complete input information for the tar-
get task to calculate objective function, so that reliable
gradient information can be obtained to update network
weights. In addition, a new lightweight network architec-
ture – Generalized Efficient Layer Aggregation Network
(GELAN), based on gradient path planning is designed.
GELAN’s architecture confirms that PGI has gained su-
perior results on lightweight models. We verified the pro-
posed GELAN and PGI on MS COCO dataset based ob-
ject detection. The results show that GELAN only uses
conventional convolution operators to achieve better pa-
rameter utilization than the state-of-the-art methods devel-
oped based on depth-wise convolution. PGI can be used
for variety of models from lightweight to large. It can be
used to obtain complete information, so that train-from-
scratch models can achieve better results than state-of-the-
art models pre-trained using large datasets, the compari-
son results are shown in Figure 1. The source codes are at:
https://github.com/WongKinYiu/yolov9.
1. Introduction
Deep learning-based models have demonstrated far bet-
ter performance than past artificial intelligence systems in
various fields, such as computer vision, language process-
ing, and speech recognition. In recent years, researchers
Figure 1. Comparisons of the real-time object detecors on MS
COCO dataset. The GELAN and PGI-based object detection
method surpassed all previous train-from-scratch methods in terms
of object detection performance. In terms of accuracy, the new
method outperforms RT DETR [43] pre-trained with a large
dataset, and it also outperforms depth-wise convolution-based de-
sign YOLO MS [7] in terms of parameters utilization.
in the field of deep learning have mainly focused on how
to develop more powerful system architectures and learn-
ing methods, such as CNNs [21–23, 42, 55, 71, 72], Trans-
formers [8, 9, 40, 41, 60, 69, 70], Perceivers [26, 26, 32, 52,
56, 81, 81], and Mambas [17, 38, 80]. In addition, some
researchers have tried to develop more general objective
functions, such as loss function [5, 45, 46, 50, 77, 78], la-
bel assignment [10, 12, 33, 67, 79] and auxiliary supervi-
sion [18, 20, 24, 28, 29, 51, 54, 68, 76]. The above studies
all try to precisely find the mapping between input and tar-
get tasks. However, most past approaches have ignored that
input data may have a non-negligible amount of informa-
tion loss during the feedforward process. This loss of in-
formation can lead to biased gradient flows, which are sub-
sequently used to update the model. The above problems
can result in deep networks to establish incorrect associa-
tions between targets and inputs, causing the trained model
to produce incorrect predictions.
1
arXiv:2402.13616v2 [cs.CV] 29 Feb 2024
Figure 2. Visualization results of random initial weight output feature maps for different network architectures: (a) input image, (b)
PlainNet, (c) ResNet, (d) CSPNet, and (e) proposed GELAN. From the figure, we can see that in different architectures, the information
provided to the objective function to calculate the loss is lost to varying degrees, and our architecture can retain the most complete
information and provide the most reliable gradient information for calculating the objective function.
In deep networks, the phenomenon of input data losing
information during the feedforward process is commonly
known as information bottleneck [59], and its schematic di-
agram is as shown in Figure 2. At present, the main meth-
ods that can alleviate this phenomenon are as follows: (1)
The use of reversible architectures [3, 16, 19]: this method
mainly uses repeated input data and maintains the informa-
tion of the input data in an explicit way; (2) The use of
masked modeling [1, 6, 9, 27, 71, 73]: it mainly uses recon-
struction loss and adopts an implicit way to maximize the
extracted features and retain the input information; and (3)
Introduction of the deep supervision concept [28,51,54,68]:
it uses shallow features that have not lost too much impor-
tant information to pre-establish a mapping from features
to targets to ensure that important information can be trans-
ferred to deeper layers. However, the above methods have
different drawbacks in the training process and inference
process. For example, a reversible architecture requires ad-
ditional layers to combine repeatedly fed input data, which
will significantly increase the inference cost. In addition,
since the input data layer to the output layer cannot have a
too deep path, this limitation will make it difficult to model
high-order semantic information during the training pro-
cess. As for masked modeling, its reconstruction loss some-
times conflicts with the target loss. In addition, most mask
mechanisms also produce incorrect associations with data.
For the deep supervision mechanism, it will produce error
accumulation, and if the shallow supervision loses informa-
tion during the training process, the subsequent layers will
not be able to retrieve the required information. The above
phenomenon will be more significant on difficult tasks and
small models.
To address the above-mentioned issues, we propose a
new concept, which is programmable gradient information
(PGI). The concept is to generate reliable gradients through
auxiliary reversible branch, so that the deep features can
still maintain key characteristics for executing target task.
The design of auxiliary reversible branch can avoid the se-
mantic loss that may be caused by a traditional deep super-
vision process that integrates multi-path features. In other
words, we are programming gradient information propaga-
tion at different semantic levels, and thereby achieving the
best training results. The reversible architecture of PGI is
built on auxiliary branch, so there is no additional cost.
Since PGI can freely select loss function suitable for the
target task, it also overcomes the problems encountered by
mask modeling. The proposed PGI mechanism can be ap-
plied to deep neural networks of various sizes and is more
general than the deep supervision mechanism, which is only
suitable for very deep neural networks.
In this paper, we also designed generalized ELAN
(GELAN) based on ELAN [65], the design of GELAN si-
multaneously takes into account the number of parameters,
computational complexity, accuracy and inference speed.
This design allows users to arbitrarily choose appropriate
computational blocks for different inference devices. We
combined the proposed PGI and GELAN, and then de-
signed a new generation of YOLO series object detection
system, which we call YOLOv9. We used the MS COCO
dataset to conduct experiments, and the experimental results
verified that our proposed YOLOv9 achieved the top perfor-
mance in all comparisons.
We summarize the contributions of this paper as follows:
1. We theoretically analyzed the existing deep neural net-
work architecture from the perspective of reversible
function, and through this process we successfully ex-
plained many phenomena that were difficult to explain
in the past. We also designed PGI and auxiliary re-
versible branch based on this analysis and achieved ex-
cellent results.
2. The PGI we designed solves the problem that deep su-
pervision can only be used for extremely deep neu-
ral network architectures, and therefore allows new
lightweight architectures to be truly applied in daily
life.
3. The GELAN we designed only uses conventional con-
volution to achieve a higher parameter usage than the
depth-wise convolution design that based on the most
advanced technology, while showing great advantages
of being light, fast, and accurate.
4. Combining the proposed PGI and GELAN, the object
detection performance of the YOLOv9 on MS COCO
dataset greatly surpasses the existing real-time object
detectors in all aspects.
2
2. Related work
2.1. Real-time Object Detectors
The current mainstream real-time object detectors are the
YOLO series [2, 7, 13–15, 25, 30, 31, 47–49, 61–63, 74, 75],
and most of these models use CSPNet [64] or ELAN [65]
and their variants as the main computing units. In terms of
feature integration, improved PAN [37] or FPN [35] is of-
ten used as a tool, and then improved YOLOv3 head [49] or
FCOS head [57, 58] is used as prediction head. Recently
some real-time object detectors, such as RT DETR [43],
which puts its fundation on DETR [4], have also been pro-
posed. However, since it is extremely difficult for DETR
series object detector to be applied to new domains without
a corresponding domain pre-trained model, the most widely
used real-time object detector at present is still YOLO se-
ries. This paper chooses YOLOv7 [63], which has been
proven effective in a variety of computer vision tasks and
various scenarios, as a base to develop the proposed method.
We use GELAN to improve the architecture and the training
process with the proposed PGI. The above novel approach
makes the proposed YOLOv9 the top real-time object de-
tector of the new generation.
2.2. Reversible Architectures
The operation unit of reversible architectures [3, 16, 19]
must maintain the characteristics of reversible conversion,
so it can be ensured that the output feature map of each
layer of operation unit can retain complete original informa-
tion. Before, RevCol [3] generalizes traditional reversible
unit to multiple levels, and in doing so can expand the se-
mantic levels expressed by different layer units. Through
a literature review of various neural network architectures,
we found that there are many high-performing architectures
with varying degree of reversible properties. For exam-
ple, Res2Net module [11] combines different input parti-
tions with the next partition in a hierarchical manner, and
concatenates all converted partitions before passing them
backwards. CBNet [34, 39] re-introduces the original in-
put data through composite backbone to obtain complete
original information, and obtains different levels of multi-
level reversible information through various composition
methods. These network architectures generally have ex-
cellent parameter utilization, but the extra composite layers
cause slow inference speeds. DynamicDet [36] combines
CBNet [34] and the high-efficiency real-time object detec-
tor YOLOv7 [63] to achieve a very good trade-off among
speed, number of parameters, and accuracy. This paper in-
troduces the DynamicDet architecture as the basis for de-
signing reversible branches. In addition, reversible infor-
mation is further introduced into the proposed PGI. The
proposed new architecture does not require additional con-
nections during the inference process, so it can fully retain
the advantages of speed, parameter amount, and accuracy.
2.3. Auxiliary Supervision
Deep supervision [28,54,68] is the most common auxil-
iary supervision method, which performs training by insert-
ing additional prediction layers in the middle layers. Es-
pecially the application of multi-layer decoders introduced
in the transformer-based methods is the most common one.
Another common auxiliary supervision method is to utilize
the relevant meta information to guide the feature maps pro-
duced by the intermediate layers and make them have the
properties required by the target tasks [18, 20, 24, 29, 76].
Examples of this type include using segmentation loss or
depth loss to enhance the accuracy of object detectors. Re-
cently, there are many reports in the literature [53, 67, 82]
that use different label assignment methods to generate dif-
ferent auxiliary supervision mechanisms to speed up the
convergence speed of the model and improve the robustness
at the same time. However, the auxiliary supervision mech-
anism is usually only applicable to large models, so when
it is applied to lightweight models, it is easy to cause an
under parameterization phenomenon, which makes the per-
formance worse. The PGI we proposed designed a way to
reprogram multi-level semantic information, and this design
allows lightweight models to also benefit from the auxiliary
supervision mechanism.
3. Problem Statement
Usually, people attribute the difficulty of deep neural net-
work convergence problem due to factors such as gradient
vanish or gradient saturation, and these phenomena do ex-
ist in traditional deep neural networks. However, modern
deep neural networks have already fundamentally solved
the above problem by designing various normalization and
activation functions. Nevertheless, deep neural networks
still have the problem of slow convergence or poor conver-
gence results.
In this paper, we explore the nature of the above issue
further. Through in-depth analysis of information bottle-
neck, we deduced that the root cause of this problem is that
the initial gradient originally coming from a very deep net-
work has lost a lot of information needed to achieve the
goal soon after it is transmitted. In order to confirm this
inference, we feedforward deep networks of different archi-
tectures with initial weights, and then visualize and illus-
trate them in Figure 2. Obviously, PlainNet has lost a lot of
important information required for object detection in deep
layers. As for the proportion of important information that
ResNet, CSPNet, and GELAN can retain, it is indeed posi-
tively related to the accuracy that can be obtained after train-
ing. We further design reversible network-based methods to
solve the causes of the above problems. In this section we
shall elaborate our analysis of information bottleneck prin-
ciple and reversible functions.
3
3.1. Information Bottleneck Principle
According to information bottleneck principle, we know
that data X may cause information loss when going through
transformation, as shown in Eq. 1 below:
I(X, X) ≥ I(X, f
θ
(X)) ≥ I(X, g
ϕ
(f
θ
(X))), (1)
where I indicates mutual information, f and g are transfor-
mation functions, and θ and ϕ are parameters of f and g,
respectively.
In deep neural networks, f
θ
(·) and g
ϕ
(·) respectively
represent the operations of two consecutive layers in deep
neural network. From Eq. 1, we can predict that as the num-
ber of network layer becomes deeper, the original data will
be more likely to be lost. However, the parameters of the
deep neural network are based on the output of the network
as well as the given target, and then update the network after
generating new gradients by calculating the loss function.
As one can imagine, the output of a deeper neural network
is less able to retain complete information about the pre-
diction target. This will make it possible to use incomplete
information during network training, resulting in unreliable
gradients and poor convergence.
One way to solve the above problem is to directly in-
crease the size of the model. When we use a large number
of parameters to construct a model, it is more capable of
performing a more complete transformation of the data. The
above approach allows even if information is lost during the
data feedforward process, there is still a chance to retain
enough information to perform the mapping to the target.
The above phenomenon explains why the width is more im-
portant than the depth in most modern models. However,
the above conclusion cannot fundamentally solve the prob-
lem of unreliable gradients in very deep neural network.
Below, we will introduce how to use reversible functions
to solve problems and conduct relative analysis.
3.2. Reversible Functions
When a function r has an inverse transformation func-
tion v, we call this function reversible function, as shown in
Eq. 2.
X = v
ζ
(r
ψ
(X)), (2)
where ψ and ζ are parameters of r and v, respectively. Data
X is converted by reversible function without losing infor-
mation, as shown in Eq. 3.
I(X, X) = I(X, r
ψ
(X)) = I(X, v
ζ
(r
ψ
(X))). (3)
When the network’s transformation function is composed
of reversible functions, more reliable gradients can be ob-
tained to update the model. Almost all of today’s popular
deep learning methods are architectures that conform to the
reversible property, such as Eq. 4.
X
l+1
= X
l
+ f
l+1
θ
(X
l
), (4)
where l indicates the l-th layer of a PreAct ResNet and
f is the transformation function of the l-th layer. PreAct
ResNet [22] repeatedly passes the original data X to sub-
sequent layers in an explicit way. Although such a design
can make a deep neural network with more than a thousand
layers converge very well, it destroys an important reason
why we need deep neural networks. That is, for difficult
problems, it is difficult for us to directly find simple map-
ping functions to map data to targets. This also explains
why PreAct ResNet performs worse than ResNet [21] when
the number of layers is small.
In addition, we tried to use masked modeling that al-
lowed the transformer model to achieve significant break-
throughs. We use approximation methods, such as Eq. 5,
to try to find the inverse transformation v of r, so that the
transformed features can retain enough information using
sparse features. The form of Eq. 5 is as follows:
X = v
ζ
(r
ψ
(X) · M), (5)
where M is a dynamic binary mask. Other methods that
are commonly used to perform the above tasks are diffusion
model and variational autoencoder, and they both have the
function of finding the inverse function. However, when
we apply the above approach to a lightweight model, there
will be defects because the lightweight model will be under
parameterized to a large amount of raw data. Because of
the above reason, important information I(Y, X) that maps
data X to target Y will also face the same problem. For this
issue, we will explore it using the concept of information
bottleneck [59]. The formula for information bottleneck is
as follows:
I(X, X) ≥ I(Y, X) ≥ I(Y, f
θ
(X)) ≥ ... ≥ I(Y,
ˆ
Y ). (6)
Generally speaking, I(Y, X) will only occupy a very small
part of I(X, X). However, it is critical to the target mis-
sion. Therefore, even if the amount of information lost in
the feedforward stage is not significant, as long as I(Y, X)
is covered, the training effect will be greatly affected. The
lightweight model itself is in an under parameterized state,
so it is easy to lose a lot of important information in the
feedforward stage. Therefore, our goal for the lightweight
model is how to accurately filter I(Y, X) from I(X, X). As
for fully preserving the information of X, that is difficult to
achieve. Based on the above analysis, we hope to propose a
new deep neural network training method that can not only
generate reliable gradients to update the model, but also be
suitable for shallow and lightweight neural networks.
4
剩余17页未读,继续阅读
资源评论
浪淘沙jkp
- 粉丝: 323
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于Python的新能源承载力计算及界面设计源码 - HAINING-DG
- 基于Java的本科探索学习项目设计源码 - 本科探索
- 基于Javascript和Python的微商城项目设计源码 - MicroMall
- 基于Java的网上订餐系统设计源码 - online ordering system
- 基于Javascript的超级美眉网络资源管理应用模块设计源码
- 基于Typescript和PHP的编程知识储备库设计源码 - study-php
- Screenshot_2024-05-28-11-40-58-177_com.tencent.mm.jpg
- 基于Dart的Flutter小提琴调音器APP设计源码 - violinhelper
- 基于JavaScript和CSS的随寻订购网页设计源码 - web-order
- 基于MATLAB的声纹识别系统设计源码 - VoiceprintRecognition
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功