YOLO-Former：YOLO与ViT握手_yolo是ViT资源-CSDN文库

版权申诉

5星 · 超过95%的资源 54 浏览量 2024-04-11 15:26:19 上传评论 2 收藏 513KB PDF 举报

所提出的YOLO-Former方法将Transformer和YOLOv4的思想无缝集成，创建了一个高精度、高效率的目标检测系统。该方法利用了 YOLOv4 的快速推理速度，并通过集成卷积注意力和 transformer 模块，融合了 transformer 架构的优势。结果验证了所提方法的有效性，在Pascal VOC数据集上的平均精度（mAP）为85.76\%，同时保持了较高的预测速度，帧速率为每秒10.85帧。这项工作的贡献在于展示了这两种最先进技术的创新组合如何导致目标检测领域的进一步改进。 ### YOLO-Former：YOLO与ViT的完美结合 #### 一、引言在计算机视觉领域，深度学习技术的发展已经极大地推动了图像分类、图像分割以及目标检测等任务的进步。其中，目标检测是识别数字图像或视频中特定类别语义对象实例的任务。这些系统在实际应用中的例子包括车牌字符识别、物体追踪、人脸与人体检测及识别、活动识别、医学影像分析、高级驾驶辅助系统、制造业以及机器人技术等。随着大数据时代的到来以及计算能力的提升，基于深度神经网络的目标检测方法变得越来越流行。这类网络能够实现端到端的目标检测过程，不再需要额外的手工特征提取步骤，从而极大地提高了检测的准确性和效率。 #### 二、YOLO-Former概述 YOLO-Former是一种全新的目标检测方法，它巧妙地将YOLOv4与Transformer架构相结合，创造出了一个既高效又精确的检测系统。YOLO-Former不仅继承了YOLOv4快速推理的特点，而且还引入了Transformer模块和卷积注意力机制，充分利用了Transformer架构的优势。 #### 三、YOLO-Former的技术细节 ##### 3.1 YOLOv4 YOLOv4是一种基于YOLO（You Only Look Once）框架的目标检测算法。相比于其他检测模型，如R-CNN系列和SSD，YOLOv4具有更快的处理速度和较高的准确性。其核心思想是在单个网络中同时完成特征提取和边界框预测，从而实现了端到端的实时检测。 ##### 3.2 Transformer架构 Transformer是一种基于自注意力机制的深度学习模型，最初被设计用于自然语言处理任务，但后来也被广泛应用于计算机视觉领域。Transformer的核心组件包括多头自注意力机制和前馈网络，它们能够有效地捕捉输入序列中的依赖关系。 ##### 3.3 卷积注意力机制为了更好地融合YOLOv4和Transformer的优点，YOLO-Former引入了一种名为卷积注意力的机制。这种机制通过在特征图上应用卷积操作来计算注意力权重，可以更精确地定位目标物体的位置，提高检测性能。 #### 四、实验结果与分析 YOLO-Former在Pascal VOC数据集上的平均精度(mAP)达到了85.76%，这表明它具有非常高的检测精度。此外，该模型还保持着较快的预测速度，每秒可以处理10.85帧图像。这些结果证明了YOLO-Former作为一种先进的目标检测方法的有效性。 #### 五、YOLO-Former的创新点 YOLO-Former的主要贡献在于成功地将YOLOv4和Transformer架构进行了创新性的结合。这种结合不仅保留了YOLOv4快速处理的能力，而且通过引入Transformer和卷积注意力机制，显著提高了检测的准确度。此外，这种方法也展示了如何通过将两种最先进的技术进行创造性地结合，进一步推进了目标检测领域的发展。 #### 六、总结 YOLO-Former通过将YOLOv4和Transformer架构相结合，创造出了一个高度精确且高效的物体检测系统。该方法在Pascal VOC数据集上的表现优异，不仅准确率高，而且处理速度快，为未来的计算机视觉研究提供了一个强有力的基础。

资源推荐

资源详情

资源评论

YOLO-Former: YOLO Shakes Hand With ViT

Javad Khoramdel Ahmad Moori Yasamin Borhani Armin Ghanbarzadeh Esmaeil Najaﬁ

Tarbiat Modares University Faculty of Mechanical Engineering, K. N. Toosi University of Technology

Tehran, Iran

j.khoramdel@modares.ac.ir ahmadmoori@email.kntu.ac.ir borhaniyasamin@gmail.com agz1986@gmail.com

najafi.e@kntu.ac.ir

Abstract—The proposed YOLO-Former method seamlessly

integrates the ideas of transformer and YOLOv4 to create a

highly accurate and efﬁcient object detection system. The method

leverages the fast inference speed of YOLOv4 and incorporates

the advantages of the transformer architecture through the inte-

gration of convolutional attention and transformer modules. The

results demonstrate the effectiveness of the proposed approach,

with a mean average precision (mAP) of 85.76% on the Pascal

VOC dataset, while maintaining high prediction speed with a

frame rate of 10.85 frames per second. The contribution of this

work lies in the demonstration of how the innovative combination

of these two state-of-the-art techniques can lead to further

improvements in the ﬁeld of object detection.

Index Terms—Article submission, IEEE, IEEEtran, journal,

X, paper, template, typesetting.

I. INTRODUCTION

Many computer vision tasks, such as image classiﬁcation,

image segmentation, and object detection, are dominated by

deep neural networks due to the recent advancements in deep

learning. Object detection is the task of detecting instances

of semantic objects of a certain class in digital images and

videos [1]. Some applications of such systems are license

plate character recognition, object tracking, human face and

body detection and recognition, activity recognition, medical

imaging, advanced driving assistant systems, manufacturing

industry, and robotics.

With the advent of big-data and higher processing power,

the deep neural network based methods for object detection

have become more popular. These networks are capable of

end-to-end object detection without the need of additional

components and are mostly based on convolutional neural

networks [2]. The state-of-the-art object detection methods

can be further categorized into two main categories. First,

region proposal based models that prioritize detection accuracy

over inference speed such as RCNN [3], fast RCNN [4],

mask RCNN [5]. Second, one-stage detection models that

have high inference speeds and are capable of achieving real

time detection. The examples of one-stage models include

single shot multibox detector (SSD) [6], you only look once

(YOLO) [7], EfﬁcientDet [8], RetinaNet [9], CenterNet [10],

and HourGlass [11].

Although all the previously mentioned object detectors rely

solely on the convolutional and pooling layers, the impressive

results of Vision Transformer (ViT) [12] which is based on

attention layers, has inspired ViT-YOLO [13], and DETR

[14] to develop object detectors based on the idea of the

transformer. The detection transformer (DETR) framework

uses the transformer encoder-decoder-based architecture to

perform end-to-end object detection [14]. The ViT-YOLO

embeds the scaled dot multi-head attention layer at the end of

the YOLOv4 backbone by ﬂattening the feature maps before

the attention layer. It then reshapes the attention layer outputs

to 2D to be consistent with the remainder of the network [13].

This paper improves the accuracy of YOLOv4 by intro-

ducing the YOLO-Former algorithm that employs a novel

convolutional self-attention module (CSAM) in the YOLOv4

structure. The CSAM is developed based on the scaled dot

self-attention (SDSA). In order to connect the proposed CSAM

to other components in the network, a convolutional trans-

former module has been implemented. The presented object

detector is further enhanced by using several augmentation

policies to increase its generalization capability. As such,

the YOLO-Former provides more accurate results on the

Pascal VOC dataset, while preserving the real-time execution

property.

The structure of the paper is as follows. Section II presents a

summary of the studies on augmentations, regularization, and

attention mechanisms. The network structure and developed

modules are discussed in Section III. A detailed description

of the experiments conducted with the implemented model and

the YOLOv4 dataset, training conﬁguration, and evaluation is

available in Section IV. The results and comparison to the

literature are discussed in Section V and, ﬁnally, Section VI

concludes the paper.

II. BACKGROUND

A brief review of the formerly conducted studies on aug-

mentations, regularization, and attention mechanisms is given

in the following.

A. Augmentation

The great impact of the augmentation on extending the

generalization ability of the models has made it inseparable

from image processing. A network can beneﬁt from augmen-

tation methods such as translation, color jittering, rotation,

etc. not only as a means of providing more data, but also

as a means of making it less sensitive to these transformations

[15]. For instance, occlusion is a challenge problem in image

recognition. One solution for that is introduced as the cutout

method, which makes the used dataset more versatile [16]. In

this technique, a random region of images is covered by a

arXiv:2401.06244v1 [cs.CV] 11 Jan 2024

rectangle which its size can be chosen according to the size

of the objects in each image. Thus, the network should not

learn features that rely on the whole object of interest. The

same inspiration underlies other methods, such as GridMask

[17] and Hide-and-seek (HaS) [18].

Augmenting across a batch of samples can be beneﬁcial

as it extends the vicinity of the dataset as multiple instances

of multiple images are mixed to produce a new picture [19].

Various augmentations in image classiﬁcation are applied to

a batch of images like mixup [19], cutmix [20], and puzzle

mix [21]. Although attempts have been made to extend the

applicability of these techniques beyond image classiﬁcation to

object detection [19], a very compelling method called mosaic

augmentation is implemented in [22].

Studies suggest that the severity and number of augmen-

tation techniques used during the training affect the model

accuracy [23]–[25]. By training a reinforcement learning agent

on a small dataset, AutoAugment attempts to ﬁnd an augmen-

tation policy for combining the augmentation transformations.

The high computational cost of AutoAugment encouraged the

authors of [24] to develop RandAugment, which parameterizes

the data augmentation process with only two parameters;

the number of operations (N), and severity (M). Combining

RandAugment [24] and mixup [19], AugMix augments an

image separately in different chains. An augmented image

is created by weighted summation of augmentation chains

using coefﬁcients from a Dirichlet distribution. Finally, the

coefﬁcients from a Beta distribution are drawn to calculate

the weighted sum of the original and augmented images.

B. Regularization

As with augmentation, the overﬁtting problem can be re-

duced by regularization techniques like dropout. Although

dropout works well with fully connected layers, the authors of

[26] have developed Dropblock as a method for convolutional

layers. Rather than randomly dropping features at random

locations, Dropblock drops a connected region. According to

their study, decreasing the probability of keeping blocks is

more effective than utilizing a ﬁxed probability.

Some methods are only applicable to a speciﬁc structure,

like shake-shake regularization [27], which can be applied to a

multi-branch network. This approach is developed for a three-

branch network; two branches are multiplied by small random

numbers, then summed up with the third branch during the

training forward pass, and different random numbers from a

Beta distribution are used as multipliers during backpropaga-

tion. These two branches are multiplied by 0.5 at test time.

C. Attention Mechanisms

The application of attention mechanisms in the artiﬁcial

neural network has been associated with NLP tasks [28]. In

machine translation, the network should concentrate on certain

parts of the input sequence from the source language to predict

a word in the target language. An attention mechanism is

proposed in [29] that could help the network to pay attention.

This work encouraged other researchers to investigate the

applicability of attention for solving different tasks [30]–

[33]. Formerly, the common choices for solving the NLP

tasks like machine translation were recurrent and convolutional

neural networks. The authors of [34] suggested a different

architecture called Transformer. Unlike the other works which

combined attention with either recurrent or convolutional

neural networks, Transformer was only based on the scaled

dot multi-head self-attention (SDMHSA). They claimed that

attention could solve the machine translation task on its own.

This concept has been investigated in other studies like BERT

[35].

The promising results of using attention in the NLP ﬁeld

has motivated computer vision scientists to improve their

results by adding attention to their networks [36]–[39]. In

[40], convolutional block attention module (CBAM) has been

introduced for convolutional neural networks. This module

includs two sub-modules. A spatial attention module (SAM)

and a channel attention module (CAM). The authors embedded

the CBAM in the structure of a number of the state-of-the-

art architectures like ResNet50 [41], ResNeXt50 [42], and

MobileNet [43]. By taking advantage of attention, they have

achieved higher accuracy in image classiﬁcation on ImageNet

[44] and object detection on Pascal VOC [45] and Microsoft

COCO [46]. The vision transformer (ViT) has bridged the

gap between image classiﬁcation and transformer architecture

by dealing with an image as a sequence of patches. This

network has achieved state-of-the-art accuracy on ImageNet

classiﬁcation. Similar to [35] and [34], ViT only uses the

MHSDSA as the main component all over the network [12].

III. NETWORK STRUCTURE

The YOLOv4 architecture can be divided into three sub-

networks: the backbone, the neck, and the head. The backbone

of YOLOv4 is called CPS-Darknet-53. The CPS-Darknet-53

extracts feature from the input image and generate output

at three levels. The ﬁrst level output has the highest spatial

resolution and is suitable for detecting small-sized objects.

The second level output has less spatial resolution than the

ﬁrst, making it appropriate for ﬁnding medium-sized objects

in the image. The feature map has more depth than the ﬁrst

stage feature map at this stage. The third and last stage

output has the deepest feature map with the least spatial

resolution. The YOLOv4 neck takes these feature maps and

up-samples the lowest resolution feature map with the bi-

linear interpolation method to match the spatial resolution of

the second stage feature map. Then this up-sampled feature

map is then concatenated with the second-level feature map

to help the mid-level resolution feature map enrich the features

for detecting medium-sized objects. The obtained feature map

is up-sampled and concatenated with the highest-resolution

feature map. The YOLOv4 head receives the feature maps

from the neck to detect objects at three scales.

It is evident that residual blocks play a vital role inside

the YOLOv4 backbone due to 23 residual blocks in CPS-

Darknet-53. Motivated by the ViT transformer block, a trans-

former attention block is implemented and utilized to replace

the residual blocks in CPS-Darknet-53. Replacement of the

剩余8页未读，继续阅读

评论收藏

内容反馈

版权申诉

我是宇航员

2024-05-20

资源很受用，资源主总结的很全面，内容与描述一致，解决了我当下的问题。

人工智能_SYBH

粉丝: 5w+
资源: 233

YOLO-Former：YOLO与ViT握手

YOLO-TLA：基于YOLOv5的高效轻量级小目标检测模型

YOLO-World：实时开放词汇对象检测

YOLO-CIANNA：在无线电数据中进行深度学习的星系检测 I. 一种受YOLO启发的新型源检测方法应用于SKAO SDC1

YOLO-Drone：高空视角空中实时检测致密小物体

yolo-world资料（源码+文档）

YOLO-Nano:新版YOLO-Nano

yolo-world的实现方式.pdf

yolo-world的系统代码

用opencv的dnn模块实现Yolo-Fastest的目标检测python源码+模型+说明.zip

YOLO-Fastest:YOLO_最快通过pytorch实现

pytorch实现的YOLO-v1源代码

Yolo-Fastest:Yolo通用目标检测模型与EfficientNet-lite结合使用，计算量仅为230Mflops（0.23Bflops），模型大小为1.3MB

YOLO-Lite测试报告1

yolo-FastestV2

YOLO-V5:使用对象检测模型YOLO-V5对图像进行定位和分类

论文对YOLO的演进进行了全面的分析，考察了从原始的YOLO到YOLOv8和YOLO-NAS每个版本中的创新和贡献

YOLO-Pose人体姿态估计部署-基于NVIDIA DeepStream SDK的应用程序运行YOLO-Pose模型+运行教程（C和Python两版本）.zip

YOLO-World完整代码资源

yolo-labeler：每个图像针对单个对象的自动YOLO标签

MobileNet-Yolo：MobileNetV2-YoloV3-Nano：0.5BFlops 3MB华为P40：6msimg，YoloFace-500k：0.1Bflops 420KB

YOLO-World模型概述.pdf

用opencv的dnn模块实现Yolo-Fastest的目标检测.zip

yolo-fastest-xl-based-on-opencv-DNN-using-onnx:yolo-fastest-xl基于基于onc的opencv DNN

基于改进YOLO和迁移学习的水下鱼类目标实时检测.pdf

yolo-swag:项目申请

yolo-world.zip,yolo-world.zip

YOLO-Summary:YOLO系列资料

YOLO 数据集：中草药图像目标检测【包含划分好的数据集、类别class文件、数据可视化脚本】

YOLO-World + EfficientViT SAM.zip

最新资源