【免费】TPH-YOLOv5原论文_yolov5结构图资源-CSDN文库

论文

需积分: 0 101 浏览量 2023-02-01 08:34:08 上传评论 1 收藏 8.46MB PDF 举报

资源推荐

资源详情

资源评论

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for

Object Detection on Drone-captured Scenarios

Xingkui Zhu

1 *

Shuchang Lyu

Xu Wang

Qi Zhao

1 †

Beihang University, Beijing, China

{adlith, lyushuchang, sy2002406, zhaoqi}@buaa.edu.cn

Abstract

Object detection on drone-captured scenarios is a re-

cent popular task. As drones always navigate in different

altitudes, the object scale varies violently, which burdens

the optimization of networks. Moreover, high-speed and

low-altitude ﬂight bring in the motion blur on the densely

packed objects, which leads to great challenge of object

distinction. To solve the two issues mentioned above, we

propose TPH-YOLOv5. Based on YOLOv5, we add one

more prediction head to detect different-scale objects. Then

we replace the original prediction heads with Transformer

Prediction Heads (TPH) to explore the prediction poten-

tial with self-attention mechanism. We also integrate con-

volutional block attention model (CBAM) to ﬁnd attention

region on scenarios with dense objects. To achieve more

improvement of our proposed TPH-YOLOv5, we provide

bags of useful strategies such as data augmentation, multi-

scale testing, multi-model integration and utilizing extra

classiﬁer. Extensive experiments on dataset VisDrone2021

show that TPH-YOLOv5 have good performance with im-

pressive interpretability on drone-captured scenarios. On

DET-test-challenge dataset, the AP result of TPH-YOLOv5

are 39.18%, which is better than previous SOTA method

(DPNetV3) by 1.81%. On VisDrone Challenge 2021, TPH-

YOLOv5 wins 5

place and achieves well-matched results

with 1

place model (AP 39.43%). Compared to baseline

model (YOLOv5), TPH-YOLOv5 improves about 7%, which

is encouraging and competitive.

1. Introduction

Object detection technology on drone-captured scenarios

has been widely used in many practical applications, such as

plant protection [18, 41], wildlife protection [23, 22] and ur-

ban surveillance [1, 15]. In this paper, we focus on improv-

Contribute Equally.

†

Corresponding author.

Figure 1. Intuitive cases to explain the three main problems in ob-

ject detection on drone-captured images. The cases in ﬁrst row,

second row and third row respectively shows the size variation,

high-density and large coverage of objects on drone-captured im-

ages.

ing the performance of object detection on drone-captured

images and providing insight for the above-mentioned nu-

merous applications.

Recent years have witnessed signiﬁcant progresses in ob-

ject detection tasks using deep convolutional neural net-

works [40, 37, 34, 27, 58]. Some notable benchmark

datasets like MS COCO [30] and PASCALVOC [9] greatly

2778

Figure 2. The overview of working pipeline using TPH-YOLOv5. Compared to original version, we mainly improve the head by applying

Transformer Prediction Head (TPH). We also add one more head to better detect different scale objects. In addition, we employ bag of

tricks like data augmentation, multi-scale testing, model ensemble and self-trained classiﬁer to make TPH-YOLOv5 stronger.

promote the development of object detection application.

However, most previous deep convolutional neural net-

works are designed for natural scene images. Directly ap-

plying previous models to tackle object detection task on

drone-captured scenarios mainly has three problems, which

are intuitively illustrated by some cases in Fig.1. First,

the object scale varies violently because the ﬂight altitude

of drones change greatly. Second, drone-captured images

contain objects with high density, which brings in occlu-

sion between objects. Third, drone-captured images always

contain confusing geographic elements because of covering

large area. The above-mentioned three problems make the

object detection of drone-captured images very challenging.

In object detection task, YOLO series [37, 38, 39, 2]

play an important role in one-stage detectors. In this pa-

per, we propose an improved model, TPH-YOLOv5 based

on YOLOv5 [21] to solve the above-mentioned three prob-

lems. The overview of the detection pipeline using TPH-

YOLOv5 is shown in Fig.2. We respectively use CSPDark-

net53 [52, 2] and path aggregation network (PANet [33])

as backbone and neck of TPH-YOLOv5, which follows the

original version. In the head part, we ﬁrst introduce one

more head for tiny object detection. Totally, TPH-YOLOv5

contains four detection heads separately used for the detec-

tion of tiny, small, medium, large objects. Then, we replace

the original prediction heads with Transformer Prediction

Heads (TPH) [7, 49] to explore the prediction potential. To

ﬁnd the attention region in images with large coverage, we

adopt Convolutional Block Attention Module (CBAM [54])

to sequentially generate the attention map along channel-

wise and spatial-wise dimensions. Compared to YOLOv5,

our improved TPH-YOLOv5 can better deal with drone-

captured images.

To further improve the performance of TPH-YOLOv5,

we employ bag of tricks (Fig.2). Speciﬁcally, we adopt data

augmentation during training, which promote the adapta-

tion for dramatic size changes of objects in images. We

also add multi-scale testing (ms-testing) and multi-model

ensemble strategies during inference to obtain more con-

vincing detection results. Moreover, through visualizing

the failure cases, we ﬁnd that our proposed architecture has

excellent localization ability but poor classiﬁcation ability,

especially on some similar categories like “tricycle” and

“awning-tricycle”. To solve this problem, we provide a self-

trained classiﬁer (ResNet18 [17]) using the image patches

cropping from training data as classiﬁcation training set.

With self-trained classiﬁer, our method has 0.8%∼1.0% im-

provement on AP value.

Our contributions are listed as follows:

• We add one more prediction head to deal with large

scale variance of objects.

• We integrate the Transformer Prediction Heads (TPH)

into YOLOv5, which can accurately localize objects in

high-density scenes.

• We integrate CBAM into YOLOv5, which can help the

network to ﬁnd region of interest in images that have

large region coverage.

• We provide useful bag of tricks and ﬁltering some use-

less tricks for object detection task on drone-captured

scenarios.

• We use self-trained classiﬁer to improve the classiﬁca-

tion ability on some confusing categories.

• On VisDrone2021 test-challenge dataset, our proposed

TPH-YOLOv5 achieve 39.18% (AP), outperforming

DPNetV3 (previous SOTA method) by 1.81%. In Vis-

Drone2021 DET challenge, TPH-YOLOv5 wins 5

place and has minor gap comparing with 1

place

models.

2779

2. Related Work

2.1. Data Augmentation

The effectiveness of data augmentation is to expand the

dataset, so that the model has higher robustness to the im-

ages obtained from different environments. Photometric

distortions and geometric distortions are wildly used by re-

searchers. As for photometric distortion, we adjusted the

hue, saturation and value of the images. In dealing with ge-

ometric distortion, we add random scaling, cropping, trans-

lation, shearing, and rotating. In addition to the above-

mentioned global pixel augmentation methods, there are

some more unique data augmentation methods. Some re-

searchers have proposed methods using multiple images to-

gether for data augmentation i.e. MixUp [57], CutMix [56]

and Mosaic [2]. MixUp randomly select two samples from

the training images to perform random weighted summa-

tion, and the labels of the samples also correspond to the

weighted summation. Unlike occlusion works that gener-

ally use zero-pixel ”black cloth” to occlude a image, Cut-

Mix uses an area of another image to cover the occluded

area. Mosaic is an improved version of the CutMix. Mosaic

stitches four images, which greatly enriches the background

of the detected object. In addition, batch normalization cal-

culates the activation statistics of 4 different images on each

layer.

In TPH-YOLOv5, we use a combination of MixUp, Mo-

saic and traditional methods in data augmentation.

2.2. Multi-Model Ensemble Method in Object De-

tection

Deep learning neural networks are non-linear methods.

They provide greater ﬂexibility and can scale in proportion

to the amount of training data. One disadvantage of this

ﬂexibility is that they learn through random training algo-

rithms, which means that they are sensitive to the details

of the training data, and may ﬁnd a different set of weights

each time they train, resulting in different predictions. This

gives the neural network a high variance. A successful way

to reduce the variance of neural network models is to train

multiple models instead of a single model, and combine the

predictions of these models.

There are three different methods to ensemble boxes

from different object detection models: Non-maximum sup-

pression (NMS) [36], Soft-NMS [53], weighted boxes fu-

sion (WBF) [43]. In the NMS method, if the overlap, in-

tersection over union (IoU) of the boxes is higher than a

certain threshold, they are considered to belong to the same

object. For each object, NMS only leaves one bounding

box with the highest conﬁdence, and other bounding boxes

are deleted. Therefore, the box ﬁltering process depends

on the choice of this single IoU threshold, which have a

big impact on model performance. Soft-NMS has made

a slightly change to NMS, which made Soft-NMS shows

a signiﬁcant improvement over traditional NMS on stan-

dard benchmark datasets (such as PASCAL VOC [10] and

MS COCO [30]). It sets an attenuation function for the

conﬁdence of adjacent bounding boxes based on the IoU

value instead of completely setting their conﬁdence scores

to zero and delete them. WBF works differently from NMS.

Both NMS and Soft-NMS exclude some boxes, while WBF

merges all boxes to form the ﬁnal result. Therefore, it can

solve all the inaccurate predictions of the model. We use

WBF to ensemble ﬁnal models, which performs much bet-

ter than NMS.

2.3. Object Detection

CNN-based object detectors can be divided into

many types: 1) one-stage detectors: YOLOX [11],

FCOS [48], DETR [65], Scaled-YOLOv4 [51], Efﬁ-

cientDet [45]. 2) two-stage detectors: VFNet [59],

CenterNet2 [62]. 3) anchor-based detectors: Scaled-

YOLOv4 [51], YOLOv5 [21]. 4) anchor-free detectors:

CenterNet [63], YOLOX [11], RepPoints [55]. Some detec-

tors are specially designed for Drone-captured images like

RRNet [4], PENet [46], CenterNet [63] etc. But from the

perspective of components, they generally consist of two

parts, an CNN-based backbone, used for image feature ex-

traction, and the other part is detection head used to predict

the class and bounding box for object. In addition, the ob-

ject detectors developed in recent years often insert some

layers between the backbone and the head, people usually

call this part the neck of the detector. Next, we will sepa-

rately introduce these three structures in detail.

Backbone. The backbone that are often used in-

clude VGG [42], ResNet [17], DenseNet [20], Mo-

bileNet [19], EfﬁcientNet [44], CSPDarknet53 [52], Swin

Transformer [35] etc., rather than networks designed by

ourselves. Because these networks have proven that they

have strong feature extraction capabilities on classiﬁcation

and other issues. But researchers will also ﬁne-tune the

backbone to make it more suitable for speciﬁc tasks.

Neck. The neck is designed to make better use of the fea-

tures extracted by the backbone. It reprocesses and ratio-

nally uses the feature maps extracted by Backbone at dif-

ferent stages. Usually, a neck consists of several bottom-up

paths and several top-down paths. Neck is a key link in the

target detection framework. The earliest neck is the use of

up and down sampling block. The feature of this method is

that there is no feature layer aggregation operation, such as

SSD [34], directly follow the head after the multi-level fea-

ture map. Commonly used path-aggregation blocks in neck

are: FPN [28], PANet [33], NAS-FPN [12], BiFPN [45],

ASFF [32], SFAM [61].The commonality of these methods

is to repeatedly use various up-and-down sampling, splic-

ing, dot sum or dot product to design aggregation strate-

2780

剩余10页未读，继续阅读

评论收藏

内容反馈

Shaco、LYF

粉丝: 4
资源: 7

TPH-YOLOv5原论文

TPH-YOLOv5用于无人机捕获场景目标检测

yolov论文-基于YOLOv5算法的名优茶采摘机器人

yolov论文-一种改进 YOLOv5 算法来提高自动驾驶系统中小物体检测的方法

计算机系论文，人工智能方向，5000字，YOLOv5

YOLOv1-YOLOv5论文解读.pdf

论文研究 - 使用土壤样品中的TPH和PAHs分析在巴西RJ的Volta Redonda住宅区Volta Grande IV进行环境法证调查

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head

YOLOV5的多主干网络（backbone）实现.zip

TPH-2基因的交互作用与单相抑郁症关系

TPH2基因交互作用与单相抑郁症关系研究

yolodet-pytorch:在pytorch中复制YOLO系列论文，包括YOLOv4，PP-YOLO，YOLOv5，YOLOv3等

国丰能源60TPH锅炉补给水系统-技术协议-V5-130522最终版6月7日.doc

论文研究-基于时序概率超图模型的视频多语义标注.pdf

TPH型在线PH计.pdf

能源60TPH锅炉补给水系统-技术协议.doc

三菱PLC程序源码-6TPH超纯水设备PLC程序.zip

轨到轨四运放, 3.0V~36V, 汽车级工作温度-40～125℃，低偏置，低温漂-TPH2504.pdf

国丰能源60TPH锅炉补给水系统_技术协议_V5_130522最终版6月7日.doc

6TPH超纯水设备PLC程序.zip

GEFRAN高压压力传感器TPH产品资料.rar

6TPH超纯水设备PLC程序.zip三菱PLC编程案例源码资料编程控制器应用通讯通信例子程序实例

TPH1R204PL_datasheet_ja_20191018-综合文档

TPH1R204PL_datasheet_en_20191018-综合文档

雌二醇对经历强迫游泳应激大鼠中缝核色氨酸羟化酶与5-羟色胺含量的影响

Academic+Phrasebank+2021+Edition+_中英文对照.pdf

基于python的超市管理系统的设计与实现毕业论文+项目文档源码

1000套计算机毕业设计带源码

数模国赛word模板.zip

2021年国赛A题（FAST主动反射面形状调节）论文+代码材料.zip

最新资源