没有合适的资源?快使用搜索试试~ 我知道了~
TPH-YOLOv5原论文
资源推荐
资源详情
资源评论
TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for
Object Detection on Drone-captured Scenarios
Xingkui Zhu
1 *
Shuchang Lyu
1
*
Xu Wang
1
Qi Zhao
1 †
1
Beihang University, Beijing, China
{adlith, lyushuchang, sy2002406, zhaoqi}@buaa.edu.cn
Abstract
Object detection on drone-captured scenarios is a re-
cent popular task. As drones always navigate in different
altitudes, the object scale varies violently, which burdens
the optimization of networks. Moreover, high-speed and
low-altitude flight bring in the motion blur on the densely
packed objects, which leads to great challenge of object
distinction. To solve the two issues mentioned above, we
propose TPH-YOLOv5. Based on YOLOv5, we add one
more prediction head to detect different-scale objects. Then
we replace the original prediction heads with Transformer
Prediction Heads (TPH) to explore the prediction poten-
tial with self-attention mechanism. We also integrate con-
volutional block attention model (CBAM) to find attention
region on scenarios with dense objects. To achieve more
improvement of our proposed TPH-YOLOv5, we provide
bags of useful strategies such as data augmentation, multi-
scale testing, multi-model integration and utilizing extra
classifier. Extensive experiments on dataset VisDrone2021
show that TPH-YOLOv5 have good performance with im-
pressive interpretability on drone-captured scenarios. On
DET-test-challenge dataset, the AP result of TPH-YOLOv5
are 39.18%, which is better than previous SOTA method
(DPNetV3) by 1.81%. On VisDrone Challenge 2021, TPH-
YOLOv5 wins 5
th
place and achieves well-matched results
with 1
st
place model (AP 39.43%). Compared to baseline
model (YOLOv5), TPH-YOLOv5 improves about 7%, which
is encouraging and competitive.
1. Introduction
Object detection technology on drone-captured scenarios
has been widely used in many practical applications, such as
plant protection [18, 41], wildlife protection [23, 22] and ur-
ban surveillance [1, 15]. In this paper, we focus on improv-
*
Contribute Equally.
†
Corresponding author.
Figure 1. Intuitive cases to explain the three main problems in ob-
ject detection on drone-captured images. The cases in first row,
second row and third row respectively shows the size variation,
high-density and large coverage of objects on drone-captured im-
ages.
ing the performance of object detection on drone-captured
images and providing insight for the above-mentioned nu-
merous applications.
Recent years have witnessed significant progresses in ob-
ject detection tasks using deep convolutional neural net-
works [40, 37, 34, 27, 58]. Some notable benchmark
datasets like MS COCO [30] and PASCALVOC [9] greatly
2778
Figure 2. The overview of working pipeline using TPH-YOLOv5. Compared to original version, we mainly improve the head by applying
Transformer Prediction Head (TPH). We also add one more head to better detect different scale objects. In addition, we employ bag of
tricks like data augmentation, multi-scale testing, model ensemble and self-trained classifier to make TPH-YOLOv5 stronger.
promote the development of object detection application.
However, most previous deep convolutional neural net-
works are designed for natural scene images. Directly ap-
plying previous models to tackle object detection task on
drone-captured scenarios mainly has three problems, which
are intuitively illustrated by some cases in Fig.1. First,
the object scale varies violently because the flight altitude
of drones change greatly. Second, drone-captured images
contain objects with high density, which brings in occlu-
sion between objects. Third, drone-captured images always
contain confusing geographic elements because of covering
large area. The above-mentioned three problems make the
object detection of drone-captured images very challenging.
In object detection task, YOLO series [37, 38, 39, 2]
play an important role in one-stage detectors. In this pa-
per, we propose an improved model, TPH-YOLOv5 based
on YOLOv5 [21] to solve the above-mentioned three prob-
lems. The overview of the detection pipeline using TPH-
YOLOv5 is shown in Fig.2. We respectively use CSPDark-
net53 [52, 2] and path aggregation network (PANet [33])
as backbone and neck of TPH-YOLOv5, which follows the
original version. In the head part, we first introduce one
more head for tiny object detection. Totally, TPH-YOLOv5
contains four detection heads separately used for the detec-
tion of tiny, small, medium, large objects. Then, we replace
the original prediction heads with Transformer Prediction
Heads (TPH) [7, 49] to explore the prediction potential. To
find the attention region in images with large coverage, we
adopt Convolutional Block Attention Module (CBAM [54])
to sequentially generate the attention map along channel-
wise and spatial-wise dimensions. Compared to YOLOv5,
our improved TPH-YOLOv5 can better deal with drone-
captured images.
To further improve the performance of TPH-YOLOv5,
we employ bag of tricks (Fig.2). Specifically, we adopt data
augmentation during training, which promote the adapta-
tion for dramatic size changes of objects in images. We
also add multi-scale testing (ms-testing) and multi-model
ensemble strategies during inference to obtain more con-
vincing detection results. Moreover, through visualizing
the failure cases, we find that our proposed architecture has
excellent localization ability but poor classification ability,
especially on some similar categories like “tricycle” and
“awning-tricycle”. To solve this problem, we provide a self-
trained classifier (ResNet18 [17]) using the image patches
cropping from training data as classification training set.
With self-trained classifier, our method has 0.8%∼1.0% im-
provement on AP value.
Our contributions are listed as follows:
• We add one more prediction head to deal with large
scale variance of objects.
• We integrate the Transformer Prediction Heads (TPH)
into YOLOv5, which can accurately localize objects in
high-density scenes.
• We integrate CBAM into YOLOv5, which can help the
network to find region of interest in images that have
large region coverage.
• We provide useful bag of tricks and filtering some use-
less tricks for object detection task on drone-captured
scenarios.
• We use self-trained classifier to improve the classifica-
tion ability on some confusing categories.
• On VisDrone2021 test-challenge dataset, our proposed
TPH-YOLOv5 achieve 39.18% (AP), outperforming
DPNetV3 (previous SOTA method) by 1.81%. In Vis-
Drone2021 DET challenge, TPH-YOLOv5 wins 5
th
place and has minor gap comparing with 1
st
place
models.
2779
2. Related Work
2.1. Data Augmentation
The effectiveness of data augmentation is to expand the
dataset, so that the model has higher robustness to the im-
ages obtained from different environments. Photometric
distortions and geometric distortions are wildly used by re-
searchers. As for photometric distortion, we adjusted the
hue, saturation and value of the images. In dealing with ge-
ometric distortion, we add random scaling, cropping, trans-
lation, shearing, and rotating. In addition to the above-
mentioned global pixel augmentation methods, there are
some more unique data augmentation methods. Some re-
searchers have proposed methods using multiple images to-
gether for data augmentation i.e. MixUp [57], CutMix [56]
and Mosaic [2]. MixUp randomly select two samples from
the training images to perform random weighted summa-
tion, and the labels of the samples also correspond to the
weighted summation. Unlike occlusion works that gener-
ally use zero-pixel ”black cloth” to occlude a image, Cut-
Mix uses an area of another image to cover the occluded
area. Mosaic is an improved version of the CutMix. Mosaic
stitches four images, which greatly enriches the background
of the detected object. In addition, batch normalization cal-
culates the activation statistics of 4 different images on each
layer.
In TPH-YOLOv5, we use a combination of MixUp, Mo-
saic and traditional methods in data augmentation.
2.2. Multi-Model Ensemble Method in Object De-
tection
Deep learning neural networks are non-linear methods.
They provide greater flexibility and can scale in proportion
to the amount of training data. One disadvantage of this
flexibility is that they learn through random training algo-
rithms, which means that they are sensitive to the details
of the training data, and may find a different set of weights
each time they train, resulting in different predictions. This
gives the neural network a high variance. A successful way
to reduce the variance of neural network models is to train
multiple models instead of a single model, and combine the
predictions of these models.
There are three different methods to ensemble boxes
from different object detection models: Non-maximum sup-
pression (NMS) [36], Soft-NMS [53], weighted boxes fu-
sion (WBF) [43]. In the NMS method, if the overlap, in-
tersection over union (IoU) of the boxes is higher than a
certain threshold, they are considered to belong to the same
object. For each object, NMS only leaves one bounding
box with the highest confidence, and other bounding boxes
are deleted. Therefore, the box filtering process depends
on the choice of this single IoU threshold, which have a
big impact on model performance. Soft-NMS has made
a slightly change to NMS, which made Soft-NMS shows
a significant improvement over traditional NMS on stan-
dard benchmark datasets (such as PASCAL VOC [10] and
MS COCO [30]). It sets an attenuation function for the
confidence of adjacent bounding boxes based on the IoU
value instead of completely setting their confidence scores
to zero and delete them. WBF works differently from NMS.
Both NMS and Soft-NMS exclude some boxes, while WBF
merges all boxes to form the final result. Therefore, it can
solve all the inaccurate predictions of the model. We use
WBF to ensemble final models, which performs much bet-
ter than NMS.
2.3. Object Detection
CNN-based object detectors can be divided into
many types: 1) one-stage detectors: YOLOX [11],
FCOS [48], DETR [65], Scaled-YOLOv4 [51], Effi-
cientDet [45]. 2) two-stage detectors: VFNet [59],
CenterNet2 [62]. 3) anchor-based detectors: Scaled-
YOLOv4 [51], YOLOv5 [21]. 4) anchor-free detectors:
CenterNet [63], YOLOX [11], RepPoints [55]. Some detec-
tors are specially designed for Drone-captured images like
RRNet [4], PENet [46], CenterNet [63] etc. But from the
perspective of components, they generally consist of two
parts, an CNN-based backbone, used for image feature ex-
traction, and the other part is detection head used to predict
the class and bounding box for object. In addition, the ob-
ject detectors developed in recent years often insert some
layers between the backbone and the head, people usually
call this part the neck of the detector. Next, we will sepa-
rately introduce these three structures in detail.
Backbone. The backbone that are often used in-
clude VGG [42], ResNet [17], DenseNet [20], Mo-
bileNet [19], EfficientNet [44], CSPDarknet53 [52], Swin
Transformer [35] etc., rather than networks designed by
ourselves. Because these networks have proven that they
have strong feature extraction capabilities on classification
and other issues. But researchers will also fine-tune the
backbone to make it more suitable for specific tasks.
Neck. The neck is designed to make better use of the fea-
tures extracted by the backbone. It reprocesses and ratio-
nally uses the feature maps extracted by Backbone at dif-
ferent stages. Usually, a neck consists of several bottom-up
paths and several top-down paths. Neck is a key link in the
target detection framework. The earliest neck is the use of
up and down sampling block. The feature of this method is
that there is no feature layer aggregation operation, such as
SSD [34], directly follow the head after the multi-level fea-
ture map. Commonly used path-aggregation blocks in neck
are: FPN [28], PANet [33], NAS-FPN [12], BiFPN [45],
ASFF [32], SFAM [61].The commonality of these methods
is to repeatedly use various up-and-down sampling, splic-
ing, dot sum or dot product to design aggregation strate-
2780
剩余10页未读,继续阅读
资源评论
Shaco、LYF
- 粉丝: 4
- 资源: 7
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功