没有合适的资源?快使用搜索试试~ 我知道了~
面向多模态遥感影像实时轻量级目标检测框架HyperYOLO
需积分: 1 1 下载量 155 浏览量
2024-10-13
15:41:09
上传
评论 2
收藏 5.01MB PDF 举报
温馨提示
本文提出了一种适用于多模态遥感图像的目标检测框架HyperYOLO,它具有更快的推理速度与更低计算成本的优势。该架构包括通道和空间交换(CSE)融合模块来高效提取不同模态数据间的互补信息,以及基于特征金字塔网络用于超分辨率(FPNSR)增强小物体识别性能的辅助分支,提高了小目标识别准确性同时在模型部署时减少了额外的计算负担。实验证明,相较于SOTA方法如SuperYOLO等,本研究提出的方案不仅参数规模减少超过一半,在VEDAI远程感应图像集上,平均精确率达到了新的水平,且更适合在资源受限设备上应用,比如卫星及无人驾驶飞行器。 适用人群为从事机器视觉与遥感技术的专业研究人员和技术开发者,特别是在无人航空载具(UAVs),小型化传感技术和低资源环境下需要实现高性能视觉感知的应用中。 使用场景及目标:利用可见光与红外光谱传感器提供的多模态遥感影像开展目标监测任务,尤其是小尺度目标的识别与定位;同时追求实时高效,适合嵌入有限算力的环境。 本文强调创新模块的设计细节以及相关实验分析结果,为理解各组成部分如何提升系统性能提供理论和实践参考,对于优化和加速未来的模型迭代流程有一定启发意义。
资源推荐
资源详情
资源评论
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024 8581
Object Detection by Channel and Spatial Exchange
for Multimodal Remote Sensing Imagery
Guozheng Nan ,YueZhao , Liyong Fu , and Qiaolin Ye , Member, IEEE
Abstract—Smart satellites and unmanned aerial vehicles (UAVs)
are typically equipped with visible light and infrared (IR) spec-
trum sensors. However, achieving real-time object detection uti-
lizing these multimodal data on such resource-limited devices is a
challenging task. This article proposes HyperYOLO, a real-time
lightweight object detection framework for multimodal remote
sensing images. First, we propose a lightweight multimodal fusion
module named channel and spatial exchange (CSE) to effectively
extract complementary information from different modalities. The
CSE module consists of two stages: channel exchange and spatial
exchange. Channel exchange achieves global fusion by learning
global weights to better utilize cross-channel information corre-
lation, while spatial exchange captures details by considering spa-
tial relationships to calibrate local fusion. Second, we propose an
effective auxiliary branch module based on the feature pyramid
network for super resolution (FPNSR) to enhance the framework’s
responsiveness to small objects by learning high-quality feature
representations. Moreover, we embed a coordinate attention mech-
anism to assist our network in precisely localizing and attending to
the objects of interest. The experimental results show that on the
VEDAI remote sensing dataset, HyperYOLO achieves a 76.72%
mAP
50
, surpassing the SOTA SuperYOLO by 1.63%. Meanwhile,
the parameter size and GFLOPs of HyperYOLO are about 1.34
million (28%) and 3.97 (22%) less than SuperYOLO, respectively.
In addition, HyperYOLO has a file size of only 7.3 MB after the
removal of the auxiliary FPNSR branch, which makes it easier to
deploy on these resource-constrained devices.
Index Terms—Multimodal feature fusion, remote sensing image
(RSI), RGB-infrared object detection, super resolution (SR).
I. INTRODUCTION
O
BJECT detection is an important task in the field of
remote sensing image (RSI) processing, which not only
contributes to applications in monitoring natural disasters and
military reconnaissance but also has far-reaching impacts on
Manuscript received 9 March 2024; accepted 7 April 2024. Date of publication
12 April 2024; date of current version 24 April 2024. This work was supported
in part by the National Key Research and Development Program under Grant
2022YFD2201005-03, in part by the National Natural Science Foundation of
China under Grant 62072246 and Grant 32371877, in part by the Technology
Winter Olympics Special Project under Grant 201001D, and in part by the Forest
Fire Comprehensive System Construction-Unmanned Aerial Patrol Monitoring
System of Chongli under Grant DA2-20001. (Corresponding author: Qiaolin
Ye.)
Guozheng Nan, Yue Zhao, and Qiaolin Ye are with the College of In-
formation Science and Technology, College of Artificial Intelligence, Nan-
jing Forestry University, Nanjing 210037, China (e-mail: gzn@njfu.edu.cn;
zyue0109@163.com; yqlcom@njfu.edu.cn).
Liyong Fu is with the Institute of Forest Resource Information Techniques,
Chinese Academy of Forestry, Beijing 100091, China, and also with the College
of Forestry, Hebei Agricultural University, Baoding 071000, China.
Digital Object Identifier 10.1109/JSTARS.2024.3388013
urban planning and forest management. Traditional image fea-
ture extraction [1], [2] is important in computer vision, but its
performance in object detection is limited by complex visual
patterns and manual engineering. On the contrary, deep learning
significantly improves detection performance by automatically
learning discriminative features. In recent years, due to the
rapid development of deep learning technology, many excellent
algorithms [3], [4], [5], [6] have emerged in the field of object
detection.
However, compared to general object detection tasks, RSIs
have various characteristics such as complex backgrounds, small
and densely arranged objects, and shadow occlusion. Therefore,
it is necessary to adjust and optimize the model structure ac-
cording to the characteristics of RSIs. Traditional object detec-
tion algorithms [7], [8], [9] are typically designed based on a
single modality, primarily utilizing the visual information from
images for detection, lacking the assistance of other modalities’
information. This may result in limited feature representation
capability and difficulty capturing the diversity and contextual
information of objects in complex scenes. In addition, different
sensors may encounter various defects and noise when acquiring
target detection data [10]. These sensors can be influenced by
factors such as weather conditions, terrain, obstructions, and
shadows, leading to a decrease in image quality or incomplete
targetinformation. Therefore, relying solely on a single modality
for target detection may have certain limitations. Furthermore,
certain targets may be difficult to discern in specific modalities
but may become more apparent in other modalities. For instance,
in RGB images, some targets may blend with the background
color, making them challenging to distinguish. However, in
IR images, these targets may exhibit distinct thermal features
compared to the surrounding environment, making them easier
to identify. By fusing information from both RGB and IR modal-
ities, it is possible to enhance the visibility and distinctiveness of
targets under different spectra. Manish et al. [11] improved the
performance of detection through the introduction of a fusion
strategy for multimodal data. To discover potential correlations
between differentmodalities, many researchers employ complex
fusion modules, such as transformer [12] and illumination-
aware [13], which lead to increased computational complex-
ity. Similarly, widely adopted fusion methods, encompassing
feature-level and decision-level fusion [14], [15], [16] may lead
to redundant computations among different modality branches
or the introduction of additional backbone networks, thereby
restricting the deployment of the model. The recent development
direction of remote sensing object detection algorithms [9], [17],
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
8582 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024
[18] involves the use of super resolution (SR) technology to learn
the mapping relationship from low resolution (LR) images to
high resolution (HR) images, enabling the reconstruction and
detection of LR images. Although this approach can improve
the performance of small target detection by increasing the
detailed information in the image, the benefit comes at the cost
of increased model complexity, introducing a certain level of
intricacy and time overhead. Zhang et al. [19] introduced an
auxiliary SR branch to guide the detector in learning high-quality
HR representations, facilitating the distinction of small objects
from the LR input background. It is worth noting that many
upsampling methods, like bilinear and nearest-neighbor inter-
polation, estimate new pixel values from nearby pixels. While
these increase image size, they may lose texture details, and thus,
hinder small object reconstruction in LR images. In addition,
many methods fail to fully utilize multiscale information when
increasing the size of images, which limits their performance.
In recent years, the you only look once (YOLO) series of
algorithms [7], [20], [21], [22], [23] has emerged as a repre-
sentative in the field of object detection due to its rapid, accu-
rate, and proven engineering capabilities. To further enhance
real-time performance and achieve efficient object detection in
computationally constrained environments, several lightweight
and real-time improvement algorithms based on YOLO have
emerged. Zi et al. [24] proposed TP-YOLO, integrating self-
attention mechanisms and omnidimensional dynamic convolu-
tion (ODConv) [25] into YOLOv8 [23] to improve small target
detection while reducing parameters and computational com-
plexity. However, ODConv may not be suitable for sparse image
processing since it requires computing unique convolution ker-
nels for each position and channel, making it difficult to achieve
stable and reliable convolution operations with sparse input data.
In addition, Nvidia’s acceleration library, TensorRT, is not very
friendly toward ODConv operations. Zhang et al. [26] intro-
duced FFCA-YOLO, an improved version of YOLOv5 [22],to
address the issue of insufficient feature representation in small
object detection, achieving performance enhancement through
feature fusion and spatial context-aware modules while mini-
mizing complexity. However, FFCA-YOLO is an algorithm de-
signed based on a single modality, and its detection performance
may not meet expectations when encountering extreme weather
conditions or severe target occlusion.
In light of the aforementioned challenges, we propose a real-
time object detection framework for multimodal RSIs that excels
not only in having fewer parameters and lower computational
complexity but also in delivering exceptional detection accu-
racy. First, we choose the YOLOv7tiny [7] architecture as our
detection baseline. It has a smaller network structure and fewer
parameters, making it easier to deploy and run with limited com-
putational resources. Second, we propose a novel lightweight
fusion module employing channel and spatial exchange (CSE)
to ensure that each modality retains its unique features while
effectively integrating the prominent features of other modali-
ties. Moreover, an efficaciously assisted branch module based on
the feature pyramid network [8] for super resolution (FPNSR),
which uses the PixelShuffle (PS) upsampling [27] method and
multiscale feature information, is designed to preserve the fea-
tures of small targets in LR input within the backbone network.
Meanwhile, it is necessary to remove the FPNSR branch to
avoid additional computational overhead in the inference and
deployment stages. Finally, we introduced a coordinate attention
(CA) [28] mechanism to suppress interference from complex
backgrounds on small objects, while focusing on the regions of
interest to enhance boundary accuracy and preserve fine details.
In summary, the main contributions of our article are as follows.
1) We propose a computationally efficient lightweight mul-
timodal fusion method that symmetrically and compactly
combines internal information in a bidirectional manner,
utilizing spatial and channel relationships to ensure both
upper and lower branches focus on each other’s comple-
mentary information.
2) We designed the FPNSR auxiliary branch to address in-
conspicuous features in small targets within LR inputs,
utilizing SR techniques to enhance the model’s ability to
learn HR feature representations and improve the identi-
fication of small objects in cluttered backgrounds.
3) The FPNSR branch is a flexible component, guiding the
network to enhance the detector’s responsiveness to small
targets during training, and it can be removed during de-
ployment to reduce computational demands while main-
taining target detection accuracy.
4) We introduce the CA mechanism to enable our frame-
work to focus on specific regions within the target image,
thereby more effectively capturing crucial information
and achieving precise localization and recognition of the
target.
II. R
ELATED WORK
A. Remote Sensing Object Detection
HR satellite RSIs usually contain rich information about
ground objects, making the high-precision extraction of these
objects through the application of remote sensing object detec-
tion a current hot research direction [29], [30]. Object detection
algorithms often fall into two categories: two-stage methods
based on anchor boxes, such as Faster R-CNN [31] and Mask
R-CNN [32], and single-stage detectors like SSD [33] and
the YOLO series [22], [34], [35], among others. Recently,
some anchor-free detectors, such as the fully convolutional
FCOS [36], and transformer-based algorithms like deformable
DETR [37] and Swin Transformer [38], have also achieved
remarkable results. The two-stage detection method consists
of two core components: the feature extraction network and
the candidate region generation network. Initially, the feature
extraction network identifies the regions of interest in the image.
Subsequently, the candidate region generation network further
processes the features to generate potential target candidate
regions.While this two-stage approach achieveshigh accuracy, it
comes at the cost of slower detection speed. On the contrary, the
single-stage detection method directly predicts the category and
location information of the object without generating candidate
regions to improve the detection speed. Given the significant
NAN et al.: OBJECT DETECTION BY CHANNEL AND SPATIAL EXCHANGE 8583
differences between general objects and remote sensing objects,
researchers have proposed various solutions to address the issue
of small object detection in remote sensing imagery for the
one-stage detector YOLO series networks effectively. Zakria
et al. [39] proposed an improvement for the YOLOv4 [21]
network by introducing a nonmaximum suppression threshold
classification setting and an anchor box assignment scheme.
Lin et al. [30] introduced a decoupled detection head and
terminal attention mechanism to enhance the target localiza-
tion performance of the YOLOv5 framework. Yi et al. [40]
used a visual transform for feature extraction and designed an
attention-guided bidirectional FPN to improve the performance
of YOLOv8 for small target detection in RSIs.
Although the aforementioned improvement methods have
made progress in certain aspects, they only utilize unimodal data
and fail to fully exploit the inherent value of multimodal data.
The vibrant advancement of imaging technology offers greater
opportunities for collecting multimodal data in RSIs presenting
new possibilities to improve the accuracy of RSI analysis and
detection.
B. SR in Object Detection
Existing methods for small object detection in RSIs mainly fo-
cus on two aspects: context information and multiscale process-
ing [41], [42], [43]. However, these methods overlook a crucial
issue, which is the severe loss of feature information for small
objects after multiple downsampling operations in RSIs, as well
as the inadequate preservation of HR contextual information. As
a result, the current models exhibit poor detection accuracy for
RSIs. The SR technology is a highly regarded research direction
in the fields of computer vision and image processing [44],
[45]. In recent years, it has made significant advancements and
found widespread application, particularly in the domain of RSI
processing [18].Jietal.[9] introduced a two-branch network
for simultaneous SR and target detection, where the SR branch
generates HR feature maps for use in the detection branch, and
jointly optimizes the SR and target detection losses to train the
network. Courtrai et al. [46] improved small object detection in
RSIs by using generative adversarial networks (GANs) and the
EDSR [47] network to generate HR images for input into a detec-
tor. Although the above methods address the challenge of small
target detection to some extent, they are not suitable for practical
deployment of models in real-time application scenarios due to
the introduction of a large number of additional computations,
including the complexity associated with SR techniques and the
fact that HR features increase the complexity of the detection
model.
Recently, Wang et al. [48] and Zhang et al. [19] proposed an
SR module, respectively, which can maintain an HR representa-
tion even with LR inputs, while reducing model computation in
segmentation and detection t asks. However, a limitation of these
SR modules is their underutilization of multiscale features. In
SR networks, multiscale features play a crucial role in enhancing
reconstruction quality and preserving fine details. By integrating
information from different scales, the network can better capture
subtle changes and texture details in the image. Building upon
these structures, we propose a multiscale SR module, namely, the
FPNSR module, that achieves high-quality LR reconstruction to
preserveHR representation by integratingfeatures from multiple
scales.
C. CA Mechanism
The CA mechanism is a computational unit used to enhance
the feature representation capability of convolutional neural
networks (CNNs). Its design purpose is to assist the model in
focusing on important locations and content while addressing
the potential issue of position information loss in the squeeze-
and-excitation (SE) [49] attention module. To counteract spatial
information loss caused by 2-D global pooling layers, the CA
mechanism utilizes two 1-D networks to generate X and Y
1-D features, producing corresponding attention features aligned
with the spatial characteristics of the image. Specifically, as
illustrated in Fig. 2, the CA mechanism utilizes two 1-D global
pooling layers to extract directional features along the vertical
and horizontal directions from image features. Then, these direc-
tional feature maps are concatenated and subjected to dimension
reduction using a 1×1 convolution, followed by nonlinear acti-
vation operations, generating a new feature map. Subsequently,
the feature map is split along the spatial dimension, resulting
in two split features. Each split feature is further subjected to
dimension expansion using a 1×1 convolution and finally com-
bined with a sigmoid activation function to obtain the final atten-
tion vector feature. This operation effectively captures long-term
dependencies in image features along both directions, preserving
spatial information. The combination of these attention vector
features with the original image is achieved through elementwise
multiplication, resulting in image features weighted by attention
scores which indicate the degree of emphasis on the regions of
interest within the image features.
The CA mechanism not only captures crucial features across
channel dimensions but also possesses the ability to perceive
and extract spatial coordinate features in different directions,
effectively highlighting objects of interest in the input features.
In addition, with a low computational cost and complexity, the
CA mechanism can be efficiently utilized in object detection
models, achieving powerful enhancement of features.
III. P
ROPOSED METHOD
A. Baseline Architecture
Designed for edge computing devices, YOLOv7tiny is a
model in the YOLOv7 series that boasts a smaller model size
and faster inference speed. As shown in Fig. 1, the YOLOv7tiny
network architecture consists of three main components: the
Backbone, Neck, and Head. The Backbone section is composed
of several Convolution-BatchNorm-LeakyReLU (CBL) mod-
ules, UP modules, MP modules, and efficient layer aggregation
network (ELAN) [50] modules. The UP is built using CBL
modules and upsampling operations. The MP performs max
pooling operations. The ELAN, composed of multiple stacked
CBL modules and featuring a two-branch structure, contributes
to the reduction of gradient propagation delay and information
剩余12页未读,继续阅读
资源评论
pk_xz123456
- 粉丝: 2582
- 资源: 3641
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功