面向多模态遥感影像实时轻量级目标检测框架HyperYOLO

遥感图像处理

需积分: 1 155 浏览量 2024-10-13 15:41:09 上传评论 2 收藏 5.01MB PDF 举报

资源推荐

资源详情

资源评论

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024 8581

Object Detection by Channel and Spatial Exchange

for Multimodal Remote Sensing Imagery

Guozheng Nan ,YueZhao , Liyong Fu , and Qiaolin Ye , Member, IEEE

Abstract—Smart satellites and unmanned aerial vehicles (UAVs)

are typically equipped with visible light and infrared (IR) spec-

trum sensors. However, achieving real-time object detection uti-

lizing these multimodal data on such resource-limited devices is a

challenging task. This article proposes HyperYOLO, a real-time

lightweight object detection framework for multimodal remote

sensing images. First, we propose a lightweight multimodal fusion

module named channel and spatial exchange (CSE) to effectively

extract complementary information from different modalities. The

CSE module consists of two stages: channel exchange and spatial

exchange. Channel exchange achieves global fusion by learning

global weights to better utilize cross-channel information corre-

lation, while spatial exchange captures details by considering spa-

tial relationships to calibrate local fusion. Second, we propose an

effective auxiliary branch module based on the feature pyramid

network for super resolution (FPNSR) to enhance the framework’s

responsiveness to small objects by learning high-quality feature

representations. Moreover, we embed a coordinate attention mech-

anism to assist our network in precisely localizing and attending to

the objects of interest. The experimental results show that on the

VEDAI remote sensing dataset, HyperYOLO achieves a 76.72%

mAP

, surpassing the SOTA SuperYOLO by 1.63%. Meanwhile,

the parameter size and GFLOPs of HyperYOLO are about 1.34

million (28%) and 3.97 (22%) less than SuperYOLO, respectively.

In addition, HyperYOLO has a ﬁle size of only 7.3 MB after the

removal of the auxiliary FPNSR branch, which makes it easier to

deploy on these resource-constrained devices.

Index Terms—Multimodal feature fusion, remote sensing image

(RSI), RGB-infrared object detection, super resolution (SR).

I. INTRODUCTION

BJECT detection is an important task in the ﬁeld of

remote sensing image (RSI) processing, which not only

contributes to applications in monitoring natural disasters and

military reconnaissance but also has far-reaching impacts on

Manuscript received 9 March 2024; accepted 7 April 2024. Date of publication

12 April 2024; date of current version 24 April 2024. This work was supported

in part by the National Key Research and Development Program under Grant

2022YFD2201005-03, in part by the National Natural Science Foundation of

China under Grant 62072246 and Grant 32371877, in part by the Technology

Winter Olympics Special Project under Grant 201001D, and in part by the Forest

Fire Comprehensive System Construction-Unmanned Aerial Patrol Monitoring

System of Chongli under Grant DA2-20001. (Corresponding author: Qiaolin

Ye.)

Guozheng Nan, Yue Zhao, and Qiaolin Ye are with the College of In-

formation Science and Technology, College of Artiﬁcial Intelligence, Nan-

jing Forestry University, Nanjing 210037, China (e-mail: gzn@njfu.edu.cn;

zyue0109@163.com; yqlcom@njfu.edu.cn).

Liyong Fu is with the Institute of Forest Resource Information Techniques,

Chinese Academy of Forestry, Beijing 100091, China, and also with the College

of Forestry, Hebei Agricultural University, Baoding 071000, China.

Digital Object Identiﬁer 10.1109/JSTARS.2024.3388013

urban planning and forest management. Traditional image fea-

ture extraction [1], [2] is important in computer vision, but its

performance in object detection is limited by complex visual

patterns and manual engineering. On the contrary, deep learning

signiﬁcantly improves detection performance by automatically

learning discriminative features. In recent years, due to the

rapid development of deep learning technology, many excellent

algorithms [3], [4], [5], [6] have emerged in the ﬁeld of object

detection.

However, compared to general object detection tasks, RSIs

have various characteristics such as complex backgrounds, small

and densely arranged objects, and shadow occlusion. Therefore,

it is necessary to adjust and optimize the model structure ac-

cording to the characteristics of RSIs. Traditional object detec-

tion algorithms [7], [8], [9] are typically designed based on a

single modality, primarily utilizing the visual information from

images for detection, lacking the assistance of other modalities’

information. This may result in limited feature representation

capability and difﬁculty capturing the diversity and contextual

information of objects in complex scenes. In addition, different

sensors may encounter various defects and noise when acquiring

target detection data [10]. These sensors can be inﬂuenced by

factors such as weather conditions, terrain, obstructions, and

shadows, leading to a decrease in image quality or incomplete

targetinformation. Therefore, relying solely on a single modality

for target detection may have certain limitations. Furthermore,

certain targets may be difﬁcult to discern in speciﬁc modalities

but may become more apparent in other modalities. For instance,

in RGB images, some targets may blend with the background

color, making them challenging to distinguish. However, in

IR images, these targets may exhibit distinct thermal features

compared to the surrounding environment, making them easier

to identify. By fusing information from both RGB and IR modal-

ities, it is possible to enhance the visibility and distinctiveness of

targets under different spectra. Manish et al. [11] improved the

performance of detection through the introduction of a fusion

strategy for multimodal data. To discover potential correlations

between differentmodalities, many researchers employ complex

fusion modules, such as transformer [12] and illumination-

aware [13], which lead to increased computational complex-

ity. Similarly, widely adopted fusion methods, encompassing

feature-level and decision-level fusion [14], [15], [16] may lead

to redundant computations among different modality branches

or the introduction of additional backbone networks, thereby

restricting the deployment of the model. The recent development

direction of remote sensing object detection algorithms [9], [17],

https://creativecommons.org/licenses/by-nc-nd/4.0/

8582 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 17, 2024

[18] involves the use of super resolution (SR) technology to learn

the mapping relationship from low resolution (LR) images to

high resolution (HR) images, enabling the reconstruction and

detection of LR images. Although this approach can improve

the performance of small target detection by increasing the

detailed information in the image, the beneﬁt comes at the cost

of increased model complexity, introducing a certain level of

intricacy and time overhead. Zhang et al. [19] introduced an

auxiliary SR branch to guide the detector in learning high-quality

HR representations, facilitating the distinction of small objects

from the LR input background. It is worth noting that many

upsampling methods, like bilinear and nearest-neighbor inter-

polation, estimate new pixel values from nearby pixels. While

these increase image size, they may lose texture details, and thus,

hinder small object reconstruction in LR images. In addition,

many methods fail to fully utilize multiscale information when

increasing the size of images, which limits their performance.

In recent years, the you only look once (YOLO) series of

algorithms [7], [20], [21], [22], [23] has emerged as a repre-

sentative in the ﬁeld of object detection due to its rapid, accu-

rate, and proven engineering capabilities. To further enhance

real-time performance and achieve efﬁcient object detection in

computationally constrained environments, several lightweight

and real-time improvement algorithms based on YOLO have

emerged. Zi et al. [24] proposed TP-YOLO, integrating self-

attention mechanisms and omnidimensional dynamic convolu-

tion (ODConv) [25] into YOLOv8 [23] to improve small target

detection while reducing parameters and computational com-

plexity. However, ODConv may not be suitable for sparse image

processing since it requires computing unique convolution ker-

nels for each position and channel, making it difﬁcult to achieve

stable and reliable convolution operations with sparse input data.

In addition, Nvidia’s acceleration library, TensorRT, is not very

friendly toward ODConv operations. Zhang et al. [26] intro-

duced FFCA-YOLO, an improved version of YOLOv5 [22],to

address the issue of insufﬁcient feature representation in small

object detection, achieving performance enhancement through

feature fusion and spatial context-aware modules while mini-

mizing complexity. However, FFCA-YOLO is an algorithm de-

signed based on a single modality, and its detection performance

may not meet expectations when encountering extreme weather

conditions or severe target occlusion.

In light of the aforementioned challenges, we propose a real-

time object detection framework for multimodal RSIs that excels

not only in having fewer parameters and lower computational

complexity but also in delivering exceptional detection accu-

racy. First, we choose the YOLOv7tiny [7] architecture as our

detection baseline. It has a smaller network structure and fewer

parameters, making it easier to deploy and run with limited com-

putational resources. Second, we propose a novel lightweight

fusion module employing channel and spatial exchange (CSE)

to ensure that each modality retains its unique features while

effectively integrating the prominent features of other modali-

ties. Moreover, an efﬁcaciously assisted branch module based on

the feature pyramid network [8] for super resolution (FPNSR),

which uses the PixelShufﬂe (PS) upsampling [27] method and

multiscale feature information, is designed to preserve the fea-

tures of small targets in LR input within the backbone network.

Meanwhile, it is necessary to remove the FPNSR branch to

avoid additional computational overhead in the inference and

deployment stages. Finally, we introduced a coordinate attention

(CA) [28] mechanism to suppress interference from complex

backgrounds on small objects, while focusing on the regions of

interest to enhance boundary accuracy and preserve ﬁne details.

In summary, the main contributions of our article are as follows.

1) We propose a computationally efﬁcient lightweight mul-

timodal fusion method that symmetrically and compactly

combines internal information in a bidirectional manner,

utilizing spatial and channel relationships to ensure both

upper and lower branches focus on each other’s comple-

mentary information.

2) We designed the FPNSR auxiliary branch to address in-

conspicuous features in small targets within LR inputs,

utilizing SR techniques to enhance the model’s ability to

learn HR feature representations and improve the identi-

ﬁcation of small objects in cluttered backgrounds.

3) The FPNSR branch is a ﬂexible component, guiding the

network to enhance the detector’s responsiveness to small

targets during training, and it can be removed during de-

ployment to reduce computational demands while main-

taining target detection accuracy.

4) We introduce the CA mechanism to enable our frame-

work to focus on speciﬁc regions within the target image,

thereby more effectively capturing crucial information

and achieving precise localization and recognition of the

target.

II. R

ELATED WORK

A. Remote Sensing Object Detection

HR satellite RSIs usually contain rich information about

ground objects, making the high-precision extraction of these

objects through the application of remote sensing object detec-

tion a current hot research direction [29], [30]. Object detection

algorithms often fall into two categories: two-stage methods

based on anchor boxes, such as Faster R-CNN [31] and Mask

R-CNN [32], and single-stage detectors like SSD [33] and

the YOLO series [22], [34], [35], among others. Recently,

some anchor-free detectors, such as the fully convolutional

FCOS [36], and transformer-based algorithms like deformable

DETR [37] and Swin Transformer [38], have also achieved

remarkable results. The two-stage detection method consists

of two core components: the feature extraction network and

the candidate region generation network. Initially, the feature

extraction network identiﬁes the regions of interest in the image.

Subsequently, the candidate region generation network further

processes the features to generate potential target candidate

regions.While this two-stage approach achieveshigh accuracy, it

comes at the cost of slower detection speed. On the contrary, the

single-stage detection method directly predicts the category and

location information of the object without generating candidate

regions to improve the detection speed. Given the signiﬁcant

NAN et al.: OBJECT DETECTION BY CHANNEL AND SPATIAL EXCHANGE 8583

differences between general objects and remote sensing objects,

researchers have proposed various solutions to address the issue

of small object detection in remote sensing imagery for the

one-stage detector YOLO series networks effectively. Zakria

et al. [39] proposed an improvement for the YOLOv4 [21]

network by introducing a nonmaximum suppression threshold

classiﬁcation setting and an anchor box assignment scheme.

Lin et al. [30] introduced a decoupled detection head and

terminal attention mechanism to enhance the target localiza-

tion performance of the YOLOv5 framework. Yi et al. [40]

used a visual transform for feature extraction and designed an

attention-guided bidirectional FPN to improve the performance

of YOLOv8 for small target detection in RSIs.

Although the aforementioned improvement methods have

made progress in certain aspects, they only utilize unimodal data

and fail to fully exploit the inherent value of multimodal data.

The vibrant advancement of imaging technology offers greater

opportunities for collecting multimodal data in RSIs presenting

new possibilities to improve the accuracy of RSI analysis and

detection.

B. SR in Object Detection

Existing methods for small object detection in RSIs mainly fo-

cus on two aspects: context information and multiscale process-

ing [41], [42], [43]. However, these methods overlook a crucial

issue, which is the severe loss of feature information for small

objects after multiple downsampling operations in RSIs, as well

as the inadequate preservation of HR contextual information. As

a result, the current models exhibit poor detection accuracy for

RSIs. The SR technology is a highly regarded research direction

in the ﬁelds of computer vision and image processing [44],

[45]. In recent years, it has made signiﬁcant advancements and

found widespread application, particularly in the domain of RSI

processing [18].Jietal.[9] introduced a two-branch network

for simultaneous SR and target detection, where the SR branch

generates HR feature maps for use in the detection branch, and

jointly optimizes the SR and target detection losses to train the

network. Courtrai et al. [46] improved small object detection in

RSIs by using generative adversarial networks (GANs) and the

EDSR [47] network to generate HR images for input into a detec-

tor. Although the above methods address the challenge of small

target detection to some extent, they are not suitable for practical

deployment of models in real-time application scenarios due to

the introduction of a large number of additional computations,

including the complexity associated with SR techniques and the

fact that HR features increase the complexity of the detection

model.

Recently, Wang et al. [48] and Zhang et al. [19] proposed an

SR module, respectively, which can maintain an HR representa-

tion even with LR inputs, while reducing model computation in

segmentation and detection t asks. However, a limitation of these

SR modules is their underutilization of multiscale features. In

SR networks, multiscale features play a crucial role in enhancing

reconstruction quality and preserving ﬁne details. By integrating

information from different scales, the network can better capture

subtle changes and texture details in the image. Building upon

these structures, we propose a multiscale SR module, namely, the

FPNSR module, that achieves high-quality LR reconstruction to

preserveHR representation by integratingfeatures from multiple

scales.

C. CA Mechanism

The CA mechanism is a computational unit used to enhance

the feature representation capability of convolutional neural

networks (CNNs). Its design purpose is to assist the model in

focusing on important locations and content while addressing

the potential issue of position information loss in the squeeze-

and-excitation (SE) [49] attention module. To counteract spatial

information loss caused by 2-D global pooling layers, the CA

mechanism utilizes two 1-D networks to generate X and Y

1-D features, producing corresponding attention features aligned

with the spatial characteristics of the image. Speciﬁcally, as

illustrated in Fig. 2, the CA mechanism utilizes two 1-D global

pooling layers to extract directional features along the vertical

and horizontal directions from image features. Then, these direc-

tional feature maps are concatenated and subjected to dimension

reduction using a 1×1 convolution, followed by nonlinear acti-

vation operations, generating a new feature map. Subsequently,

the feature map is split along the spatial dimension, resulting

in two split features. Each split feature is further subjected to

dimension expansion using a 1×1 convolution and ﬁnally com-

bined with a sigmoid activation function to obtain the ﬁnal atten-

tion vector feature. This operation effectively captures long-term

dependencies in image features along both directions, preserving

spatial information. The combination of these attention vector

features with the original image is achieved through elementwise

multiplication, resulting in image features weighted by attention

scores which indicate the degree of emphasis on the regions of

interest within the image features.

The CA mechanism not only captures crucial features across

channel dimensions but also possesses the ability to perceive

and extract spatial coordinate features in different directions,

effectively highlighting objects of interest in the input features.

In addition, with a low computational cost and complexity, the

CA mechanism can be efﬁciently utilized in object detection

models, achieving powerful enhancement of features.

III. P

ROPOSED METHOD

A. Baseline Architecture

Designed for edge computing devices, YOLOv7tiny is a

model in the YOLOv7 series that boasts a smaller model size

and faster inference speed. As shown in Fig. 1, the YOLOv7tiny

network architecture consists of three main components: the

Backbone, Neck, and Head. The Backbone section is composed

of several Convolution-BatchNorm-LeakyReLU (CBL) mod-

ules, UP modules, MP modules, and efﬁcient layer aggregation

network (ELAN) [50] modules. The UP is built using CBL

modules and upsampling operations. The MP performs max

pooling operations. The ELAN, composed of multiple stacked

CBL modules and featuring a two-branch structure, contributes

to the reduction of gradient propagation delay and information

剩余12页未读，继续阅读

评论收藏

内容反馈

pk_xz123456

粉丝: 2582
资源: 3641

面向多模态遥感影像实时轻量级目标检测框架HyperYOLO

基于深度卷积神经网络的遥感影像目标检测技术研究及应用

异源遥感影像特征匹配的深度学习算法.pdf

多模态序列遥感影像的洪涝灾害应急信息快速提取.docx

面向自动驾驶目标检测的深度多模态融合技术.pdf

基于异构机器学习算法融合的遥感影像分类.pdf

通用的多模态遥感图像匹配框架matlab代码（已实现工业级应用）

多模态影像融合解决方案

快速鲁棒的多模态遥感自动配准系统

语义增强的多模态虚假新闻检测.docx

基于深度学习与多模态医学影像融合识别阈下抑郁患者.pdf

高光谱和LiDAR多模态遥感图像分类数据集

基于深度学习的多模态骨癌影像分类诊断系统研究.pdf

多模态医学影像的非刚体配准与多分辨率融合方法.pdf

结合空洞卷积的FuseNet变体网络高分辨率遥感影像语义分割.docx

(源码)基于Python和MMDetection框架的多模态目标检测系统.zip

YOLO卫星遥感舰船检测数据集(含5000张图片)+对应voc、coco和yolo三种格式标签+划分脚本+训练教程.rar

论文研究-二维经验模态分解算法遥感影像解模糊.pdf

优雅完整和轻量级的模态对话框jQuery插件

多模态医学影像配准算法之概述与应用.docx

形拓扑多模态多目标粒子群算法的代码

浙商证券-20230405-人工智能行业专题报告：多模态AI研究框架.pdf

基于python的多模态内容理解算法框架源码

深度学习在遥感影像分类中的研究进展.pdf

一种面向多模态数据的小样本机器学习方法、系统和介质.pdf

水下目标多模态深度学习分类识别研究.pdf

中文多模态医学大模型智能分析X光片，实现影像诊断，完成医生问诊多轮对话

多模态为什么比单模态好？第一份严谨证明来了！.pdf

最新资源