没有合适的资源?快使用搜索试试~ 我知道了~
DAMSDet: 动态自适应多光谱检测Transformer解决红外可见物体融合与错位问题
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 69 浏览量
2024-11-16
09:41:00
上传
评论
收藏 16.61MB PDF 举报
温馨提示
内容概要:本文提出了DAMSDet,一种动态自适应多光谱检测Transformer方法,解决了红外可见物体检测中的互补信息融合和模态对齐问题。DAMSDet采用了模态竞争查询选择(Modality Competitive Query Selection)策略和多光谱变形交叉注意力模块(Multispectral Deformable Cross-attention),能有效挖掘细粒度的部分互补信息,并适应模态对齐情况。实验结果表明,在多个公开数据集上,该方法比现有方法显著提升了性能。 适合人群:对计算机视觉尤其是红外可见物体检测感兴趣的研究人员和技术人员。 使用场景及目标:适用于需要高精度多光谱物体检测的应用场景,如智能监控、自动驾驶、安防系统等。旨在提高复杂光照条件下的物体检测效果,特别是在低光照、烟雾或雾霾环境中。 其他说明:本方法不仅能够有效地融合多模态特征,还能动态地适应不同的模态特征变化,提高了检测的鲁棒性和准确性。
资源推荐
资源详情
资源评论
DAMSDet: Dynamic Adaptive Multispectral
Detection Transformer with Competitive Query
Selection and Adaptive Feature Fusion
Junjie Guo
1
, Chenqiang Gao
2∗
, Fangcen Liu
1
, Deyu Meng
3
, and Xinbo Gao
1
1
School of Communications and Information Engineering, Chongqing University of
Posts and Telecommunications, Chongqing 400065, China
2
School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen,
Guangdong 518107, China.
3
Deyu Meng is with the School of Mathematics and Statistics, Xi’an Jiaotong
University, Xi’an, Shanxi, 710049, China.
Abstract. Infrared-visible object detection aims to achieve robust even
full-day object detection by fusing the complementary information of in-
frared and visible images. However, highly dynamically variable comple-
mentary characteristics and commonly existing modality misalignment
make the fusion of complementary information difficult. In this paper,
we propose a Dynamic Adaptive Multispectral Detection Transformer
(DAMSDet) to simultaneously address these two challenges. Specifically,
we propose a Modality Competitive Query Selection strategy to pro-
vide useful prior information. This strategy can dynamically select basic
salient modality feature representation for each object. To effectively
mine the complementary information and adapt to misalignment situa-
tions, we propose a Multispectral Deformable Cross-attention module to
adaptively sample and aggregate multi-semantic level features of infrared
and visible images for each object. In addition, we further adopt the
cascade structure of DETR to better mine complementary information.
Experiments on four public datasets of different scenes demonstrate sig-
nificant improvements compared to other state-of-the-art methods. The
code will be released at https://github.com/gjj45/DAMSDet.
Keywords: Object detection · Multispectral detection · Infrared · DETR
· Query selection · Adaptive feature fusion
1 Introduction
Object detection is a fundamental task in computer vision and most of re-
search works are based on visible images with detailed object information, e.g.,
texture and color information. Thanks to the development of deep learning, the
object detection technique has made great progress. However, it is still chal-
lenged by poor imaging conditions, such as low illumination, smoke, fog, and so
on, which could make objects pretty obscure and further obviously degrade the
performance of object detection. Thus, infrared images are introduced into the
arXiv:2403.00326v3 [cs.CV] 7 Mar 2024
2 Guo et al.
Infrared
Infrared
Visible
Visible
(a) Different complementary characteristics in infrared-visible images.
(b) Misalignment in infrared-visible images.
Fig. 1: Illustrations of two typical challenges in infrared-visible object detection. (a)
Three pedestrians represent different complex complementary characteristics. In this
example, the objects in the visible image provide unuseful interference information
(red), partial complementary information (blue), and full complementary information
(green). (b) One example of the misalignment problem, in which the ground truths of
infrared and visible objects appear obvious dislocation. This misalignment commonly
happens in infrared-visible images. We propose a Multispectral Transformer Decoder
with Multispectral Deformable Cross-attention module to simultaneously address these
two typical challenges.
object detection task. Different from visible imaging, infrared imaging captures
the thermal radiation of objects, making it unaffected by illumination, smoke
and fog occlusion conditions. Therefore, infrared imaging can still well capture
objects even in low illumination, heavy smoke, or fog, whereas the detailed tex-
ture and color information will be lost. These complementary characteristics of
infrared and visible imaging not only can improve the performance of object
detection, but also is considered to be promising to implement full-day object
detection. Thus, infrared-visible object detection has attracted extensive atten-
tion in recent years [5,15, 30, 33, 39, 41].
However, existing methods tend to neglect the modality interference encoun-
tered in complex scenes during the fusion process. For the case that the object
signal in one modality is poor or absent, directly fusing the information of two
modalities will bring in unuseful interference information, which could lead to
feature confusion and thus degrade the object detection performance. For exam-
ple, as shown in Fig. 1(a), the pedestrian within the smoke fully disappears, and
intuitively, the best way is to suppress or discard the visible information for that
pedestrian. Some works learn a global fusion weight to adapt to specific scenes,
the representative ones of which are to adopt the illumination-aware network
to obtain illumination score as the global fusion weight [12, 33, 41]. Other works
learn local region fusion weights through bounding-box level semantic segmen-
tation [12, 33, 41], or regions of interest (ROI) prediction [9, 39, 40].
Actually, due to the fully different imaging principles, the complementary
characteristics in infrared-visible images appear highly variable with the specific
scenes and objects, as shown as Fig. 1. Especially, from Fig. 1(a), we can observe
DAMSDet 3
that three pedestrians have obviously different complementary characteristics.
The one with the green bounding box has good complementary information in
both modalities, while the one with the red bounding box has only infrared
information available, as mentioned previously. In contrast, the one with the
blue bounding box has partial information available in both modalities, which
commonly exists in practical applications. This situation would make current
methods fail to effectively fuse features, even for above mentioned region-based
weight fusion methods in which the segmented or predicted regions are usu-
ally bigger than objects. Therefore, more fine-grained two-modality information
fusion still remains a challenge.
Another important challenge in infrared-visible object detection is the modal-
ity misalignment problem. Most feature fusion methods assume that the two
modalities are well-aligned. However, precise registration is difficult because
infrared-visible images often exhibit significant visual differences and are not
always captured at the exact same timestamp [43]. As a result, even through
manual registration, the imaging objects in two modalities for the same one
are usually misaligned, as shown in Fig. 1(b). This could lead to disrupting
the consistency of fused feature representation of current methods, affecting the
final detection performance. AR-CNN [39,40] explicitly learned the offsets of ob-
jects in both modalities to achieve alignment on object features. However, this
method requires additional paired bounding-box annotations of two modalities
during training, which is time-consuming and labor-intensive.
In this paper, we propose a novel adaptive infrared-visible object detection
method, which contains a Multispectral Transformer Decoder with Multispec-
tral Deformable Cross-attention module to simultaneously address the above two
challenges inspired by the deformable cross attention [42]. Specifically, we adopt
an effective strategy of adaptive sparse feature sampling and weight aggregation
on two-modality feature maps of different semantic levels. This strategy can ef-
fectively fuse fine-grained complementary information even when two modalities
are misaligned. Since the two challenges of fine-grained information fusion and
modality alignment are simultaneously handled in a single module, our method
is more efficient than existing methods which usually handle them separately.
Furthermore, unlike the one-step fusion strategy adopted by existing methods,
the information fusion of each specific object in our method happens on differ-
ent semantic levels, which makes the complementary information fully be mined
and utilized. Actually, we observed that the complementary information of two
modalities also dynamically varies with the semantic levels, as discussed in Sec.
3.3. This is similar to our observation on scenes and objects discussed previously.
Thus, our adaptive multi-level fusion is more reasonable.
In order to provide reliable input at the early stage, we design a Competitive
Query selection strategy to select dominant modality features for each object as
initial position and content queries to build a basic salient feature representa-
tion for the Multispectral Transformer Decoder, which can provide useful prior
information for following processing. To further exploit more reliable and com-
prehensive complementary information step by step, the cascaded layer structure
4 Guo et al.
of DETR [35] is employed in this paper. Totally, our method is similar to the hu-
man observation pattern which dynamically focuses on objects in each modality
and gradually aggregates key information of two modalities.
Our contributions can be summarized as follows:
• We propose a novel infrared-visible object detection method, named DAMS-
Det, which can dynamically focus on dominant modality objects and adap-
tively fuse complementary information.
• We propose a Competitive Selection strategy for multimodal initialization
queries to dynamically focus on the dominant modality of each object and
provide useful prior information for following fusion process.
• We propose a Multispectral Deformable Cross-attention module, which can
simultaneously adaptively mine fine-grained partial complementary informa-
tion at different semantic levels and adapt to modality misalignment situa-
tions.
• Experiments on four public datasets with different scenarios demonstrate
that the proposed method achieves significant improvement compared with
other state-of-the-art methods.
2 Related Work
Infrared-Visible object detection. Previous research in infrared-visible ob-
ject detection are primarily built upon the single modality object detection
frameworks, which are generally divided into one-stage object detectors, such
as Faster RCNN [28], and two-stage object detectors, such as YOLO [25–27,31].
In order to fuse the complementary information of infrared and visible im-
ages, Konig et al. [10] introduced a fully convolutional fusion RPN network,
which fused infrared and visible image features by concatenation, and concluded
that halfway fusion can obtain better result [16]. On this foundation, some stud-
ies designed CNN-based attention modules to better exploit the potential com-
plementation of infrared and visible images [2,23,29]. Additionally, other works
introduced transformer-based fusion modules to capture a more global comple-
mentary relationship between infrared and visible images [5,22, 30, 43].
In addition to the above methods of directly fusing image features. Some
works adopted illumination information as global weights to fuse infrared and
visible image features or post-fuse the multibrance detection results to reduce the
impact of interfering information [12,33,41]. Considering that the complementary
characteristics of different regions could be different, some studies introduced
bounding box level semantic segmentation [1,11,12, 37, 38] or regions of interest
(ROI) prediction [9, 39, 40] to guide the fusion of different regions. Other works
also utilized the confidence or uncertainty scores of regions to post-fuse the
predictions of multibranchs [14,15].
To address the challenge of modality misalignment, Zhang et al. [39,40] devel-
oped the AR-CNN network and explicitly aligned the features of two modalities
by incorporating additional paired bounding box annotations to learn object
剩余16页未读,继续阅读
资源评论
pk_xz123456
- 粉丝: 2014
- 资源: 987
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于SpringBoot+Vue的校园招聘管理系统(前端代码)
- C++期末大作业-2024-QT仓库商品管理系统,经过老师审定过的,应该能够满足学习、使用需求,如果有需要的话可以放心下载使用
- DH-GSTN5600 剩余电流式电气火灾监控探测器 安装使用说明书
- 天津理工大学信息系统设计实验
- jsp ssm 学校录取查询系统 高校志愿填报录取 项目源码 web java【项目源码+数据库脚本+项目说明+软件工具】毕设
- jsp ssm 网上购物系统 在线购物 在线商城平台 项目源码 web java【项目源码+数据库脚本+项目说明+软件工具】毕设
- 29网课交单平台源码最新修复全开源版本
- jsp ssm 超市网上购物系统 超市管理 超市购物 项目源码 web java【项目源码+数据库脚本+项目说明+软件工具】毕
- 海湾火灾自动报警系统主要设备参数
- C++自制多功能游戏头文件
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功