DAMSDet:动态自适应多光谱检测Transformer解决红外可见物体融合与错位问题资源-CSDN文库

版权申诉

物体检测

69 浏览量 2024-11-16 09:41:00 上传评论收藏 16.61MB PDF 举报

资源推荐

资源详情

资源评论

DAMSDet: Dynamic Adaptive Multispectral

Detection Transformer with Competitive Query

Selection and Adaptive Feature Fusion

Junjie Guo

, Chenqiang Gao

2∗

, Fangcen Liu

, Deyu Meng

, and Xinbo Gao

School of Communications and Information Engineering, Chongqing University of

Posts and Telecommunications, Chongqing 400065, China

School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen,

Guangdong 518107, China.

Deyu Meng is with the School of Mathematics and Statistics, Xi’an Jiaotong

University, Xi’an, Shanxi, 710049, China.

Abstract. Infrared-visible object detection aims to achieve robust even

full-day object detection by fusing the complementary information of in-

frared and visible images. However, highly dynamically variable comple-

mentary characteristics and commonly existing modality misalignment

make the fusion of complementary information diﬃcult. In this paper,

we propose a Dynamic Adaptive Multispectral Detection Transformer

(DAMSDet) to simultaneously address these two challenges. Speciﬁcally,

we propose a Modality Competitive Query Selection strategy to pro-

vide useful prior information. This strategy can dynamically select basic

salient modality feature representation for each object. To eﬀectively

mine the complementary information and adapt to misalignment situa-

tions, we propose a Multispectral Deformable Cross-attention module to

adaptively sample and aggregate multi-semantic level features of infrared

and visible images for each object. In addition, we further adopt the

cascade structure of DETR to better mine complementary information.

Experiments on four public datasets of diﬀerent scenes demonstrate sig-

niﬁcant improvements compared to other state-of-the-art methods. The

code will be released at https://github.com/gjj45/DAMSDet.

Keywords: Object detection · Multispectral detection · Infrared · DETR

· Query selection · Adaptive feature fusion

1 Introduction

Object detection is a fundamental task in computer vision and most of re-

search works are based on visible images with detailed object information, e.g.,

texture and color information. Thanks to the development of deep learning, the

object detection technique has made great progress. However, it is still chal-

lenged by poor imaging conditions, such as low illumination, smoke, fog, and so

on, which could make objects pretty obscure and further obviously degrade the

performance of object detection. Thus, infrared images are introduced into the

arXiv:2403.00326v3 [cs.CV] 7 Mar 2024

2 Guo et al.

Infrared

Visible

(a) Different complementary characteristics in infrared-visible images.

(b) Misalignment in infrared-visible images.

Fig. 1: Illustrations of two typical challenges in infrared-visible object detection. (a)

Three pedestrians represent diﬀerent complex complementary characteristics. In this

example, the objects in the visible image provide unuseful interference information

(red), partial complementary information (blue), and full complementary information

(green). (b) One example of the misalignment problem, in which the ground truths of

infrared and visible objects appear obvious dislocation. This misalignment commonly

happens in infrared-visible images. We propose a Multispectral Transformer Decoder

with Multispectral Deformable Cross-attention module to simultaneously address these

two typical challenges.

object detection task. Diﬀerent from visible imaging, infrared imaging captures

the thermal radiation of objects, making it unaﬀected by illumination, smoke

and fog occlusion conditions. Therefore, infrared imaging can still well capture

objects even in low illumination, heavy smoke, or fog, whereas the detailed tex-

ture and color information will be lost. These complementary characteristics of

infrared and visible imaging not only can improve the performance of object

detection, but also is considered to be promising to implement full-day object

detection. Thus, infrared-visible object detection has attracted extensive atten-

tion in recent years [5,15, 30, 33, 39, 41].

However, existing methods tend to neglect the modality interference encoun-

tered in complex scenes during the fusion process. For the case that the object

signal in one modality is poor or absent, directly fusing the information of two

modalities will bring in unuseful interference information, which could lead to

feature confusion and thus degrade the object detection performance. For exam-

ple, as shown in Fig. 1(a), the pedestrian within the smoke fully disappears, and

intuitively, the best way is to suppress or discard the visible information for that

pedestrian. Some works learn a global fusion weight to adapt to speciﬁc scenes,

the representative ones of which are to adopt the illumination-aware network

to obtain illumination score as the global fusion weight [12, 33, 41]. Other works

learn local region fusion weights through bounding-box level semantic segmen-

tation [12, 33, 41], or regions of interest (ROI) prediction [9, 39, 40].

Actually, due to the fully diﬀerent imaging principles, the complementary

characteristics in infrared-visible images appear highly variable with the speciﬁc

scenes and objects, as shown as Fig. 1. Especially, from Fig. 1(a), we can observe

DAMSDet 3

that three pedestrians have obviously diﬀerent complementary characteristics.

The one with the green bounding box has good complementary information in

both modalities, while the one with the red bounding box has only infrared

information available, as mentioned previously. In contrast, the one with the

blue bounding box has partial information available in both modalities, which

commonly exists in practical applications. This situation would make current

methods fail to eﬀectively fuse features, even for above mentioned region-based

weight fusion methods in which the segmented or predicted regions are usu-

ally bigger than objects. Therefore, more ﬁne-grained two-modality information

fusion still remains a challenge.

Another important challenge in infrared-visible object detection is the modal-

ity misalignment problem. Most feature fusion methods assume that the two

modalities are well-aligned. However, precise registration is diﬃcult because

infrared-visible images often exhibit signiﬁcant visual diﬀerences and are not

always captured at the exact same timestamp [43]. As a result, even through

manual registration, the imaging objects in two modalities for the same one

are usually misaligned, as shown in Fig. 1(b). This could lead to disrupting

the consistency of fused feature representation of current methods, aﬀecting the

ﬁnal detection performance. AR-CNN [39,40] explicitly learned the oﬀsets of ob-

jects in both modalities to achieve alignment on object features. However, this

method requires additional paired bounding-box annotations of two modalities

during training, which is time-consuming and labor-intensive.

In this paper, we propose a novel adaptive infrared-visible object detection

method, which contains a Multispectral Transformer Decoder with Multispec-

tral Deformable Cross-attention module to simultaneously address the above two

challenges inspired by the deformable cross attention [42]. Speciﬁcally, we adopt

an eﬀective strategy of adaptive sparse feature sampling and weight aggregation

on two-modality feature maps of diﬀerent semantic levels. This strategy can ef-

fectively fuse ﬁne-grained complementary information even when two modalities

are misaligned. Since the two challenges of ﬁne-grained information fusion and

modality alignment are simultaneously handled in a single module, our method

is more eﬃcient than existing methods which usually handle them separately.

Furthermore, unlike the one-step fusion strategy adopted by existing methods,

the information fusion of each speciﬁc object in our method happens on diﬀer-

ent semantic levels, which makes the complementary information fully be mined

and utilized. Actually, we observed that the complementary information of two

modalities also dynamically varies with the semantic levels, as discussed in Sec.

3.3. This is similar to our observation on scenes and objects discussed previously.

Thus, our adaptive multi-level fusion is more reasonable.

In order to provide reliable input at the early stage, we design a Competitive

Query selection strategy to select dominant modality features for each object as

initial position and content queries to build a basic salient feature representa-

tion for the Multispectral Transformer Decoder, which can provide useful prior

information for following processing. To further exploit more reliable and com-

prehensive complementary information step by step, the cascaded layer structure

4 Guo et al.

of DETR [35] is employed in this paper. Totally, our method is similar to the hu-

man observation pattern which dynamically focuses on objects in each modality

and gradually aggregates key information of two modalities.

Our contributions can be summarized as follows:

• We propose a novel infrared-visible object detection method, named DAMS-

Det, which can dynamically focus on dominant modality objects and adap-

tively fuse complementary information.

• We propose a Competitive Selection strategy for multimodal initialization

queries to dynamically focus on the dominant modality of each object and

provide useful prior information for following fusion process.

• We propose a Multispectral Deformable Cross-attention module, which can

simultaneously adaptively mine ﬁne-grained partial complementary informa-

tion at diﬀerent semantic levels and adapt to modality misalignment situa-

tions.

• Experiments on four public datasets with diﬀerent scenarios demonstrate

that the proposed method achieves signiﬁcant improvement compared with

other state-of-the-art methods.

2 Related Work

Infrared-Visible object detection. Previous research in infrared-visible ob-

ject detection are primarily built upon the single modality object detection

frameworks, which are generally divided into one-stage object detectors, such

as Faster RCNN [28], and two-stage object detectors, such as YOLO [25–27,31].

In order to fuse the complementary information of infrared and visible im-

ages, Konig et al. [10] introduced a fully convolutional fusion RPN network,

which fused infrared and visible image features by concatenation, and concluded

that halfway fusion can obtain better result [16]. On this foundation, some stud-

ies designed CNN-based attention modules to better exploit the potential com-

plementation of infrared and visible images [2,23,29]. Additionally, other works

introduced transformer-based fusion modules to capture a more global comple-

mentary relationship between infrared and visible images [5,22, 30, 43].

In addition to the above methods of directly fusing image features. Some

works adopted illumination information as global weights to fuse infrared and

visible image features or post-fuse the multibrance detection results to reduce the

impact of interfering information [12,33,41]. Considering that the complementary

characteristics of diﬀerent regions could be diﬀerent, some studies introduced

bounding box level semantic segmentation [1,11,12, 37, 38] or regions of interest

(ROI) prediction [9, 39, 40] to guide the fusion of diﬀerent regions. Other works

also utilized the conﬁdence or uncertainty scores of regions to post-fuse the

predictions of multibranchs [14,15].

To address the challenge of modality misalignment, Zhang et al. [39,40] devel-

oped the AR-CNN network and explicitly aligned the features of two modalities

by incorporating additional paired bounding box annotations to learn object

剩余16页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2014
资源: 987

DAMSDet: 动态自适应多光谱检测Transformer解决红外可见物体融合与错位问题

最新资源

DAMSDet: 动态自适应多光谱检测Transformer解决红外可见物体融合与错位问题

高分项目，基于Yolov5+Transformer的多光谱目标检测系统

基于Swin-Transformer和Unet 分割项目、自适应多尺度训练、多类别分割、迁移学习：遥感道路二值分割项目

Swin Transformer实战：timm中的 Swin Transformer实现图像分类（多GPU）。

Transformer模型应用领域

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head

基于Swin-Transformer和Unet 项目、自适应多尺度训练、多类别分割：腹部多脏器5类别分割数据集

2021_12_02 直播課－第15課：多任務對話Transformer架構的Intent和NER算法剖析和對比.mp4

基于Swin-Transformer和Unet 项目、自适应多尺度训练、多类别分割：裂缝分割实战

基于图注意力机制和Transformer的异常检测.docx

yolov5目标检测模型 (融合transformer+已调参优化）

基于Swin-Transformer和Unet 项目、自适应多尺度训练、多类别分割：眼镜分割实战

基于Swin-Transformer和Unet 项目、自适应多尺度训练、多类别分割：脊柱二值图像分割

BERT：预训练的深度双向 Transformer 语言模型

基于Swin-Transformer和Unet 自适应多尺度训练、多类别分割、迁移学习：医学图像子宫颈细胞核分割

Shuffle Transformer重新思考视觉转换器的空间洗牌_Shuffle Transformer Rethinking

基于Swin-Transformer和Unet 自适应多尺度训练、多类别分割：BraTS 3d脑肿瘤图像切分的2D图片分割项目

优秀毕业设计：基于transformer的序列数据二分类完整代码+数据可直接运行

基于Swin-Transformer和Unet 项目、自适应多尺度训练、多类别分割：医学细胞图像语义分割项目【包含数据集】

基于多特征融合及Transformer的人体跌倒动作检测算法.docx

基于Swin-Transformer和Unet 项目、自适应多尺度训练、多类别分割：人体脊柱MR图像二值分割【包含数据和代码】

基于Transformer的detr目标检测算法，源码解读

Hyperspectral_Image_Transformer_Classification_Ne.pdf

C# Onnx LSTR基于Transformer的端到端实时车道线检测 源码

Ansoft PExprt入门教材：Getting Started_A Transformer Design Example

最新资源

C# Onnx LSTR基于Transformer的端到端实时车道线检测源码