Objectdetectionviainner-interrelationalreasoningnetwork资源-CSDN文库

12 浏览量 2023-11-25 15:48:40 上传评论收藏 1.06MB PDF 举报

框内的物体。近年来，利用图消息传递机制下物体与（或）标签之间的关系来促进对象检测的方法已经被广泛研究。然而，这些方法严重依赖于手工设计的图结构，这可能导致不可靠的关系，进而影响对象检测的性能。针对这一问题，本文提出了一种新颖的对象检测框架，该框架充分利用了在全注意力架构下的物体和标签的关系表示。具体来说，我们直接将提取的提案（即候选物体区域）和候选标签视为视觉特征空间和标签嵌入空间中的两个独立集合。然后，我们设计了一个自注意力模块，用于发现视觉特征空间或标签嵌入空间内的内在关系。此外，我们还开发了一个交叉注意力模块，用于探索这两个空间之间的相互关系。通过利用内在关系和相互关系，我们可以增强物体特征和标签嵌入表示，从而促进对象检测。为了验证所提出的框架在提高对象检测性能方面的作用，我们将它嵌入到几个最先进的基线模型中，并在两个公共基准数据集（Pascal VOC 和 COCO 2017）上进行了广泛的实验。实验结果证明了所提框架的有效性和灵活性。 1. 引言对象检测作为图像识别领域的一个基础问题，旨在定位并分类图像中的候选边界框。传统的方法往往侧重于单个物体特征的学习，而忽略了物体之间的交互关系。随着深度学习技术的发展，尤其是卷积神经网络（CNNs）的应用，对象检测的准确率得到了显著提升。然而，依赖于手工设计的图形结构来捕获物体间关系的方法存在局限性。 2. 相关工作在关系推理网络的研究中，许多方法采用了图神经网络（GNNs）来传播和融合信息，从而揭示物体之间的关系。这些方法通常需要预先定义图结构，这可能导致模型泛化能力受限。相比之下，本文提出的框架无需手动构造图，而是通过自注意力和交叉注意力机制自动学习这些关系。 3. 提出的方法我们提出的方法包括两部分：自注意力模块和交叉注意力模块。自注意力模块允许在同一个空间内（如视觉特征空间或标签空间）探索物体之间的内在联系，而交叉注意力模块则负责在不同空间之间建立联系。这两个模块结合使用，可以增强每个物体区域的表示，使其更好地捕捉到上下文信息和相关标签的影响。 4. 实验与分析在Pascal VOC和COCO 2017数据集上的实验表明，我们的方法在多个指标上均优于基线模型，证实了内在关系和相互关系的利用对于提高对象检测性能的重要性。此外，我们还对模型的敏感性、鲁棒性和计算效率进行了分析。 5. 结论本文提出了一种基于内在和相互关系推理的对象检测框架，它能够自适应地学习和利用物体间的复杂关系，提高了检测的准确性。未来的工作可能会进一步优化这种关系推理机制，以应对更复杂的场景和更高的计算效率需求。本文的工作为对象检测提供了一个新的视角，强调了关系推理在网络中的关键作用，并提出了一种无需手工构造图的解决方案，这有望推动对象检测技术的进步。通过充分利用物体和标签的内在及相互关系，该框架为深度学习模型在对象检测任务上的表现提供了有力的支持。

资源推荐

资源详情

资源评论

Object detection via inner-inter relational reasoning network

He Liu, Xiuting You, Tao Wang

⁎

,YidongLi

School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

abstractarticle info

Article history:

Received 28 September 2022

Accepted 17 December 2022

Available online 29 December 2022

Keywords:

Object detection

Relational reasoning

Attention model

Exploiting relationships between objects and (or) labels under graph message passing mechanism to facilitate

object detection has been widely investigated in recent years. However, these methods heavily rely on hand-

crafted graph structures, which may introduce unreliable relationships and in turn hurt the object detection

performance. Aiming to address this issue, we propose a novel object detection framework that fully explores

the relational representations for objects and labels under a full attention architecture. Speciﬁcally, we directly

regard the extracted proposals and candidate labels as two independent sets in visual feature space and label

embedding space, respectively. And we design a self-attention module to discover the inner-relationships within

the visual feature space or label embedding space. In addition, a cross-attention module is developed to explore

the inter-relationships between the two spaces. Furthermore, both the inner-relationships and inter-

relationships are utilized to enhance the object features and label embedding representations to facilitate the

object detection. To validate the proposed framework in improving object detection performance, we embed it

into several state-of-the-art baselines and perform extensive experiments on two public benchmarks (named

Pascal VOC and COCO 2017). The experimental results demons trate the effecti veness and ﬂexibility of the

proposed framework.

1. Introduction

As a fundamental problem in image recognition community, ob-

ject detection aims to localize and classify the candidate bounding

boxes extracted f rom a given image, and has been wid ely used in

many realistic tasks such as visual survei llance [1] and automated

driving [2]. In general, object detection methods can be divid ed

into two groups including regression-based detection methods and

region-based detection methods. G iven an image, the regression-

based de tection methods take it as input and di rectly predict the

location and classiﬁcatio n of the ob jects. While the region-based

detection methods gener ally extract a series of region proposals

from Region Proposal Network (RPN) used to indicate the coarse loca-

tions of the candidate objects, and then pass the region proposals

into follow-up learnable modules to predict more precise locations

and categories.

Previous classic methods, including Faster R-CNN [3], Mask R-CNN

[4], YOLO [5] and SSD [6], deal with the location regression and classiﬁ-

cation on the extracted proposals individually, and pay less attention

to the relationships between them, thus making the limited

representation ability and leading to unsatisfactory performance.

Recently, several works that attempt to introduce relation between in-

stances into object detection via graph message passing mancin ism

have been proposed. For example, Liu et al. [7] explore the relationship

between global scene context and individual objects and enhance the

region feature using recurrent neural network (RNN). Li et al. [8] estab-

lish the relationship between image feature maps in feature pyramid

networks (FPN), and propose a dynamic feature fusion method based

on the graph convolution network (GCN) to enrich the representation

of image feature maps. In addition, several works [9–11] consider estab-

lishing the spatial position relationship of the region proposals

extracted from the RPN, and enhance the features of region proposals

via GCNs. Similarly, Li et a l. [12] introduced global scene features on

region-based relation graph, which ma kes the region proposal learn

both local and global features to enhance feature representation of re-

gion. Different from the above methods that mainly focus on exploring

the relationships within the visual feature space, several works [13,14]

consider es tablishing the relationship between category labels on a

constructed label graph, and enhance the feature representation of re-

gions by fusing the information of neighbors to improve the detection

performance of the detector.

Although the above methods have effectively improved the detec-

tion performance, they heavily rely on the heuristically generated

graph structure, which may impose noisy relationships in the graph

Image and Vision Computing 130 (2023) 104615

⁎ Corresponding author.

E-mail addresses: liuhe1996@bjtu.edu.cn (H. Liu), yxting@bjtu.edu.cn (X. You),

twang@bjtu.edu.cn (T. Wang), ydli@bjtu.edu.cn (Y. Li).

https://doi.org/10.1016/j.imavis.2022.104615

Contents lists available at ScienceDirect

Image and Vision Computing

journal homepage: www.elsevier.com/locate/imavis

reasoning process, thus limiting the improvement of object detection.

To solve the above problem, several studies have been devoted to intro-

ducing the attention mechanism into relationship searching process to

improve object detection performance. A notable work named DETR

[15] imposes the encoder and decoder modules of Transformer [16] to

explore the dependencies within object queries used for enhancing

the object representations. Several studies including [17–22] are de-

voted to optimizing the inference efﬁciency of DETR [15]. Despite of

the progress made by these works, they mainly focus on the establishing

the dependencies within the visual feature space, but the relationships

within the label embedding space and the ones across two spaces are

not fully investigated in these works, thus limiting the improvement

of object detection.

Aiming to address the issues mentioned above, we propose a novel

learnable inner-inter relational reasoning framework (IRR) for object

detection which fully explores the inner- and inter- relationships be-

tween and within the visual feature space and label embedding space.

Speciﬁcally, instead of building graph patterns on the extrac ted

proposals or categories labels, we regard them as two independent

sets in visual feature space and label embedding space, respectively.

Subsequently, inspired by the multi-head attention s trategy, we

develop a self-attention module to model the inner-relationships

between each pair of elements within two spaces, which are utilized

to generate the structural representations for each set. Also, we design

a cross-attention module to explore the inter-relationships across the

two space, and form the attributes in one set based on the computed re-

lationships with the other set. After several iterations of self-attention

module and cross-attention module, the ﬁnal representations of the

two sets are used to predict the classiﬁcation and regress the locations

of extracted bounding boxes.

For evaluating the proposed IRR framework, we embed it into sev-

eral state-of-the-art region-based object detection methods, and report

their performance on two public benchmarks including Pascal VOC [23]

and MS-COCO [24]. Experimental results reveal that combining with our

IRR module, the performanc e of these methods are consistently im-

proved on both the two datasets, which demonstrates the effectiveness

and ﬂexibility of the proposed model.

In summary, with the proposed learnable inner-inter relational rea-

soning framework for object detection, this paper makes the following

main contributions:

• Instead of building graph patterns for the extracted proposals or

category labels, we regard them as two independent sets, which

enable us to model the inner- and inter- relationships in the long-

range dependencies between them.

• With multi-head attention strategy, we develop two relational rea-

soning modules, named self-attention module and cross-attenti on

module, which form the representative attributes for the two sets

according to the inner- and inter- relationships.

• Ex perimental results on two public object detection benchmarks

reveal that the proposed model can consistently improve the

performance of state-of-the-art region-based methods.

2. Related works

For the decades, many works are devoted to improve the quality of

object detection, including feature enhancement [25,26], eliminating

the false positive [27] and context information aggregation [28]. While

recently, solving the object detection utilizing learnable relationships

between extracted proposals and category labels has been attached

more and more research attention. Therefore, we focus on reviewing

the recently propo sed relation-based object detection meth ods that

are mainly designed on top of graph message passing mechanism and

attention framework, and leave the reviews about ot her types of

works in the comprehensive surveys [29–31].

2.1. Object detection via graph neural networks

The methods that explore the relationships using graph message

passing str ategi es generally build graph patterns with heurist ic

manners on visual feature spa ce and (or) label space, and employ

well-designed graph neural networks to aggregate and up date the

representations in the two spaces. For example, Liu et al. [7] proposed

a Structure Inference Network (SIN) which learns a fully-connected

graph with stacked Gate Recurrent Unit (GRU) [32] cells to perform re-

lational reasoning. Du et al. [9] explicitly established a graph structure

based on the spatial positions of the regions. The features of each region

are regarded as a node in the graph, and the feature representations are

enhanced by fusing the features from neighbors through graph

convolutional networks [33]. Li et al. [12] leveraged local relation and

global relation between region proposals to facilitate feature enhance-

ment. In addition, Xu et al. [11] proposed a spatial-aware graph relation

network (SGRN) to implicitly establis h a graph structure to perform

adaptive graph reasoning on regions, where context information and

spatial information are extracted and propagated through graphs.

There are also several works [14,13] that explore relationships between

category labels for relational reaso ning. For example, Xu et al. [14]

employed the prior co-occurrence between labels to construct the

label graph where the word embeddings are taken as the raw features,

and the GCN-style modules are adopted to perform feature fusion to en-

hance the representations for labels. Similarly, OD-GCN [13] proposed

an object detection framework that develops a GCN module with adap-

tive parameters to modify the classi ﬁcation based on a constructed

category knowledge graph. Despite of the progress m ade by these

graph-based object detection methods, the graph structures deﬁned in

their framework are generall y handcrafted and ﬁxed, which may

introduce unreliable relationships and are inﬂexible to capture the

long-range dependencies, t hus limiting the performance in object

detection.

2.2. Object detection with attention framework

Recently, with the growing interest in utilizing attention mechanism

for sequence data, more and more research attention has been devoted

to designing attention-based object detection methods . DETR [15]

borrowed the encoder module in transformers into object detection

pipeline to automatically distinguish the positive samples and negative

ones, which simpl iﬁes the process of extracting candida te boxes.

Although DETR [15] achieved excellent performance in object detection,

it suffers from the low convergence rate in training. Thus, its training

process needs large-scale t raining samples and more time to obtain

the op timal parameters, wh ich limits its adaptabi lity and scalability.

Aiming to address the issues mentioned above, several works are

devoted to simplifying the training process and improving the conver-

gence rate for attention-based object detection models. For example,

Deformable-DETR [17] considered introducing the sparse spatial sam-

pling of deformable convolution into attention-based object detector,

which makes the model focus on key sampling points, thus accelerating

the speed of convergence in model training. UP-DETR [19] proposed to

ﬁrst pre-train DETR model [15] on large-scale datasets with an unsuper-

vised manner and then ﬁne-tune it in labeled data to enhance the con-

vergence speed. To efﬁciently simplify the DETR framework [15], Gao

et al. [20] proposed a Spatially Modulated Co-Attention mechanism

that focuses on a few regions inside the estimated boxes by learning a

gaussian-like weight map centered on the reference point according

to the decoder embeddings. Moreover, Meng et al.

[21] proposed a

conditional DETR that deduces the spatial range of the different areas

used to locate object classiﬁcation and box regression by learning the

conditional spatial queries from the decoder content embeddings.

Jiang et al. [22] proposed a Guided Query Position (GQPos) that is

used to iteratively embed the latest location information of query object,

and a Similar Attention (SiA) that fuses multi-scale attention weight

H. Liu, X. You, T. Wang et al. Image and Vision Computing 130 (2023) 104615

剩余7页未读，继续阅读

评论收藏

内容反馈

DrYJ

粉丝: 40
资源: 24

Object detection via inner-inter relational reasoning network

图机器学习峰会-3-6 Relational Reasoning with Rule Discovery.pdf

Compositional Language Understanding with Text-based Relational Reasoning.pdf

self-supervised-relational-reasoning:PyTorch正式实施的论文“用于表示学习的自我监督关系推理”，NeurIPS 2020聚焦

Six Step Relational Database Design 2nd Edition

Oracle 9i Application Developer's Guide - Object-Relational Feat

前端项目-backbone-relational.zip

Oracle8i Application Developer’s Guide - Object-Relational Featu

Object-Relational Databases

A Relational Tucker Decomposition for Multi-Relational Link Prediction.pdf

Temporal Relational Reasoning in Videos.pdf

codd_1970 - A relational model for large shared databanks

英文原版-Advanced Relational Programming 1st Edition

Oracle Database Object-Relational Developer's Guide 11g Release

COMPOSITION-BASED MULTI-RELATIONAL GRAPH CONVOLUTIONAL NETWORKS

A simple neural network module for relational reasoning -《简单事实关系推理模块》

TMS Business Core Library Aurelius is an Object-Relational Mapping (ORM) .zip

外文文献及中文翻译-Access2000-Relational-Database.doc

清华deepseek入门到精通文档 夸克网盘资源下载

DeepSeek从入门到精通

Ollama windows安装包 0.5.7（截止2025-02-01）

DeepSeek从入门到精通-清华大学

人工智能应用：DeepSeek从入门到精通的操作指南与多功能实战详解

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

二十届智能车微缩光电搜线方案分享

AI使用+DeepSeek+DeepSeek清华大学第二版

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

清华deepseek从入门到精通

Ollama0.5.7 服务安装包windows

最新资源

清华deepseek入门到精通文档夸克网盘资源下载