reasoning process, thus limiting the improvement of object detection.
To solve the above problem, several studies have been devoted to intro-
ducing the attention mechanism into relationship searching process to
improve object detection performance. A notable work named DETR
[15] imposes the encoder and decoder modules of Transformer [16] to
explore the dependencies within object queries used for enhancing
the object representations. Several studies including [17–22] are de-
voted to optimizing the inference efficiency of DETR [15]. Despite of
the progress made by these works, they mainly focus on the establishing
the dependencies within the visual feature space, but the relationships
within the label embedding space and the ones across two spaces are
not fully investigated in these works, thus limiting the improvement
of object detection.
Aiming to address the issues mentioned above, we propose a novel
learnable inner-inter relational reasoning framework (IRR) for object
detection which fully explores the inner- and inter- relationships be-
tween and within the visual feature space and label embedding space.
Specifically, instead of building graph patterns on the extrac ted
proposals or categories labels, we regard them as two independent
sets in visual feature space and label embedding space, respectively.
Subsequently, inspired by the multi-head attention s trategy, we
develop a self-attention module to model the inner-relationships
between each pair of elements within two spaces, which are utilized
to generate the structural representations for each set. Also, we design
a cross-attention module to explore the inter-relationships across the
two space, and form the attributes in one set based on the computed re-
lationships with the other set. After several iterations of self-attention
module and cross-attention module, the final representations of the
two sets are used to predict the classification and regress the locations
of extracted bounding boxes.
For evaluating the proposed IRR framework, we embed it into sev-
eral state-of-the-art region-based object detection methods, and report
their performance on two public benchmarks including Pascal VOC [23]
and MS-COCO [24]. Experimental results reveal that combining with our
IRR module, the performanc e of these methods are consistently im-
proved on both the two datasets, which demonstrates the effectiveness
and flexibility of the proposed model.
In summary, with the proposed learnable inner-inter relational rea-
soning framework for object detection, this paper makes the following
main contributions:
• Instead of building graph patterns for the extracted proposals or
category labels, we regard them as two independent sets, which
enable us to model the inner- and inter- relationships in the long-
range dependencies between them.
• With multi-head attention strategy, we develop two relational rea-
soning modules, named self-attention module and cross-attenti on
module, which form the representative attributes for the two sets
according to the inner- and inter- relationships.
• Ex perimental results on two public object detection benchmarks
reveal that the proposed model can consistently improve the
performance of state-of-the-art region-based methods.
2. Related works
For the decades, many works are devoted to improve the quality of
object detection, including feature enhancement [25,26], eliminating
the false positive [27] and context information aggregation [28]. While
recently, solving the object detection utilizing learnable relationships
between extracted proposals and category labels has been attached
more and more research attention. Therefore, we focus on reviewing
the recently propo sed relation-based object detection meth ods that
are mainly designed on top of graph message passing mechanism and
attention framework, and leave the reviews about ot her types of
works in the comprehensive surveys [29–31].
2.1. Object detection via graph neural networks
The methods that explore the relationships using graph message
passing str ategi es generally build graph patterns with heurist ic
manners on visual feature spa ce and (or) label space, and employ
well-designed graph neural networks to aggregate and up date the
representations in the two spaces. For example, Liu et al. [7] proposed
a Structure Inference Network (SIN) which learns a fully-connected
graph with stacked Gate Recurrent Unit (GRU) [32] cells to perform re-
lational reasoning. Du et al. [9] explicitly established a graph structure
based on the spatial positions of the regions. The features of each region
are regarded as a node in the graph, and the feature representations are
enhanced by fusing the features from neighbors through graph
convolutional networks [33]. Li et al. [12] leveraged local relation and
global relation between region proposals to facilitate feature enhance-
ment. In addition, Xu et al. [11] proposed a spatial-aware graph relation
network (SGRN) to implicitly establis h a graph structure to perform
adaptive graph reasoning on regions, where context information and
spatial information are extracted and propagated through graphs.
There are also several works [14,13] that explore relationships between
category labels for relational reaso ning. For example, Xu et al. [14]
employed the prior co-occurrence between labels to construct the
label graph where the word embeddings are taken as the raw features,
and the GCN-style modules are adopted to perform feature fusion to en-
hance the representations for labels. Similarly, OD-GCN [13] proposed
an object detection framework that develops a GCN module with adap-
tive parameters to modify the classi fication based on a constructed
category knowledge graph. Despite of the progress m ade by these
graph-based object detection methods, the graph structures defined in
their framework are generall y handcrafted and fixed, which may
introduce unreliable relationships and are inflexible to capture the
long-range dependencies, t hus limiting the performance in object
detection.
2.2. Object detection with attention framework
Recently, with the growing interest in utilizing attention mechanism
for sequence data, more and more research attention has been devoted
to designing attention-based object detection methods . DETR [15]
borrowed the encoder module in transformers into object detection
pipeline to automatically distinguish the positive samples and negative
ones, which simpl ifies the process of extracting candida te boxes.
Although DETR [15] achieved excellent performance in object detection,
it suffers from the low convergence rate in training. Thus, its training
process needs large-scale t raining samples and more time to obtain
the op timal parameters, wh ich limits its adaptabi lity and scalability.
Aiming to address the issues mentioned above, several works are
devoted to simplifying the training process and improving the conver-
gence rate for attention-based object detection models. For example,
Deformable-DETR [17] considered introducing the sparse spatial sam-
pling of deformable convolution into attention-based object detector,
which makes the model focus on key sampling points, thus accelerating
the speed of convergence in model training. UP-DETR [19] proposed to
first pre-train DETR model [15] on large-scale datasets with an unsuper-
vised manner and then fine-tune it in labeled data to enhance the con-
vergence speed. To efficiently simplify the DETR framework [15], Gao
et al. [20] proposed a Spatially Modulated Co-Attention mechanism
that focuses on a few regions inside the estimated boxes by learning a
gaussian-like weight map centered on the reference point according
to the decoder embeddings. Moreover, Meng et al.
[21] proposed a
conditional DETR that deduces the spatial range of the different areas
used to locate object classification and box regression by learning the
conditional spatial queries from the decoder content embeddings.
Jiang et al. [22] proposed a Guided Query Position (GQPos) that is
used to iteratively embed the latest location information of query object,
and a Similar Attention (SiA) that fuses multi-scale attention weight
H. Liu, X. You, T. Wang et al. Image and Vision Computing 130 (2023) 104615
2