RelationNet++:BridgingVisualRepresentationsforObjectDetect资源-CSDN文库

深度学习

192 浏览量 2023-11-25 15:53:20 上传评论收藏 611KB PDF 举报

资源推荐

资源详情

资源评论

RelationNet++: Bridging Visual Representations for

Object Detection via Transformer Decoder

Cheng Chi

∗

Institute of Automation, CAS

chicheng15@mails.ucas.ac.cn

Fangyun Wei

Microsoft Research Asia

fawe@microsoft.com

Han Hu

Microsoft Research Asia

hanhu@microsoft.com

Abstract

Existing object detection frameworks are usually built on a single format of ob-

ject/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and

Faster R-CNN, center points in FCOS and RepPoints, and corner points in Corner-

Net. While these different representations usually drive the frameworks to perform

well in different aspects, e.g., better classiﬁcation or ﬁner localization, it is in gen-

eral difﬁcult to combine these representations in a single framework to make good

use of each strength, due to the heterogeneous or non-grid feature extraction by

different representations. This paper presents an attention-based decoder module

similar as that in Transformer [

] to bridge other representations into a typical

object detector built on a single representation format, in an end-to-end fashion.

The other representations act as a set of key instances to strengthen the main query

representation features in the vanilla detectors. Novel techniques are proposed

towards efﬁcient computation of the decoder module, including a key sampling

approach and a shared location embedding approach. The proposed module is

named bridging visual representations (BVR). It can perform in-place and we

demonstrate its broad effectiveness in bridging other representations into prevalent

object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS,

where about

1.5 ∼ 3.0

AP improvements are achieved. In particular, we improve a

state-of-the-art framework with a strong backbone by about

2.0

AP, reaching

52.7

AP on COCO test-dev. The resulting network is named RelationNet++. The code

is available at https://github.com/microsoft/RelationNet2.

1 Introduction

Object detection is a vital problem in computer vision that many visual applications build on. While

there have been numerous approaches towards solving this problem, they usually leverage a single

visual representation format. For example, most object detection frameworks [

] utilize

the rectangle box to represent object hypotheses in all intermediate stages. Recently, there have also

been some frameworks adopting points to represent an object hypothesis, e.g., center point in Center-

Net [

] and FCOS [

], point set in RepPoints [

] and PSN [

]. In contrast to representing

whole objects, some keypoint-based methods, e.g., CornerNet [

], leverage part representations of

corner points to compose an object. In general, different representation methods usually steer the

detectors to perform well in different aspects. For example, the bounding box representation is better

aligned with annotation formats for object detection. The center representation avoids the need for an

anchoring design and is usually friendly to small objects. The corner representation is usually more

accurate for ﬁner localization.

It is natural to raise a question: could we combine these representations into a single framework to

make good use of each strength? Noticing that different representations and their feature extractions

∗

The work is done when Cheng Chi is an intern at Microsoft Research Asia.

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

Anchor

Detection Result

Center

Left Top

Right Bottom

Attention

Person

(a) Bridge representations.

Anchor Bounding Box

Proposal Bounding Box

Object Center Corners

(b) Typical object/part representations.

Figure 1: (a) An illustration of bridging various representations, speciﬁcally leveraging corner/center

representations to enhance the anchor box features. (b) Object/part representations used in object

detection (geometric description and feature extraction). The red dashed box denotes ground-truth.

are usually heterogeneous, combination is difﬁcult. To address this issue, we present an attention

based decoder module similar as that in Transformer [

], which can effectively model dependency

between heterogeneous features. The main representations in an object detector are set as the query

input, and other visual representations act as the auxiliary keys to enhance the query features by

certain interactions, where both appearance and geometry relationships are considered.

In general, all feature map points can act as corner/center key instances, which are usually too

many for practical attention computation. In addition, the pairwise geometry term is computation

and memory consuming. To address these issues, two novel techniques are proposed, including a

key sampling approach and a shared location embedding approach for efﬁcient computation of the

geometry term. The proposed module is named bridging visual representations (BVR).

Figure 1a illustrates the application of this module to bridge center and corner representations into an

anchor-based object detector. The center and corner representations act as key instances to enhance

the anchor box features, and the enhanced features are then used for category classiﬁcation and

bounding box regression to produce the detection results. The module can work in-place. Compared

with the original object detector, the main change is that the input features for classiﬁcation and

regression are replaced by the enhanced features, and thus the strengthened detector largely maintains

its convenience in use.

The proposed BVR module is general. It is applied to various prevalent object detection frame-

works, including RetinaNet, Faster R-CNN, FCOS and ATSS. Extensive experiments on the COCO

dataset [19] show that the BVR module substantially improves these various detectors by 1.5 ∼ 3.0

AP. In particular, we improve a strong ATSS detector by about

2.0

AP with small overhead, reaching

52.7

AP on COCO test-dev. The resulting network is named RelationNet++, which strengthens the

relation modeling in [12] from bbox-to-bbox to across heterogeneous object/part representations.

The main contributions of this work are summarized as:

•

A general module, named BVR, to bridge various heterogeneous visual representations and

combine the strengths of each. The proposed module can be applied in-place and does not

break the overall inference process by the main representations.

•

Novel techniques to make the proposed bridging module efﬁcient, including a key sampling

approach and a shared location embedding approach.

• Broad effectiveness of the proposed module for four prevalent object detectors: RetinaNet,

Faster R-CNN, FCOS and ATSS.

2 A Representation View for Object Detection

2.1 Object / Part Representations

Object detection aims to ﬁnd all objects in a scene with their location described by rectangle

bounding boxes. To discriminate object bounding boxes from background and to categorize objects,

intermediate geometric object/part candidates with associated features are required. We refer to the

joint geometric description and feature extraction as the representation, where typical representations

used in object detection are illustrated in Figure 1b and summarized below.

Object bounding box representation

Object detection uses bounding boxes as the ﬁnal output.

Probably because of this, bounding box is now the most prevalent representation. Geometrically, a

Object Center

Detection

(a) Faster R-CNN

Anchor

Proposal

Detection

Anchor

Detection

(b) RetinaNet

Corner Points Grouping

(d) CornerNet

Figure 2: Representation ﬂows for several typical detection frameworks.

bounding box can be described by a

-d vector, either as center-size

, y

, w, h)

or as opposing

corners

, y

, x

, y

)

. Besides the ﬁnal output, this representation is also commonly used as initial

and intermediate object representations, such as anchors [

] and proposals [

]. For bounding box representations, features are usually extracted by pooling operators within

the bounding box area on an image feature map. Common pooling operators include RoIPool [

RoIAlign [

], and Deformable RoIPool [

]. There are also simpliﬁed feature extraction methods,

e.g., the box center features are usually employed in the anchor box representation [24, 18].

Object center representation

The

-d vector space of a bounding box representation is at a scale

O(H

× W

)

for an image with resolution

H × W

, which is too large to fully process. To

reduce the representation space, some recent frameworks [

] use the center point

as a simpliﬁed representation. Geometrically, a center point is described by a 2-d vector (x

, y

), in

which the hypothesis space is of the scale

O(H × W )

, which is much more tractable. For a center

point representation, the image feature on the center point is usually employed as the object feature.

Corner representation

A bounding box can be determined by two points, e.g., a top-left corner and

a bottom-right corner. Some approaches [

] ﬁrst detect these individual points

and then compose bounding boxes from them. We refer to these representation methods as corner

representation. The image feature at the corner location can be employed as the part feature.

Summary and comparison

Different representation approaches usually have strengths in different

aspects. For example, object based representations (bounding box and center) are better in category

classiﬁcation while worse in object localization than part based representations (corners). Object

based representations are also more friendly for end-to-end learning because they do not require

a post-processing step to compose objects from corners as in part-based representation methods.

Comparing different object-based representations, while the bounding box representation enables

more sophisticated feature extraction and multiple-stage processing, the center representation is

attractive due to the simpliﬁed system design.

2.2 Object Detection Frameworks in a Representation View

Object detection methods can be seen as evolving intermediate object/part representations until the

ﬁnal bounding box outputs. The representation ﬂows largely shape different object detectors. Several

major categorization of object detectors are based on such representation ﬂow, such as top-down

(object-based representation) vs bottom-up (part-based representation), anchor-based (bounding

box based) vs anchor-free (center point based), and single-stage (one-time representation ﬂow) vs

multiple-stage (multiple-time representation ﬂow). Figure 2 shows the representation ﬂows of several

typical object detection frameworks, as detailed below.

Faster R-CNN

[

] employs bounding boxes as its intermediate object representations in all stages.

At the beginning, multiple anchor boxes at each feature map position are hypothesized to coarsely

cover the 4-d bounding box space in an image, i.e.,

anchor boxes with different aspect ratios.

The image feature vector at the center point is extracted to represent each anchor box, which is

then used for foreground/background classiﬁcation and localization reﬁnement. After anchor box

selection and localization reﬁnement, the object representation is evolved to a set of proposal boxes,

where the object features are usually extracted by an RoIAlign operator within each box area. The

ﬁnal bounding box outputs are obtained by localization reﬁnement, through a small network on the

proposal features.

RetinaNet

[

] is a one-stage object detector, which also employs bounding boxes as its intermediate

representation. Due to its one-stage nature, it usually requires denser anchor hypotheses, i.e.,

anchor

boxes at each feature map position. The ﬁnal bounding box outputs are also obtained by applying a

localization reﬁnement head network.

剩余10页未读，继续阅读

评论收藏

内容反馈

DrYJ

粉丝: 40
资源: 24

RelationNet++: Bridging Visual Representations for Object Detect

Action-oriented process mining: bridging the gap between insight

consensus bridging theory and practice

Home-school token economies: Bridging the communication gap

Referral of the child with learning problems: Bridging a communication gap

Wireless Transceiver Architecture Bridging RF and Digital Communication

Consensus Bridging Theory and Practice

Google's Neural Machine Translation System - Bridging the Gap between Human and Machine Translation - 2016 (1609.08144v1)-计算机科学

Software Architecture for Big Data and the Cloud

深度学习 论文

翻译Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid

EURASIP JWCN_Rate-Splitting_noma_NOMAcvx_ratesplitting_NOMAmatla

RFID_P2P_PMP_Implement

Foundations of Cryptocurrency and Blockchain Programming for Beginners 2017

Knowledge-driven Encode, Retrieve, Paraphrase for MedicalImageReport.pdf

Pku acm 第1631题 Bridging signals 代码，有详细的注释

Binuclear Cyclopentadienylrhenium Hydride Chemistry: Terminal versus Bridging Hydride and Cyclopentadienyl Ligands

Professional.Swift.1119016770

Bridging Relational and NoSQL Databases

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

行人跌倒数据集（VOC格式）

YOLOV5 + 双目相机实现三维测距（新版本）

全新的SOTA模型YOLOv9

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

最新资源

深度学习论文