RecurrentDETR:Transformer-BasedObjectDetectionforCrowdedS

148 浏览量 2023-11-25 15:45:34 上传评论收藏 5.88MB PDF 举报

资源推荐

资源详情

资源评论

Received 17 June 2023, accepted 5 July 2023, date of publication 10 July 2023, date of current version 2 August 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3293532

Recurrent DETR: Transformer-Based Object

Detection for Crowded Scenes

HYEONG KYU CHOI

, CHONG KEUN PAIK

, HYUN WOO KO

1,3

, MIN-CHUL PARK

1,3

AND HYUNWOO J. KIM

Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea

Samsung Electro-Mechanics, Suwon 16674, Republic of Korea

Center for Opto-Electronic Materials and Device, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea

Corresponding author: Hyunwoo J. Kim (hyunwoojkim@korea.ac.kr)

This work was supported in part by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET)

and the Korea Smart Farm Research and Development Foundation (KosFarm) through the Smart Farm Innovation Technology

Development Program by the Ministry of Agriculture, Food and Rural Affairs (MAFRA), and Ministry of Science and ICT (MSIT),

Rural Development Administration (RDA), under Grant 421025-04; in part by the Grant of the Korea Health Technology Research and

Development Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare,

Republic of Korea, under Grant HR20C0021; and in part by the Neubla.

ABSTRACT Recent Transformer-based object detectors have achieved remarkable performance on

benchmark datasets, but few have addressed the real-world challenge of object detection in crowded scenes

using transformers. This limitation stems from the ﬁxed query set size of the transformer decoder, which

restricts the model’s inference capacity. To overcome this challenge, we propose Recurrent Detection

Transformer (Recurrent DETR), an object detector that iterates the decoder block to render more predictions

with a ﬁnite number of query tokens. Recurrent DETR can adaptively control the number of decoder

block iterations based on the image’s crowdedness or complexity, resulting in a variable-size prediction

set. This is enabled by our novel Pondering Hungarian Loss, which helps the model to learn when additional

computation is required to identify all the objects in a crowded scene. We demonstrate the effectiveness

of Recurrent DETR on two datasets: COCO 2017, which represents a standard setting, and CrowdHuman,

which features a crowded setting. Our experiments on both datasets show that Recurrent DETR achieves

signiﬁcant performance gains of 0.8 AP and 0.4 AP, respectively, over its base architectures. Moreover,

we conduct comprehensive analyses under different query set size constraints to provide a thorough

evaluation of our proposed method.

INDEX TERMS Computer vision, object detection, detection transformers, dynamic computation.

I. INTRODUCTION

Object detection is a fundamental task in computer vision that

involves both recognizing and localizing objects within an

image. Several CNN-based methods have introduced models

to ﬁrst locate probable objects and then categorize their

types [1], [2], [3], whereas others have tried to locate and

classify objects simultaneously in a single stage [4], [5], [6].

While convolutional neural networks (CNNs) have achieved

outstanding performance on this task, recent works have

explored the use of Transformer [7] architectures to further

The associate editor coordinating the review of this manuscript and

approving it for publication was Varuna De Silva

improve model performance. The self-attention mechanism

in Transformers allows for a wider receptive ﬁeld per layer,

while the architecture’s high capacity yields outstanding

performance.

One of the main challenges with transformer-based object

detectors is their high computational overhead. The burden

stems from the large number of spurious tokens input into the

decoder blocks, which do not necessarily query foreground

objects. Several works have attempted to address this issue

by pruning tokens to improve computational efﬁciency [8],

[9]. However, few have addressed the real-world challenge of

object detection in crowded scenes, where there are too many

objects to identify with a ﬁxed number of query tokens and

VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

78623

H. K. Choi et al.: Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes

FIGURE 1. Recurrent DETR ponders. Depending on the crowdedness of the image, Recurrent DETR adaptively controls the number of iteration. For each

iteration, it detects different objects in different positions.

more computation might be required. Considering that simply

increasing the number of query tokens may be infeasible

due to resource constraints, other measures to expand the

prediction set is necessary.

To address this challenge, we propose Recurrent Detec-

tion Transformer (Recurrent DETR), a learning method

for transformer detectors that iterates the decoder block to

generate more predictions. Recurrent DETR adaptively con-

trols the number of iterations depending on the crowdedness

and difﬁculty of the input image, resulting in a variable-

size prediction set. That is, our model ponders for longer

computational steps if the scene is crowded with foreground

objects (Figure 1), and vice versa for relatively sparse scenes.

This is enabled by our novel Pondering Hungarian Loss,

which helps the model learn when to halt the decoder iteration

process given the current token state.

To the best of our knowledge, our approach is the ﬁrst

transformer-based object detector that dynamically computes

predictions based on sample difﬁculty. It can render a larger

prediction set with only a ﬁxed number of query tokens,

vastly improving usability of previously trained object

detection Transformers. We ﬁrst evaluate the effectiveness of

Recurrent DETR on the standard object detection benchmark

dataset, COCO 2017 [10]. To further verify our method’s

performance on crowded scenes, we provide experimental

results on CrowdHuman [11] and our self-constructed

CrowdHuman++, which contains four times more objects

per image compared to the original CrowdHuman dataset.

Then, our contributions are threefold:

• We propose Recurrent DETR, an object detector that

adaptively iterates the decoder block to output a

variable-size prediction set, speciﬁcally designed to

handle crowded scenes.

• We formulate the Pondering Hungarian Loss, which

enables dynamic iterations with transformer-based

object detectors. We also introduce the Novelty Bias to

enforce sufﬁcient content difference among iterations,

guaranteeing dynamic halting.

• We improve the performance of base architectures on

COCO 2017 and CrowdHuman benchmark datasets

by signiﬁcant margins. We also demonstrate Recurrent

DETR’s robustness under various query size settings

through extensive analyses on its learning properties.

See section II for background knowledge required for

Recurrent DETR. Architectural details and loss functions of

our method can be found in section III, whose effectiveness is

evidenced by our experimental results in section IV. We also

provide extensive analyses in the section to further support

Recurrent DETR’s utility in the experiment section.

II. RELATED WORKS

A. TRANSFORMERS FOR OB JECT DETECTION

The Transformer [7] model, proven to be successful in natural

language processing, has been widely applied across various

tasks including image classiﬁcation [12], [13], [14], [15],

[16], [17], object detection [8], [9], [18], [19], [20], [21], [22],

[23], [24], [25], human-object interaction detection [26], [27],

[28], [29], [30], and efﬁcient deep learning [31], [32], [33],

[34]. Especially in the object detection domain, DETR [18]

ﬁrst introduced the transformer module to signiﬁcantly

boost performance, and Deformable DETR [22] provided an

effective means to gather token information via deformable

attention. Its overall efﬁciency enabled the adoption of multi-

scale feature maps, thereby improving detection performance

on small objects. Many subsequent works tried to further

enhance detection performance [19], [20], [21], and some

have tried to address the high computational issue [8], [9],

[23], [24], [25] by reducing the number of tokens, encoder

layers, and decoder layers et cetera. Also, several works

studied object detection settings with crowded scenes with

occlusions [35], [36], [37]. On the other hand, transformers

are also actively adopted in various applications including

autonomous driving [38], [39], animal detection and segmen-

tation [40], [41], or produce detection [42], [43].

B. LEARNING TO PONDER

The required amount of computation varies not only by task

but also by sample. Relatively difﬁcult samples may take

longer computation steps, and vice versa for easier ones.

Accordingly, methods to dynamically assign computational

graphs have long been researched. ACT [44] introduced

the pondering mechanism for RNN’s to adaptively control

computation time. Dynamic Routing Neural Networks [45]

is another branch of learning to ponder, where the network

learns to route different paths depending on the input signal.

Similar approaches have been applied for NLP transformers

78624 VOLUME 11, 2023

H. K. Choi et al.: Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes

FIGURE 2. Overall Recurrent DETR architecture. The encoder extracts the input image feature, whose parameters are frozen.

The decoder (θ ) takes the image feature and initial query tokens Q

and P

to compute the output Z

, which is used to

localize and classify foreground objects (ψ

). This decoder block is iterated multiple times until the halt controller (ξ) decides

to halt, depending on the crowdedness of the scene. For each new iteration, the query updater φ

updates the positional

embeddings. When the iteration ends, the predictions from each computation step are finally unioned.

that adaptively iterate the encoder and decoder structure for

the prediction of word tokens [46]. Recently, PonderNet [47]

proposed a general loss form that learns to dynamically halt

when the target loss is minimized. AdaViT [48] applied these

concepts to learn to halt the processing of background tokens,

which can be redundant for image classiﬁcation.

III. METHODS

In this section, we introduce Recurrent Detection Trans-

former (DETR), a simple and intuitive framework that allows

a variable-size prediction set. We ﬁrst explain prerequisite

loss functions for our method in section III-A. In the follow-

ing section III-B, architectural details are introduced, and the

Pondering Hungarian Loss is proposed in section III-C.

A. PRELIMINARIES

We here describe the two primary loss functions that deﬁne

our approach: the Hungarian loss [49] and the Ponder

loss [47].

1) HUNGARIAN LOSS

The transformer-based detection models [9], [18], [22], [23]

formulate object detection as a set prediction problem. Since

detection transformers output a prediction set, the Hungarian

loss [49] is adopted to map the permuted outputs to ground

truth labels. Prior to loss computation, bipartite matching is

performed on the prediction set

Y = {

}

i=1

and target set Y,

which is padded with the no object class to equate its set size

with the former. Hungarian matching is performed by ﬁnding

the optimal permutation σ

∗

∈ S

Y, where S

refers to

all possible permutations. This is expressed as

∗

= arg min

σ ∈S

i=1

match

(

σ (i)

, Y

) (1)

such that

match

(

σ (i)

, Y

) = 1

=φ}

cls

(

σ (i)

, Y

)

+ 1

=φ}

bbox

(

σ (i)

, Y

) (2)

where σ (i) maps the original index i to a new permuted index

with respect to σ ∈ S

, c

is the class of the i

prediction

and 1

=φ}

is an indicator function that returns 1 if c

is not

the no object class. Then, Hungarian loss L

hung

is computed

hung

i=1

cls

(

∗

(i)

, Y

) + 1

=φ}

bbox

(

∗

(i)

, Y

)]

(3)

where L

cls

refers to the classiﬁcation loss between raw

prediction set

∗

and target set Y . Speciﬁcally, the Focal

loss [6] is utilized as the classiﬁcation loss in Deformable

DETR [22]. L

bbox

refers to the bounding box loss, which is

the sum of the L1 loss and the generalized IoU loss [50] of

the predicted bounding box coordinates.

2) PONDER LOSS

PonderNet [47] provides a general learning scheme that

helps learn to dynamically adapt computation based on

VOLUME 11, 2023 78625

H. K. Choi et al.: Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes

the complexity of the given task. The loss requires each

computational step to take the general form of

, h

t+1

, λ

= s(x, h

), (4)

where function s(·, ·) is the computational step function that

takes data x and state h

as input, and outputs the prediction

, next state h

t+1

, and probability λ

by which to halt

computation at step t. Then, the ponder loss is deﬁned as

ponder

t=1

, y) + β KL(P(ξ) || P

geom

(3)), (5)

where L refers to the task loss, and p

denotes the geometric

probability of halting at computation step t computed by

= λ

t−1

k=1

(1 − λ

), (6)

which indicates the probability of an event occurring in step t

at last. Moreover, P(ξ ) and P

geom

(3) refer to the probability

distribution of p

parameterized by learnable parameter

ξ and the geometric distribution deﬁned by the random

variable 3, respectively. The Kullback–Leibler divergence

term regularizes the learned distribution, P(ξ ), to be similar

to the geometric distribution parameterized by 3.

Recurrent DETR builds on these two base concepts, whose

architecture is designed to accommodate the dynamic ponder-

ing process. The architecture explanations are provided in the

following section.

B. RECURRENT DETR

Here, we introduce the architectural components of Recurrent

Detection Tranformer (DETR). Our method can easily be

applied to any transformer-based detector model with two

additional components: the query updater and the halt

controller. As depicted in Figure 2, the two components are

applied to the decoder blocks in the detection transformer

architecture. The decoder, denoted θ, predicts the foreground

objects with ψ. Then, the halt controller ξ decides whether to

halt with respect to the predicted probability λ

, and the query

updater φ

updates the input query embedding if Recurrent

DETR decides to iterate more. In this work, we speciﬁcally

extend the decoder module in Deformable DETR [22].

1) HALT CONTROLLER

Recurrent DETR iterates the decoder module to render

predictions in each iteration (the decoder is equivalent to

function s in Eq. (4)). To adaptively control when to stop

iterating, we deﬁne the halt controller network. At timestep

t, the halt decision is sampled by the network that takes

the M output token embeddings Z

∈ R

M×D

as input.

Global Average Pooling (GAP) is ﬁrst applied to the output

embeddings Z

. Then, the pooled vector is propagated

through a two-layered FFN with sigmoid activation as

= sigmoid(FFN

(GAP(Z

)), (7)

where λ

refers to the halting probability at iteration t, and ξ

is the parameter of the FFN layers shared across iterations.

At inference, the halting decision is made at the end of each

step, by sampling from a Bernoulli distribution parameterized

by λ

. In the training phase, in contrast, the decoder is always

iterated up to the maximal step T , where the predicted λ

(t =

1, 2, . . . , T ) are used for learning. The learning details will be

discussed in section III-C and IV-A.

2) QUERY UPDATER

For each additional iteration, we update the query tokens.

The query token embeddings are a sum of the semantic

query Q

and positional query P

, which are both learnable.

Generally, the decoder module processes the input semantic

query Q

, while positional query P

is only used to determine

the attention values and is not updated during forward

propagation. Thus, to encourage exploration of diverse image

regions, we also update P

t+1

= P

+ tanh(FFN

+ Z

)), (8)

where φ

refers to the query updater parameter at iteration

t. The FFN layer consists of two linear layers with a ﬁnal

hyperbolic tangent function activation, tanh, to mitigate

aggressive updates. The output tokens from the previous

iteration, Z

, are also input into the query updater to provide

semantic context in deciding where to explore. In the case of

the next step semantic query Q

t+1

, we directly use Z

Overall, the linear layers of the halt controller and query

updater accounts for only a small portion of the entire model.

The number of additional parameters can be further reduced

by sharing the query updater parameters across iterations.

Hence, without much increase in model parameters, Recur-

rent DETR can modify the original detection transformer

architecture to a recurrent structure that iteratively renders a

larger prediction set. See Algorithm 1 for pseudocode.

C. PONDERING HUNGARIAN LOSS

A proper loss function is crucial to enabling dynamic

computation with the architecture described in Section III-

B. We thus present our Pondering Hungarian Loss to help

Recurrent DETR dynamically determine its iteration steps.

First, we modify the original Ponder loss [47] into an

adequate form, which is then merged with the Hungarian

loss [49].

1) MODIFIED PONDER LOSS

By omitting the KL divergence term in Eq. (5), the ponder

loss with inﬁnite iterations (i.e. without step truncation) can

be expressed as

ponder

∞

t=1

, (9)

where L

is the loss value computed with the prediction

outputs at step t. Note that the step t predictions include all

78626 VOLUME 11, 2023

H. K. Choi et al.: Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes

Algorithm 1 Recurrent DETR Inference

Input: X ∈ R

N ×D

; Q

∈ R

M×D

; P

∈ R

M×D

Output: Prediction Set S

1: Initialize set S ← {}

2: X ← Encoder(X)

3: for t = 1, 2, . . . , T do

4: Z

← Decoder

(X, Q

+ P

)

5: Y

cls

← softmax(FFN

cls

))

6: Y

bbox

← sigmoid(FFN

bbox

))

7: S ← S ∪ {(Y

(i)

cls

, Y

(i)

bbox

)}

i=1,...,M

8: λ

← sigmoid(FFN

(GAP(Z

)) ▷ halt controller

9: if random(0, 1) ≤ λ

or t = T then

10: return S

11: else

12: Q

t+1

← Z

13: P

t+1

← P

+tanh(FFN

)) ▷ query updater

14: end if

15: end for

the outputs from step 1 to t. We then deﬁne the relation of

two consecutive losses in terms of their value difference as

t+1

= L

+ δ

. Applying this to Eq. (9) and developing the

equation results in

ponder

= L

∞

t=1

i=1

, (10)

where r

= 1 − λ

is the continuing probability at step i (See

Appendix D for derivation). Eq. (10) deﬁnes the ﬁnal loss

in terms of the step-wise contribution to the total loss value.

We ﬁnd this exposition particularly useful in our setting

where L

t+1

computes the loss with the prediction set that

subsumes the predictions from the previous t steps, i.e.,

⊂

· · · ⊂

⊂

t+1

⊂ · · · . Thus, it is straightforward to

consider the effect of additional predictions at each iteration,

where the evaluation of each timestep is crucial in deciding

when to halt computation.

In practice, the maximum pondering step T is set to avoid

unstopping computation. We simply truncate the inﬁnite

summation in Eq. (10) as

ponder

= L

T −1

t=1

i=1

, (11)

on top of which we merge the Hungarian loss to derive our

Pondering Hungarian loss.

2) PONDERING HUNGARIAN LOSS

A straightforward way to incorporate the Hungarian loss

in Eq. (11) would be to substitute all L

(i = 1, . . . , T )

computations with the standard L

hung

in Eq. (3). However,

this is an inappropriate approach since the scale of the

hungarian loss generally decreases as the total prediction

set size increases. To illustrate, the accumulated predictions

at iteration t + 1 subsume predictions from the previous

iteration t. Then, the Hungarian matching at t + 1 should

at least be as optimal as the matching done at step t. That

is, the δ

in Eq. (11) generally has negative value, depriving

the halt controller the need to ever halt. This can cause

redundant computations without much gain in performance.

Thus, we need a way to regulate Recurrent DETR to halt

iteration at an adequate timestep.

Accordingly, we propose the Novelty Bias, 8

. The novelty

bias is added to the original δ

value in (11) to redeﬁne

= L

t+1

−

+ 8

, (12)

where

is a stop-gradient of L

. The bias term sets

a threshold on the level of loss difference between two

consecutive iterations. That is, the δ

value will become

negative only when the loss at t + 1 decreases more than 8

Here, the novelty bias is computed as

i=1

[



j=1

(i)



⊤

(i)

t+1

], (13)

where M is the number of input query tokens, τ is a parameter

that scales the magnitude of the novelty bias, and

(i)

is a stop-

gradient of the i

output token embedding at decoder iteration

step t. Note, truncating the gradients in Eq. (12) and Eq. (13)

stabilized learning.

Furthermore, since Recurrent DETR is designed for

crowded scenes, we apply our Pondering Hungarian loss only

after the total prediction set size exceeds the target set size.

To be speciﬁc, if the query set size M is smaller than the given

ground truth target set size K , we begin computing the loss at

step I = ⌊(K − 1)/M⌋ + 1. Finally, our Pondering Hungarian

loss L

is expressed as

= L

T −1

t= I

t+1

−

+ 8

)

i=1

, (14)

where T is the maximum iteration allowed for the model.

IV. EXPERIMENTS

In this section, we provide experimental results and relevant

analyses on Recurrent DETR to address the following

research questions:

Q1. In what cases is Recurrent DETR useful? (IV-B)

Q2. What are the effects of each component? (IV-C)

Q3. Does Recurrent DETR learn to ponder? (IV-D)

A. IMPLEMENTATION DETAILS

Here, we ﬁrst provide implementation details of our

experiments, including dataset information and training &

evaluation protocols.

1) DATASETS

We evaluate Recurrent DETR on COCO 2017 [10] and

CrowdHuman [11]. COCO 2017 data samples are relatively

sparse, whereas CrowdHuman contains more densely pop-

ulated target objects per scene. We summarize the dataset

statistics in Table 1.

VOLUME 11, 2023 78627

剩余20页未读，继续阅读

评论收藏

内容反馈

DrYJ

粉丝: 40
资源: 24

Recurrent DETR: Transformer-Based Object Detection for Crowded S

Attention-Based Recurrent Neural Network Models for Joint Intent Detection

Improved Recurrent Neural Networks for Session-based Recommendations.pdf

Training Deep and Recurrent Networks with Hessian-Free Optimization

Session-based Recommendations with Recurrent Neural Networks.pdf

Recurrent Squeeze-and-Excitation Context Aggregation Net for Single Image

论文笔记：Progressive Attention Guided Recurrent Network for Salient Object Detection-附件资源

Recurrent Deep Neural network based Object Detection :Recurrent Deep Neural network based Object Detection matlab code with example-matlab开发

Recurrent_Back-Projection_Network_for_Video_Super

An Attention-based Recurrent Convolutional Networkfor Vehicle

RCNN-FAST RCNN-FASTER RCNN

Recurrent-Attention-Convolutional-Neural-Network

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION

Recurrent Neural Network-Based Adaptive Energy Management Control Strategy of Plug-In Hybrid Electric Vehicles Considering Battery Aging

A-Dual-Stage-Attention-Based-Recurrent-Neural-Network-for-Time-Series-Prediction

DUAL-PATH RNN FOR TIME-DOMAIN SINGLE-CHANNEL SPEECH SEPARATION

2019-论文-RECURRENT MODELS FOR DRUG generation-rrrrr1

Dual-Stage-Attention-Based-Recurrent-Neural-Network:Pytorch https的实现

Important-Trading-Point-Prediction-Using-a-HybridConvolutional-Recurrent-Neural-Network

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

全新的SOTA模型YOLOv9

YOLOV5 + 双目相机实现三维测距（新版本）

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

pycharm连接autodl服务器（yolov8训练自己的数据集）

最新资源