MDETR论文翻译及理解_mdetr运行资源-CSDN文库

共3个文件

pdf：2个

temp：1个

论文

需积分: 5 144 浏览量 2023-12-19 12:00:57 上传评论收藏 39.41MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

MDETR.zip （3个子文件）

2104.12763-MDETR.pdf 22.61MB

c316491cb3132542a04c2cc8c83bb906.zip.temp 3.63MB

MDETR - 翻译.pdf 17.5MB

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

Aishwarya Kamath

Mannat Singh

Yann LeCun

123

Gabriel Synnaeve

Ishan Misra

Nicolas Carion

NYU Center for Data Science

Facebook AI Research

NYU Courant Institute

{aish, yann.lecun, nc2794}@nyu.edu, {mannatsingh,imisra,gab}@fb.com

Abstract

Multi-modal reasoning systems rely on a pre-trained ob-

ject detector to extract regions of interest from the im-

age. However, this crucial module is typically used as a

black box, trained independently of the downstream task

and on a ﬁxed vocabulary of objects and attributes. This

makes it challenging for such systems to capture the long

tail of visual concepts expressed in free form text. In this

paper we propose MDETR, an end-to-end modulated de-

tector that detects objects in an image conditioned on a

raw text query, like a caption or a question. We use a

transformer-based architecture to reason jointly over text

and image by fusing the two modalities at an early stage of

the model. We pre-train the network on 1.3M text-image

pairs, mined from pre-existing multi-modal datasets hav-

ing explicit alignment between phrases in text and objects

in the image. We then ﬁne-tune on several downstream

tasks such as phrase grounding, referring expression com-

prehension and segmentation, achieving state-of-the-art re-

sults on popular benchmarks. We also investigate the util-

ity of our model as an object detector on a given label set

when ﬁne-tuned in a few-shot setting. We show that our

pre-training approach provides a way to handle the long

tail of object categories which have very few labelled in-

stances. Our approach can be easily extended for visual

question answering, achieving competitive performance on

GQA and CLEVR. The code and models are available at

https://github.com/ashkamath/mdetr.

1. Introduction

Object detection forms an integral component of most

state-of-the-art multi-modal understanding systems [6, 28],

typically used as a black-box to detect a ﬁxed vocabulary

of concepts in an image followed by multi-modal align-

ment. This “pipelined” approach limits co-training with

other modalities as context and restricts the downstream

model to only have access to the detected objects and not the

whole image. In addition, the detection system is usually

Figure 1: Output of MDETR for the query “A pink elephant”. The

colors are not segmentation masks but the real colors of the pixels.

The model has never seen a pink nor a blue elephant in training.

frozen, which prevents further reﬁnement of the model’s

perceptive capability. In the vision-language setting, it im-

plies restricting the vocabulary of the resulting system to

the categories and attributes of the detector, and is often a

bottleneck for performance on these tasks [72]. As a re-

sult, such a system cannot recognize novel combinations of

concepts expressed in free-form text.

A recent line of work [66, 45, 13] considers the problem

of text-conditioned object detection. These methods extend

mainstream one-stage and two-stage detection architectures

to achieve this goal. However, to the best of our knowl-

edge, it has not been demonstrated that such detectors can

improve performance on downstream tasks that require rea-

soning over the detected objects, such as visual question an-

swering (VQA). We believe this is because these detectors

are not end-to-end differentiable and thus cannot be trained

in synergy with downstream tasks.

Our method, MDETR, is an end-to-end modulated de-

tector based on the recent DETR [2] detection framework,

and performs object detection in conjunction with natural

language understanding, enabling truly end-to-end multi-

modal reasoning. MDETR relies solely on text and aligned

boxes as a form of supervision for concepts in an image.

Thus, unlike current detection methods, MDETR detects

nuanced concepts from free-form text, and generalizes to

unseen combinations of categories and attributes. We show-

case such a combination as well as modulated detection in

arXiv:2104.12763v2 [cs.CV] 12 Oct 2021

“A cat with white paws jumps over

a fence in front of a yellow tree”

CNN

RoBERTa

Concat

Transformer

set of images

features

sequence of

text features

2D positional

embedding

predicted

boxes

Linear

ø no object

A cat with white paws jumps over

a fence in front of a yellow tree ø

Figure 2: MDETR uses a convolutional backbone to extract visual features, and a language model such as RoBERTa to extract text features.

The features of both modalities are projected to a shared embedding space, concatenated and fed to a transformer encoder-decoder that

predicts the bounding boxes of the objects and their grounding in text.

Fig. 1. By design, our predictions are grounded in text,

which is a key requirement for visual reasoning [65]. When

pre-trained using a dataset of 200,000 images and aligned

text with box annotations, we achieve best reported re-

sults on the Flickr30k dataset for phrase grounding, Re-

fCOCO/+/g datasets for referring expression comprehen-

sion, and referring expression segmentation on Phrase-

Cut, as well as competitive performance on the GQA and

CLEVR benchmarks for visual question answering.

Our contributions are as follows:

• We introduce an end-to-end text-modulated detection

system derived from the DETR detector.

• We demonstrate that the modulated detection approach

can be applied seamlessly to solve tasks such as phrase

grounding and referring expression comprehension,

setting new state of the art performance on both these

tasks using datasets having synthetic as well as real

images.

• We show that good modulated detection performance

naturally translates to downstream task performance,

for instance achieving competitive performance on vi-

sual question answering, referring expression segmen-

tation, and on few-shot long-tailed object detection.

2. Method

In this section we ﬁrst brieﬂy summarize the object de-

tection pipeline [2] based on which we build our model in

§2.1 and then describe how we extend it for modulated de-

tection in §2.2.

2.1. Background

DETR Our approach to modulated detection builds on the

DETR system [2], which we brieﬂy review here. We re-

fer the readers to the original paper for additional details.

DETR is an end-to-end detection model composed of a

backbone (typically a convolutional residual network [12]),

followed by a Transformer Encoder-Decoder [59].

The DETR encoder operates on 2D ﬂattened image fea-

tures from the backbone and applies a series of transformer

layers. The decoder takes as input a set of N learned em-

beddings called object queries, that can be viewed as slots

that the model needs to ﬁll with detected objects. All the

object queries are fed in parallel to the decoder, which uses

cross-attention layers to look at the encoded image and pre-

dicts the output embeddings for each of the queries. The

ﬁnal representation of each object query is independently

decoded into box coordinates and class labels using a shared

feed-forward layer. The number of object queries acts as a

de facto upper-bound on the number of objects the model

can detect simultaneously. It has to be set to a sufﬁciently

large upper-bound on the number of objects one may expect

to encounter in a given image. Since the actual number of

objects in a particular image may be less than the number of

queries N, an extra class label corresponding to “no object”

is used, denoted by ∅. The model is trained to output this

class for every query that doesn’t correspond to an object.

DETR is trained using a Hungarian matching loss, where

a bipartite matching is computed between the N proposed

objects and the ground-truth objects. Each matched ob-

ject is supervised using the corresponding target as ground-

truth, while the un-matched objects are supervised to pre-

dict the “no object” label ∅. The classiﬁcation head is su-

pervised using standard cross-entropy, while the bounding

box head is supervised using a combination of absolute er-

ror (L1 loss) and Generalized IoU [48].

2.2. MDETR

2.2.1 Architecture

We depict the architecture for MDETR in Fig. 2. As in

DETR, the image is encoded by a convolutional backbone

and ﬂattened. In order to conserve the spatial information,

2-D positional embeddings are added to this ﬂattened vec-

tor. We encode the text using a pre-trained transformer lan-

guage model to produce a sequence of hidden vectors of

same size as the input. We then apply a modality depen-

dent linear projection to both the image and text features to

project them into a shared embedding space. These feature

vectors are then concatenated on the sequence dimension to

yield a single sequence of image and text features. This se-

quence is fed to a joint transformer encoder termed as the

cross encoder. Following DETR, we apply a transformer

decoder on the object queries while cross attending to the

ﬁnal hidden state of the cross encoder. The decoder’s out-

put is used for predicting the actual boxes.

2.2.2 Training

We present the two additional loss functions used by

MDETR, which encourage alignment between the image

and the text. Both of these use the same source of annota-

tions: free form text with aligned bounding boxes. The ﬁrst

loss function that we term as the soft token prediction loss is

a non parametric alignment loss. The second, termed as the

text-query contrastive alignment is a parametric loss func-

tion enforcing similarity between aligned object queries and

tokens.

Soft token prediction For modulated detection, unlike in

the standard detection setting, we are not interested in pre-

dicting a categorical class for each detected object. Instead,

we predict the span of tokens from the original text that

refers to each matched object. Concretely, we ﬁrst set the

maximum number of tokens for any given sentence to be

L = 256. For each predicted box that is matched to a ground

truth box using the bi-partite matching, the model is trained

to predict a uniform distribution over all token positions that

correspond to the object. Fig. 2 shows an example where

the box for cat is trained to predict a uniform distribution

over the ﬁrst two words. In Fig. 6, we show a simpliﬁed

visualization of the loss for this example, in terms of a dis-

tribution over words for each box, but in practice we use

token spans after tokenization using a BPE scheme [52].

Any query that is not matched to a target is trained to pre-

dict the “no object” label ∅. Note that several words in

the text could correspond to the same object in the image,

and conversely several objects could correspond to the same

text. For example, “a couple” referred to by two boxes in

the image, could further be referred to individually in the

same caption. By designing the loss function in this way,

our model is able to learn about co-referenced objects from

the same referring expression.

Contrastive alignment While the soft token prediction

uses positional information to align the objects to text, the

contrastive alignment loss enforces alignment between the

embedded representations of the object at the output of the

decoder, and the text representation at the output of the cross

encoder. This additional contrastive alignment loss ensures

that the embeddings of a (visual) object and its correspond-

ing (text) token are closer in the feature space compared to

embeddings of unrelated tokens. This constraint is stronger

than the soft token prediction loss as it directly operates on

the representations and is not solely based on positional in-

formation. More concretely, consider the maximum number

of tokens to be L and maximum number of objects to be N.

Let T

be the set of tokens that a given object o

should be

aligned to, and O

be the set of objects to be aligned with

a given token t

The contrastive loss for all objects, inspired by InfoNCE

[40] is normalized by number of positive tokens for each

object and can be written as follows:

N−1

i=0

j∈T

− log



exp(o

⊤

/τ)

L−1

k=0

exp(o

⊤

/τ)



(1)

where τ is a temperature parameter that we set to 0.07 fol-

lowing literature [63, 47]. By symmetry, the contrastive loss

for all tokens, normalized by the number of positive objects

for each token is given by:

L−1

i=0

j∈O

− log



exp(t

⊤

/τ)

N−1

k=0

exp(t

⊤

/τ)



(2)

We take a the average of these two loss functions as our

contrastive alignment loss.

Combining all the losses In MDETR, a bipartite match-

ing is used to ﬁnd the best match between the predictions

and the ground truth targets just as in DETR. The main dif-

ference is that there is no class label predicted for each ob-

ject - instead predicting a uniform distribution over the rele-

vant positions in the text that correspond to this object (soft

token predictions), supervised using a soft cross entropy.

The matching cost consists of this in addition to the L1 &

GIoU loss between the prediction and the target box as in

DETR. After matching, the total loss consists of the box

prediction losses (L1 & GIoU), soft-token prediction loss,

and the contrastive alignment loss.

3. Experiments

In this section we describe the data and training used for

pre-training MDETR, and provide details and results on the

tasks that we use to evaluate our approach. Results on the

CLEVR dataset are reported in Table 1. For a discussion on

the CLEVR results and further details on data preparation

and training, please see Appendix B. Experimental details

for pre-training and downstream tasks on natural images are

detailed in §3.1 and §3.2.

3.1. Pre-training Modulated Detection

For pre-training, we focus on the task of modulated de-

tection where the aim is to detect all the objects that are re-

Method CLEVR CLEVR-Hu CoGenT CLEVR-Ref+

Overall - FT + FT TestA TestB Acc

MAttNet[69] - - - - - 60.9

MGA-Net[73] - - - - - 80.1

FiLM[42] 97.7 56.6 75.9 98.3 78.8 -

MAC [17] 98.9 57.4 81.5 - - -

NS-VQA[67]

∗

99.8 - 67.8 99.8 63.9 -

OCCAM [60] 99.4 - - - - -

MDETR 99.7 59.9 81.7 99.8 76.7 100

Table 1: Results on CLEVR-based datasets. We report accura-

cies on the test set of CLEVR. On CLEVR-Humans, we report

accuracy on the test set before and after ﬁne-tuning. On CoGenT,

we report performance when the model is trained in condition A,

without ﬁnetuning on condition B. On CLEVR-Ref+, we report

the accuracy on the subset where the referred object is unique. *in-

dicates method uses external program annotations. Further details

in Appendix B.

ferred to in the aligned free form text. We create a combined

dataset using images from the Flickr30k [46], MS COCO

[30] and Visual Genome (VG) [24] datasets. Annotations

from the referring expressions datasets, VG regions, Flickr

entities and GQA train balanced set are used for training.

An image may have several text annotations associated with

it. Details on the datasets can be found in Appendix C.

Data combination For each image, we take all annota-

tions from these datasets and combine the text that refers to

the same image while ensuring that all images that are in

the validation or testing set for all our downstream tasks are

removed from our train set. The combination of sentences

is done using a graph coloring algorithm which ensures that

only phrases having boxes with GIoU ≤ 0.5 are combined,

and that the total length of a combined sentence is less than

250 characters. In this way, we arrive at a dataset having

1.3M aligned image - text pairs. This combination step is

important for two reasons: 1) data efﬁciency, by packing

more information into a single training example and 2) it

provides a better learning signal for our soft token predic-

tion loss since the model has to learn to disambiguate be-

tween multiple occurrences of the same object category, as

depicted in Fig 3. In the single sentence case, the soft to-

ken prediction task becomes trivial since it can always pre-

dict the root of the sentence without looking at the image.

Experimentally, we ﬁnd that such dense annotations trans-

late to better grounding between text and image and subse-

quently to better downstream performance.

Model We use a pre-trained RoBERTa-base [32] as our

text encoder, having 12 transformer encoder layers, each

with hidden dimension of 768 and 12 heads in the multi-

head attention. We use the implementation and weights

from HuggingFace [61]. For the visual backbone, we ex-

plore two options. The ﬁrst is a ResNet-101 [12] pretrained

on ImageNet with frozen batchnorm layers, taken from

Figure 3: Our combination of annotations results in examples

such as the following: “the person in the grey shirt with a watch

on their wrist. the other person wearing a blue sweater. the third

person in a gray coat and scarf.” We show the predictions from our

model for this caption. It is able to pay attention to all the objects

in the image and then disambiguate between them based on the

text. The model is trained to predict the root of the phrase as the

positive token span, which as we can see in this ﬁgure, correctly

refers to the three different people.

Torchvision. This is to be comparable with current litera-

ture in the space of multi-modal understanding, where the

popular approach is to use the BUTD object detector with a

Resnet-101 backbone from [1] trained on the VG dataset. In

our work, we are not limited by the existence of pre-trained

detectors, and inspired by its success in object detection

[58], we choose to explore the EfﬁcientNet family [57] for

our backbone. We use a model which was trained on large

amounts of unlabelled data in addition to ImageNet, using a

pseudo-labelling technique called Noisy-Student [64]. We

choose the EfﬁcientNetB3, which achieves 84.1% top 1 ac-

curacy on ImageNet with only 12M weights and EfﬁcientB5

which achieves 86.1% using 30M weights. We use the im-

plementation provided by the Timm library [?], and freeze

the batchnorm layers. We pre-train our model for 40 epochs

on 32 V100 gpus with an effective batch size of 64, which

takes approximately a week to train. Training hyperparam-

eters are detailed in Appendix A.

3.2. Downstream Tasks

We evaluate our method on 4 downstream tasks: re-

ferring expression comprehension and segmentation, visual

question answering and phrase grounding. Training hyper-

prameters for all tasks can be found in Appendix A.

Phrase grounding Given one or more phrases, which

Method Detection Pre-training RefCOCO RefCOCO+ RefCOCOg

backbone image data val testA testB val testA testB val test

MAttNet[69] R101 None 76.65 81.14 69.99 65.33 71.62 56.02 66.58 67.27

ViLBERT[34] R101 CC (3.3M) - - - 72.34 78.52 62.61 - -

VL-BERT L [54] R101 CC (3.3M) - - - 72.59 78.57 62.30 - -

UNITER L[6]

∗

R101 CC, SBU, COCO, VG (4.6M) 81.41 87.04 74.17 75.90 81.45 66.70 74.86 75.77

VILLA L[9]

∗

R101 CC, SBU, COCO, VG (4.6M) 82.39 87.48 74.84 76.17 81.54 66.84 76.18 76.71

ERNIE-ViL L[68] R101 CC, SBU (4.3M) - - - 75.95 82.07 66.88 - -

MDETR R101 COCO, VG, Flickr30k (200k) 86.75 89.58 81.41 79.52 84.09 70.62 81.64 80.89

MDETR ENB3 COCO, VG, Flickr30k (200k) 87.51 90.40 82.67 81.13 85.52 72.96 83.35 83.31

Table 2: Accuracy results on referring expression comprehension. *As mentioned in UNITER [6], methods using box proposals from

the BUTD detector [1] suffer from a test set leak, since the detector was trained on images including the validation and test set of the

RE comprehension datasets. We report numbers for these methods from their papers using these “contaminated features” but we would

like to stress that all of our pre-training excluded the images used in the val/test of any of the downstream datasets including for RE

comprehension. CC refers to Conceptual Captions [53], VG to Visual Genome [24], SBU refers to the SBU Captions[41] and COCO to

Micosoft COCO [30].

Method Val Test

R@1 R@5 R@10 R@1 R@5 R@10

ANY-BOX-PROTOCOL

BAN [22] - - - 69.7 84.2 86.4

VisualBert[26] 68.1 84.0 86.2 - - -

VisualBert†[26] 70.4 84.5 86.3 71.3 85.0 86.5

MDETR-R101 78.9 88.8 90.8 - - -

MDETR-R101†∗ 82.5 92.9 94.9 83.4 93.5 95.3

MDETR-ENB3†∗ 82.9 93.2 95.2 84.0 93.8 95.6

MDETR-ENB5†∗ 83.6 93.4 95.1 84.3 93.9 95.8

MERGED-BOXES-PROTOCOL

CITE [43] - - - 61.9 - -

FAOG [66] - - - 68.7 - -

SimNet-CCA [45] - - - 71.9 - -

DDPN [71] 72.8 - - 73.5 - -

MDETR-R101 79.0 86.7 88.6 - - -

MDETR-R101†∗ 82.3 91.8 93.7 83.8 92.7 94.4

Table 3: Results on the phrase grounding task on Flickr30k enti-

ties dataset [46]. Models with † are pre-trained on COCO, mod-

els with ∗ are also pre-trained on VG and Flickr 30k. Our mod-

els (MDETR) use a RoBERTa text encoder while other models

use RNNs, word2vec-based features, or BERT (comparable to

RoBERTa) text encoders. All models use a ResNet101 backbone,

except MDETR-ENB3 which uses EfﬁcientNet-B3 and MDETR-

ENB5 with an EfﬁcientNet-B5.

may be inter-related, the task is to provide a set of bound-

ing boxes for each phrase. We use the Flickr30k entities

dataset for this task, with the train/val/test splits as provided

by [46] and evaluate our performance in terms of Recall@k.

For each sentence in the test set, we predict 100 bounding

boxes and use the soft token alignment prediction to rank

the boxes according to the score given to the token positions

that correspond to the phrase. We evaluate under two pro-

tocols which we name ANY-BOX [26, 22] and MERGED-

BOXES [44]. Please see Appendix D for a discussion on the

two protocols. We compare our method to existing state-of-

the-art results from two types of approaches - the text con-

ditioned detection models [45, 66] and a transformer based

vision-language pre-training model [26]. In the ANY-BOX

setting, we obtain a 8.5 point boost over current state of

the art on this task as measured in terms of Recall@1 on

the validation set, without using any pre-training (no ad-

ditional data). With pre-training, we further obtain a 12.1

point boost over the best model’s performance on the test

set, while using the same backbone.

Referring expression comprehension Given an image

and a referring expression in plain text, the task is to lo-

calize the object being referred to by returning a bound-

ing box around it. The approach taken by most prior work

[69, 34, 6, 68] on this task has been to rank a set of pre-

extracted bounding boxes associated with an image, that are

obtained using a pre-trained object detector. In this paper,

we solve a much harder task - we train our model to directly

predict the bounding box, given a referring expression and

the associated image. There are three established datasets

for this task called RefCOCO, RefCOCO+ [70] and Ref-

COCOg [36]. Since during pre-training we annotate every

object referred to within the text, there is a slight shift in

the way the model is used in this task. For example, dur-

ing pre-training, given the caption “The woman wearing a

blue dress standing next to the rose bush.”, MDETR would

be trained to predict boxes for all referred objects such as

the woman, the blue dress and the rose bush. However, for

referring expressions, the task would be to only return one

bounding box, which signiﬁes the woman being referred to

by the entire expression. For this reason, we ﬁnetune the

评论收藏

内容反馈

一杯水果茶！

粉丝: 1379
资源: 23

MDETR 论文翻译及理解

最新资源

MDETR 论文翻译及理解

mdetr_annotations.tar.zip.004

mdetr_annotations.tar.zip.002

mdetr_annotations.tar.zip.001

mdetr_annotations.tar.zip.003

mdetr_annotations.tar.zip.005

ICCV2021最新论文合集下载

Academic+Phrasebank+2021+Edition+_中英文对照.pdf

基于python的超市管理系统的设计与实现毕业论文+项目文档源码

1000套计算机毕业设计带源码

数模国赛word模板.zip

2023高教社数学建模C题 - 蔬菜类商品的自动定价与补货决策【数据处理详细代码】

2021年国赛A题（FAST主动反射面形状调节）论文+代码材料.zip

基于高校校园网的网络规划设计与实现-以锦城学院为例-kaic.docx

Python大作业（包含论文）——可打包的双人五子棋程序

YOLOv9论文，2024.02发布

软考 系统分析师论文 范文

elsevier投稿模板Latex压缩包

ChatGPT4.0中文版论文

yolov综述论文，v1到v8的详细深入对比剖析

萌宠领养平台微信小程序的设计与实现毕业设计答辩PPT

电工杯历年优秀论文 .zip

python地铁客流量分析平台_python毕业设计_爬虫可视化_论文_python_毕业论文

2020年认证杯C题优秀论文.rar

基于Spring Boot和Vue的留守儿童爱心捐赠系统设计与实现的毕业论文

飞机俯仰角控制系统设计报告

LaTex公式符号手册

恒达高停车场信息管理系统 数据库课程设计（55页18749字数）.doc

软考高级系统架构师论文案例

java毕业设计项目带源码和论文

最新资源

软考系统分析师论文范文

恒达高停车场信息管理系统数据库课程设计（55页18749字数）.doc