End-to-EndObjectDetectionwithTransformers

需积分: 3 71 浏览量 2022-12-31 15:57:43 上传评论收藏 9.33MB PDF 举报

《End-to-End Object Detection with Transformers》这篇论文探讨了一种新的对象检测方法，该方法将对象检测视为直接的集合预测问题。传统的对象检测框架通常包含许多手动设计的组件，如非极大值抑制（Non-Maximum Suppression, NMS）和锚框生成（Anchor Generation），这些组件基于对任务的先验知识进行编码。而DEtection TRansformer（DETR）模型则摒弃了这些繁琐的步骤，以更简洁的方式实现对象检测。 DETR的核心是基于Transformer的编码器-解码器架构和一个集合理论损失函数。Transformer最初在自然语言处理领域取得显著成果，DETR将其引入计算机视觉，用于理解图像中的物体关系和全局上下文。模型使用一组固定的学习对象查询（Object Queries）作为输入，通过Transformer直接并行地输出最终的预测集合，无需NMS等后处理步骤。 DETR的工作流程如下：编码器对输入图像进行特征提取，然后解码器通过与固定数量的对象查询交互来生成预测。每个对象查询可以被视为潜在物体的表示，解码器通过学习这些查询与图像特征的相互作用来预测物体的边界框和类别。由于采用了集合理论损失，DETR能够通过二分匹配（Bipartite Matching）强制预测结果的唯一性，从而确保每个真实物体仅对应一个预测。在COCO对象检测数据集上，DETR的表现与高度优化的Faster R-CNN基线相当，同时在运行时间和准确性方面具有竞争力。此外，DETR还能轻易扩展到产生统一的全景分割任务，并且在这一领域超越了竞争基准。值得注意的是，DETR的实现不依赖于特定的库，这使得它在实际应用中更具灵活性。论文的贡献主要体现在以下几个方面： 1. 提出了一种新的端到端对象检测框架，消除了传统方法中的手工设计组件，简化了检测流程。 2. 引入Transformer架构，利用其强大的序列建模能力来捕捉图像中的物体关系和上下文信息。 3. 设计了集合理论损失函数，通过二分匹配解决了多预测问题，确保预测的唯一性。 4. 展示了DETR在COCO数据集上的优秀性能，并且可以扩展到其他视觉任务，如全景分割。 DETR的出现标志着对象检测领域的创新，它为后续研究提供了新的思路，即如何更高效、更直接地解决集合预测问题，同时也展示了Transformer在计算机视觉领域的广阔应用前景。随着深度学习和Transformer技术的不断发展，未来可能会有更多类似DETR的模型出现，进一步推动对象检测和其他视觉任务的技术进步。

资源推荐

资源详情

资源评论

End-to-End Object Detection with Transformers

Nicolas Carion

, Francisco Massa

, Gabriel Synnaeve, Nicolas Usunier,

Alexander Kirillov, and Sergey Zagoruyko

Faceb ook AI

Abstract. We present a new method that views object detection as a

direct s et prediction problem. Our approach streamlines the detection

pipeline, e↵ectively removing the need for many hand-designed compo-

nents like a non-maximum suppression procedure or a n chor gene ra ti o n

that ex p li ci t ly encode our prior knowledge about the task. The main

ingredients of the new framework, called DEtecti o n TRansformer or

DETR, are a set-based global loss that forces unique predictions via bi-

partite matching, and a transformer encoder-decoder architecture. Given

a ﬁxed small set of learned object queries, DETR reasons about the re-

lations of the objects and the global image context to directly output

the ﬁn a l set of predictions in parallel. The new model is conceptually

simple and d oes not require a specialized library, unlike many other

modern detectors. DETR demon s trat es accuracy and run-time perfo r-

mance on par with the well-established and highly-optimized Faster R-

CNN baseline on the challenging COCO object detection dataset. More-

over, DETR can be easily generalize d to produc e panoptic segmentation

in a uniﬁed manner. We show that it signi ﬁ c antly outperforms com-

petitive baselines. Training code and pretra i n ed models are available at

https://github.com/facebookresearch/detr.

1 Introduction

The goal of object detection is to predict a set of bounding boxes and category

labels for each object of interest. Modern det ect or s address this set prediction

task in an i n di r ec t way, by deﬁning surrogat e regression and classiﬁcation prob-

lems on a large set of proposals [37,5], anchors [23], or window centers [53,46].

Their performance s are signiﬁcantly inﬂuenced by postprocessing steps to col-

lapse near-duplicate predictions, by the design of the anchor sets and by the

heuristics that assign target boxes to anchors [52]. To simplify these pip elines,

we propose a direct set prediction approach to bypass the surrogate tasks. This

end-to-end philosophy has led to signiﬁcant advance s in complex structured pr e-

diction tasks such as machine translation or speech recognition, but not yet in

object d et e ct i on: previous attempts [43,16,4,39] either add other forms of prior

knowledge, or have not proven to be competitive with strong baselines on chal-

lenging benchmarks. This paper aims to bridge this gap.

Equal contribution

arXiv:2005.12872v3 [cs.CV] 28 May 2020

2 Carion et al .

transformer

encoder-

decoder

CNN

set of box predictions bipartite matching loss

no object (ø) no object (ø)

set of image features

Fig. 1: DETR directly predicts (in parallel) the ﬁnal set of detections by combining

a common CNN with a transformer a rchitecture. During training, bipartite matching

uniquely assigns predictions with ground truth boxes. Prediction with no match should

yield a “no object”(?) c la s s predict io n .

We streamline the training pipeline by viewing object detection as a direct set

prediction problem. We adopt an encode r- de coder architecture based on trans-

formers [47], a popular architecture for sequence prediction. The self-attention

mechanisms of transformer s, which explicitly model all pairwise interactions be-

tween elements in a sequence, make these architectures particularly sui t ab l e for

speciﬁc constraints of set prediction such as removing dupl i cat e pred ic t i ons .

Our DEtection TRansformer (DETR, see Figure 1) predicts all objects at

once, and is trained end-to-end with a set loss function which performs bipar-

tite matching between predic t ed and ground-truth objects. DETR simpliﬁes the

detection pipeline by dropping multiple hand-designed components that encode

prior kn owledge, like spat i al anchors or non-maximal su pp re ss ion . Unlike most

existing detec t i on methods, DETR doesn’t require any cus t omi zed layers, and

thus can be reproduced easily in any framework that contains standard CNN

and transformer classes.

Compared to most pr ev i ous work on direct set prediction, t h e main featu re s of

DETR ar e the conjunction of the bipartite matching loss an d transfor me rs with

(non-autoregressive) parallel decoding [29, 12, 10, 8]. In contrast, previous work

focused on autoregressive decoding with RNNs [43,41,30,36,42]. Our matching

loss function uniquely assigns a prediction to a ground tru t h object, and is

invariant to a permutation of predicted objects, so we can emit them in parallel.

We evaluate DETR on one of the most popular object detection dat ase t s,

COCO [24], against a very competitive Faster R-CNN baseline [37]. Faster R-

CNN has undergone many design iterations an d its perf orm anc e was greatl y

improved since the origin al publication. Our experiments show that our new

model achieves c ompar ab l e performances. More precisely, DETR demonstrate s

signiﬁcantly better performance on large objects, a result likely enabled by the

non-local computations of the transformer. It obtains, however, lower perfor-

mances on small objects. We expe c t that future work will improve this aspect

in the same way the development of FPN [22] did for Faster R-CNN.

Training settings for DETR di↵er from st an dar d object detectors in mul-

tiple ways. The new model requires extra-l on g trainin g schedule and beneﬁts

In our work we use standard implementations of Transformers [47]andResNet[15]

backbones from standard deep learnin g lib rari es .

End-to-End Object Detection with Transformers 3

from auxiliary decoding losses in the transformer. We thoroughl y e xp l or e what

components are crucial for the demonstrated performance.

The design ethos of DETR easily extend to more complex tasks. In our

experiments, we show that a simple segmentation head trained on top of a pre-

trained DETR outperfoms competitive baselines on Panoptic Segmentation [19],

a challenging pixel-level recognition task that has recently gained popularity.

2Relatedwork

Our work build on prior work in several domains: bipar t i t e matching losses for

set prediction, encoder-decoder architectures based on the transformer, parallel

decoding, and object d et ec t i on meth ods.

2.1 Set Prediction

There is no canonical deep learning model to directly predict sets. The basic set

prediction task is multilabel classiﬁcation (see e.g., [40,33] for references in the

context of computer vision) for which the baseline approach, one-vs-rest , does

not apply to problems such as detection where there is an underlying structure

between elements (i.e., near-identical boxes). T he ﬁrst diﬃculty in these tasks

is to avoid near-dupli c at es. Most current detectors use postprocessings such as

non-maximal suppression to address this issue, but direct se t prediction are

postprocessing-free. They need global inf er en ce schemes that model interactions

between all predicted elements to avoid redundancy. For constant-size set pre-

diction, dense fully connected networks [9] are suﬃcient but costly. A general

approach i s to use auto-regressive sequence models such as recurrent neural net-

works [48]. In all cases, the loss function shou ld be invariant by a permutation of

the predictions. The usual soluti on i s to design a loss based on the Hungarian al-

gorithm [20], to ﬁ n d a bipartite matching between ground-t r ut h and predict i on.

This enforces permutation-invariance, and guarantees that each tar get element

has a unique match. We foll ow the bipartite matching loss approach. In contrast

to most prior work however, we step away from autoregressive models and use

transformers with parallel decoding, which we describe below.

2.2 Transformers and Parallel Decoding

Transformers were introduced by Vaswani et al .[47] as a new attention-based

building block for machine translation. Attention mechanisms [2] are neural net-

work layers that aggr egat e information from t he entire input sequence. Trans-

formers introduced self-attention layers, which, similarly to Non-Local Neural

Networks [49], scan through each elem ent of a sequence and update it by ag-

gregating in f orm ati on from the whole sequence. One of the main advantages of

attention -b ase d models is their global computations and perfect memory, which

makes the m mor e su i t abl e t h an RNNs on long sequence s. Transformers ar e now

4 Carion et al .

replacing RNNs i n many problems in natur al language processing, speech pro-

cessing and computer vision [8,27,45,34,31].

Transformers were ﬁrst used in auto-regressive models, following early sequence-

to-sequence models [44], generati n g output tokens one by one. However, the pro-

hibitive inference cost (proportional to output length, and hard to batch) lead

to the development of parallel sequen ce generation, in t h e domains of audio [29],

machine translation [12,10], word represent at i on learning [8], and more recently

speech recognition [6]. We also combine transformers and parallel decoding for

their suitable trade-o↵ between computational cost and the ability to perform

the global computations required for set prediction.

2.3 Object detection

Most modern object detection metho d s make predictions r el at ive to some ini-

tial guesses. Two-stage detectors [37, 5] predict boxes w.r.t. proposals, whereas

single-stage methods make predictions w.r.t. anchors [23] or a grid of possible

object centers [53,46]. Recent work [52] demonstrate that the ﬁnal performanc e

of these systems heavily depends on the exact way these initial guesses are set.

In our m odel we are able to re move this hand-crafted process and streamli n e the

detection process by directly predicting the set of detecti ons with absolute box

prediction w.r.t. the input image rather than an anchor.

Set-based loss. S everal object d et ectors [9,25,35] used the bipartite matching

loss. However, in these early deep learning models, the relation between di↵e r ent

prediction was modeled with convolutional or fully-connected layers only and a

hand-designed NMS post-processing can improve their performan ce. More recent

detectors [37,23,53] use non-uni q u e assignment rules between ground truth and

predictions together with an NMS.

Learnable NMS methods [16,4] and relation networks [17]explicitlymodel

relations between di ↵e re nt predictions with attention. Using direct set losses,

they do not require any post-processing steps. However, the se methods employ

additional hand-crafted context features like proposal box coordinates to model

relations between detections eﬃciently, while we look for sol u t i ons that reduce

the prior knowledge encoded in the model.

Recurrent detectors. Cl os est to our app roach are end-to-end set predictions

for objec t detection [43] and instance segment at i on [41,30,36,42]. Similarly to us,

they use bipartite-matching losses with encoder-decoder architectures based on

CNN activations to directly produce a set of boundi ng boxes. Th ese approaches,

however, were only evaluated on small datasets and not against modern baselines.

In p art i cu l ar , they are based on autoregressive models (more precisely RNNs),

so they do not leverage the recent transformers with parallel decoding.

3 The DETR model

Two ingredients are essential for direct set predictions in detection: (1) a set

prediction loss that forces unique matching between predicted and ground truth

剩余25页未读，继续阅读

评论收藏

内容反馈

Mrwei_418

粉丝: 165
资源: 4

End-to-End Object Detection with Transformers

DETR(End-to-End Object Detection with Transformers （CVPR 20)相关代码

DETR- End-to-End Object Detection with Transformers 论文解析Yannic Kilcher版本

End-to-End Object Detection with Transformers 文献汇报ppt

报告：End-to-End Object Detection with Transformers.pdf

End-to-End Object Detection with Transformers 目标检测论文组会汇报

End-to-End_Object_Detection_with_Transformers_detr.zip

End-to-End Object Detection with Transformers 文献汇报ppt.zip

深度学习领域detr算法在小麦头目标检测（带数据集）-2、end-to-end-object-detection-with-t

DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT D

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

zotero翻译插件.xpi

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

YOLOv8目标追踪实战全套资源包 - 源码与数据集完整分享

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

YOLOv5 人脸口罩图片数据集

mamba、causal-conv1d安装.whl文件

LabVIEW AI Vision(LabVIEW AI视觉工具包)

labelme v5.3.1 （2023年8月新版本，双击打开即用）

时间序列预测实战(十九)魔改Informer模型进行滚动长期预测（科研版本，结果可视化）

皮肤病语义分割数据集+代码+unet模型 2000张标注好的数据+教学视频

【大作业-08】YOLOV5火灾检测数据集+代码+模型 2000张标注好的数据+教学视频

第二版Science Research Writing for Non-Native Speakers of English

最新资源