【免费】MeMOTR:Long-TermMemory-AugmentedTransformerforMulti-Object

需积分: 0 135 浏览量更新于2023-08-02 收藏 4.04MB PDF 举报

多目标跟踪+Transformer 项目链接：https://link.zhihu.com/?target=https%3A//github.com/MCG-NJU/MeMOTR 内容简介： 1）方向：多目标跟踪（Multi-Object Tracking） 2）应用：视频任务 3）背景：现有的多目标跟踪方法大多只能在相邻帧之间明确利用目标特征，缺乏对长期时间信息的建模能力。 4）方法：本文提出了一种长期记忆增强的Transformer模型（MeMOTR）用于多目标跟踪。该方法通过使用定制的记忆-注意力层注入长期记忆，使得同一目标的跟踪嵌入更加稳定和可区分。这显著提高了模型的目标关联能力。 5）结果：在DanceTrack数据集上的实验结果显示，MeMOTR在HOTA和AssA指标上分别比最先进的方法提高了7.9%和13.0%。此外，该模型在MOT17数据集上的关联性能也优于其他基于Transformer的方法，并且在BDD100K上具有良好的泛化能力。《MeMOTR：长期记忆增强Transformer在多目标跟踪中的应用》多目标跟踪（Multi-Object Tracking，简称MOT）是视频分析领域的重要任务，它要求系统能够检测到视频流中的多个对象并保持其身份识别。随着深度学习技术的发展，Transformer模型在各个领域的应用越来越广泛，但在MOT中，如何有效地利用长期时间信息仍然是一个挑战。现有的大多数方法局限于相邻帧之间的目标特征利用，缺乏对长时间序列信息的建模能力。为了解决这一问题，研究者提出了MeMOTR（Long-Term Memory-Augmented Transformer），这是一种创新性的模型，旨在通过引入定制的记忆-注意力层来注入长期记忆，从而增强同一目标的跟踪嵌入的稳定性和可区分性。这种设计使得模型能够更好地理解和关联目标的运动轨迹，提高了目标关联能力。在DanceTrack数据集上的实验表明，MeMOTR在HOTA（Higher Order Association Metric）和AssA（Assignment Accuracy）两个关键评估指标上分别实现了7.9%和13.0%的显著提升，相较于当前最先进的方法。这证明了MeMOTR在复杂场景下的优秀跟踪性能。同时，在MOT17数据集上，MeMOTR的关联性能超越了其他基于Transformer的方法，并在BDD100K数据集上展现出良好的泛化能力，展示了其在不同环境和场景中的适应性。 MeMOTR的成功在于其巧妙地结合了Transformer模型的注意力机制和长期记忆的概念。Transformer模型以其强大的序列建模能力和并行计算能力在自然语言处理等领域取得了突破，而MeMOTR则将这一优势扩展到了视觉任务中。记忆-注意力层的设计允许模型不仅关注当前帧的信息，还能回溯和利用过去帧的上下文，这对于处理复杂的多目标动态场景至关重要。在实际应用中，如自动驾驶和监控系统，准确的多目标跟踪是必不可少的。MeMOTR的出色表现意味着它有潜力提高这些系统的实时性能和鲁棒性，特别是在处理快速变化和复杂交互的目标时。此外，由于其代码已经公开，研究者和开发者可以进一步研究和改进MeMOTR，推动多目标跟踪技术的进步。 MeMOTR通过引入长期记忆增强的Transformer模型，为多目标跟踪带来了新的视角和解决方案，提高了目标识别的稳定性与准确性。这一创新工作为未来的视频分析任务提供了有价值的参考，特别是在处理具有挑战性的目标关联问题时。

MeMOTR: Long-Term Memory-Augmented Transformer

for Multi-Object Tracking

Ruopeng Gao

Limin Wang

1,2,B

State Key Laboratory for Novel Software Technology, Nanjing University

Shanghai AI Lab

Abstract

As a video task, Multi-Object Tracking (MOT) is ex-

pected to capture temporal information of targets effec-

tively. Unfortunately, most existing methods only explicitly

exploit the object features between adjacent frames, while

lacking the capacity to model long-term temporal infor-

mation. In this paper, we propose MeMOTR, a long-term

memory-augmented Transformer for multi-object tracking.

Our method is able to make the same object’s track embed-

ding more stable and distinguishable by leveraging long-

term memory injection with a customized memory-attention

layer. This signiﬁcantly improves the target association

ability of our model. Experimental results on DanceTrack

show that MeMOTR impressively surpasses the state-of-the-

art method by 7.9% and 13.0% on HOTA and AssA met-

rics, respectively. Furthermore, our model also outperforms

other Transformer-based methods on association perfor-

mance on MOT17 and generalizes well on BDD100K. Code

is available at https://github.com/MCG-NJU/MeMOTR.

1. Introduction

Multi-Object Tracking (MOT) [8, 23, 30] aims to de-

tect multiple objects and maintain their identities in a video

stream. MOT can be applied to numerous downstream

tasks, such as action recognition [7], behavior analysis [16],

and so on. It is also an important technique for real-world

applications, e.g., autonomous driving and surveillance.

According to the deﬁnition of MOT, this task can be

formally divided into two parts: object detection and as-

sociation. For a long time, pedestrian tracking datasets

(like MOT17 [23]) have had mainstream domination in the

community. However, these datasets have insufﬁcient chal-

lenges in target association because of their almost lin-

ear motion pattern. Therefore, tracking-by-detection meth-

ods [5, 33, 44] achieve the state-of-the-art performance of

MOT for several years. They ﬁrst adopt a robust object de-

B : Corresponding author (lmwang@nju.edu.cn).

tector (e.g., YOLOX [12]) to independently localize the ob-

jects in each frame and associate them with IoU [3, 41] or

ReID features [27]. However, associating targets becomes

a critical challenge in some complex scenarios, like group

dancers [30] and sports players [8, 13]. These similar ap-

pearances and erratic movements may cause existing meth-

ods to fail. Recently, Transformer-based tracking meth-

ods [22, 43] have introduced a new fully-end-to-end MOT

paradigm. Through the interaction and progressive decod-

ing of detect and track queries in Transformer, they simul-

taneously complete detection and tracking. This paradigm

is expected to have greater potential for object association

due to the ﬂexibility of Transformer, especially in the above

complex scenes.

Although these Transformer-based methods achieve ex-

cellent performance, they still struggle with some compli-

cated issues, such as analogous appearances, irregular mo-

tion patterns, and long-term occlusions. We hypothesize

that more intelligent leverage of temporal information can

provide the tracker a more effective and robust representa-

tion for each tracked target, thereby relieving the above is-

sues and boosting the tracking performance. Unfortunately,

most previous methods [22, 43] only exploit the image or

object features between two adjacent frames, which lacking

the utilization of long-term temporal information.

Based on the analysis above, in this paper, we focus on

leveraging temporal information by proposing a long-term

Memory-augmented Multi-Object Tracking method with

TRansformer, coined as MeMOTR. We exploit detect and

track embeddings to localize newborn and tracked objects

via a Transformer Decoder, respectively. Our model main-

tains a long-term memory with the exponential recursion

update algorithm [29] for each tracked object. Afterward,

we inject this memory into the track embedding, reducing

its abrupt changes and thus improving the model association

ability. As multiple tracked targets exist in a video stream,

we apply a memory-attention layer to produce a more dis-

tinguishable representation. Besides, we present an adap-

tive aggregation to fuse the object feature from two adjacent

frames to improve tracking robustness.

In addition, we argue that the learnable detection query

arXiv:2307.15700v1 [cs.CV] 28 Jul 2023

in DETR [6] has no semantic information about speciﬁc

objects. However, the track query in Transformer-based

MOT methods like MOTR [43] carries information about a

tracked object. This difference will cause a semantic infor-

mation gap and thus degrade the ﬁnal tracking performance.

Therefore, to overcome this issue, we use a light decoder to

perform preliminary object detection, which outputs the de-

tect embedding with speciﬁc semantics. Then we jointly

input detect and track embeddings into the subsequent de-

coder to make MeMOTR tracking results more precise.

We mainly evaluate our method on the DanceTrack

dataset [30] because of its serious association challenge.

Experimental results show that our method achieves the

state-of-the-art performance on this challenging Dance-

Track dataset, especially on association metrics (e.g., AssA,

IDF1). We also evaluate our model on the traditional

pedestrian tracking dataset of MOT17 [23] and the multi-

categories tracking dataset of BDD100K [42]. In addition,

we perform extensive ablation studies further demonstrate

the effectiveness of our designs.

2. Related Work

Tracking-by-Detection is a widely used MOT paradigm

that has recently dominated the community. These methods

always get trajectories by associating a given set of detec-

tions in a streaming video.

The objects in classic pedestrian tracking scenarios [9,

23] always have different appearances and regular motion

patterns. Therefore, appearance matching and linear mo-

tion estimation are widely used to match targets in consec-

utive frames. SORT [3] uses the Intersection-over-Union

(IoU) to match predictions of the Kalman ﬁlter [34] and

detected boxes. Deep-SORT [35] applies an additional net-

work to extract target features, then utilizes cosine distances

for matching besides motion consideration in SORT [3].

JDE [33], FairMOT [45], and Unicorn [39] further ex-

plore the architecture of appearance embedding and match-

ing. ByteTrack [44] employs a robust detector based on

YOLOX [12] and reuses low-conﬁdence detections to en-

hance the association ability. Furthermore, OC-SORT [5]

improves SORT [3] by rehabilitating lost targets. In recent

years, as a trendy framework in vision tasks, some stud-

ies [36, 48] have also applied Transformers to match detec-

tion bounding boxes. Moreover, Dendorfer et al. [10] at-

tempt to model pedestrian trajectories by leveraging more

complex motion estimation methods (like S-GAN [14])

from the trajectory prediction task.

The methods described above have powerful detection

capabilities due to their robust detectors. However, although

such methods have achieved outstanding performance in

pedestrian tracking datasets, they are mediocre at dealing

with more complex scenarios having irregular movements.

These unforeseeable motion patterns will cause the trajec-

tory estimation and prediction module to fail.

Tracking-by-Query usually does not require additional

post-processing to associate detection results. Unlike the

tracking-by-detection paradigm mentioned above, tracking-

by-query methods apply the track query to decode the loca-

tion of tracked objects progressively.

Inspired by DETR-family [6], most of these meth-

ods [22, 43] leverage the learnable object query to per-

form newborn object detection, while the track query lo-

calizes the position of tracked objects. TransTrack [31]

builds a siamese network for detection and tracking, then

applies an IoU matching to produce newborn targets.

TrackFormer [22] utilizes the same Transformer decoder

for both detection and tracking, then employs a non-

maximum suppression (NMS) with a high IoU threshold

to remove strongly overlapping duplicate bounding boxes.

MOTR [43] builds an elegant and fully end-to-end Trans-

former for multi-object tracking. This paradigm performs

excellently in dealing with irregular movements due to the

ﬂexibility of query-based design. Furthermore, MQT [17]

employs different queries to represent one tracked object

and cares more about class-agnostic tracking.

However, current query-based methods typically exploit

the information of adjacent frames (query [43] or fea-

ture [22] fusion). Although the track query can be continu-

ously updated over time, most methods still do not explicitly

exploit longer temporal information. Cai et al. [4] explore a

large memory bank to beneﬁt from time-related knowledge

but suffer enormous storage costs. In order to use long-term

information, we propose a long-term memory to stabilize

the tracked object feature over time and a memory-attention

layer for a more distinguishable representation. Our exper-

iments further approve that this approach signiﬁcantly im-

proves association performance in MOT.

3. Method

3.1. Overview

We propose the MeMOTR, a long-term memory-

augmented Transformer for multi-object tracking. Different

from most existing methods [22, 43] that only explicitly uti-

lize the states of tracked objects between adjacent frames,

our core contribution is to build a long-term memory (in

Section 3.3) that maintains the long-term temporal feature

for each tracked target, together with a temporal interaction

module (TIM) that effectively injects the temporal informa-

tion into subsequent tracking processes.

Like most DETR-family methods [6], we use a ResNet-

50 [15] backbone and a Transformer Encoder to produce the

image feature of an input frame I

. As shown in Figure 1,

the learnable detect query Q

det

is fed into the Detection

Decoder D

det

(in Section 3.2) to generate the detect em-

bedding E

det

for the current frame. Afterward, by query-

剩余11页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源评论

学术菜鸟小晨

粉丝: 2w+
资源: 6079

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object

Handbook of Augmented Reality

Buffer of Thoughts: Thought-Augmented Reasoning with Large Langu

REALM: Retrieval-Augmented Language Model Pre-Training

Charpter 3:Marker-less Augmented Reality工程

Virtual & Augmented Reality For Dummies by Paul Mealy-2018.epub

视频换天造物实践秒变科幻大片实践相关资料：checkpoints-G-coord-resnet50

ceng-407-408-2017-2018-project-augmented-reality-alp-and-framework:ceng-407-408-project-augmented-reality-alp-and-framework由GitHub Classroom创建

论文研究-PA-RetinaNet: Path Augmented RetinaNet for Dense Object Detection.pdf

3D-Unity-shadow-Augmented-Reality.zip

KinectFusion: Real-Time Dense Surface Mapping and Tracking

Virtual & Augmented Reality For Dummies by Paul Mealy-2018.pdf

augmented-ui:赛博朋克风格的Web UI变得容易。 得到增强

REALM Retrieval-Augmented Language Model Pre-Training 翻译.pdf

One-shot learning with Memory-Augmented Neural Networks

pocvar:POCVAR-用于增强现实的Python开源计算机视觉库

深度感应：规范：https：immersive-web.github.iodepth-sensing解释器：https：github.comimmersive-webdepth-sensingblobmainexplainer.md

Augmented BNF for Syntax Specifications: ABNF

Augmented Reality Principles and Practice

21-RAG（Retrieval-Augmented Generation）评测面.pdf

AR-Gong.zip_Augmented reality_C-C++_VS2010_osgART _osgARTcollide

AR0144-D (彩色).pdf

PRO-C174-AR:PRO-C174的课后项目解决方案

Persistence Programming Models for Non-Volatile Memory - July, 2015 (HPL-2015-59)-计算机科学

Complete Virtual Reality and Augmented Reality Development with Unity-2019.epub

Augmented Reality for Developers azw3

COVID-CT19-挑战：COVID-CT19-挑战

Augmented Reality Where We Will All Live / by Jon Peddie

演示：ar-vr medical pletform将医疗专家与非洲医疗保健提供者联系起来

ARKit by Tutorials: Building Augmented Reality Apps in Swift 4.2, 2nd-pdf-分卷4

最新资源

augmented-ui:赛博朋克风格的Web UI变得容易。得到增强