没有合适的资源?快使用搜索试试~ 我知道了~
MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object
需积分: 0 0 下载量 120 浏览量
2023-08-02
11:44:44
上传
评论
收藏 4.04MB PDF 举报
温馨提示
试读
12页
多目标跟踪+Transformer 项目链接:https://link.zhihu.com/?target=https%3A//github.com/MCG-NJU/MeMOTR 内容简介: 1)方向:多目标跟踪(Multi-Object Tracking) 2)应用:视频任务 3)背景:现有的多目标跟踪方法大多只能在相邻帧之间明确利用目标特征,缺乏对长期时间信息的建模能力。 4)方法:本文提出了一种长期记忆增强的Transformer模型(MeMOTR)用于多目标跟踪。该方法通过使用定制的记忆-注意力层注入长期记忆,使得同一目标的跟踪嵌入更加稳定和可区分。这显著提高了模型的目标关联能力。 5)结果:在DanceTrack数据集上的实验结果显示,MeMOTR在HOTA和AssA指标上分别比最先进的方法提高了7.9%和13.0%。此外,该模型在MOT17数据集上的关联性能也优于其他基于Transformer的方法,并且在BDD100K上具有良好的泛化能力。
资源推荐
资源详情
资源评论
MeMOTR: Long-Term Memory-Augmented Transformer
for Multi-Object Tracking
Ruopeng Gao
1
Limin Wang
1,2,B
1
State Key Laboratory for Novel Software Technology, Nanjing University
2
Shanghai AI Lab
Abstract
As a video task, Multi-Object Tracking (MOT) is ex-
pected to capture temporal information of targets effec-
tively. Unfortunately, most existing methods only explicitly
exploit the object features between adjacent frames, while
lacking the capacity to model long-term temporal infor-
mation. In this paper, we propose MeMOTR, a long-term
memory-augmented Transformer for multi-object tracking.
Our method is able to make the same object’s track embed-
ding more stable and distinguishable by leveraging long-
term memory injection with a customized memory-attention
layer. This significantly improves the target association
ability of our model. Experimental results on DanceTrack
show that MeMOTR impressively surpasses the state-of-the-
art method by 7.9% and 13.0% on HOTA and AssA met-
rics, respectively. Furthermore, our model also outperforms
other Transformer-based methods on association perfor-
mance on MOT17 and generalizes well on BDD100K. Code
is available at https://github.com/MCG-NJU/MeMOTR.
1. Introduction
Multi-Object Tracking (MOT) [8, 23, 30] aims to de-
tect multiple objects and maintain their identities in a video
stream. MOT can be applied to numerous downstream
tasks, such as action recognition [7], behavior analysis [16],
and so on. It is also an important technique for real-world
applications, e.g., autonomous driving and surveillance.
According to the definition of MOT, this task can be
formally divided into two parts: object detection and as-
sociation. For a long time, pedestrian tracking datasets
(like MOT17 [23]) have had mainstream domination in the
community. However, these datasets have insufficient chal-
lenges in target association because of their almost lin-
ear motion pattern. Therefore, tracking-by-detection meth-
ods [5, 33, 44] achieve the state-of-the-art performance of
MOT for several years. They first adopt a robust object de-
B : Corresponding author (lmwang@nju.edu.cn).
tector (e.g., YOLOX [12]) to independently localize the ob-
jects in each frame and associate them with IoU [3, 41] or
ReID features [27]. However, associating targets becomes
a critical challenge in some complex scenarios, like group
dancers [30] and sports players [8, 13]. These similar ap-
pearances and erratic movements may cause existing meth-
ods to fail. Recently, Transformer-based tracking meth-
ods [22, 43] have introduced a new fully-end-to-end MOT
paradigm. Through the interaction and progressive decod-
ing of detect and track queries in Transformer, they simul-
taneously complete detection and tracking. This paradigm
is expected to have greater potential for object association
due to the flexibility of Transformer, especially in the above
complex scenes.
Although these Transformer-based methods achieve ex-
cellent performance, they still struggle with some compli-
cated issues, such as analogous appearances, irregular mo-
tion patterns, and long-term occlusions. We hypothesize
that more intelligent leverage of temporal information can
provide the tracker a more effective and robust representa-
tion for each tracked target, thereby relieving the above is-
sues and boosting the tracking performance. Unfortunately,
most previous methods [22, 43] only exploit the image or
object features between two adjacent frames, which lacking
the utilization of long-term temporal information.
Based on the analysis above, in this paper, we focus on
leveraging temporal information by proposing a long-term
Memory-augmented Multi-Object Tracking method with
TRansformer, coined as MeMOTR. We exploit detect and
track embeddings to localize newborn and tracked objects
via a Transformer Decoder, respectively. Our model main-
tains a long-term memory with the exponential recursion
update algorithm [29] for each tracked object. Afterward,
we inject this memory into the track embedding, reducing
its abrupt changes and thus improving the model association
ability. As multiple tracked targets exist in a video stream,
we apply a memory-attention layer to produce a more dis-
tinguishable representation. Besides, we present an adap-
tive aggregation to fuse the object feature from two adjacent
frames to improve tracking robustness.
In addition, we argue that the learnable detection query
arXiv:2307.15700v1 [cs.CV] 28 Jul 2023
in DETR [6] has no semantic information about specific
objects. However, the track query in Transformer-based
MOT methods like MOTR [43] carries information about a
tracked object. This difference will cause a semantic infor-
mation gap and thus degrade the final tracking performance.
Therefore, to overcome this issue, we use a light decoder to
perform preliminary object detection, which outputs the de-
tect embedding with specific semantics. Then we jointly
input detect and track embeddings into the subsequent de-
coder to make MeMOTR tracking results more precise.
We mainly evaluate our method on the DanceTrack
dataset [30] because of its serious association challenge.
Experimental results show that our method achieves the
state-of-the-art performance on this challenging Dance-
Track dataset, especially on association metrics (e.g., AssA,
IDF1). We also evaluate our model on the traditional
pedestrian tracking dataset of MOT17 [23] and the multi-
categories tracking dataset of BDD100K [42]. In addition,
we perform extensive ablation studies further demonstrate
the effectiveness of our designs.
2. Related Work
Tracking-by-Detection is a widely used MOT paradigm
that has recently dominated the community. These methods
always get trajectories by associating a given set of detec-
tions in a streaming video.
The objects in classic pedestrian tracking scenarios [9,
23] always have different appearances and regular motion
patterns. Therefore, appearance matching and linear mo-
tion estimation are widely used to match targets in consec-
utive frames. SORT [3] uses the Intersection-over-Union
(IoU) to match predictions of the Kalman filter [34] and
detected boxes. Deep-SORT [35] applies an additional net-
work to extract target features, then utilizes cosine distances
for matching besides motion consideration in SORT [3].
JDE [33], FairMOT [45], and Unicorn [39] further ex-
plore the architecture of appearance embedding and match-
ing. ByteTrack [44] employs a robust detector based on
YOLOX [12] and reuses low-confidence detections to en-
hance the association ability. Furthermore, OC-SORT [5]
improves SORT [3] by rehabilitating lost targets. In recent
years, as a trendy framework in vision tasks, some stud-
ies [36, 48] have also applied Transformers to match detec-
tion bounding boxes. Moreover, Dendorfer et al. [10] at-
tempt to model pedestrian trajectories by leveraging more
complex motion estimation methods (like S-GAN [14])
from the trajectory prediction task.
The methods described above have powerful detection
capabilities due to their robust detectors. However, although
such methods have achieved outstanding performance in
pedestrian tracking datasets, they are mediocre at dealing
with more complex scenarios having irregular movements.
These unforeseeable motion patterns will cause the trajec-
tory estimation and prediction module to fail.
Tracking-by-Query usually does not require additional
post-processing to associate detection results. Unlike the
tracking-by-detection paradigm mentioned above, tracking-
by-query methods apply the track query to decode the loca-
tion of tracked objects progressively.
Inspired by DETR-family [6], most of these meth-
ods [22, 43] leverage the learnable object query to per-
form newborn object detection, while the track query lo-
calizes the position of tracked objects. TransTrack [31]
builds a siamese network for detection and tracking, then
applies an IoU matching to produce newborn targets.
TrackFormer [22] utilizes the same Transformer decoder
for both detection and tracking, then employs a non-
maximum suppression (NMS) with a high IoU threshold
to remove strongly overlapping duplicate bounding boxes.
MOTR [43] builds an elegant and fully end-to-end Trans-
former for multi-object tracking. This paradigm performs
excellently in dealing with irregular movements due to the
flexibility of query-based design. Furthermore, MQT [17]
employs different queries to represent one tracked object
and cares more about class-agnostic tracking.
However, current query-based methods typically exploit
the information of adjacent frames (query [43] or fea-
ture [22] fusion). Although the track query can be continu-
ously updated over time, most methods still do not explicitly
exploit longer temporal information. Cai et al. [4] explore a
large memory bank to benefit from time-related knowledge
but suffer enormous storage costs. In order to use long-term
information, we propose a long-term memory to stabilize
the tracked object feature over time and a memory-attention
layer for a more distinguishable representation. Our exper-
iments further approve that this approach significantly im-
proves association performance in MOT.
3. Method
3.1. Overview
We propose the MeMOTR, a long-term memory-
augmented Transformer for multi-object tracking. Different
from most existing methods [22, 43] that only explicitly uti-
lize the states of tracked objects between adjacent frames,
our core contribution is to build a long-term memory (in
Section 3.3) that maintains the long-term temporal feature
for each tracked target, together with a temporal interaction
module (TIM) that effectively injects the temporal informa-
tion into subsequent tracking processes.
Like most DETR-family methods [6], we use a ResNet-
50 [15] backbone and a Transformer Encoder to produce the
image feature of an input frame I
t
. As shown in Figure 1,
the learnable detect query Q
det
is fed into the Detection
Decoder D
det
(in Section 3.2) to generate the detect em-
bedding E
t
det
for the current frame. Afterward, by query-
𝑰
𝒕
𝑰
𝒕"𝟏
𝑰
𝒕"𝟐
Input Video Stream
Backbone
&
Encoder
Transformer Joint
Decoder
Detection
Decoder
𝑸
𝒅𝒆𝒕
𝑬
𝒅𝒆𝒕
𝒕
𝑬
𝒕𝒄𝒌
𝒕
𝑴
𝒕𝒄𝒌
𝒕
𝑶
𝒕𝒄𝒌
𝒕
𝑶
𝒕𝒄𝒌
𝒕&𝟏
Temp ora l
Interaction
Module
copy for newborn targets
update for subsequent
normal input
𝑴
𝒕𝒄𝒌
𝒕
learnable detect query
track and detect embedding
long-term memory
output embedding
Figure 1. Overview of MeMOTR. Like most DETR-based [6] methods, we exploit a ResNet-50 [15] backbone and a Transformer [32]
Encoder to learn a 2D representation of an input image. We use different colors to indicate different tracked targets, and the learnable
detect query Q
det
is illustrated in gray. Then the Detection Decoder D
det
processes the detect query to generate the detect embedding
E
t
det
, which aligns with the track embedding E
t
tck
from previous frames. Long-term memory is denoted as M
t
tck
. The initialization process
in the blue dotted arrow will be applied to newborn objects. Our Long-Term Memory and Temporal Interaction Module is discussed in
Section 3.3 and 3.4. More details are illustrated in Figure 2.
ing the encoded image feature with [E
t
det
, E
t
tck
], the Trans-
former Joint Decoder D
joint
produces the corresponding
output [
ˆ
O
t
det
,
ˆ
O
t
tck
]. For simplicity, we merge the newborn
objects in
ˆ
O
t
det
(yellow box) with tracked objects’ output
ˆ
O
t
tck
, denoted by O
t
tck
. Afterward, we predict the classifi-
cation confidence c
t
i
and bounding box b
t
i
corresponding to
the i
th
target from the output embeddings. Finally, we feed
the output from adjacent frames [O
t
tck
, O
t−1
tck
] and the long-
term memory M
t
tck
into the Temporal Interaction Module,
updating the subsequent track embedding E
t+1
tck
and long-
term memory M
t+1
tck
. The details of our components will be
elaborated in the following sections.
3.2. Detection Decoder
In the previous Transformer-based methods [22, 43],
the learnable detect query and the previous track query
are jointly input to Transformer Decoder from scratch.
This simple idea extends the end-to-end detection Trans-
former [6] to multi-object tracking. Nonetheless, we argue
that this design may cause misalignment between detect and
track queries. As discussed in numerous works [6, 20], the
learnable object query in DETR-family plays a role similar
to a learnable anchor with little semantic information. On
the other hand, track queries have specific semantic knowl-
edge to resolve their category and bounding boxes since
they are generated from the output of previous frames.
Therefore, as illustrated in Figure 1, we split the origi-
nal Transformer Decoder into two parts. The first decoder
layer is used for detection, and the remaining five layers are
used for joint detection and tracking. These two decoders
have the same structure but different inputs. The Detection
Decoder D
det
takes the original learnable detect query Q
det
as input and generates the corresponding detect embedding
E
t
det
, carrying enough semantic information to locate and
classify the target roughly. After that, we concatenate the
detect and track embedding together and feed them into the
Joint Decoder D
joint
.
3.3. Long-Term Memory
Unlike previous methods [17, 43] that only exploit ad-
jacent frames’ information, we explicitly introduce a long-
term memory M
t
tck
to maintain longer temporal information
for tracked targets. When a newborn object is detected, we
initialize its long-term memory with the current output.
It should be noted that in a video stream, objects only
have minor deformation and movement in consecutive
frames. Thus, we suppose the semantic feature of a tracked
object changes only slightly in a short time. In the same
way, our long-term memory should also update smoothly
over time. Inspired by [29], we apply a simple but effec-
tive running average with exponentially decaying weights
to update long-term memory M
t
tck
:
f
M
t+1
tck
= (1 − λ)M
t
tck
+ λ · O
t
tck
, (1)
where
f
M
t+1
tck
is the new long-term memory for the next
frame. The memory update rate λ is experimentally set to
0.01, following the assumption that the memory changes
smoothly and consistently in consecutive frames. We also
tried some other values in Table 7.
3.4. Temporal Interaction Module
Adaptive Aggregation for Temporal Enhancement. Is-
sues such as blurring or occlusion are often seen in a video
stream. An intuitive idea to solve this problem is using
剩余11页未读,继续阅读
资源评论
学术菜鸟小晨
- 粉丝: 1w+
- 资源: 4953
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- #P0015. 全排列 超级简单
- pta题库答案c语言之排序4统计工龄.zip
- pta题库答案c语言之树结构7堆中的路径.zip
- pta题库答案c语言之树结构3TreeTraversalsAgain.zip
- pta题库答案c语言之树结构2ListLeaves.zip
- pta题库答案c语言之树结构1树的同构.zip
- 基于C++实现民航飞行与地图简易管理系统可执行程序+说明+详细注释.zip
- pta题库答案c语言之复杂度1最大子列和问题.zip
- 三维装箱问题(Three-Dimensional Bin Packing Problem,3D-BPP)是一个经典的组合优化问题
- 以下是一些关于Linux线程同步的基本概念和方法.txt
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功