没有合适的资源?快使用搜索试试~ 我知道了~
MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object
试读
12页
需积分: 0 1 下载量 58 浏览量
更新于2023-08-02
收藏 4.04MB PDF 举报
多目标跟踪+Transformer
项目链接:https://link.zhihu.com/?target=https%3A//github.com/MCG-NJU/MeMOTR
内容简介:
1)方向:多目标跟踪(Multi-Object Tracking)
2)应用:视频任务
3)背景:现有的多目标跟踪方法大多只能在相邻帧之间明确利用目标特征,缺乏对长期时间信息的建模能力。
4)方法:本文提出了一种长期记忆增强的Transformer模型(MeMOTR)用于多目标跟踪。该方法通过使用定制的记忆-注意力层注入长期记忆,使得同一目标的跟踪嵌入更加稳定和可区分。这显著提高了模型的目标关联能力。
5)结果:在DanceTrack数据集上的实验结果显示,MeMOTR在HOTA和AssA指标上分别比最先进的方法提高了7.9%和13.0%。此外,该模型在MOT17数据集上的关联性能也优于其他基于Transformer的方法,并且在BDD100K上具有良好的泛化能力。
《MeMOTR:长期记忆增强Transformer在多目标跟踪中的应用》
多目标跟踪(Multi-Object Tracking,简称MOT)是视频分析领域的重要任务,它要求系统能够检测到视频流中的多个对象并保持其身份识别。随着深度学习技术的发展,Transformer模型在各个领域的应用越来越广泛,但在MOT中,如何有效地利用长期时间信息仍然是一个挑战。现有的大多数方法局限于相邻帧之间的目标特征利用,缺乏对长时间序列信息的建模能力。
为了解决这一问题,研究者提出了MeMOTR(Long-Term Memory-Augmented Transformer),这是一种创新性的模型,旨在通过引入定制的记忆-注意力层来注入长期记忆,从而增强同一目标的跟踪嵌入的稳定性和可区分性。这种设计使得模型能够更好地理解和关联目标的运动轨迹,提高了目标关联能力。
在DanceTrack数据集上的实验表明,MeMOTR在HOTA(Higher Order Association Metric)和AssA(Assignment Accuracy)两个关键评估指标上分别实现了7.9%和13.0%的显著提升,相较于当前最先进的方法。这证明了MeMOTR在复杂场景下的优秀跟踪性能。同时,在MOT17数据集上,MeMOTR的关联性能超越了其他基于Transformer的方法,并在BDD100K数据集上展现出良好的泛化能力,展示了其在不同环境和场景中的适应性。
MeMOTR的成功在于其巧妙地结合了Transformer模型的注意力机制和长期记忆的概念。Transformer模型以其强大的序列建模能力和并行计算能力在自然语言处理等领域取得了突破,而MeMOTR则将这一优势扩展到了视觉任务中。记忆-注意力层的设计允许模型不仅关注当前帧的信息,还能回溯和利用过去帧的上下文,这对于处理复杂的多目标动态场景至关重要。
在实际应用中,如自动驾驶和监控系统,准确的多目标跟踪是必不可少的。MeMOTR的出色表现意味着它有潜力提高这些系统的实时性能和鲁棒性,特别是在处理快速变化和复杂交互的目标时。此外,由于其代码已经公开,研究者和开发者可以进一步研究和改进MeMOTR,推动多目标跟踪技术的进步。
MeMOTR通过引入长期记忆增强的Transformer模型,为多目标跟踪带来了新的视角和解决方案,提高了目标识别的稳定性与准确性。这一创新工作为未来的视频分析任务提供了有价值的参考,特别是在处理具有挑战性的目标关联问题时。
MeMOTR: Long-Term Memory-Augmented Transformer
for Multi-Object Tracking
Ruopeng Gao
1
Limin Wang
1,2,B
1
State Key Laboratory for Novel Software Technology, Nanjing University
2
Shanghai AI Lab
Abstract
As a video task, Multi-Object Tracking (MOT) is ex-
pected to capture temporal information of targets effec-
tively. Unfortunately, most existing methods only explicitly
exploit the object features between adjacent frames, while
lacking the capacity to model long-term temporal infor-
mation. In this paper, we propose MeMOTR, a long-term
memory-augmented Transformer for multi-object tracking.
Our method is able to make the same object’s track embed-
ding more stable and distinguishable by leveraging long-
term memory injection with a customized memory-attention
layer. This significantly improves the target association
ability of our model. Experimental results on DanceTrack
show that MeMOTR impressively surpasses the state-of-the-
art method by 7.9% and 13.0% on HOTA and AssA met-
rics, respectively. Furthermore, our model also outperforms
other Transformer-based methods on association perfor-
mance on MOT17 and generalizes well on BDD100K. Code
is available at https://github.com/MCG-NJU/MeMOTR.
1. Introduction
Multi-Object Tracking (MOT) [8, 23, 30] aims to de-
tect multiple objects and maintain their identities in a video
stream. MOT can be applied to numerous downstream
tasks, such as action recognition [7], behavior analysis [16],
and so on. It is also an important technique for real-world
applications, e.g., autonomous driving and surveillance.
According to the definition of MOT, this task can be
formally divided into two parts: object detection and as-
sociation. For a long time, pedestrian tracking datasets
(like MOT17 [23]) have had mainstream domination in the
community. However, these datasets have insufficient chal-
lenges in target association because of their almost lin-
ear motion pattern. Therefore, tracking-by-detection meth-
ods [5, 33, 44] achieve the state-of-the-art performance of
MOT for several years. They first adopt a robust object de-
B : Corresponding author (lmwang@nju.edu.cn).
tector (e.g., YOLOX [12]) to independently localize the ob-
jects in each frame and associate them with IoU [3, 41] or
ReID features [27]. However, associating targets becomes
a critical challenge in some complex scenarios, like group
dancers [30] and sports players [8, 13]. These similar ap-
pearances and erratic movements may cause existing meth-
ods to fail. Recently, Transformer-based tracking meth-
ods [22, 43] have introduced a new fully-end-to-end MOT
paradigm. Through the interaction and progressive decod-
ing of detect and track queries in Transformer, they simul-
taneously complete detection and tracking. This paradigm
is expected to have greater potential for object association
due to the flexibility of Transformer, especially in the above
complex scenes.
Although these Transformer-based methods achieve ex-
cellent performance, they still struggle with some compli-
cated issues, such as analogous appearances, irregular mo-
tion patterns, and long-term occlusions. We hypothesize
that more intelligent leverage of temporal information can
provide the tracker a more effective and robust representa-
tion for each tracked target, thereby relieving the above is-
sues and boosting the tracking performance. Unfortunately,
most previous methods [22, 43] only exploit the image or
object features between two adjacent frames, which lacking
the utilization of long-term temporal information.
Based on the analysis above, in this paper, we focus on
leveraging temporal information by proposing a long-term
Memory-augmented Multi-Object Tracking method with
TRansformer, coined as MeMOTR. We exploit detect and
track embeddings to localize newborn and tracked objects
via a Transformer Decoder, respectively. Our model main-
tains a long-term memory with the exponential recursion
update algorithm [29] for each tracked object. Afterward,
we inject this memory into the track embedding, reducing
its abrupt changes and thus improving the model association
ability. As multiple tracked targets exist in a video stream,
we apply a memory-attention layer to produce a more dis-
tinguishable representation. Besides, we present an adap-
tive aggregation to fuse the object feature from two adjacent
frames to improve tracking robustness.
In addition, we argue that the learnable detection query
arXiv:2307.15700v1 [cs.CV] 28 Jul 2023
in DETR [6] has no semantic information about specific
objects. However, the track query in Transformer-based
MOT methods like MOTR [43] carries information about a
tracked object. This difference will cause a semantic infor-
mation gap and thus degrade the final tracking performance.
Therefore, to overcome this issue, we use a light decoder to
perform preliminary object detection, which outputs the de-
tect embedding with specific semantics. Then we jointly
input detect and track embeddings into the subsequent de-
coder to make MeMOTR tracking results more precise.
We mainly evaluate our method on the DanceTrack
dataset [30] because of its serious association challenge.
Experimental results show that our method achieves the
state-of-the-art performance on this challenging Dance-
Track dataset, especially on association metrics (e.g., AssA,
IDF1). We also evaluate our model on the traditional
pedestrian tracking dataset of MOT17 [23] and the multi-
categories tracking dataset of BDD100K [42]. In addition,
we perform extensive ablation studies further demonstrate
the effectiveness of our designs.
2. Related Work
Tracking-by-Detection is a widely used MOT paradigm
that has recently dominated the community. These methods
always get trajectories by associating a given set of detec-
tions in a streaming video.
The objects in classic pedestrian tracking scenarios [9,
23] always have different appearances and regular motion
patterns. Therefore, appearance matching and linear mo-
tion estimation are widely used to match targets in consec-
utive frames. SORT [3] uses the Intersection-over-Union
(IoU) to match predictions of the Kalman filter [34] and
detected boxes. Deep-SORT [35] applies an additional net-
work to extract target features, then utilizes cosine distances
for matching besides motion consideration in SORT [3].
JDE [33], FairMOT [45], and Unicorn [39] further ex-
plore the architecture of appearance embedding and match-
ing. ByteTrack [44] employs a robust detector based on
YOLOX [12] and reuses low-confidence detections to en-
hance the association ability. Furthermore, OC-SORT [5]
improves SORT [3] by rehabilitating lost targets. In recent
years, as a trendy framework in vision tasks, some stud-
ies [36, 48] have also applied Transformers to match detec-
tion bounding boxes. Moreover, Dendorfer et al. [10] at-
tempt to model pedestrian trajectories by leveraging more
complex motion estimation methods (like S-GAN [14])
from the trajectory prediction task.
The methods described above have powerful detection
capabilities due to their robust detectors. However, although
such methods have achieved outstanding performance in
pedestrian tracking datasets, they are mediocre at dealing
with more complex scenarios having irregular movements.
These unforeseeable motion patterns will cause the trajec-
tory estimation and prediction module to fail.
Tracking-by-Query usually does not require additional
post-processing to associate detection results. Unlike the
tracking-by-detection paradigm mentioned above, tracking-
by-query methods apply the track query to decode the loca-
tion of tracked objects progressively.
Inspired by DETR-family [6], most of these meth-
ods [22, 43] leverage the learnable object query to per-
form newborn object detection, while the track query lo-
calizes the position of tracked objects. TransTrack [31]
builds a siamese network for detection and tracking, then
applies an IoU matching to produce newborn targets.
TrackFormer [22] utilizes the same Transformer decoder
for both detection and tracking, then employs a non-
maximum suppression (NMS) with a high IoU threshold
to remove strongly overlapping duplicate bounding boxes.
MOTR [43] builds an elegant and fully end-to-end Trans-
former for multi-object tracking. This paradigm performs
excellently in dealing with irregular movements due to the
flexibility of query-based design. Furthermore, MQT [17]
employs different queries to represent one tracked object
and cares more about class-agnostic tracking.
However, current query-based methods typically exploit
the information of adjacent frames (query [43] or fea-
ture [22] fusion). Although the track query can be continu-
ously updated over time, most methods still do not explicitly
exploit longer temporal information. Cai et al. [4] explore a
large memory bank to benefit from time-related knowledge
but suffer enormous storage costs. In order to use long-term
information, we propose a long-term memory to stabilize
the tracked object feature over time and a memory-attention
layer for a more distinguishable representation. Our exper-
iments further approve that this approach significantly im-
proves association performance in MOT.
3. Method
3.1. Overview
We propose the MeMOTR, a long-term memory-
augmented Transformer for multi-object tracking. Different
from most existing methods [22, 43] that only explicitly uti-
lize the states of tracked objects between adjacent frames,
our core contribution is to build a long-term memory (in
Section 3.3) that maintains the long-term temporal feature
for each tracked target, together with a temporal interaction
module (TIM) that effectively injects the temporal informa-
tion into subsequent tracking processes.
Like most DETR-family methods [6], we use a ResNet-
50 [15] backbone and a Transformer Encoder to produce the
image feature of an input frame I
t
. As shown in Figure 1,
the learnable detect query Q
det
is fed into the Detection
Decoder D
det
(in Section 3.2) to generate the detect em-
bedding E
t
det
for the current frame. Afterward, by query-
剩余11页未读,继续阅读
资源推荐
资源评论
2018-03-10 上传
135 浏览量
147 浏览量
5星 · 资源好评率100%
132 浏览量
180 浏览量
2023-06-11 上传
2021-05-15 上传
111 浏览量
162 浏览量
2013-05-08 上传
2019-06-30 上传
153 浏览量
123 浏览量
112 浏览量
2021-02-17 上传
130 浏览量
156 浏览量
2022-09-23 上传
2020-06-12 上传
133 浏览量
174 浏览量
2019-06-30 上传
184 浏览量
2021-01-31 上传
169 浏览量
125 浏览量
2019-05-21 上传
资源评论
学术菜鸟小晨
- 粉丝: 2w+
- 资源: 5688
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- java项目,毕业设计-企业客户信息反馈平台
- 数据分析-29-260万用户大型家电和电子产品购买分析(包含数据代码)
- 投票微信小程序完整源码+数据库(高分毕设项目)
- Design Compiler各版本User Guide
- java项目,毕业设计-人事管理系统
- opencv基于摄像头实现的人脸捕获及识别项目源代码+模型文件+使用说明
- sealos离线安装k8s集群镜像-part3
- 基于阶梯碳交易成本的含电转气-碳捕集(P2G-CCS)耦合的综合能源系统低碳经济优化调度,采用(Matlab+Yalmip+Cplex) 考虑P2G设备、碳捕集电厂、风电机组、光伏机组、CHP机组、燃
- Linux常用命令大全.zip
- 富士康PLM项目简报.pptx
- 直驱式永磁同步风力发电系统的仿真模型
- java项目,毕业设计-书籍学习平台
- PaddleTS 是一个易用的深度时序建模的Python库,它基于飞桨深度学习框架PaddlePaddle,专注业界领先的深度模型,旨在为领域专家和行业用户提供可扩展的时序建模能力和便捷易用的用户体验
- 微信投票小程序投票小程序源码(高分项目)
- 数据分析-30-7万条天猫订单数据分析
- MATLAB 给变量输入二进制 二进制与十进制转化
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功