in DETR [6] has no semantic information about specific
objects. However, the track query in Transformer-based
MOT methods like MOTR [43] carries information about a
tracked object. This difference will cause a semantic infor-
mation gap and thus degrade the final tracking performance.
Therefore, to overcome this issue, we use a light decoder to
perform preliminary object detection, which outputs the de-
tect embedding with specific semantics. Then we jointly
input detect and track embeddings into the subsequent de-
coder to make MeMOTR tracking results more precise.
We mainly evaluate our method on the DanceTrack
dataset [30] because of its serious association challenge.
Experimental results show that our method achieves the
state-of-the-art performance on this challenging Dance-
Track dataset, especially on association metrics (e.g., AssA,
IDF1). We also evaluate our model on the traditional
pedestrian tracking dataset of MOT17 [23] and the multi-
categories tracking dataset of BDD100K [42]. In addition,
we perform extensive ablation studies further demonstrate
the effectiveness of our designs.
2. Related Work
Tracking-by-Detection is a widely used MOT paradigm
that has recently dominated the community. These methods
always get trajectories by associating a given set of detec-
tions in a streaming video.
The objects in classic pedestrian tracking scenarios [9,
23] always have different appearances and regular motion
patterns. Therefore, appearance matching and linear mo-
tion estimation are widely used to match targets in consec-
utive frames. SORT [3] uses the Intersection-over-Union
(IoU) to match predictions of the Kalman filter [34] and
detected boxes. Deep-SORT [35] applies an additional net-
work to extract target features, then utilizes cosine distances
for matching besides motion consideration in SORT [3].
JDE [33], FairMOT [45], and Unicorn [39] further ex-
plore the architecture of appearance embedding and match-
ing. ByteTrack [44] employs a robust detector based on
YOLOX [12] and reuses low-confidence detections to en-
hance the association ability. Furthermore, OC-SORT [5]
improves SORT [3] by rehabilitating lost targets. In recent
years, as a trendy framework in vision tasks, some stud-
ies [36, 48] have also applied Transformers to match detec-
tion bounding boxes. Moreover, Dendorfer et al. [10] at-
tempt to model pedestrian trajectories by leveraging more
complex motion estimation methods (like S-GAN [14])
from the trajectory prediction task.
The methods described above have powerful detection
capabilities due to their robust detectors. However, although
such methods have achieved outstanding performance in
pedestrian tracking datasets, they are mediocre at dealing
with more complex scenarios having irregular movements.
These unforeseeable motion patterns will cause the trajec-
tory estimation and prediction module to fail.
Tracking-by-Query usually does not require additional
post-processing to associate detection results. Unlike the
tracking-by-detection paradigm mentioned above, tracking-
by-query methods apply the track query to decode the loca-
tion of tracked objects progressively.
Inspired by DETR-family [6], most of these meth-
ods [22, 43] leverage the learnable object query to per-
form newborn object detection, while the track query lo-
calizes the position of tracked objects. TransTrack [31]
builds a siamese network for detection and tracking, then
applies an IoU matching to produce newborn targets.
TrackFormer [22] utilizes the same Transformer decoder
for both detection and tracking, then employs a non-
maximum suppression (NMS) with a high IoU threshold
to remove strongly overlapping duplicate bounding boxes.
MOTR [43] builds an elegant and fully end-to-end Trans-
former for multi-object tracking. This paradigm performs
excellently in dealing with irregular movements due to the
flexibility of query-based design. Furthermore, MQT [17]
employs different queries to represent one tracked object
and cares more about class-agnostic tracking.
However, current query-based methods typically exploit
the information of adjacent frames (query [43] or fea-
ture [22] fusion). Although the track query can be continu-
ously updated over time, most methods still do not explicitly
exploit longer temporal information. Cai et al. [4] explore a
large memory bank to benefit from time-related knowledge
but suffer enormous storage costs. In order to use long-term
information, we propose a long-term memory to stabilize
the tracked object feature over time and a memory-attention
layer for a more distinguishable representation. Our exper-
iments further approve that this approach significantly im-
proves association performance in MOT.
3. Method
3.1. Overview
We propose the MeMOTR, a long-term memory-
augmented Transformer for multi-object tracking. Different
from most existing methods [22, 43] that only explicitly uti-
lize the states of tracked objects between adjacent frames,
our core contribution is to build a long-term memory (in
Section 3.3) that maintains the long-term temporal feature
for each tracked target, together with a temporal interaction
module (TIM) that effectively injects the temporal informa-
tion into subsequent tracking processes.
Like most DETR-family methods [6], we use a ResNet-
50 [15] backbone and a Transformer Encoder to produce the
image feature of an input frame I
t
. As shown in Figure 1,
the learnable detect query Q
det
is fed into the Detection
Decoder D
det
(in Section 3.2) to generate the detect em-
bedding E
t
det
for the current frame. Afterward, by query-