DeepLearningforVisualTrackingAComprehensiveSurvey.pdf

需积分: 15 123 浏览量 2020-12-13 23:46:11 上传评论收藏 8.35MB PDF 举报

深度学习在视觉追踪中的应用：全面调查视觉目标追踪是计算机视觉领域中最受关注且具有挑战性的研究主题之一。由于问题的不确定性以及其在各种实际应用场景中的广泛应用，已经建立了多个大规模基准数据集，这些数据集上发展并展示了大量方法，近年来取得了显著的进步，尤其是在基于深度学习（DL）的方法上。本调查旨在系统地研究当前基于DL的视觉追踪方法、基准数据集和评估指标，并对领先的视觉追踪方法进行深入的评估和分析。从六个关键方面总结了基于DL的方法的基本特征、主要动机和贡献：网络架构、网络利用、针对视觉追踪的网络训练、网络目标、网络输出以及关联滤波器优势的利用。网络架构部分探讨了如卷积神经网络（CNN）、循环神经网络（RNN）和门控循环单元（GRU）等不同结构如何适应追踪任务。网络利用部分则涵盖了如何利用预训练模型、迁移学习和在线学习来改进追踪性能。网络训练部分讨论了端到端训练、在线训练和离线训练策略在视觉追踪中的应用。网络目标部分解释了损失函数的设计，如最小化位置误差、分类损失和回归损失等。网络输出部分涉及了如何预测目标的位置、形状和运动状态。关联滤波器优势的利用部分则阐述了深度学习如何与传统的滤波方法相结合，以增强追踪的稳定性和准确性。对比了流行的视觉追踪基准数据集，例如OTB2013、OTB2015、VOT2018和LaSOT，以及它们各自的特点，如视频数量、目标类别、场景复杂度和运动模式。同时，总结了这些数据集常用的评估指标，如准确率（Accuracy）、重合率（Overlap）、成功率曲线（Success Plot）和精度距离曲线（Precision Plot），以衡量追踪算法的性能。第三，对最新的DL方法在上述基准数据集上的表现进行了全面评估。这些方法包括但不限于Siamese网络、深度相关滤波器、注意力机制、多任务学习和元学习等。通过对这些方法的比较，揭示了各自的优势和局限性，如Siamese网络在快速追踪中的有效性，深度相关滤波器在保持追踪稳定性方面的贡献，以及注意力机制如何帮助算法专注于关键区域。对这些最先进的方法进行了批判性分析，探讨了未来研究可能的方向，如解决在线适应性、鲁棒性、实时性、模型泛化能力和对复杂环境变化的适应等问题。此外，还强调了开放源代码和公平的实验设置对于推动领域发展的关键作用。深度学习为视觉追踪带来了革命性的进步，但仍然存在许多挑战需要克服。这个全面的调查提供了对该领域的深入了解，对于研究人员和实践者来说，是理解和进一步发展视觉追踪技术的重要资源。

资源详情

资源评论

资源推荐

2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any

current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating

new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in

other works.

arXiv:1912.00535v1 [cs.CV] 2 Dec 2019

JOURNAL OF L

X CLASS FILES, VOL. ., NO. ., MONTH . 2

Deep Learning for Visual Tracking: A

Comprehensive Survey

Seyed Mojtaba Marvasti-Zadeh, Student Member, IEEE, Li Cheng, Senior Member, IEEE,

Hossein Ghanei-Yakhdan, and Shohreh Kasaei, Senior Member, IEEE

Abstract—Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed

nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have

been established, on which considerable methods have been developed and demonstrated with signiﬁcant progress in recent years –

predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual

tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking

methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from six

key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, and the

exploitation of correlation ﬁlter advantages. Second, popular visual tracking benchmarks and their respective properties are compared,

and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set

of well-established benchmarks of OTB2013, OTB2015, VOT2018, and LaSOT. Finally, by conducting critical analyses of these state-

of-the-art methods both quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may

serve as a gentle use guide for practitioners to weigh on when and under what conditions to choose which method(s). It also facilitates

a discussion on ongoing issues and sheds light on promising research directions.

Index Terms—Visual tracking, deep learning, computer vision, appearance modeling.

1 INTRODUCTION

ENERIC visual tracking aims to estimate the trajectory

of an unknown visual target when only an initial state

of the target (in a video frame) is available. Visual tracking is

an open and attractive research ﬁeld (see Fig. 1) with a broad

extent of categories and applications; including self-driving

cars [1]–[4], autonomous robots [5], [6], surveillance [7]–

[10], augmented reality [11]–[13], unmanned aerial vehicle

(UAV) tracking [14], sports [15], surgery [16], biology [17]–

[19], ocean exploration [20], to name a few. The ill-posed

deﬁnition of the visual tracking (i.e., model-free tracking,

on-the-ﬂy learning, single-camera, 2D information) is more

challenging in complicated real-world scenarios which may

include arbitrary classes of target appearance and their

motion model (e.g., human, drone, animal, vehicle), dif-

ferent imaging characteristics (e.g., static/moving camera,

smooth/abrupt movement, camera resolution), and changes

in environmental conditions (e.g., illumination variation,

background clutter, crowded scenes). Although traditional

• S. M. Marvasti-Zadeh is with the Digital Image and Video Processing

Lab (DIVPL), Department of Electrical Engineering, Yazd University,

Yazd, Iran, and he is also a member of the Image Processing Lab (IPL),

Sharif University of Technology, Tehran, Iran and the Visual Analysis

of People Lab (VAP), Aalborg University, Aalborg, Denmark. E-mails:

mojtaba.marvasti@stu.yazd.ac.ir

• L. Cheng is with the Vision and Learning Lab, Department of Electrical

and Computer Engineering, University of Alberta, Edmonton, Canada.

E-mail: lcheng5@ualberta.ca

• H. Ghanei-Yakhdan is with the Digital Image and Video Processing Lab

(DIVPL), Department of Electrical Engineering, Yazd University, Yazd,

Iran. E-mail: hghaneiy@yazd.ac.ir

• S. Kasaei is with the Image Processing Lab (IPL), Department of Com-

puter Engineering, Sharif University of Technology, Tehran, Iran. E-mail:

kasaei@sharif.edu

Manuscript received ...; revised ...

Visual Tracking

Number

and Type

of States

Generic

object

tracking

Speciﬁc

object

tracking

Single

object

tracking

Multi-

object

tracking

Benchmark

Dataset

Short-term

video

sequences

Long-term

video

sequences

Evaluation

Metrics

Performance

measures

Performance

plots

Camera

Single-view

tracking

Multi-view

tracking

RGB

camera

Other

cameras

Detection

Module

No: Short-

term

tracking

Yes:

Long-term

tracking

Environment

Constrained

environ-

ments

Real-world

scenarios

Fig. 1: An overview of visual target tracking.

visual tracking methods utilize various frameworks – like

discriminative correlation ﬁlters (DCF) [21]–[24], silhouette

tracking [25], [26], Kernel tracking [27]–[29], point tracking

[30], [31], and so forth – these methods cannot provide sat-

isfactory results in unconstrained environments. The main

reasons are the target representation by handcrafted features

(such as the histogram of oriented gradients (HOG) [32]

and Color-Names (CN)) [33] and inﬂexible target model-

ing. Inspired by deep learning (DL) breakthroughs [34]–

[38] in ImageNet large scale visual recognition competition

(ILSVRC) [39] and also visual object tracking (VOT) chal-

lenge [40]–[46], DL-based methods have attracted consider-

able interest in visual tracking community to provide robust

visual trackers. Although convolutional neural networks

Visual tracking datasets

and challenges

…

2013

2014

2015

2016

2017

2018

2019

Developing CNN-

based methods

Developing RNN-

based methods

Developing SNN-

based methods

Developing RL- and

GAN-based methods

Developing custom

NN-based methods

Fig. 2: Timeline of deep visual tracking methods.

2015: Exploring and studying deep features to exploit the traditional methods.

2016: Ofﬂine training/ﬁne-tuning of DNNs for visual tracking purpose, Employ-

ing Siamese network for real-time tracking, Integrating DNNs into traditional

frameworks.

2017: Incorporating temporal and contextual information, Investigating various

ofﬂine training on large-scale image/video datasets.

2018: Studying different learning and search strategies, Designing more sophisti-

cated architectures for visual tracking task.

2019: Investigating deep detection and segmentation approaches for visual track-

ing, Taking advantages of deeper backbone networks.

(CNNs) have been dominant networks initially, the broad

range of architectures such as Siamese neural networks

(SNNs), recurrent neural networks (RNNs), auto-encoders

(AEs), generative adversarial networks (GANs), and custom

neural networks are currently investigated. Fig. 2 presents a

brief history of the development of deep visual trackers in

recent years. The state-of-the-art DL-based visual trackers

have distinct characteristics such as exploitation of deep

architecture, backbone network, learning procedure, train-

ing datasets, network objective, network output, types of

exploited deep features, CPU/GPU implementation, pro-

gramming language and framework, speed, and so forth.

Besides, several visual tracking benchmark datasets have

been proposed in the past few years for practical training

and evaluating of visual tracking methods. Despite various

properties, some of these benchmark datasets have common

video sequences. Thus, a comparative study of DL-based

methods, their benchmark datasets, and evaluation metrics

are provided in this paper to facilitate developing advanced

methods by the visual tracking community.

The visual tracking methods can be roughly classiﬁed

into two main categories of before and after the revolution of

DL in computer vision. The ﬁrst category of visual tracking

survey papers [47]–[50] mainly review traditional methods

based on classical object and motion representations, and

then examine their pros and cons systematically, experi-

mentally, or both. Considering the signiﬁcant progress of

DL-based visual trackers, the reviewed methods by these

papers are outdated. On the other hand, the second category

reviews limited deep visual trackers [51]–[53]. The papers

[51], [52] (two versions of a paper) categorize 81 and 93

handcrafted and deep visual trackers into the correlation

ﬁlter trackers and non-correlation ﬁlter trackers, and then

a further classiﬁcation based on architectures and tracking

mechanisms has applied. These papers study <40 DL-based

method with limited investigations. Although the paper [54]

particularly investigates the network branches, layers, and

training aspects of nine SNN-based methods, it does not

include state-of-the-art SNN-based trackers (e.g., [55]–[57])

and the custom networks (e.g., [58]) which exploit SNNs,

partially. The last review paper [53] has categorized the

43 DL-based methods according to their structure, func-

tion, and training. Then, 16 DL-based visual trackers are

evaluated with different hand-crafted-based visual tracking

methods. From the structure perspective, these trackers

are categorized into 34 CNN-based (including ten CNN-

Matching and 24 CNN-Classiﬁcation), ﬁve RNN-based, and

four other architecture-based methods (e.g., AE). Besides,

from the network function perspective, these methods are

categorized into the feature extraction network (FEN) or

end-to-end network (EEN). While the FENs are the methods

that exploit pre-trained models on different tasks, the EENs

are classiﬁed in terms of their outputs; namely, object score,

conﬁdence map, and bounding box (BB). From the network

training perspective, these methods are categorized into the

NP-OL, IP-NOL, IP-OL, VP-NOL, and VP-OL categories, in

which the NP, IP, VP, OL, and NOL are the abbreviations of

no pre-trained, image pre-trained, video-pre-trained, online

learning, and no online learning, respectively.

Despite all efforts, there is no comprehensive study to

not only extensively categorize DL-based trackers, their

motivations, and solutions to different problems, but also

experimentally analyze the best methods according to dif-

ferent challenging scenarios. Motivated by these concerns,

the main goal of this survey is to ﬁll this gap and investigate

the main present problems and future directions. The main

differences of this survey and prior ones are described as

follows.

Differences to Prior Surveys: Despite the currently avail-

able review papers, this paper focuses merely on 129

state-of-the-art DL-based visual tracking methods, which

have been published in major image processing and com-

puter vision conferences and journals. These methods in-

clude the HCFT [59], DeepSRDCF [60], FCNT [61], CNN-

SVM [62], DPST [63], CCOT [64], GOTURN [65], SiamFC

[66], SINT [67], MDNet [68], HDT [69], STCT [70], RPNT

[71], DeepTrack [72], CNT [73], CF-CNN [74], TCNN

[75], RDLT [76], PTAV [77], [78], CREST [79], UCT/UCT-

Lite [80], DSiam/DSiamM [81], TSN [82], WECO [83],

RFL [84], IBCCF [85], DTO [86]], SRT [87], R-FCSN [88],

GNET [89], LST [90], VRCPF [91], DCPF [92], CFNet [93],

ECO [94], DeepCSRDCF [95], MCPF [96], BranchOut [97],

DeepLMCF [98], Obli-RaFT [99], ACFN [100], SANet [101],

DCFNet/DCFNet2 [102], DET [103], DRN [104], DNT [105],

STSGS [106], TripletLoss [107], DSLT [108], UPDT [109],

ACT [110], DaSiamRPN [111], RT-MDNet [112], StructSiam

[113], MMLT [114], CPT [115], STP [116], Siam-MCF [117],

Siam-BM [118], WAEF [119], TRACA [120], VITAL [121],

DeepSTRCF [122], SiamRPN [123], SA-Siam [124], Flow-

Track [125], DRT [126], LSART [127], RASNet [128], MCCT

[129], DCPF2 [130], VDSR-SRT [131], FCSFN [132], FRPN2T-

Siam [133], FMFT [134], IMLCF [135], TGGAN [136], DAT

[137], DCTN [138], FPRNet [139], HCFTs [140], adaDDCF

[141], YCNN [142], DeepHPFT [143], CFCF [144], CFSRL

[145], P2T [146], DCDCF [147], FICFNet [148], LCTdeep

[149], HSTC [150], DeepFWDCF [151], CF-FCSiam [152],

MGNet [153], ORHF [154], ASRCF [155], ATOM [156], C-

RPN [157], GCT [158], RPCF [159], SPM [160], SiamDW

[56], SiamMask [57], SiamRPN++ [55], TADT [161], UDT

[162], DiMP [163], ADT [164], CODA [165], DRRL [166],

SMART [167], MRCNN [168], MM [169], MTHCF [170],

AEPCF [171], IMM-DFT [172], TAAT [173], DeepTACF [174],

MAM [175], ADNet [176], [177], C2FT [178], DRL-IS [179],

DRLT [180], EAST [181], HP [182], P-Track [183], RDT [184],

and SINT++ [58].

The trackers include 73 CNN-based, 35 SNN-based, 15

custom-based (including AE-based, reinforcement learning

(RL)-based, and combined networks), three RNN-based,

and three GAN-based methods. One major contribution and

novelty of this paper is the inclusion and comparison of

SNN-based visual tracking methods that are of great interest

in the visual tracking community at the present time. More-

over, the recent visual trackers based on GAN and custom

networks (which includes RL-based methods) are reviewed.

Although the methods in this survey are categorized into the

exploitation of off-the-shelf deep features and deep features

for visual tracking (similar to the FENs and EENs in [53]),

detailed characteristics of these methods such as pre-trained

or backbone networks, exploited layer(s), training datasets,

objective function, tracking speed, used features, types of

tracking output, CPU/GPU implementation, programming

language, DL framework are also presented. From the net-

work training perspective, this survey independently stud-

ies deep off-the-shelf features and deep features for visual

tracking. Because deep off-the-shelf features (i.e., extracted

from FENs) are mostly pre-trained on the ImageNet for

object recognition tasks, their training details are reviewed,

independently. Hence, the network training for visual track-

ing purposes is categorized to the DL-based methods that

exploit only ofﬂine training, only online training, or both

ofﬂine and online training procedures. Finally, this paper

comprehensively analyses different aspects of 45 state-of-

the-art visual tracking methods on four visual tracking

datasets.

The main contributions of this paper are summarized as

follows:

1) State-of-the-art DL-based visual tracking methods are cat-

egorized based on their architecture (i.e., CNN, SNN, RNN,

GAN, and custom networks), network exploitation (i.e., off-

the-shelf deep features and deep features for visual track-

ing), network training for visual tracking (i.e., only ofﬂine

training, only online training, both ofﬂine and online train-

ing), network objective (i.e., regression-based, classiﬁcation-

based, and both classiﬁcation and regression-based), and

exploitation of correlation ﬁlter advantages (i.e., DCF frame-

work and utilizing correlation ﬁlter/layer/functions). Such

a study covering all of these aspects in detailed categoriza-

tion of visual tracking methods has not been previously

presented.

2) The main motivations and contributions of the DL-based

methods to tackle the visual tracking problems are summa-

rized. To the best of our knowledge, this is the ﬁrst paper

that investigates the primary problems and proposed solu-

tions of visual tracking methods. This classiﬁcation provides

a proper insight in designing accurate and robust DL-based

visual tracking methods.

3) Based on fundamental characteristics (including the

number of videos, number of frames, number of classes

or clusters, sequence attributes, absent labels, and over-

lap with other datasets), recent visual tracking benchmark

datasets including OTB2013 [185], VOT [40]–[46], ALOV

[48], OTB2015 [186], TC128 [187], UAV123 [188], NUS-PRO

[189], NfS [190], DTB [191], TrackingNet [192], OxUvA [193],

BUAA-PRO [194], GOT10k [195], and LaSOT [196] are com-

pared.

4) Finally, extensive quantitative and qualitative experimen-

tal evaluations are performed on well-known OTB2013,

OTB2015, VOT2018, and LaSOT visual tracking datasets,

and the state-of-the-art visual trackers are analyzed based

on different aspects. Moreover, this paper speciﬁes the most

challenging visual attributes not only for the VOT2018

dataset, but also for the OTB2015 and LaSOT datasets for

the ﬁrst time. At last, the VOT toolkit [45] has been modiﬁed

to qualitatively compare different methods according to the

TraX protocol [197].

According to the comparisons, the following observa-

tions are made:

1) The SNN-based methods are the most attractive deep

architectures due to their satisfactory balance between per-

formance and efﬁciency for visual tracking. Moreover, the

visual tracking methods recently attempt to exploit the

advantages of RL and GAN methods to reﬁne their decision

making and alleviate the lack of training data. Based on

these advantages, the recent visual tracking methods aim

to design custom neural networks for visual tracking pur-

poses.

2) The ofﬂine end-to-end learning of deep features appro-

priately adapts the pre-trained features for visual track-

ing. Although the online training of DNN increases the

computational complexity such that most of these methods

are not suitable for real-time applications, it considerably

helps visual trackers to adapt with signiﬁcant appearance

variation, prevent from visual distractors, and improve the

accuracy and robustness of visual tracking methods. Hence,

exploiting both ofﬂine and online training procedures pro-

vides more robust visual trackers.

3) Leveraging deeper and wider backbone networks im-

proves the discriminative power of distinguishing the target

from its background.

4) The best visual tracking methods use both regression and

classiﬁcation objective functions not only to estimate the

best target proposal but also to ﬁnd the tightest BB for target

localization.

5) The exploitation of different features enhances the ro-

bustness of the target model. For instance, most of the

DCF-based methods fuse the deep off-the-shelf features and

hand-crafted features (e.g., HOG and CN) for this reason.

Also, the exploitation of complementary features such as

temporal or contextual information has led to more discrim-

inative and robust features for target representation.

6) The most challenging attributes for DL-based visual

tracking methods are occlusion, out-of-view, and fast mo-

tion. Moreover, visual distractors with similar semantics

may result in drifting problem.

The rest of this paper is as follows. Section 2 introduces

our taxonomy of deep visual tracking methods. The visual

tracking benchmark datasets and evaluation metrics are

brieﬂy compared in Section 3. Experimental comparisons of

the state-of-the-art visual tracking methods are performed

in Section 4. Finally, Section 5 summarizes the conclusions

and future directions.

2 TAXONOMY OF DEEP VISUAL TRACKING METH-

ODS

In this section, three major components of: target representa-

tion/information, training process, and learning procedure

are described. Then, the proposed comprehensive taxonomy

of DL-based methods is presented.

One of primary motivations of DL-based methods is

improving a target representation by utilizing/fusing deep

剩余22页未读，继续阅读

评论收藏

内容反馈

小潘同学️

粉丝: 15
资源: 2

Deep Learning for Visual Tracking A Comprehensive Survey.pdf

评论0

最新资源

Deep Learning for Visual Tracking A Comprehensive Survey.pdf

评论0

Deep Learning on Graphs A Survey

Deep Learning in Mobile and Wireless Networking：A Survey

The Deep Learning Compiler A Comprehensive Survey.pdf

A Comprehensive Survey on Transfer Learning.pdf

A Survey on Deep Learning Technique for Video Segmentation.pdf

Deep Learning for Face Anti-Spoofing A Survey.pdf

AI编译器的架构 The Deep Learning Compiler A Comprehensive Survey.pdf

A Survey of the Usages of Deep Learning for NLP(2021).pdf

A Survey of Deep Learning-based Object Detection.pdf

Incremental Learning for Robust Visual Tracking.

Deep Learning for Chest X-ray Analysis A Survey.pdf

A General Survey on Attention Mechanisms in Deep Learning.pdf

Convolutional Residual Learning for Visual Tracking

Incremental Learning for Robust Visual Tracking

Object Tracking A Survey.pdf （带书签）

deep learning

Visual Tracking An Experimental Survey

Survey of maneuvering target tracking: II. Ballistic target models

论文《Accurate scale estimation for robust visual tracking》代码

A-Survey-of-Deep-Learning-Based-Object-Detection.pdf

Online Visual Tracking by Huchuan Lu-June 1, 2019.epub

CVPR2018_Oral_论文合集_人工智能_机器学习

Deep learning

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

最新资源