KCF/DCF英文论文原文，带注释哟资源-CSDN文库

需积分: 9 51 浏览量 2018-07-19 09:19:48 上传评论收藏 1.29MB PDF 举报

资源详情

资源评论

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

High-Speed Tracking with

Kernelized Correlation Filters

João F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista

Abstract—The core component of most modern trackers is a discriminative classiﬁer, tasked with distinguishing between the target

and the surrounding environment. To cope with natural image changes, this classiﬁer is typically trained with translated and scaled

sample patches. Such sets of samples are riddled with redundancies – any overlapping pixels are constrained to be the same. Based

on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting

data matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several

orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation ﬁlter, used by some of the fastest

competitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernel

algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of

linear correlation ﬁlters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-ranking

trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented

in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

Index Terms—Visual tracking, circulant matrices, discrete Fourier transform, kernel methods, ridge regression, correlation ﬁlters.

1 INTRODUCTION

RGUABLY one of the biggest breakthroughs in recent

visual tracking research was the widespread adoption

of discriminative learning methods. The task of tracking, a

crucial component of many computer vision systems, can

be naturally speciﬁed as an online learning problem [1], [2].

Given an initial image patch containing the target, the goal

is to learn a classiﬁer to discriminate between its appearance

and that of the environment. This classiﬁer can be evaluated

exhaustively at many locations, in order to detect it in

subsequent frames. Of course, each new detection provides

a new image patch that can be used to update the model.

It is tempting to focus on characterizing the object of

interest – the positive samples for the classiﬁer. However,

a core tenet of discriminative methods is to give as much

importance, or more, to the relevant environment – the

negative samples. The most commonly used negative sam-

ples are image patches from different locations and scales,

reﬂecting the prior knowledge that the classiﬁer will be

evaluated under those conditions.

An extremely challenging factor is the virtually unlim-

ited amount of negative samples that can be obtained from

an image. Due to the time-sensitive nature of tracking,

modern trackers walk a ﬁne line between incorporating

as many samples as possible and keeping computational

demand low. It is common practice to randomly choose only

a few samples each frame [3], [4], [5], [6], [7].

Although the reasons for doing so are understandable,

we argue that undersampling negatives is the main factor

inhibiting performance in tracking. In this paper, we de-

velop tools to analytically incorporate thousands of samples

• The authors are with the Institute of Systems and Robotics, University of

Coimbra.

E-mail: {henriques,ruicaseiro,pedromartins,batista}@isr.uc.pt

at different relative translations, without iterating over them

explicitly. This is made possible by the discovery that, in the

Fourier domain, some learning algorithms actually become

easier as we add more samples, if we use a speciﬁc model for

translations.

These analytical tools, namely circulant matrices, pro-

vide a useful bridge between popular learning algorithms

and classical signal processing. The implication is that we

are able to propose a tracker based on Kernel Ridge Re-

gression [8] that does not suffer from the “curse of ker-

nelization”, which is its larger asymptotic complexity, and

even exhibits lower complexity than unstructured linear

regression. Instead, it can be seen as a kernelized version

of a linear correlation ﬁlter, which forms the basis for the

fastest trackers available [9], [10]. We leverage the power-

ful kernel trick at the same computational complexity as

linear correlation ﬁlters. Our framework easily incorporates

multiple feature channels, and by using a linear kernel we

show a fast extension of linear correlation ﬁlters to the multi-

channel case.

2 RELATED WORK

2.1 On tracking-by-detection

A comprehensive review of tracking-by-detection is outside

the scope of this article, but we refer the interested reader

to two excellent and very recent surveys [1], [2]. The most

popular approach is to use a discriminative appearance

model [3], [4], [5], [6]. It consists of training a classiﬁer

online, inspired by statistical machine learning methods, to

predict the presence or absence of the target in an image

patch. This classiﬁer is then tested on many candidate

patches to ﬁnd the most likely location. Alternatively, the

position can also be predicted directly [7]. Regression with

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

Kernelized Correlation Filter (proposed) TLD Struck

Figure 1: Qualitative results for the proposed Kernelized Correlation Filter (KCF), compared with the top-performing Struck and

TLD. Best viewed on a high-resolution screen. The chosen kernel is Gaussian, on HOG features. These snapshots were taken at

the midpoints of the 50 videos of a recent benchmark [11]. Missing trackers are denoted by an “x”. KCF outperforms both Struck

and TLD, despite its minimal implementation and running at 172 FPS (see Algorithm 1, and Table 1).

class labels can be seen as classiﬁcation, so we use the two

terms interchangeably.

We will discuss some relevant trackers before focusing

on the literature that is more directly related to our an-

alytical methods. Canonical examples of the tracking-by-

detection paradigm include those based on Support Vec-

tor Machines (SVM) [12], Random Forest classiﬁers [6], or

boosting variants [13], [5]. All the mentioned algorithms had

to be adapted for online learning, in order to be useful for

tracking. Zhang et al. [3] propose a projection to a ﬁxed

random basis, to train a Naive Bayes classiﬁer, inspired

by compressive sensing techniques. Aiming to predict the

target’s location directly, instead of its presence in a given

image patch, Hare et al. [7] employed a Structured Output

SVM and Gaussian kernels, based on a large number of

image features. Examples of non-discriminative trackers

include the work of Wu et al. [14], who formulate track-

ing as a sequence of image alignment objectives, and of

Sevilla-Lara and Learned-Miller [15], who propose a strong

appearance descriptor based on distribution ﬁelds. Another

discriminative approach by Kalal et al. [4] uses a set of

structural constraints to guide the sampling process of a

boosting classiﬁer. Finally, Bolme et al. [9] employ classical

signal processing analysis to derive fast correlation ﬁlters.

We will discuss these last two works in more detail shortly.

2.2 On sample translations and correlation ﬁltering

Recall that our goal is to learn and detect over translated

image patches efﬁciently. Unlike our approach, most at-

tempts so far have focused on trying to weed out irrelevant

image patches. On the detection side, it is possible to use

branch-and-bound to ﬁnd the maximum of a classiﬁer’s

response while avoiding unpromising candidate patches

[16]. Unfortunately, in the worst-case the algorithm may

still have to iterate over all patches. A related method ﬁnds

the most similar patches of a pair of images efﬁciently [17],

but is not directly translated to our setting. Though it does

not preclude an exhaustive search, a notable optimization

is to use a fast but inaccurate classiﬁer to select promising

patches, and only apply the full, slower classiﬁer on those

[18], [19].

On the training side, Kalal et al. [4] propose using

structural constraints to select relevant sample patches from

each new image. This approach is relatively expensive,

limiting the features that can be used, and requires careful

tuning of the structural heuristics. A popular and related

method, though it is mainly used in ofﬂine detector learn-

ing, is hard-negative mining [20]. It consists of running

an initial detector on a pool of images, and selecting any

wrong detections as samples for re-training. Even though

both approaches reduce the number of training samples, a

major drawback is that the candidate patches have to be

considered exhaustively, by running a detector.

The initial motivation for our line of research was the

recent success of correlation ﬁlters in tracking [9], [10]. Cor-

relation ﬁlters have proved to be competitive with far more

complicated approaches, but using only a fraction of the

computational power, at hundreds of frames-per-second.

They take advantage of the fact that the convolution of

two patches (loosely, their dot-product at different relative

translations) is equivalent to an element-wise product in the

Fourier domain. Thus, by formulating their objective in the

Fourier domain, they can specify the desired output of a

linear classiﬁer for several translations, or image shifts, at

once.

A Fourier domain approach can be very efﬁcient, and

has several decades of research in signal processing to draw

from [21]. Unfortunately, it can also be extremely limiting.

We would like to simultaneously leverage more recent ad-

vances in computer vision, such as more powerful features,

large-margin classiﬁers or kernel methods [22], [20], [23].

A few studies go in that direction, and attempt to apply

kernel methods to correlation ﬁlters [24], [25], [26], [27]. In

these works, a distinction must be drawn between two types

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

of objective functions: those that do not consider the power

spectrum or image translations, such as Synthetic Discrim-

inant Function (SDF) ﬁlters [25], [26], and those that do,

such as Minimum Average Correlation Energy [28], Optimal

Trade-Off [27] and Minimum Output Sum of Squared Error

(MOSSE) ﬁlters [9]. Since the spatial structure can effectively

be ignored, the former are easier to kernelize, and Kernel

SDF ﬁlters have been proposed [26], [27], [25]. However,

lacking a clearer relationship between translated images,

non-linear kernels and the Fourier domain, applying the

kernel trick to other ﬁlters has proven much more difﬁcult

[25], [24], with some proposals requiring signiﬁcantly higher

computation times and imposing strong limits on the num-

ber of image shifts that can be considered [24].

For us, this hinted that a deeper connection between

translated image patches and training algorithms was

needed, in order to overcome the limitations of direct

Fourier domain formulations.

2.3 Subsequent work

Since the initial version of this work [29], an interesting

time-domain variant of the proposed cyclic shift model has

been used very successfully for video event retrieval [30].

Generalizations of linear correlation ﬁlters to multiple chan-

nels have also been proposed [31], [32], [33], some of which

building on our initial work. This allows them to leverage

more modern features (e.g. Histogram of Oriented Gradi-

ents – HOG). A generalization to other linear algorithms,

such as Support Vector Regression, was also proposed [31].

We must point out that all of these works target off-line

training, and thus rely on slower solvers [31], [32], [33]. In

contrast, we focus on fast element-wise operations, which

are more suitable for real-time tracking, even with the kernel

trick.

3 CONTRIBUTIONS

A preliminary version of this work was presented earlier

[29]. It demonstrated, for the ﬁrst time, the connection

between Ridge Regression with cyclically shifted samples

and classical correlation ﬁlters. This enabled fast learning

with O(n log n) Fast Fourier Transforms instead of expen-

sive matrix algebra. The ﬁrst Kernelized Correlation Filter

was also proposed, though limited to a single channel.

Additionally, it proposed closed-form solutions to compute

kernels at all cyclic shifts. These carried the same O(n log n)

computational cost, and were derived for radial basis and

dot-product kernels.

The present work adds to the initial version in signiﬁ-

cant ways. All the original results were re-derived using a

much simpler diagonalization technique (Sections 4-6). We

extend the original work to deal with multiple channels,

which allows the use of state-of-the-art features that give an

important boost to performance (Section 7). Considerable

new analysis and intuitive explanations are added to the

initial results. We also extend the original experiments from

12 to 50 videos, and add a new variant of the Kernelized

Correlation Filter (KCF) tracker based on Histogram of

Oriented Gradients (HOG) features instead of raw pixels.

Via a linear kernel, we additionally propose a linear multi-

channel ﬁlter with very low computational complexity, that

almost matches the performance of non-linear kernels. We

name it Dual Correlation Filter (DCF), and show how it

is related to a set of recent, more expensive multi-channel

ﬁlters [31]. Experimentally, we demonstrate that the KCF

already performs better than a linear ﬁlter, without any

feature extraction. With HOG features, both the linear DCF

and non-linear KCF outperform by a large margin top-

ranking trackers, such as Struck [7] or Track-Learn-Detect

(TLD) [4], while comfortably running at hundreds of frames-

per-second.

4 BUILDING BLOCKS

In this section, we propose an analytical model for image

patches extracted at different translations, and work out the

impact on a linear regression algorithm. We will show a

natural underlying connection to classical correlation ﬁlters.

The tools we develop will allow us to study more compli-

cated algorithms in Sections 5-7.

4.1 Linear regression

We will focus on Ridge Regression, since it admits a simple

closed-form solution, and can achieve performance that is

close to more sophisticated methods, such as Support Vector

Machines [8]. The goal of training is to ﬁnd a function

f(z) = w

z that minimizes the squared error over samples

and their regression targets y

min

(f(x

) −y

)

+ λ kwk

. (1)

The λ is a regularization parameter that controls overﬁt-

ting, as in the SVM. As mentioned earlier, the minimizer has

a closed-form, which is given by [8]

w =



X + λI



−1

y. (2)

where the data matrix X has one sample per row x

, and

each element of y is a regression target y

. I is an identity

matrix.

Starting in Section 4.4, we will have to work in the

Fourier domain, where quantities are usually complex-

valued. They are not harder to deal with, as long as we

use the complex version of Eq. 2 instead,

w =



X + λI



−1

y, (3)

where X

is the Hermitian transpose, i.e., X

= (X

∗

)

and X

∗

is the complex-conjugate of X. For real numbers,

Eq. 3 reduces to Eq. 2.

In general, a large system of linear equations must be

solved to compute the solution, which can become pro-

hibitive in a real-time setting. Over the next paragraphs we

will see a special case of x

that bypasses this limitation.

剩余13页未读，继续阅读

评论收藏

内容反馈

KCF/DCF英文论文原文，带注释哟

评论0

最新资源

KCF/DCF英文论文原文，带注释哟

评论0

最新资源

相关推荐

KCF算法论文原版

DCF模型性能分析英文原文.pdf

Kernelized Correlation filters(KCF)

KCF原程序的详细解读

基于KCF 跟踪算法的目标轨迹记录系统的设计与实现.pdf

KCF/DCF matlab源码

介绍小波的几篇不错的外文文献

CSK论文翻译.doc

英文论文原文FaceNet: A Unified Embedding for Face Recognition and Clustering

数据挖掘ACM论文翻译-附录为英文原文.pdf

NSGA II论文英文原文

yolov1-yolov7总共7篇英文论文原文

YOLO算法v6~v7论文英文原文

相关滤波跟踪资料（MOSSE/CSK/KCF/STAPLE/CF2/ACFN等）.7z

SRDCF_code.rar_KCF算法_kcf_mightybsz_惩罚项_目标跟踪

视觉跟踪算法SRDCF-matlab代码

KCF目标跟踪算法源码

KCF-FOR-MATLAB-KCF程序注释.zip

经典目标跟踪论文KCF的ppt

基于SSD目标检测结构的改进论文的英文论文原文

VB论文英文原文加中文翻译

YOLOv1~v5五篇论文英文原文

毕业英文论文翻译原文

软件测试相关论文-英文原文82篇

KCF_matlab.rar

KCF_kcf_matlab_benchmark_

简易kcf推导演示

KCF源码及论文

KCF.rar_KCF 目标跟踪_KCF目标跟踪_kcf precition_单目标跟踪_目标跟踪代码