IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
of objective functions: those that do not consider the power
spectrum or image translations, such as Synthetic Discrim-
inant Function (SDF) filters [25], [26], and those that do,
such as Minimum Average Correlation Energy [28], Optimal
Trade-Off [27] and Minimum Output Sum of Squared Error
(MOSSE) filters [9]. Since the spatial structure can effectively
be ignored, the former are easier to kernelize, and Kernel
SDF filters have been proposed [26], [27], [25]. However,
lacking a clearer relationship between translated images,
non-linear kernels and the Fourier domain, applying the
kernel trick to other filters has proven much more difficult
[25], [24], with some proposals requiring significantly higher
computation times and imposing strong limits on the num-
ber of image shifts that can be considered [24].
For us, this hinted that a deeper connection between
translated image patches and training algorithms was
needed, in order to overcome the limitations of direct
Fourier domain formulations.
2.3 Subsequent work
Since the initial version of this work [29], an interesting
time-domain variant of the proposed cyclic shift model has
been used very successfully for video event retrieval [30].
Generalizations of linear correlation filters to multiple chan-
nels have also been proposed [31], [32], [33], some of which
building on our initial work. This allows them to leverage
more modern features (e.g. Histogram of Oriented Gradi-
ents – HOG). A generalization to other linear algorithms,
such as Support Vector Regression, was also proposed [31].
We must point out that all of these works target off-line
training, and thus rely on slower solvers [31], [32], [33]. In
contrast, we focus on fast element-wise operations, which
are more suitable for real-time tracking, even with the kernel
trick.
3 CONTRIBUTIONS
A preliminary version of this work was presented earlier
[29]. It demonstrated, for the first time, the connection
between Ridge Regression with cyclically shifted samples
and classical correlation filters. This enabled fast learning
with O(n log n) Fast Fourier Transforms instead of expen-
sive matrix algebra. The first Kernelized Correlation Filter
was also proposed, though limited to a single channel.
Additionally, it proposed closed-form solutions to compute
kernels at all cyclic shifts. These carried the same O(n log n)
computational cost, and were derived for radial basis and
dot-product kernels.
The present work adds to the initial version in signifi-
cant ways. All the original results were re-derived using a
much simpler diagonalization technique (Sections 4-6). We
extend the original work to deal with multiple channels,
which allows the use of state-of-the-art features that give an
important boost to performance (Section 7). Considerable
new analysis and intuitive explanations are added to the
initial results. We also extend the original experiments from
12 to 50 videos, and add a new variant of the Kernelized
Correlation Filter (KCF) tracker based on Histogram of
Oriented Gradients (HOG) features instead of raw pixels.
Via a linear kernel, we additionally propose a linear multi-
channel filter with very low computational complexity, that
almost matches the performance of non-linear kernels. We
name it Dual Correlation Filter (DCF), and show how it
is related to a set of recent, more expensive multi-channel
filters [31]. Experimentally, we demonstrate that the KCF
already performs better than a linear filter, without any
feature extraction. With HOG features, both the linear DCF
and non-linear KCF outperform by a large margin top-
ranking trackers, such as Struck [7] or Track-Learn-Detect
(TLD) [4], while comfortably running at hundreds of frames-
per-second.
4 BUILDING BLOCKS
In this section, we propose an analytical model for image
patches extracted at different translations, and work out the
impact on a linear regression algorithm. We will show a
natural underlying connection to classical correlation filters.
The tools we develop will allow us to study more compli-
cated algorithms in Sections 5-7.
4.1 Linear regression
We will focus on Ridge Regression, since it admits a simple
closed-form solution, and can achieve performance that is
close to more sophisticated methods, such as Support Vector
Machines [8]. The goal of training is to find a function
f(z) = w
T
z that minimizes the squared error over samples
x
i
and their regression targets y
i
,
min
w
X
i
(f(x
i
) −y
i
)
2
+ λ kwk
2
. (1)
The λ is a regularization parameter that controls overfit-
ting, as in the SVM. As mentioned earlier, the minimizer has
a closed-form, which is given by [8]
w =
X
T
X + λI
−1
X
T
y. (2)
where the data matrix X has one sample per row x
i
, and
each element of y is a regression target y
i
. I is an identity
matrix.
Starting in Section 4.4, we will have to work in the
Fourier domain, where quantities are usually complex-
valued. They are not harder to deal with, as long as we
use the complex version of Eq. 2 instead,
w =
X
H
X + λI
−1
X
H
y, (3)
where X
H
is the Hermitian transpose, i.e., X
H
= (X
∗
)
T
,
and X
∗
is the complex-conjugate of X. For real numbers,
Eq. 3 reduces to Eq. 2.
In general, a large system of linear equations must be
solved to compute the solution, which can become pro-
hibitive in a real-time setting. Over the next paragraphs we
will see a special case of x
i
that bypasses this limitation.
评论0
最新资源