2 L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr
Several recent works have aimed to overcome this limitation using a pre-
trained deep conv-net that was learnt for a different but related task. These
approaches either apply “shallow” methods (e.g. correlation filters) using the
network’s internal representation as features [5,6] or perform SGD (stochastic
gradient descent) to fine-tune multiple layers of the network [7,8,9]. While the use
of shallow methods does not take full advantage of the benefits of end-to-end
learning, methods that apply SGD during tracking to achieve state-of-the-art
results have not been able to operate in real-time.
We advocate an alternative approach in which a deep conv-net is trained to
address a more general similarity learning problem in an initial offline phase,
and then this function is simply evaluated online during tracking. The key con-
tribution of this paper is to demonstrate that this approach achieves very com-
petitive performance in modern tracking benchmarks at speeds that far exceed
the frame-rate requirement. Specifically, we train a Siamese network to locate
an exemplar image within a larger search image. A further contribution is a
novel Siamese architecture that is fully-convolutional with respect to the search
image: dense and efficient sliding-window evaluation is achieved with a bilinear
layer that computes the cross-correlation of its two inputs.
We posit that the similarity learning approach has gone relatively neglected
because the tracking community did not have access to vast labelled datasets.
In fact, until recently the available datasets comprised only a few hundred anno-
tated videos. However, we believe that the emergence of the ILSVRC dataset for
object detection in video [10] (henceforth ImageNet Video) makes it possible to
train such a model. Furthermore, the fairness of training and testing deep models
for tracking using videos from the same domain is a point of controversy, as it
has been recently prohibited by the VOT committee. We show that our model
generalizes from the ImageNet Video domain to the ALOV/OTB/VOT [1,11,12]
domain, enabling the videos of tracking benchmarks to be reserved for testing
purposes.
2 Deep similarity learning for tracking
Learning to track arbitrary objects can be addressed using similarity learning.
We propose to learn a function f(z, x) that compares an exemplar image z to a
candidate image x of the same size and returns a high score if the two images
depict the same object and a low score otherwise. To find the position of the
object in a new image, we can then exhaustively test all possible locations and
choose the candidate with the maximum similarity to the past appearance of the
object. In experiments, we will simply use the initial appearance of the object
as the exemplar. The function f will be learnt from a dataset of videos with
labelled object trajectories.
Given their widespread success in computer vision [13,14,15,16], we will use a
deep conv-net as the function f. Similarity learning with deep conv-nets is typ-
ically addressed using Siamese architectures [17,18,19]. Siamese networks apply
an identical transformation ϕ to both inputs and then combine their represen-