PERCEPTION ENHANCED FRAME FOR VISUAL OBJECT TRACKING
BinpengSong
1,2
, JianfengLiu
1,2
, JianY e
1
1 Institute of Computing Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
ABSTRACT
Deep trackers which based on pre-trained network trained on
object detection datasets, have shown great potentials in vi-
sual object tracking. However, the gap between object detec-
tion and object tracking is non-negligible. And the fixed tem-
plate with the initial target feature during tracking in some
previous deep trackers greatly limit the performance of the
trackers. Therefore, we propose a perception enhanced frame
(PEF) to exploit the target-aware features which can better
recognize the target from background and update the template
features through response map. Our PEF tracker takes advan-
tage of the fully connected network with mask loss to select
target-aware feature channels, and updates the template to en-
hance the robustness, which enables our trackers to reduce the
deep features, enhances the discriminative ability, and ensures
the diversity of comparison template. Experimental results on
three popular datasets show that our method get superior per-
formance than the state-of-the-art trackers in terms of accu-
racy and speed.
Index Terms— feature selection, convolution network,
mask loss, template update, object tracking
1. INTRODUCTION
Visual object tracking has been a core task of computer vi-
sion, which is critical in many online real-time visual tracking
applications, such as intelligent transportation, video mon-
itoring, intelligent robot and so on. Object tracking is at-
tempting to capture the trajectory of a target in a sequence
of images when the target is given by a bounding box in ini-
tial frame. Traditional object tracking methods using origi-
nal color features or some manual features such as HOG and
Color Names [1, 2, 3, 4, 5], although guarantee real-time per-
formance, hardly meet the location accuracy [5, 6, 7]. Re-
cently, the visual trackers with convolution features have been
widely concerned [8, 9, 10]. And the performance have sig-
nificantly improved due to the power of convolution feature
extraction.
Numerous popular deep trackers obtain inspiration from
object detection pre-trained network. Detection modules
might either improve the localization precision and get a bet-
ter discriminability against occlusions and background[11].
Fig. 1. Image (left) shows the target search area and target ob-
ject (red box), Heat map (middle) shows the confidence map
of search area origin Siamese tracker using VGG-16 network,
Heat map (right) shows the confidence map via our PEF.
The detection-based framework although achieve the state-of-
the-art performance[2, 7], the gap between object detection
and object tracking is non-negligible. First, object detection is
aimed to distinguish specific classes while object tracking is
supposed to track moving objects. Second, object detection is
unnecessary to differentiate intra-class instances while object
tracking not [11]. So the pre-trained network contains some
redundant convolution channels for object tracking, which
might do harm to target location and tracking efficiency. Be-
sides, we discover that fixed template is used in former works
of object tracking. In the process of tracking, object might
transform caused by some actions or the perspective of obser-
vation changes, e.g. human has some different actions during
walking.
To address the above issues, we propose perception en-
hanced frame (PEF). In this work, our PEF is built upon
advanced deep detectors, Siamese matching network [8]. For
dealing with redundant channels from pre-trained model, our
PEF exploits the target-aware features which can better recog-
nize the target from background as shown in Figure 1. Instead
of using fixed template, we incorporate the dynamic template
mechanism that updates the template features through the
feed of response map. In experiments, we evaluate the pro-
posed PEF tracker on three benchmark datasets and the result