408 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 4, APRIL 2011
Visual Object Tracking Based on Combination of
Local Description and Global Representation
Li Sun and Guizhong Liu, Member, IEEE
Abstract—This paper provides a novel method for visual object
tracking based on the combination of local scale-invariant feature
transform (SIFT) description and global incremental principal
component analysis (PCA) representation in loosely constrained
conditions. The state of object is defined by the position and shape
of a parallelogram, which means that tracking results are given
by locating the object in every frame using parallelograms. The
whole method is constructed in the framework of particle filter
which includes two models: the dynamic model and the observa-
tion model. In the dynamic model, particle states are predicted
with the help of local SIFT descriptors. Local key point matching
between successive frames based on SIFT descriptors provides us
an important cue for the prediction of particle states; thus, we can
efficiently spread particles in the neighborhood of the predicted
position. In the observation model, every particle is evaluated by
local key point-weighted incremental PCA representation, which
can describe the object more accurately by giving large weights
to the pixels in the influence area of key points. Moreover, by
incorporating the dynamic forgetting factor, we can update the
PCA eigenvectors online according to the object states, which
makes our method more adaptable under different situations.
Experimental results show that compared to other state-of-the-
art methods, the proposed method is robust especially under
some difficult conditions, such as strong motion of both object
and background, large pose change, and illumination change.
Index Terms—Forgetting factor, object tracking, PCA, SIFT.
I. Introduction
F
OR DECADES, visual object tracking has drawn many
researchers’ attention and has become a hot research
topic. It has been widely used in computer vision system,
such as video surveillance [1], [2], activity analysis and
recognition [3], [4], human–computer interaction [5], [6], and
video retrieval and summarization [7], [8]. The goal of tracking
is to automatically locate the same object in an adjacent frame
from a video sequence once it is initialized. Although many
methods have been already proposed for different applications,
it is still a challenging task to develop a robust object tracking
algorithm for the following two main reasons. First, there often
Manuscript received November 18, 2009; revised April 30, 2010; accepted
June 18, 2010. Date of publication October 18, 2010; date of current version
April 1, 2011. This work was supported in part by the National 973, under
Project 2007CB311002, in part by the National Natural Science Foundation
of China, under Project 60903121, in part by the National High Tech., under
Project 2009AA012409, and in part by the China State-Funded Study Abroad
Program, under CSC 2007U06001. This paper was recommended by Associate
Editor S. Yan.
The authors are with the School of Electronics and Information Engineering,
Xi’an Jiaotong University, Xi’an 710049, China (e-mail: liugz@xjtu.edu.cn;
sunli@mailst.xjtu.edu.cn).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2010.2087815
exist pose change and shape deformation on object, which
make it difficult to obtain stable and discriminative features
from an image. Second, different situations of environment,
which include illumination variation, cluttered background, or
even possible occlusions, also set obstacles for the adaptability
of the algorithm.
There are usually two major modules in an object tracking
system, which are the appearance description module and the
location determination module. The previous module tries to
tell us what is being tracked by describing the appearance
of a target with some specific features. It provides us a
foundation since reliable and discriminative features are of
extreme importance for object tracking. Here, the description
can be either based on global features or representations, such
as an original image patch [9], histograms of pixel color [10],
[11], or local features or descriptions such as an object contour
[12], [13] and a collection of local key points [14], [15]. The
latter module aims to determine the state of an object based
on the description features. Here, the state can be the location
in the image plane, the curvature of the contour or the speed
of the object. It is actually an estimation process based on
previous state and can be done either at low level (e.g., using
pixel-based optic flow [16] or block-based matching [17])
or high level (dynamic state transition model [18], [12]). In
particle filter, these two modules correspond to the observation
model and the dynamic model, respectively.
In this paper, we focus on both of the two modules and
design a compact object tracking scheme, which is based on
the combination of both global representation by incremental
principal component analysis (PCA) and local description by
scale-invariant feature transform (SIFT) descriptors. Particu-
larly, the proposed scheme is constructed in the framework
of particle filter. In the dynamic model, local key points
are reliably matched based on SIFT descriptors between
successive frames, which can give a prediction about the
state of the object according to its corresponding location
in previous frame. In the observation model, incremental
PCA representation can accurately describe the object as a
whole. It keeps a set of eigenvectors which are dynamically
updated during tracking. Moreover, key point information is
also adopted to calculate the particle weight.
The proposed method complies with the visual perception
of human being [19], which is that both global and local
information are important for humans to locate objects. In
object detection and localization, global PCA representation
has already obtained some achievements, which can be found
1051-8215/$26.00
c
2010 IEEE