Int J Comput Vis (2008) 77: 125–141
DOI 10.1007/s11263-007-0075-7
Incremental Learning for Robust Visual Tracking
David A. Ross ·Jongwoo Lim ·Ruei-Sung Lin ·
Ming-Hsuan Yang
Received: 6 September 2005 / Accepted: 17 July 2007 / Published online: 17 August 2007
© Springer Science+Business Media, LLC 2007
Abstract Visual tracking, in essence, deals with non-
stationary image streams that change over time. While most
existing algorithms are able to track objects well in con-
trolled environments, they usually fail in the presence of sig-
nificant variation of the object’s appearance or surrounding
illumination. One reason for such failures is that many algo-
rithms employ fixed appearance models of the target. Such
models are trained using only appearance data available be-
fore tracking begins, which in practice limits the range of ap-
pearances that are modeled, and ignores the large volume of
information (such as shape changes or specific lighting con-
ditions) that becomes available during tracking. In this pa-
per, we present a tracking method that incrementally learns a
low-dimensional subspace representation, efficiently adapt-
ing online to changes in the appearance of the target. The
model update, based on incremental algorithms for princi-
pal component analysis, includes two important features: a
method for correctly updating the sample mean, and a for-
D.A. Ross (
)
University of Toronto, 10 Kings College Road, Toronto, ON,
M55 3G4, Canada
e-mail: dross@cs.toronto.edu
J. Lim · M.-H. Yang
Honda Research Institute, 800 California Street, Mountain View,
CA 94041, USA
J. Lim
e-mail: jlim@honda-ri.com
M.-H. Yang
e-mail: mhyang@ieee.org
R.-S. Lin
Motorola Labs, 1303 E Algonquin Rd., Schaumburg, IL 60196,
USA
e-mail: ruei-sung.lin@motorola.com
getting factor to ensure less modeling power is expended
fitting older observations. Both of these features contribute
measurably to improving overall tracking performance. Nu-
merous experiments demonstrate the effectiveness of the
proposed tracking algorithm in indoor and outdoor envi-
ronments where the target objects undergo large changes
in pose, scale, and illumination.
Keywords Visual tracking · Subspace update · Online
algorithms ·Adaptive methods ·Particle filter · Illumination
1 Introduction
Visual tracking essentially deals with non-stationary data,
both the target object and the background, that change over
time. Most existing algorithms are able to track objects,
either previously viewed or not, in short durations and in
well controlled environments. However these algorithms
usually fail to observe the object motion or have signifi-
cant drift after some period of time, due to drastic change
in the object’s appearance or large lighting variation in its
surroundings. Although such situations can be ameliorated
with recourse to richer representations, effective prediction
schemes or combination, most algorithms typically oper-
ate on the premise that the model of the target object does
not change drastically over time. Examples abound, ranging
from representation methods based on view-based appear-
ance models (Black and Jepson 1996), contours (Isard and
Blake 1996), parametric templates of geometry and illumi-
nation (Hager and Belhumeur 1996), integration of shape
and color (Birchfield 1998), mixture models (Black et al.
1998), 3D models (La Cascia and Sclaroff 1999), exemplars
(Toyama and Blake 2001), foreground/background models
(Harville 2002) templates with updating (Matthews et al.
126 Int J Comput Vis (2008) 77: 125–141
2004); prediction schemes using particle filters (Isard and
Blake 1996), joint probabilistic data association filters (Ras-
mussen and Hager 1998), kernel-based filters (Comaniciu
et al. 2003; Georgescu et al. 2004), support vector machines
(Avidan 2001; Williams et al. 2003) and variational infer-
ence (Vermaak et al. 2003). These algorithms usually build
or learn a model of the target object first and then use it for
tracking, without adapting the model to account for changes
in the appearance of the object, e.g., large variation of pose
or facial expression, or the surroundings, e.g., lighting varia-
tion. Furthermore, it is assumed that all images are acquired
with a stationary camera. Such an approach, in our view, is
prone to performance instability, thus needs to be addressed
when building a robust visual tracker.
The chief challenge of visual tracking can be attributed
to the difficulty in handling the appearance variability of a
target object. Intrinsic appearance variability includes pose
variation and shape deformation,whereas extrinsic illumina-
tion change, camera motion, camera viewpoint, and occlu-
sions inevitably cause large appearance variation. Due to the
nature of the tracking problem, it is imperative for a robust
algorithm to model such appearance variation.
In this paper we propose a method
1
that, during visual
tracking, efficiently learns and updates a low dimensional
subspace representation of the target object. The advantages
of this adaptive subspace representation are several fold.
The subspace representation provides a compact notion of
the “thing” being tracked rather than treating the target as a
set of independent pixels, i.e., “stuff” (Adelson and Bergen
1991), and facilitates object recognition. An efficient incre-
mental method continually updates the subspace model to
reflect changes in appearance caused by intrinsic and extrin-
sic factors, thereby facilitating the tracking process. Incre-
mentally updating the subspace removes the offline learning
phase required by other eigentrackers, allowing one to track
objects for which a database of training images is not even
available. To estimate the locations of the target objects in
consecutive frames, we use a sampling algorithm with like-
lihood estimates, which is in contrast to other tracking meth-
ods that usually solve complex optimization problems using
gradient descent. Furthermore, while numerous algorithms
operate under the assumption that there is no camera motion,
our method is able to track objects without this constraint.
The remaining part of this paper is organized as follows.
We begin, in the next section, by reviewing the most relevant
algorithms that motivated this work. The details of our algo-
rithm are described in Sect. 3, where we propose an efficient
incremental subspace method with a mean update and for-
getting factor, followed by an effective tracking algorithm.
1
Preliminary results of this paper were presented in Ross et al. (2004)
and Lim et al. (2005).
The results of numerous experiments and performance eval-
uation are presented in Sect. 4. We conclude this paper with
remarks on potential extensions for future work. The data,
source code, and videos corresponding to this work can all
be found at http://www.cs.toronto.edu/~dross/ivt/.
2 Related Work and Motivation
There is a rich literature in visual tracking and a thorough
discussion on this topic is beyond the scope of this paper. In
this section we review only the most relevant visual track-
ing work, focusing on algorithms that operate directly on
grayscale images. We contrast our method with these meth-
ods in terms of their representation scheme, target prediction
approach, and their ability to handle changes in illumination
as well as appearance.
Visual tracking problems have conventionally been for-
mulated as prediction tasks within which fixed templates and
optical flow techniques are utilized to estimate the motion of
a target object (Lucas and Kanade 1981). Such approaches
do not take the appearance variability into consideration, and
thus perform well only over short periods of time. To en-
hance the robustness of such object trackers, Black and Jep-
son proposed an algorithm using a pre-trained view-based
eigenbasis representation and a robust error norm (Black
and Jepson 1996). Instead of relying on the brightness con-
stancy principal assumed in optical flow techniques, they
advocated the use of a subspace constancy assumption for
motion estimation. Although their algorithm demonstrated
excellent empirical results, it entailed learning a set of view-
based eigenbases before the tracking task began. To achieve
robust visual tracking with this method, it is imperative to
collect a large set of training images covering the range of
possible appearance variation (including viewing angles and
illumination) from which to construct the eigenbasis, as this
representation, once learned, is not updated.
Observing that low-dimensional linear subspaces are able
to effectively model image variation due to illumination
(Belhumeur and Kreigman 1997), Hager and Belhumeur de-
veloped a tracking algorithm to handle the appearance vari-
ation caused by illumination and pose change using para-
metric models (Hager and Belhumeur 1996). Their method
extends a gradient-based optical flow algorithm by incor-
porating low-dimensional representations (Belhumeur and
Kreigman 1997) for object tracking under varying illumina-
tion conditions. Before tracking begins, a set of illumination
bases needs to be constructed at a fixed pose in order to ac-
count for changes in appearance due to lighting variation.
However, this basis does not attempt to account for changes
in pose such as out-of-plane rotations.
Realizing the limitations of having a single (unimodal or
Gaussian) hypothesis of target location at each timestep—
as produced by the Kalman filter and its relatives—Isard
Int J Comput Vis (2008) 77: 125–141 127
and Blake introduced particle filters to visual tracking and
presented the Condensation algorithm for contour track-
ing in which multiple plausible interpretations are propa-
gated over time (Isard and Blake 1996). This probabilis-
tic approach has demonstrated success in tracking the out-
line of target objects in clutter. However, the representa-
tion scheme employed (curves or splines) ignores the inter-
nal appearance of the target, and is not updated to account
for variations in its appearance, due to pose or illumination
change.
Supervised discriminative methods for classification and
regression have also been exploited to solve visual tracking
problems. For example, Avidan (2001) developed a tracking
algorithm that employs the support vector machine (SVM)
classifier within a optic flow framework. Avidan modified
the conventional use of the SVM classification score to in-
stead predict target location, by computing image gradi-
ents as is done in optical flow algorithms. Although this al-
gorithm has demonstrated success in tracking specific ob-
jects, e.g., cars from a mounted camera in a moving vehi-
cle, significant effort is required in training a SVM. Along
similar lines, Williams et al. developed a method in which
an SVM-based regressor was used for tracking (Williams
et al. 2003). Instead of relying on optical flow to predict
object location, they learned a perturbation function of spa-
tial in-plane displacements between frames, thereby predict-
ing the most likely object location. As a result of train-
ing the regressor on in-plane image motion, this method
is not effective in tracking objects with out-of-plane move-
ments.
Mixture models have been studied as alternatives to lin-
ear representations, to better account for appearance change
in motion estimation. Black et al. (1998) identified four pos-
sible factors causing appearance change, fitting them with a
mixture model which was then used to estimate image mo-
tion. A more elaborate mixture model fit via an online EM
algorithm was recently proposed by Jepson et al. (2003), in
which three components were used to model the responses
of wavelet filters, and thereby account for appearance vari-
ation during tracking. Their method is able to handle vari-
ations in pose, illumination and expression. However, their
appearance model treats pixels within the target region inde-
pendently (ignoring their covariance) and thus does not have
notion of the “thing” being tracked. This can result in mod-
eling background rather than the foreground, thereby failing
to track the target object (Jepson et al. 2003).
Attempts to improve the classic Lucas–Kanade tracker
(Lucas and Kanade 1981) with updates was recently made
by Matthews et al. (2004). They developed a template up-
date method for visual tracking, which employs an active
appearance model (Cootes et al. 2001) to account for im-
age variation. Thus instead of using a fixed template, the
object appearance is modeled by a linear combination of ap-
pearance images. The tracking problem is then formulated
as a search (using gradient descent) for the affine parame-
ters and linear combination which minimize the difference
between the target object and the current appearance model.
The newly tracked object is then used to update appearance
model, as necessary. They demonstrated good tracking re-
sults on vehicles and faces with varying expressions. How-
ever, the authors noted that the computation cost for updat-
ing the template increases dramatically as principal compo-
nent analysis is carried out at each update, and that their
work covers the case where the visibility of the target object
does not change.
Our work is motivated in part by the prowess of subspace
representations as appearance models (Murase and Nayar
1995; Belhumeur and Kreigman 1997), the effectiveness of
particle filters (Isard and Blake 1996), and the adaptability
of on-line update schemes (Jepson et al. 2003). In contrast to
the eigentracking algorithm (Black and Jepson 1996), our
algorithm does not require a training phase but learns the
eigenbases on-line during the object tracking process. Thus
our appearance model can adapt to changes in pose, view
angle, and illumination not captured by the set of training
images—in fact the need to manually collect training im-
ages prior to tracking is eliminated. Further, our method uses
a particle filter for motion parameter estimation rather than
the gradient descent method, which often gets stuck in local
minima or is distracted by outliers (Black and Jepson 1996).
Our appearance-based model provides a richer description
than simple curves or splines as used in (Isard and Blake
1996), and has a notion of the “thing” being tracked. In ad-
dition, the learned representation can be utilized for other
tasks such as object recognition. With respect to the template
update method (Cootes et al. 2001), we concurrently devel-
oped an efficient subspace update algorithm that facilitates
object tracking under varying pose and lighting conditions.
Furthermore, our algorithm is able to handle camera motion
while learning compact representations and tracking objects.
In this work, an eigenbasis representation is learned directly
from pixel values corresponding to a target object in the
image space. Experiments show that good tracking results
can be obtained using this representation without employing
more complicated wavelet features as in Jepson et al. (2003),
although this elaboration is still possible and may lead to
even better results. Note also that the view-based eigenba-
sis representation has demonstrated its ability to model the
appearance of objects in different poses (Murase and Nayar
1995), and under different lighting conditions (Belhumeur
and Kreigman 1997). Consequently, the learned eigenbasis
facilitates tracking objects undergoing illumination and pose
change.
128 Int J Comput Vis (2008) 77: 125–141
3 Incremental Learning for Tracking
We present details of the proposed incremental learning al-
gorithm for object tracking in this section. First we propose
an efficient method that incrementally updates an eigenbasis
as new observations arrive, which is used to learn the ap-
pearance of the target while tracking progresses. Next we
describe our approach for drawing particles in the motion
parameter space and predicting the most likely object loca-
tion with the help of the learned appearance model. Collec-
tively, we show how these two modules work in tandem to
track objects well under varying conditions.
3.1 Incremental Update of Eigenbasis and Mean
The appearance of a target object may change drastically
due to intrinsic and extrinsic factors as discussed earlier.
Therefore, to produce a robust tracker, it is important to
adapt the appearance model online, while tracking, to re-
flect these changes. The appearance model we have chosen,
a eigenbasis, is typically learned off-line from a set of train-
ing images {I
1
,...,I
n
}, by taking computing the eigenvec-
tors U of the sample covariance matrix
1
n−1
n
i=1
(I
i
−
¯
I)
(I
i
−
¯
I)
, where
¯
I =
1
n
n
i=1
I
i
is the sample mean of the
training images. Equivalently one can obtain U by comput-
ing the singular value decomposition UΣV
of the centered
data matrix [(I
1
−
¯
I)···(I
n
−
¯
I)], with columns equal to the
respective training images minus their mean.
Adapting the appearance model to account for novel
views of the target can be thought of as retraining the eigen-
basis with an additional m images {I
n+1
,...,I
n+m
},for
some value of m. Naively, this update could be performed
by computing the singular value decomposition U
Σ
V
T
of the augmented (centered) data matrix [(I
1
−
¯
I
) ···
(I
n+m
−
¯
I
)], where
¯
I
is the average of the entire n + m
training images.
Unfortunately this approach is unsatisfactory for online
applications, like visual tracking, due to its storage and com-
putational requirements. First, the naive approach uses the
entire set of training images for each update. If an update is
made at each video frame, then the number of images which
must be retained grows linearly with the length of the se-
quence. Second, the cost of computing the mean and singu-
lar value decomposition grows with the number of images,
so the algorithm will run ever slower as time progresses. In-
stead, the requirements of our application dictate that any
algorithm for updating the mean and eigenbasis must have
storage and computational requirements that are constant,
regardless of the number of images seen so far.
Numerous, more-sophisticated algorithms have been de-
veloped to efficiently update an eigenbasis as more data
arrive (Golub and Van Loan 1996;Halletal.1998;Levy
and Lindenbaum 2000; Brand 2002). However, most meth-
ods assume the sample mean is fixed when updating the
eigenbasis, or equivalently that the data is inherently zero-
mean. Neither assumption is appropriate in our application.
An exception is the method by Hall et al. (2002), which
does consider the change of the mean as each new datum
arrives. Although similar to our (independently-developed)
algorithm, it lacks the forgetting factor, which hurts its suit-
ability for tracking, and has a greater computational cost.
(Both of these disadvantages are demonstrated quantita-
tively in Sect. 4.3.) Part of the additional complexity comes,
because Hall’s algorithm is based on the notion of adding
eigenspaces, from computing the eigenvalue decomposition
of each block of new data as it arrives. In this respect our
algorithm is simpler, since it incorporates new data directly,
without the additional step.
Here we extend one of these efficient update proce-
dures—the Sequential Karhunen–Loeve (SKL) algorithm of
Levy and Lindenbaum (2000)—presenting a new incremen-
tal PCA algorithm that correctly updates the eigenbasis as
well as the mean, given one or more additional training data.
Our algorithm, a variation of which was first presented in
Lim et al. (2005), has also been applied to algorithms where
the subspace mean plays an important role. For example, it
can be applied to adaptively update the between-class and
within-class covariance matrices used in Fisher linear dis-
criminant analysis (Lin et al. 2005). We begin with a sum-
mary of the SKL algorithm, then describe our new incre-
mental PCA algorithm, and follow with a discussion of a
forgetting factor which can be used to down-weight the ef-
fect of earlier observations on the PCA model.
Putting aside for the moment the problem of the sample
mean, suppose we have a d ×n data matrix A ={I
1
,...,I
n
}
where each column I
i
is an observation (a d-dimensional
image vector in this paper), for which we have already com-
puted the singular value decomposition A =UΣV
. When
a d ×m matrix B of new observations is available, the goal
is to efficiently compute the SVD of the concatenation of A
and B: [AB]=U
Σ
V
. Letting
˜
B be the component of
B orthogonal to U , we can express the concatenation of A
and B in a partitioned form as follows:
AB
=
U
˜
B
ΣU
T
B
0
˜
B
B
V
0
0 I
. (1)
Let R =
ΣU
T
B
0
˜
B
B
, which is a square matrix of size
k +m, where k is the number of singular values in Σ.The
SVD of R, R =
˜
U
˜
Σ
˜
V
, can be computed in constant time
regardless of n, the initial number of data. Now the SVD of
[AB] can be expressed as
AB
=
U
˜
B
˜
U
˜
Σ
˜
V
V
0
0 I
.
Since an incremental PCA is only interested in comput-
ing U
and Σ
, V
, whose size scales with the number of
Int J Comput Vis (2008) 77: 125–141 129
observed data, need not be computed. Thus we arrive at the
following formulation of the SKL algorithm.
Given U and Σ from the SVD of A, compute U
and Σ
from the SVD of [AB]:
1. Obtain
˜
B and R by taking the QR decomposition of
[UΣ B]: [U
˜
B]R
QR
=[UΣ B].
2. Compute the SVD of R: R
SVD
=
˜
U
˜
Σ
˜
V
.
3. Finally U
=[U
˜
B]
˜
U and Σ
=
˜
Σ. If the desired number
of basis vectors in U
is less than the number of non-zero
singular values, then these excess vectors and singular
values may be discarded.
The algorithm can also be made slightly faster, al-
though somewhat more complicated, by modifying the
arrangement of calculations in Step 1. Instead of comput-
ing the QR decomposition of [UΣ B],
˜
B and R can be
obtained directly as follows:
˜
B = orth(B − UU
B) and
R =
ΣU
B
0
˜
B(B−UU
B)
, where orth() performs orthogonal-
ization, perhaps via QR. This reorganization, which fol-
lows from (1), avoids performing QR on the entire ma-
trix [UΣ B] (note that the columns corresponding to
U are already orthogonal), instead only orthogonalizing
(B − UU
B), which is the component of B not already
in the subspace U .
The computational advantage of the SKL algorithm over
the naive approach is that it has space and time complexity
that is constant in n, the number of training data seen so far.
Specifically each update makes use of only the k largest sin-
gular values and basis vectors from the previous stage. This,
together with the storage required for the m new images,
reduces the space complexity to O(d(k + m)), down from
O(d(n +m)
2
) with the naive approach. Similarly, the com-
putational requirements are also reduced to O(dm
2
),versus
O(d(n+m)
2
) for recomputing the entire SVD. More details
and complexity analysis of the SKL algorithm are described
in Levy and Lindenbaum (2000).
The problem with the SKL algorithm as stated above is
that it makes no attempt to account for the sample mean
of the training data, which changes over time as new data
arrive. We will now show how this can be overcome. The
essence of the approach is, at each update of the eigenbasis,
to augment the new training data with an additional vector
carefully chosen to correct for the time-varying mean. We
begin by proving the following lemma:
Lemma 1 Let A =[I
1
, I
2
,...,I
n
], B =[I
n+1
, I
n+2
,...,
I
n+m
] be data matrices and C =[AB] be their concate-
nation. Denote the means and scatter matrices of A, B, C
as
¯
I
A
,
¯
I
B
,
¯
I
C
, and S
A
, S
B
, S
C
respectively. It can be shown
that S
C
=S
A
+S
B
+
nm
n+m
(
¯
I
B
−
¯
I
A
)(
¯
I
B
−
¯
I
A
)
.
In this lemma, we define a scatter matrix to be the outer
product of the centered data matrix, for example S
B
=
m
i=n+1
(I
i
−
¯
I
B
)(I
i
−
¯
I
B
)
. Thus a scatter matrix differs
from the sample covariance matrix by only a scalar multi-
ple S
B
=m cov(B). The proof of this lemma appears in the
Appendix.
From Lemma 1 we can see that the SVD of (C −
¯
I
C
) is
equal to the SVD of the horizontal concatenation of (A −
¯
I
A
), (B −
¯
I
B
), and one additional vector
nm
n+m
(
¯
I
B
−
¯
I
A
).
(The slight abuse of notation (A −
¯
I
A
) is meant as a short-
hand for the matrix [(I
1
−
¯
I
A
) ···(I
n
−
¯
I
A
)].) This motivates
our new algorithm, appearing in Fig. 1.
As can be seen, this algorithm shares the favorable com-
plexity of the SKL algorithm, incurring only a small con-
stant overhead to store, update, and correct for the changing
sample mean.
3.1.1 Forgetting Factor
In numerous vision applications it is desirable to focus more
on recently-acquired images and less on earlier observa-
tions. For example, when tracking a target with a chang-
ing appearing, it is likely that recent observations will be
more indicative of its appearance than would more distant
ones. Down-weighting the contribution of earlier observa-
tions also plays an important role in online learning. As time
progresses the observation history can become very large, to
the point of overwhelming the relative contribution of each
Given U and Σ from the SVD of (A −
¯
I
A
),aswellas
¯
I
A
, n, and B, compute
¯
I
C
as well as U
and Σ
from the
SVD of (C −
¯
I
C
):
1. Compute the mean vectors
¯
I
B
=
1
m
n+m
i=n+1
I
i
, and
¯
I
C
=
n
n+m
¯
I
A
+
m
n+m
¯
I
B
.
2. Form the matrix
ˆ
B =
(I
m+1
−
¯
I
B
) ... (I
n+m
−
¯
I
B
)
nm
n+m
(
¯
I
B
−
¯
I
A
)
.
3. Compute
˜
B =orth(
ˆ
B −UU
ˆ
B) and R =
ΣU
ˆ
B
0
˜
B(
ˆ
B−UU
ˆ
B)
.
Note that
˜
B will be one column larger than in the SKL algorithm.
4. Compute the SVD of R: R
SVD
=
˜
U
˜
Σ
˜
V
.
5. Finally U
=[U
˜
B]
˜
U and Σ
=
˜
Σ.
Fig. 1 The incremental PCA algorithm with mean update