IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 0, NO. 0, JUNE 2008 1
Context-aware Visual Tracking
Ming Yang, Member, IEEE, Ying Wu, Senior Member, IEEE, and Gang Hua, Member, IEEE
Abstract—Enormous uncertainties in unconstrained environments lead to a fundamental dilemma that many tracking algorithms have
to face in practice: tracking has to be computationally efficient but verifying whether or not the tracker is following the true target tends
to be demanding, especially when the background is cluttered and/or when occlusion occurs. Due to the lack of a good solution to
this problem, many existing methods tend to be either effective but computationally intensive by using sophisticated image observation
models, or efficient but vulnerable to false alarms. This greatly challenges long-duration robust tracking. This paper presents a novel
solution to this dilemma by considering the context of the tracking scene. Specifically, we integrate into the tracking process a set of
auxiliary objects that are automatically discovered in the video on the fly by data mining. Auxiliary objects have three properties, at
least in a short time interval: (1) persistent co-occurrence with the target; (2) consistent motion correlation to the target; and (3) easy
to track. Regarding these auxiliary objects as the context of the target, the collaborative tracking of these auxiliary objects leads to
efficient computation as well as strong verification. Our extensive experiments have exhibited exciting performance in very challenging
real-world testing cases.
Index Terms—Computer vision, visual object tracking, context-aware, collaborative tracking, data mining, robust fusion, belief
inconsistency.
✦
1 INTRODUCTION
R
Obust long-duration visual tracking is demanded by many
contemporary applications such as video-based surveil-
lance and vision-based interfaces. One fundamental obstacle
in the way is the lack of efficient means for verification, i.e.,
to determine whether the object being followed by the tracker
is really the target. At the extreme, this is in fact a recognition
task. Without effective verification, the tracker is likely to drift
away gradually, or fail when the target is occluded even for
a short period of time. Therefore, although extensive research
efforts have been taken, it is still quite difficult in practice to
achieve robust and efficient long-duration tracking in uncon-
strained real-world environments. Most existing methods are
in a dilemma: either be fast-but-fallible, or be robust-but-slow.
This dilemma originates from the opposite requirements
for the image likelihood models: on one hand, the likelihood
model should be simple for efficient motion estimation and
tracking; on the other hand, it has to be sophisticated for
comprehensive verification of the target. We call them de-
scriptive likelihood and discriminative likelihood, respectively.
In general, descriptive likelihood is based on the descriptive
image features that can be easily accessible and specified, e.g.,
contours [1], [2], colors [3], or even image regions [4], [5],
etc.. The matching of these image features leads to efficient
computation of the descriptive likelihood and thus fast motion
estimation (e.g., differential methods such as kernel-based
tracking [3], [5], [6]).
However, in practice, many real-world complications such
• Ming Yang and Ying Wu are with Electrical Engineering and Com-
puter Science Department, Northwestern University, 2145 Sheridan
Road, Evanston, IL 60208-3118. Email: m-yang4@u.northwestern.edu,
yingwu@ece.northwestern.edu. Gang Hua is with Microsoft Research,
Redmond, WA 98053. Email: ganghua@microsoft.com.
Manuscript received June 15, 2007; revised January 3, 2008.
as clutters, illumination and view changes, low image quality,
motion blur, and partial occlusions, all may invalidate simple
descriptive likelihood models. As a result, good matches of
these descriptive features do not necessarily have to cor-
respond to the true target, and background false positive
objects may also be good matches. Over the years, there have
been two approaches to address this issue: on-line adaptation
of the descriptive likelihood models [5], [7]–[9], or using
discriminative likelihood models that distinguish the true target
from false positives. Without strong verification that provides
confident supervision, on-line adaptation is risky and lacks a
mechanism to prevent drifting. On the other hand, discrimi-
native likelihood is generally associated with classifiers, e.g.,
the SVM tracker [10]. These classifiers can be trained off-
line or on-line [11], [12]. As learning a classifier has to be
based on a large number of training features, it tends to be
computationally demanding.
Is there a way to get out of the dilemma so as to have more
efficient but still effective verification? In all these existing
methods, the dynamic environment is taken for granted as the
adverse party for the tracker, as it generates false positives, and
most computation has to be spent in separating the true target
from the environment. However, the environment can also
be advantageous to the tracker if it contains objects that are
correlated to the target. For example, if we need to track a face
in a crowd, it is almost impossible to learn a discriminative
model to distinguish the face of interest from the rest of the
crowd. Why do we have to focus our attention only on the
target? If the person (with that face) is wearing a quite unique
shirt (or a hat), then including the shirt (or the hat) in matching
will surely make the tracking much easier and more robust.
By the same token, if another face is always accompanying
the target face, treating them as a geometric structure and
tracking them as a group will be much easier than tracking
either of them. It is clear that this makes the verification much