related to the validation and association of the measurements
arise [5, p. 150]. Gating techniques are used to validate only
measurements whose predicted probability of appearance is
high. After validation, a strategy is needed to associate the
measurements with the current targets. In addition to the
Nearest Neighbor Filter, which selects the closest measure-
ment, techniques such as Probabilistic Data Association Filter
(PDAF) are available for the single target case. The under-
lying assumption of the PDAF is that for any given target only
one measurement is valid, and the other measurements are
modeled as random interference, that is, i.i.d. uniformly
distributed random variables. The Joint Data Association
Filter (JPDAF) [5, p. 222], on the other hand, calculates the
measurement-to-target association probabilities jointly
across all the targets. A different strategy is represented by
the Multiple Hypothesis Filter (MHF) [63], [20], [5, p. 106]
which evaluates the probability that a given target gave rise to
a certain measurement sequence. The MHF formulation can
be adapted to track the modes of the state density [13]. The
data association problem for multiple target particle filtering
is presented in [62], [38].
The filtering and association techniques discussed above
were applied in computer vision for various tracking
scenarios. Boykov and Huttenlocher [9] employed the Kal-
man filter to track vehicles in an adaptive framework. Rosales
and Sclaroff [65] used the Extended Kalman Filter to estimate
a 3D object trajectory from 2D image motion. Particle filtering
was first introduced, in vision, as the Condensation algorithm
by Isard and Blake [40]. Probabilistic exclusion for tracking
multiple objects was discussed in [51]. Wu and Huang
developed an algorithm to integrate multipletarget clues [76].
Li and Chellappa [48] proposed simultaneous tracking and
verification based on particle filters applied to vehicles and
faces. Chen et al. [15] used the Hidden Markov Model
formulation for tracking combined with JPDAF data associa-
tion. Rui and Chen proposed to track the face contour based
on the unscented particle filter [66]. Cham and Rehg [13]
applied a variant of MHF for figure tracking.
The emphasis in this paper is on the other component of
tracking: target representation and localization. While the
filtering and data association have their roots in control
theory, algorithms for target representation and localization
are specific to images and related to registration methods [72],
[64], [56]. Both target localization and registration maximizes
a likelihood type function. The difference is that in tracking,
as opposed to registration, only small changes are assumed in
the location and appearance of the target in two consecutive
frames. This property can be exploited to develop efficient,
gradient-based localization schemes using the normalized
correlation criterion [6]. Since the correlation is sensitive to
illumination, Hager and Belhumeur [33] explicitly modeled
the geometry and illumination changes. The method
was improved by Sclaroff and Isidoro [67] using robust
M-estimators. Learning of appearance models by employing
a mixture of stable image structure, motion information, and
an outlier process, was discussed in [41]. In a different
approach, Ferrari et al. [26] presented an affine tracker based
on planar regions and anchor points. Tracking people, which
raises many challenges due to the presence of large 3D,
nonrigid motion, was extensively analyzed in [36], [1], [30],
[73]. Explicit tracking approaches of people [69] are time-
consuming and often the simpler blob model [75] or adaptive
mixture models [53] are also employed.
The main contribution of the paper is to introduce a new
framework for efficient tracking of nonrigid objects. We
show that by spatially masking the target with an isotropic
kernel, a spatially-smooth similarity function can be defined
and the target localization problem is then reduced to a
search in the basin of attraction of this function. The
smoothness of the similarity function allows application of
a gradient optimization method which yields much faster
target localization compared with the (optimized) exhaus-
tive search. The similarity between the target model and the
target candidates in the next frame is measured using the
metric derived from the Bhattacharyya coefficient. In our
case, the Bhattacharyya coefficient has the meaning of a
correlation score. The new target representation and
localization method can be integrated with various motion
filters and data association techniques. We present tracking
experiments in which our method successfully coped with
complex camera motion, partial occlusion of the target,
presence of significant clutter, and large variations in target
scale and appearance. We also discuss the integration of
background information and Kalman filter based tracking.
The paper is organized as follows: Section 2 discusses
issues of target representation and the importance of a
spatially-smooth similarity function. Section 3 introduces
the metric derived from the Bhattacharyya coefficient. The
optimization algorithm is described in Section 4. Experi-
mental results are shown in Section 5. Section 6 presents
extensions of the basic algorithm and the new approach is
put in the context of computer vision literature in Section 7.
2TARGET REPRESENTATION
To characterize the target, first a feature space is chosen.
The reference target model is represented by its pdf q in the
feature space. For example, the reference model can be
chosen to be the color pdf of the target. Without loss of
generality, the target model can be considered as centered
at the spatial location 0. In the subsequent frame, a target
candidate is defined at location y, and is characterized by the
pdf pðyÞ. Both pdfs are to be estimated from the data. To
satisfy the low-computational cost imposed by real-time
processing discrete densities, i.e., m-bin histograms should
be used. Thus, we have
target model :
^
qq ¼
^
qq
u
fg
u¼1...m
X
m
u¼1
^
qq
u
¼ 1
target candidate :
^
ppðyÞ¼
^
pp
u
ðyÞfg
u¼1...m
X
m
u¼1
^
pp
u
¼ 1:
The histogram is not the best nonparametric density
estimate [68], but it suffices for our purposes. Other discrete
density estimates can be also employed.
We will denote by
^
ðyÞ½
^
ppðyÞ;
^
qqð1Þ
a similarity function between
^
pp and
^
qq. The function ^ðyÞplays
the role of a likelihood and its local maxima in the image
indicate the presence of objects in the second frame having
representations similar to
^
qq defined in the first frame. If only
spectral information is used to characterize the target, the
similarity function can have large variations for adjacent
locations on the image lattice and the spatial information is
COMANICIU ET AL.: KERNEL-BASED OBJECT TRACKING 565