4
The geometric configuration of the filters is captured by
a set of deformation costs (“springs”) connecting each
part filter to the root filter, leading to a star-structured
pictorial structure model. Note that we do not model
interactions between overlapping parts. While we might
benefit from modeling such interactions, this does not
appear to be a problem when using models trained with
a discriminative procedure, and it significantly simplifies
the problem of matching a model to an image.
The introduction of new local and semi-local features
has played an important role in advancing the perfor-
mance of object recognition methods. These features are
typically invariant to illumination changes and small
deformations. Many recent approaches use wavelet-like
features [30], [41] or locally-normalized histograms of
gradients [10], [29]. Other methods, such as [5], learn
dictionaries of local structures from training images. In
our work, we use histogram of gradient (HOG) features
from [10] as a starting point, and introduce a variation
that reduces the feature size with no loss in performance.
As in [26] we used principal component analysis (PCA)
to discover low dimensional features, but we note that
the eigenvectors we obtain have a clear structure that
leads to a new set of “analytic” features. This removes
the need to perform a costly projection step when com-
puting dense feature maps.
Significant variations in shape and appearance, such as
caused by extreme viewpoint changes, are not well cap-
tured by a 2D deformable model. Aspect graphs [31] are
a classical formalism for capturing significant changes
that are due to viewpoint variation. Mixture models
provide a simpler alternative approach. For example, it
is common to use multiple templates to encode frontal
and side views of faces and cars [36]. Mixture models
have been used to capture other aspects of appearance
variation as well, such as when there are multiple natural
subclasses in an object category [5].
Matching a deformable model to an image is a diffi-
cult optimization problem. Local search methods require
initialization near the correct solution [2], [7], [43]. To
guarantee a globally optimal match, more aggressive
search is needed. One popular approach for part-based
models is to restrict part locations to a small set of
possible locations returned by an interest point detector
[1], [18], [42]. Tree (and star) structured pictorial structure
models [9], [15], [19] allow for the use of dynamic
programming and generalized distance transforms to
efficiently search over all possible object configurations
in an image, without restricting the possible locations
for each part. We use these techniques for matching our
models to images.
Part-based deformable models are parameterized by
the appearance of each part and a geometric model
capturing spatial relationships among parts. For gen-
erative models one can learn model parameters using
maximum likelihood estimation. In a fully-supervised
setting training images are labeled with part locations
and models can often be learned using simple methods
[9], [15]. In a weakly-supervised setting training images
may not specify locations of parts. In this case one can
simultaneously estimate part locations and learn model
parameters with EM [2], [18], [42].
Discriminative training methods select model param-
eters so as to minimize the mistakes of a detection algo-
rithm on a set of training images. Such approaches di-
rectly optimize the decision boundary between positive
and negative examples. We believe this is one reason for
the success of simple models trained with discriminative
methods, such as the Viola-Jones [41] and Dalal-Triggs
[10] detectors. It has been more difficult to train part-
based models discriminatively, though strategies exist
[4], [23], [32], [34].
Latent SVMs are related to hidden CRFs [32]. How-
ever, in a latent SVM we maximize over latent part loca-
tions as opposed to marginalizing over them, and we use
a hinge-loss rather than log-loss in training. This leads
to an an efficient coordinate-descent style algorithm for
training, as well as a data-mining algorithm that allows
for learning with very large datasets. A latent SVM can
be viewed as a type of energy-based model [27].
A latent SVM is equivalent to the MI-SVM formulation
of multiple instance learning (MIL) in [3], but we find
the latent variable formulation more natural for the prob-
lems we are interested in.
1
A different MIL framework
was previously used for training object detectors with
weakly labeled data in [40].
Our method for data-mining hard examples during
training is related to working set methods for SVMs (e.g.
[25]). The approach described here requires relatively
few passes through the complete set of training examples
and is particularly well suited for training with very
large data sets, where only a fraction of the examples
can fit in RAM.
The use of context for object detection and recognition
has received increasing attention in the recent years.
Some methods (e.g. [39]) use low-level holistic image fea-
tures for defining likely object hypothesis. The method
in [22] uses a coarse but semantically rich representation
of a scene, including its 3D geometry, estimated using a
variety of techniques. Here we define the context of an
image using the results of running a variety of object
detectors in the image. The idea is related to [33] where
a CRF was used to capture co-occurrences of objects,
although we use a very different approach to capture
this information.
A preliminary version of our system was described in
[17]. The system described here differs from the one in
[17] in several ways, including: the introduction of mix-
ture models; here we optimize the true latent SVM ob-
jective function using stochastic gradient descent, while
in [17] we used an SVM package to optimize a heuristic
approximation of the objective; here we use new features
that are both lower-dimensional and more informative;
1. We defined a latent SVM in [17] before realizing the relationship
to MI-SVM.
评论6
最新资源