Selective Search for Object Recognition
J.R.R. Uijlings
∗1,2
, K.E.A. van de Sande
†2
, T. Gevers
2
, and A.W.M. Smeulders
2
1
University of Trento, Italy
2
University of Amsterdam, the Netherlands
Technical Report 2012, submitted to IJCV
Abstract
This paper addresses the problem of generating possible object lo-
cations for use in object recognition. We introduce Selective Search
which combines the strength of both an exhaustive search and seg-
mentation. Like segmentation, we use the image structure to guide
our sampling process. Like exhaustive search, we aim to capture
all possible object locations. Instead of a single technique to gen-
erate possible object locations, we diversify our search and use a
variety of complementary image partitionings to deal with as many
image conditions as possible. Our Selective Search results in a
small set of data-driven, class-independent, high quality locations,
yielding 99% recall and a Mean Average Best Overlap of 0.879 at
10,097 locations. The reduced number of locations compared to
an exhaustive search enables the use of stronger machine learning
techniques and stronger appearance models for object recognition.
In this paper we show that our selective search enables the use of
the powerful Bag-of-Words model for recognition. The Selective
Search software is made publicly available
1
.
1 Introduction
For a long time, objects were sought to be delineated before their
identification. This gave rise to segmentation, which aims for
a unique partitioning of the image through a generic algorithm,
where there is one part for all object silhouettes in the image. Re-
search on this topic has yielded tremendous progress over the past
years [3, 6, 13, 26]. But images are intrinsically hierarchical: In
Figure 1a the s alad and spoons are inside t he salad bowl, which in
turn stands on the table. Furthermore, depending on the context the
term table in this picture can refer to only the wood or include ev-
erything on the table. Therefore both the nature of images and the
different uses of an object category are hierarchical. This prohibits
the unique partitioning of objects for all but the most specific pur-
poses. Hence for most tasks multiple scales in a segmentation are a
necessity. This is most naturally addressed by using a hierarchical
partitioning, as done for example by Arbelaez et al. [3].
Besides that a segmentation should be hierarchical, a generic so-
lution for segmentation using a single strategy may not exist at all.
There are many conflicting reasons why a region should be grouped
together: In Figure 1b the cats can be separated using colour, but
their texture is the same. Conversely, in Figure 1c the chameleon
∗
†
1
http://disi.unitn.it/
˜
uijlings/SelectiveSearch.html
(a) (b)
(c) (d)
Figure 1: There is a high variety of reasons that an image region
forms an object. In (b) the cats can be distinguished by colour, not
texture. In (c) the chameleon can be distinguished from the sur-
rounding leaves by texture, not colour. In (d) the wheels can be part
of the car because they are enclosed, not because they are similar
in texture or colour. Therefore, to find objects in a structured way
it is necessary to use a variety of diverse strategies. Furthermore,
an image is intrinsically hierarchical as there is no single scale for
which the complete table, salad bowl, and salad spoon can be found
in (a).
is similar to its surrounding leaves in terms of colour, yet its tex-
ture differs. Finally, in Figure 1d, the wheels are wildly different
from the car in terms of both colour and texture, yet are enclosed
by the car. Individual visual features therefore cannot resolve the
ambiguity of segmentation.
And, finally, there is a more fundamental problem. Regions with
very different characteristics, such as a face over a sweater, can
only be combined i nto one object after it has been established that
the object at hand is a human. Hence without prior recognition it is
hard to decide that a face and a sweater are part of one object [29].
This has led to the opposite of the traditional approach: to do
localisation through the identification of an object. This recent ap-
proach in object recognition has made enormous progress in less
than a decade [8, 12, 16, 35]. With an appearance model learned
from examples, an exhaustive search is performed where every lo-
cation within the image is examined as to not miss any potential
object location [8, 12, 16, 35].
1