Distinctive Image Features
from Scale-Invariant Keypoints
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., Canada
lowe@cs.ubc.ca
January 5, 2004
Abstract
This paper presents a method for extracting distinctive invariant features from
images that can be used to perform reliable matching between different views of
an object or scene. The features are invariant to image scale and rotation, and
are shown to provide robust matching across a a substantial range of affine dis-
tortion, change in 3D viewpoint, addition of noise, and change in illumination.
The features are highly distinctive, in the sense that a single feature can be cor-
rectly matched with high probability against a large database of features from
many images. This paper also describes an approach to using these features
for object recognition. The recognition proceeds by matching individual fea-
tures to a database of features from known objects using a fast nearest-neighbor
algorithm, followed by a Hough transform to identify clusters belonging to a sin-
gle object, and finally performing verification through least-squares solution for
consistent pose parameters. This approach to recognition can robustly identify
objects among clutter and occlusion while achieving near real-time performance.
Accepted for publication in the
International Journal of Computer Vision,
2004.
1
1 Introduction
Image matching is a fundamental aspect of many problems in computer vision, including
object or scene recognition, solving for 3D structure from multiple images, stereo correspon-
dence, and motion tracking. This paper describes image features that have many properties
that make them suitable for matching differing images of an object or scene. The features are
invariant to image scaling and rotation, and partially invariant to change in illumination and
3D camera viewpoint. They are well localized in both the spatial and frequency domains, re-
ducing the probability of disruption by occlusion, clutter, or noise. Large numbers of features
can be extracted from typical images with efficient algorithms. In addition, the features are
highly distinctive, which allows a single feature to be correctly matched with high probability
against a large database of features, providing a basis for object and scene recognition.
The cost of extracting these features is minimized by taking a cascade filtering approach,
in which the more expensive operations are applied only at locations that pass an initial test.
Following are the major stages of computation used to generate the set of image features:
1. Scale-space extrema detection: The first stage of computation searches over all scales
and image locations. It is implemented efficiently by using a difference-of-Gaussian
function to identify potential interest points that are invariant to scale and orientation.
2. Keypoint localization: At each candidate location, a detailed model is fit to determine
location and scale. Keypoints are selected based on measures of their stability.
3. Orientation assignment: One or more orientations are assigned to each keypoint lo-
cation based on local image gradient directions. All future operations are performed
on image data that has been transformed relative to the assigned orientation, scale, and
location for each feature, thereby providing invariance to these transformations.
4. Keypoint descriptor: The local image gradients are measured at the selected scale
in the region around each keypoint. These are transformed into a representation that
allows for significant levels of local shape distortion and change in illumination.
This approach has been named the Scale Invariant Feature Transform (SIFT), as it transforms
image data into scale-invariant coordinates relative to local features.
An important aspect of this approach is that it generates large numbers of features that
densely cover the image over the full range of scales and locations. A typical image of size
500x500 pixels will give rise to about 2000 stable features (although this number depends on
both image content and choices for various parameters). The quantity of features is partic-
ularly important for object recognition, where the ability to detect small objects in cluttered
backgrounds requires that at least 3 features be correctly matched from each object for reli-
able identification.
For image matching and recognition, SIFT features are first extracted from a set of ref-
erence images and stored in a database. A new image is matched by individually comparing
each feature from the new image to this previous database and finding candidate match-
ing features based on Euclidean distance of their feature vectors. This paper will discuss
fast nearest-neighbor algorithms that can perform this computation rapidly against large
databases.
The keypoint descriptors are highly distinctive, which allows a single feature to find its
correct match with good probability in a large database of features. However, in a cluttered
2
image, many features from the background will not have any correct match in the database,
giving rise to many false matches in addition to the correct ones. The correct matches can
be filtered from the full set of matches by identifying subsets of keypoints that agree on the
object and its location, scale, and orientation in the new image. The probability that several
features will agree on these parameters by chance is much lower than the probability that
any individual feature match will be in error. The determination of these consistent clusters
can be performed rapidly by using an efficient hash table implementation of the generalized
Hough transform.
Each cluster of 3 or more features that agree on an object and its pose is then subject
to further detailed verification. First, a least-squared estimate is made for an affine approxi-
mation to the object pose. Any other image features consistent with this pose are identified,
and outliers are discarded. Finally, a detailed computation is made of the probability that a
particular set of features indicates the presence of an object, given the accuracy of fit and
number of probable false matches. Object matches that pass all these tests can be identified
as correct with high confidence.
2 Related research
The development of image matching by using a set of local interest points can be traced back
to the work of Moravec (1981) on stereo matching using a corner detector. The Moravec
detector was improved by Harris and Stephens (1988) to make it more repeatable under small
image variations and near edges. Harris also showed its value for efficient motion tracking
and 3D structure from motion recovery (Harris, 1992), and the Harris corner detector has
since been widely used for many other image matching tasks. While these feature detectors
are usually called corner detectors, they are not selecting just corners, but rather any image
location that has large gradients in all directions at a predetermined scale.
The initial applications were to stereo and short-range motion tracking, but the approach
was later extended to more difficult problems. Zhang et al. (1995) showed that it was possi-
ble to match Harris corners over a large image range by using a correlation window around
each corner to select likely matches. Outliers were then removed by solving for a funda-
mental matrix describing the geometric constraints between the two views of rigid scene and
removing matches that did not agree with the majority solution. At the same time, a similar
approach was developed by Torr (1995) for long-range motion matching, in which geometric
constraints were used to remove outliers for rigid objects moving within an image.
The ground-breaking work of Schmid and Mohr (1997) showed that invariant local fea-
ture matching could be extended to general image recognition problems in which a feature
was matched against a large database of images. They also used Harris corners to select
interest points, but rather than matching with a correlation window, they used a rotationally
invariant descriptor of the local image region. This allowed features to be matched under
arbitrary orientation change between the two images. Furthermore, they demonstrated that
multiple feature matches could accomplish general recognition under occlusion and clutter
by identifying consistent clusters of matched features.
The Harris corner detector is very sensitive to changes in image scale, so it does not
provide a good basis for matching images of different sizes. Earlier work by the author
(Lowe, 1999) extended the local feature approach to achieve scale invariance. This work
also described a new local descriptor that provided more distinctive features while being less
3
sensitive to local image distortions such as 3D viewpoint change. This current paper provides
a more in-depth development and analysis of this earlier work, while also presenting a number
of improvements in stability and feature invariance.
There is a considerable body of previous research on identifying representations that are
stable under scale change. Some of the first work in this area was by Crowley and Parker
(1984), who developed a representation that identified peaks and ridges in scale space and
linked these into a tree structure. The tree structure could then be matched between images
with arbitrary scale change. More recent work on graph-based matching by Shokoufandeh,
Marsic and Dickinson (1999) provides more distinctive feature descriptors using wavelet co-
efficients. The problem of identifying an appropriate and consistent scale for feature detection
has been studied in depth by Lindeberg (1993, 1994). He describes this as a problem of scale
selection, and we make use of his results below.
Recently, there has been an impressive body of work on extending local features to be
invariant to full affine transformations (Baumberg, 2000; Tuytelaars and Van Gool, 2000;
Mikolajczyk and Schmid, 2002; Schaffalitzky and Zisserman, 2002; Brown and Lowe, 2002).
This allows for invariant matching to features on a planar surface under changes in ortho-
graphic 3D projection, in most cases by resampling the image in a local affine frame. How-
ever, none of these approaches are yet fully affine invariant, as they start with initial feature
scales and locations selected in a non-affine-invariant manner due to the prohibitive cost of
exploring the full affine space. The affine frames are are also more sensitive to noise than
those of the scale-invariant features, so in practice the affine features have lower repeatability
than the scale-invariant features unless the affine distortion is greater than about a 40 degree
tilt of a planar surface (Mikolajczyk, 2002). Wider affine invariance may not be important for
many applications, as training views are best taken at least every 30 degrees rotation in view-
point (meaning that recognition is within 15 degrees of the closest training view) in order to
capture non-planar changes and occlusion effects for 3D objects.
While the method to be presented in this paper is not fully affine invariant, a different
approach is used in which the local descriptor allows relative feature positions to shift signif-
icantly with only small changes in the descriptor. This approach not only allows the descrip-
tors to be reliably matched across a considerable range of affine distortion, but it also makes
the features more robust against changes in 3D viewpoint for non-planar surfaces. Other
advantages include much more efficient feature extraction and the ability to identify larger
numbers of features. On the other hand, affine invariance is a valuable property for matching
planar surfaces under very large view changes, and further research should be performed on
the best ways to combine this with non-planar 3D viewpoint invariance in an efficient and
stable manner.
Many other feature types have been proposed for use in recognition, some of which could
be used in addition to the features described in this paper to provide further matches under
differing circumstances. One class of features are those that make use of image contours or
region boundaries, which should make them less likely to be disrupted by cluttered back-
grounds near object boundaries. Matas et al., (2002) have shown that their maximally-stable
extremal regions can produce large numbers of matching features with good stability. Miko-
lajczyk et al., (2003) have developed a new descriptor that uses local edges while ignoring
unrelated nearby edges, providing the ability to find stable features even near the boundaries
of narrow shapes superimposed on background clutter. Nelson and Selinger (1998) have
shown good results with local features based on groupings of image contours. Similarly,
4
Pope and Lowe (2000) used features based on the hierarchical grouping of image contours,
which are particularly useful for objects lacking detailed texture.
The history of research on visual recognition contains work on a diverse set of other
image properties that can be used as feature measurements. Carneiro and Jepson (2002)
describe phase-based local features that represent the phase rather than the magnitude of local
spatial frequencies, which is likely to provide improved invariance to illumination. Schiele
and Crowley (2000) have proposed the use of multidimensional histograms summarizing the
distribution of measurements within image regions. This type of feature may be particularly
useful for recognition of textured objects with deformable shapes. Basri and Jacobs (1997)
have demonstrated the value of extracting local region boundaries for recognition. Other
useful properties to incorporate include color, motion, figure-ground discrimination, region
shape descriptors, and stereo depth cues. The local feature approach can easily incorporate
novel feature types because extra features contribute to robustness when they provide correct
matches, but otherwise do little harm other than their cost of computation. Therefore, future
systems are likely to combine many feature types.
3 Detection of scale-space extrema
As described in the introduction, we will detect keypoints using a cascade filtering approach
that uses efficient algorithms to identify candidate locations that are then examined in further
detail. The first stage of keypoint detection is to identify locations and scales that can be
repeatably assigned under differing views of the same object. Detecting locations that are
invariant to scale change of the image can be accomplished by searching for stable features
across all possible scales, using a continuous function of scale known as scale space (Witkin,
1983).
It has been shown by Koenderink (1984) and Lindeberg (1994) that under a variety of
reasonable assumptions the only possible scale-space kernel is the Gaussian function. There-
fore, the scale space of an image is defined as a function, L(x, y, σ), that is produced from
the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y):
L(x, y, σ) = G(x, y, σ) ∗ I(x, y),
where ∗ is the convolution operation in x and y, and
G(x, y, σ) =
1
2πσ
2
e
−(x
2
+y
2
)/2σ
2
.
To efficiently detect stable keypoint locations in scale space, we have proposed (Lowe, 1999)
using scale-space extrema in the difference-of-Gaussian function convolved with the image,
D(x, y, σ), which can be computed from the difference of two nearby scales separated by a
constant multiplicative factor k:
D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y)
= L(x, y, kσ) − L(x, y, σ). (1)
There are a number of reasons for choosing this function. First, it is a particularly efficient
function to compute, as the smoothed images, L, need to be computed in any case for scale
space feature description, and D can therefore be computed by simple image subtraction.
5
评论0