An implementation of SIFT detector and descriptor
Andrea Vedaldi
University of California – VisionLab
Contents
1 Introduction 1
2 User reference: the sift function 1
2.1 Scale space parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Detector parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Descriptor parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Direct access to SIFT components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A Internals 3
A.1 Scale spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A.2 The detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
A.3 The descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Introduction
These notes describe an implementation of the Scale-Invariant Transform Feature (SIFT) interest point detector
and descriptor [1]. This implementation is designed to produce results close to Lowe’s original implementation.
1
The SIFT detector and descriptor are discussed in some depth in the paper [1]. Here we describe the interface to
our implementation and, in the appendix, some technical details.
2 User reference: the sift function
The SIFT detector and the SIFT descriptor are invoked by me ans of the function sift, which provides a unified
interface to both.
Example 1 (Invocation). The following lines run the SIFT detector and descriptor on the image data/test.jpg.
I = imread(’data/test.png’) ;
I = double(rgb2gray(I)/256) ;
[frames,descriptors] = sift(I, ’Verbosity’, 1) ;
The pair option-value ’Verbosity’,1 causes the function to print a detailed progress report.
The sift function returns a 4×K matrix frames containing the SIFT frames and a 128×K matrix descriptors
containing their descriptors. Each frame is characterized by four numbers which are in order (x
1
, x
2
) for the center
of the frame, σ for its scale and θ for its orientation. The coordinates (x
1
, x
2
) are relative to the upper-left corner of
the image, which is assigned coordinates (0, 0), and may be fractional numbers (sub-pixel precision). The scale σ is
the smoothing level at which the frame has b ee n detected. This number can also be interpreted as size of the frame,
which is usually visualized as a disk of radius 6σ. Each descriptor is a vector describing coarsely the appearance of
the image patch c orresponding to the frame (further details are discussed in Appendix A.3). Typically this vector
has dimension 128, but this number can be changed by the user as described later.
Once frames and descriptors of two images I
1
and I
2
have been computed, siftmatch can be used to estimate
the pairs of matching features. This function uses Lowe’s method to discard ambiguous matches [1]. The result is
a 2 × M matrix, each column of which is a pair (k
1
, k
2
) of indices of corresponding SIFT frames.
1
See http://www.cs.ubc.ca/
~
lowe/keypoints/
1