International Journal of Computer Vision 59(3), 207–232, 2004
c
2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Visual Modeling with a Hand-Held Camera
MARC POLLEFEYS
Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599-3175, USA
marc@cs.unc.edu
LUC VAN GOOL, MAARTEN VERGAUWEN, FRANK VERBIEST, KURT CORNELIS AND JAN TOPS
Center for Processing of Speech and Images, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10,
B-3001 Leuven, Belgium
Luc.VanGool@esat.kuleuven.ac.be
Maarten.Vergauwen@esat.kuleuven.ac.be
Frank.Verbiest@esat.kuleuven.ac.be
Kurt.Cornelis@esat.kuleuven.ac.be
Jan.Tops@esat.kuleuven.ac.be
REINHARD KOCH
Institut f
¨
ur Informatik und Praktische Mathematik, Christian-Albrechts-Universit
¨
at Kiel,
Hermann-Rodewald-Str. 3, D-24098 Kiel, Germany
rk@mip.informatik.uni-kiel.de
Received March 14, 2001; Revised October 8, 2003; Accepted October 10, 2003
Abstract. In this paper a complete system to build visual models from camera images is presented. The system can
deal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features the
relations between multiple views are computed. From this both the structure of the scene and the motion of the camera
are retrieved. The ambiguity on the reconstruction is restricted from projective to metric through self-calibration. A
flexible multi-view stereo matching scheme is used to obtain a dense estimation of the surface geometry. From the
computed data different types of visual models are constructed. Besides the traditional geometry- and image-based
approaches, a combined approach with view-dependent geometry and texture is presented. As an application fusion
of real and virtual scenes is also shown.
Keywords: visual modeling, structure-from-motion, projective reconstruction, self-calibration, multi-view stereo
matching, dense reconstruction, 3D reconstruction, image-based rendering, augmented video, hand-held camera
1. Introduction
During recent years a lot of effort was put in devel-
oping new approaches for modeling and rendering vi-
sual scenes. A few years ago the main applications of
3d modeling in vision were robot navigation and vi-
sual inspection. Nowadays however the emphasis has
changed. There is more and more demand for 3D mod-
els in computer graphics, virtual reality and commu-
nication. This results in a change in the requirements.
The visual quality of the models becomes the main
point of attention. There is an important demand for
simple and flexible acquisition procedures. Therefore
calibration should be absent or restricted to a minimum.
208 Pollefeys et al.
Many new applications also require robust low cost ac-
quisition systems. This stimulates the use of consumer
photo- or video cameras.
In this paper we present an approach that can be
used to obtain several types of visual models from im-
ages acquired with an uncalibrated camera.The user ac-
quires the images by freely moving the camera around
an object or scene. Neither the camera motion nor the
camera settings have to beknown a priori. The pre-
sented approach can generate a textured 3D surface
model or alternatively render new views using a com-
bined geometry- and image-based approach that uses
view-dependent texture and geometry. The system can
also be used to combine virtual objects with real video,
yielding augmented video sequences.
Other approaches for extracting 3D shape and tex-
ture from image sequences acquired with a freely
moving camera have been proposed. The approach of
Tomasi and Kanade (1992) used an affine factoriza-
tion method to extract 3D from image sequences. An
important restriction of this system is the assumption
of orthographic projection. Another type of approach
starts from an approximate 3D model and camera poses
and refines the model based on images (e.g. Facade pro-
posed by Debevec et al. (1996). The advantage is that
less images are required. On the other hand a prelimi-
nary model must be available which mostly limits the
approach to man-made environments. This approach
also combines geometry- and image-based techniques,
however only the texture is view-dependent.
The approach presented here combines many ideas
and algorithms that have been developed in recent
years. This paper aims at consolidating these results
by bringing them together and showing how they can
be combined to yield a complete visual modeling
approach. In Section 1.1 notations and background are
given. The rest of the paper is then organized as follows.
Section 2 discusses feature extraction and matching
and the computation of the multi-view relations.
Section 3 deals with the structure and motion recovery,
including self-calibration. In Section 4 the approach
to obtain dense depth maps is presented and in
Section 5 the construction of the different visual mod-
els is discussed. The paper is concluded in Section 6.
1.1. Notations and Background
In this section we briefly introduce some of the geomet-
ric concepts used throughout the paper. A more in depth
description can be found in Faugeras et al. (2001) and
Hartley and Zisserman (2000). A perspective camera
is modeled through the projection equation
m ∼ PM (1)
where ∼ represents the equality up to a non-zero scale
factor, M is a 4-vector that represents 3D world point
in homogeneous coordinates, similarly m is a 3-vector
that represents a corresponding 2D image point and P
is a 3 × 4 projection matrix. In a metric or Euclidean
frame P can be factorized as follows
P = KR
[ I |-t] where K =
fsu
rf v
1
(2)
contains the intrinsic camera parameters, R is a rotation
matrix representing the orientation and t is a 3-vector
representing the position of the camera. The intrinsic
camera parameter f represents the focal length mea-
sured in width of pixels, r is the aspect ratio of pixels,
(u,v) represent the coordinates of the principal point
and s is a term accounting for the skew. In general s
can be assumed zero. In practice, the principal point is
often close to the center of the image and the aspect
ratio r close to one. In many cases the camera does
not perfectly satisfy the perspective projection model
and distortions have to be taken into account, the most
important being radial distortion. In practice, it is often
sufficient to model the radial distortion as follows:
m ∼ P(M) = KR(R
[ I |-t]M) with
R(x) = (1 + K
1
(x
2
+ y
2
))[xy0]
+ [0 0 1]
(3)
where K
1
indicates the amount of radial distortion that
is present in the image. For high accuracy applica-
tions more advanced models can be used (Slama, 1980;
Willson, 1994).
In this paper the notation d(., .) will be used to in-
dicate the Euclidean distance between entities in the
images.
TwoView Geometry. The point m
corresponding to
the point m in another image is bound to be on the
projection of its line of sight l
∼ Fm where F is the
fundamental matrix for the two views under consid-
eration. Therefore, the following equation should be
satisfied for all corresponding points:
m
Fm = 0. (4)
Visual Modeling with a Hand-Held Camera 209
The fundamental matrix has rank 2 and the right and
left null-space of F corresponds to the epipoles. The
epipoles e and e
are the projections of the projec-
tion center of one image in the other image. The
fundamental matrix can be obtained from two projec-
tion matrices P and P
as
F = (P
)
†
P
[e]
×
(5)
where the epipole e = PC
with C
the solution of
P
C
= 0.
Homographies. These can be used to transfer image
points that corresponds to 3D points that are on a spe-
cific plane from one image to the other, i.e. m
∼ Hm
where H is the homography that corresponds to that
plane (for the two views under consideration). There is
an important relationship between such homographies
and the fundamental matrix:
F ∼ [e
]
×
H and H = [e
]
×
F − e
a
(6)
with [e
]
×
an anti-symmetric matrix representing the
vector product with the epipole and with a avec-
tor related to the plane. Homographies for a plane
L = [abcd]
can also be obtained from projection
matrices as
H
L
ii
= H
Li
H
−1
Li
with H
Li
= P
i
dI
[abc]
(7)
From 3 points M
1
, M
2
and M
3
a plane is obtained as the
right null space of [M
1
M
2
M
3
]
.
Comparing Images Regions. Image regions are
typically compared using sum-of-square-differences
(SSD) or zero-mean normalized cross-correlation
(ZNCC). Consider a window W in image I and a cor-
responding region T(W )inimage J . The dissimilar-
ity between two image regions based on SSD is given
by
D =
W
[
J (T(x, y)) − I (x , y)
]
2
w(x, y) dx dy
(8)
where w(x, y)isaweighting function that is defined
over W .Typically, w(x , y) = 1oritisaGaussian. The
similarity measure between two image regions based
on ZNCC is given by
S =
W
(J (T(x, y)) −
¯
J ) · (I (x, y) −
¯
I )w(x, y) dx dy
W
(J (T(x, y)) −
¯
J )
2
w(x, y) dx dy ·
W
(I (x, y) −
¯
I )
2
w(x, y) dx dy
(9)
with
¯
J =
W
J (T(x, y)) dx dy and
¯
I =
W
I (x, y)
dx dy the mean image intensity in the considered
region. Note that this last measure is invariant to global
intensity and contrast changes over the considered
regions.
2. Relating Images
Starting from a collection of images or a video se-
quence the first step consists in relating the different
images to each other. This is not an easy problem. A
restricted number of corresponding points is sufficient
to determine the geometric relationship or multi-view
constraints between the images. Since not all points
are equally suited for matching or tracking (e.g. a pixel
in a homogeneous region), the first step consist of se-
lecting a number of interesting points or feature points.
Some approaches also use other features, such as lines
or curves, but these will not be discussed here. Depend-
ing on the type of image data (i.e. video or still pictures)
the feature points are tracked or matched and a number
of potential correspondences are obtained. From these
the multi-view constraints can be computed. However,
since the correspondence problem is an ill-posed prob-
lem, the set of corresponding points can be contam-
inated with an important number of wrong matches
or outliers.Inthis case, a traditional least-squares ap-
proach will fail and therefore a robust method is needed.
Once the multi-view constraints have been obtained
they can be used to guide the search for additional cor-
respondences. These can then be used to further refine
the results for the multi-view constraints.
2.1. Feature Extraction and Matching
One of the most important requirements for a fea-
ture point is that it can be differentiated from its
neighboring image points. If this were not the case,
it wouldn’t be possible to match it uniquely with
a corresponding point in another image. Therefore,
the neighborhood of a feature should be sufficiently
210 Pollefeys et al.
different from the neighborhoods obtained after a small
displacement.
A second order approximation of the dissimilarity,
as defined in Eq. (8), between an image window W and
a slightly translated image window is given by
d(x,y) =
x
y
M[
x y
] with
M =
W
∂ I
∂x
∂ I
∂y
∂ I
∂x
∂ I
∂y
w(x, y) dx dy
(10)
To ensure that no displacement exists for which D
is small, the eigenvalues of M should both be large.
This can be achieved by enforcing a minimal value
for the smallest eigenvalue (Shi and Tomasi, 1994) or
alternatively for the following corner response func-
tion R = det M − k(trace M)
2
(Harris and Stephens,
1988) where k is a parameter set to 0.04 (a sugges-
tion of Harris). In the case of tracking this is sufficient
to ensure that features can be tracked from one video
frame to the next. In this case it is natural to use the
tracking neighborhood to evaluate the quality of a fea-
ture (e.g. a 7 × 7 window with w(x, y) = 1). Tracking
itself is done by minimizing Eq. (8) over the param-
eters of T.For small steps a translation is sufficient
for T.Toevaluate the accumulated difference from the
start of the track it is advised to use an affine motion
model.
In the case of separate frames as obtained with a
still camera, there is the additional requirement that
as much image points originating from the same 3D
points as possible should be extracted. Therefore, only
local maxima of the corner response function are con-
sidered as suitable features. Sub-pixel precision can
be achieved through quadratic approximation of the
neighborhood of the local maxima. A typical choice
for w(x)inthis case is a Gaussian with σ = 0.7.
Matching is typically done by comparing small, e.g.
7 × 7, windows centered around the feature through
SSD or ZNCC. This measure is only invariant to image
translations and can therefore not cope with too large
variations in camera pose.
To match images that are more widely separated, it
is required to cope with a larger set of image variations.
Exhaustive search over all possible variations is com-
putationally intractable. A more interesting approach
consists of extracting a more complex feature that not
only determines the position, but also the other un-
knowns of a local similarity (Schmid and Mohr, 1997)
or affine transformation (Lowe, 1999; Tuytelaars and
Van Gool, 2000).
2.2. Two View Geometry Computation
Even for an arbitrary geometric structure, the projec-
tions of points in two views contain some structure.
Finding back this structure is not only interesting to re-
trieve information on the relative pose between the two
views (Faugeras, 1992; Hartley et al., 1992), but also
to eliminate mismatches and to facilitate the search
for additional matches. This structure corresponds to
the epipolar geometry and is mathematically expressed
by the fundamental matrix. Given a number of corre-
sponding points Eq. (4) can be used to compute F. This
equation can be rewritten in the following form:
[
xx
yx
x
xy
yy
y
xy1
]f = 0 (11)
with m = [xy1]
, m
= [x
y
1]
and f avector con-
taining the elements of the fundamental matrix. Stack-
ing 8 or more of these equations allows to linearly solve
for the fundamental matrix. Even for 7 corresponding
points the one parameter family of solutions obtained
by solving the linear equations can be restricted to 1
or 3 solutions by enforcing the cubic rank-2 constraint
det
(
F
1
+ λF
2
)
= 0. As pointed out by Hartley (1997) it
is important to normalize the image coordinates before
solving the linear equations. Otherwise the columns of
Eq. (11) would differ by several orders of magnitude
and the error would concentrate on the coefficients cor-
responding to the smaller columns. This normalization
can be achieved by transforming the image center to the
origin and scaling the images so that the coordinates
have a standard deviation of unity. More advanced ap-
proaches have been proposed (e.g. Matei and Meer,
2000), but in practice the simple approach is sufficient
to initialize a non-linear minimization. The result of
the linear equations can be refined by minimizing the
following criterion (Zhang et al., 1995):
C(F) =
(d(m
, Fm)
2
+ d(m, F
m
)
2
) (12)
This criterion can be minimized through a Levenberg-
Marquard algorithm (Press et al., 1992). An even better
approach consists of computing a maximum-likelihood
estimation (MLE) by minimizing the following
Visual Modeling with a Hand-Held Camera 211
criterion:
C(F, ^m, ^m
) =
(d(^m, m)
2
+ d(^m
, m
)
2
) with
ˆm
Fˆm = 0 (13)
Although in this case the minimization has to be carried
out over a much larger set of variables, this can be
achieved efficiently by taking advantage of the sparsity
of the problem (see Section 3.3).
To compute the fundamental matrix from a set of
matches that were automatically obtained from a pair
of real images, it is important to explicitly deal with
outliers. If the set of matches is contaminated with even
a small set of outliers, the result of the above method
can become unusable. This is typical for all types of
least-squares approaches (even non-linear ones). The
problem is that the quadratic penalty (which is optimal
for Gaussian noise) allows for a single outlier that is
very far away from the true solution to completely bias
the final result (Torr, 1995).
The approach that is used to cope with this problem is
the RANSAC algorithm that was proposed by Fischler
and Bolles (1981). A minimal subset of the data is
randomly selected and the solution obtained from it is
used to segment the remainder of the dataset in “inliers”
and “outliers”. If the initial set contains no outliers, it
can be expected that an important number of inliers
will support the solution, otherwise the initial subset is
probably contaminated with outliers. This procedure is
repeated until a satisfying solution is obtained. This can
for example be defined as a probability in excess of 95%
that a good subsample was selected. The expression for
this probability is = 1−(1−γ
p
)
m
with γ the fraction
of inliers, and p the number of features in each sample
and m the number of trials (see Rousseeuw (1987)).
Once the epipolar geometry has been computed it
can be used to guide the matching procedure toward
additional matches. At this point only features that
are close to the epipolar line should be considered for
matching. Table 1 summarizes the robust approach to
the determination of the two-view geometry.
3. Structure and Motion Recovery
In the previous section it was seen how different views
could be related to each other. In this section the relation
between the views and the correspondences between
the features will be used to retrieve the structure of the
scene and the motion of the camera. This problem is
called Structure and Motion.
Table 1.Overview of the two-view geometry computation
algorithm.
Step 1. Compute a set of potential matches
Step 2. While (#inliers, #samples) < 95% do
Step 2.1 select minimal sample (7 matches)
Step 2.2 compute solutions for F
Step 2.3 determine inliers
Step 3. Refine F based on all inliers
Step 4. Look for additional matches
Step 5. Refine F based on all correct matches
The approach that is proposed here extends
(Beardsley et al., 1997; Koch et al., 1999b) by be-
ing fully projective and therefore not dependent on
the quasi-euclidean initialization. This was achieved by
carrying out all measurements in the images. This ap-
proach provides an alternative for the triplet-based ap-
proach proposed in Fitzgibbon and Zisserman (1998).
An image-based measure that is able to obtain a quali-
tative distance between viewpoints is also proposed to
support initialization and determination of close views
(independently of the actual projective frame).
At first two images are selected and an initial recon-
struction frame is set-up. Then the pose of the camera
for the other views is determined in this frame and
each time the initial reconstruction is refined and ex-
tended. In this way the pose estimation of views that
have no common features with the reference views also
becomes possible. Typically, a view is only matched
with its predecessor in the sequence. In most cases this
works fine, but in some cases (e.g. when the camera
moves back and forth) it can be interesting to also re-
late a new view to a number of additional views. Once
the structure and motion has been determined for the
whole sequence, the results can be refined through a
projective bundle adjustment. Then the ambiguity will
be restricted to metric through self-calibration. Finally,
a metric bundle adjustment is carried out to obtain an
optimal estimation of the structure and motion.
3.1. Initial Structure and Motion
The first step consists of selecting two views that are
suited for initializing the sequential structure and mo-
tion computation. On the one hand it is important that
sufficient features are matched between these views, on
the other hand the views should not be too close to each
other so that the initial structure is well-conditioned.
The first of these criteria is easy to verify, the