基于图像序列的三维重建的两篇经典文章资源-CSDN文库

共2个文件

pdf：2个

图像序列

三维重建

5星 · 超过95%的资源需积分: 25 158 浏览量 2009-03-31 15:44:39 上传评论 3 收藏 2.48MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

3DReconstruction.rar （2个子文件）

Pollefeys2004.pdf 1.54MB

PollefeysJISPRS00.pdf 1.12MB

International Journal of Computer Vision 59(3), 207–232, 2004

 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Visual Modeling with a Hand-Held Camera

MARC POLLEFEYS

Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599-3175, USA

marc@cs.unc.edu

LUC VAN GOOL, MAARTEN VERGAUWEN, FRANK VERBIEST, KURT CORNELIS AND JAN TOPS

Center for Processing of Speech and Images, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10,

B-3001 Leuven, Belgium

Luc.VanGool@esat.kuleuven.ac.be

Maarten.Vergauwen@esat.kuleuven.ac.be

Frank.Verbiest@esat.kuleuven.ac.be

Kurt.Cornelis@esat.kuleuven.ac.be

Jan.Tops@esat.kuleuven.ac.be

REINHARD KOCH

Institut f

ur Informatik und Praktische Mathematik, Christian-Albrechts-Universit

at Kiel,

Hermann-Rodewald-Str. 3, D-24098 Kiel, Germany

rk@mip.informatik.uni-kiel.de

Received March 14, 2001; Revised October 8, 2003; Accepted October 10, 2003

Abstract. In this paper a complete system to build visual models from camera images is presented. The system can

deal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features the

relations between multiple views are computed. From this both the structure of the scene and the motion of the camera

are retrieved. The ambiguity on the reconstruction is restricted from projective to metric through self-calibration. A

ﬂexible multi-view stereo matching scheme is used to obtain a dense estimation of the surface geometry. From the

computed data different types of visual models are constructed. Besides the traditional geometry- and image-based

approaches, a combined approach with view-dependent geometry and texture is presented. As an application fusion

of real and virtual scenes is also shown.

Keywords: visual modeling, structure-from-motion, projective reconstruction, self-calibration, multi-view stereo

matching, dense reconstruction, 3D reconstruction, image-based rendering, augmented video, hand-held camera

1. Introduction

During recent years a lot of effort was put in devel-

oping new approaches for modeling and rendering vi-

sual scenes. A few years ago the main applications of

3d modeling in vision were robot navigation and vi-

sual inspection. Nowadays however the emphasis has

changed. There is more and more demand for 3D mod-

els in computer graphics, virtual reality and commu-

nication. This results in a change in the requirements.

The visual quality of the models becomes the main

point of attention. There is an important demand for

simple and ﬂexible acquisition procedures. Therefore

calibration should be absent or restricted to a minimum.

208 Pollefeys et al.

Many new applications also require robust low cost ac-

quisition systems. This stimulates the use of consumer

photo- or video cameras.

In this paper we present an approach that can be

used to obtain several types of visual models from im-

ages acquired with an uncalibrated camera.The user ac-

quires the images by freely moving the camera around

an object or scene. Neither the camera motion nor the

camera settings have to beknown a priori. The pre-

sented approach can generate a textured 3D surface

model or alternatively render new views using a com-

bined geometry- and image-based approach that uses

view-dependent texture and geometry. The system can

also be used to combine virtual objects with real video,

yielding augmented video sequences.

Other approaches for extracting 3D shape and tex-

ture from image sequences acquired with a freely

moving camera have been proposed. The approach of

Tomasi and Kanade (1992) used an afﬁne factoriza-

tion method to extract 3D from image sequences. An

important restriction of this system is the assumption

of orthographic projection. Another type of approach

starts from an approximate 3D model and camera poses

and reﬁnes the model based on images (e.g. Facade pro-

posed by Debevec et al. (1996). The advantage is that

less images are required. On the other hand a prelimi-

nary model must be available which mostly limits the

approach to man-made environments. This approach

also combines geometry- and image-based techniques,

however only the texture is view-dependent.

The approach presented here combines many ideas

and algorithms that have been developed in recent

years. This paper aims at consolidating these results

by bringing them together and showing how they can

be combined to yield a complete visual modeling

approach. In Section 1.1 notations and background are

given. The rest of the paper is then organized as follows.

Section 2 discusses feature extraction and matching

and the computation of the multi-view relations.

Section 3 deals with the structure and motion recovery,

including self-calibration. In Section 4 the approach

to obtain dense depth maps is presented and in

Section 5 the construction of the different visual mod-

els is discussed. The paper is concluded in Section 6.

1.1. Notations and Background

In this section we brieﬂy introduce some of the geomet-

ric concepts used throughout the paper. A more in depth

description can be found in Faugeras et al. (2001) and

Hartley and Zisserman (2000). A perspective camera

is modeled through the projection equation

m ∼ PM (1)

where ∼ represents the equality up to a non-zero scale

factor, M is a 4-vector that represents 3D world point

in homogeneous coordinates, similarly m is a 3-vector

that represents a corresponding 2D image point and P

is a 3 × 4 projection matrix. In a metric or Euclidean

frame P can be factorized as follows

P = KR



[ I |-t] where K =







fsu

rf v







(2)

contains the intrinsic camera parameters, R is a rotation

matrix representing the orientation and t is a 3-vector

representing the position of the camera. The intrinsic

camera parameter f represents the focal length mea-

sured in width of pixels, r is the aspect ratio of pixels,

(u,v) represent the coordinates of the principal point

and s is a term accounting for the skew. In general s

can be assumed zero. In practice, the principal point is

often close to the center of the image and the aspect

ratio r close to one. In many cases the camera does

not perfectly satisfy the perspective projection model

and distortions have to be taken into account, the most

important being radial distortion. In practice, it is often

sufﬁcient to model the radial distortion as follows:

m ∼ P(M) = KR(R



[ I |-t]M) with

R(x) = (1 + K

+ y

))[xy0]



+ [0 0 1]



(3)

where K

indicates the amount of radial distortion that

is present in the image. For high accuracy applica-

tions more advanced models can be used (Slama, 1980;

Willson, 1994).

In this paper the notation d(., .) will be used to in-

dicate the Euclidean distance between entities in the

images.

TwoView Geometry. The point m



corresponding to

the point m in another image is bound to be on the

projection of its line of sight l



∼ Fm where F is the

fundamental matrix for the two views under consid-

eration. Therefore, the following equation should be

satisﬁed for all corresponding points:



Fm = 0. (4)

Visual Modeling with a Hand-Held Camera 209

The fundamental matrix has rank 2 and the right and

left null-space of F corresponds to the epipoles. The

epipoles e and e



are the projections of the projec-

tion center of one image in the other image. The

fundamental matrix can be obtained from two projec-

tion matrices P and P



F = (P



)

†



[e]

(5)

where the epipole e = PC



with C



the solution of



= 0.

Homographies. These can be used to transfer image

points that corresponds to 3D points that are on a spe-

ciﬁc plane from one image to the other, i.e. m



∼ Hm

where H is the homography that corresponds to that

plane (for the two views under consideration). There is

an important relationship between such homographies

and the fundamental matrix:

F ∼ [e



]

H and H = [e



]

F − e





(6)

with [e



]

an anti-symmetric matrix representing the

vector product with the epipole and with a avec-

tor related to the plane. Homographies for a plane

L = [abcd]



can also be obtained from projection

matrices as



= H



−1

with H



= P





[abc]



(7)

From 3 points M

, M

and M

a plane is obtained as the

right null space of [M

]



Comparing Images Regions. Image regions are

typically compared using sum-of-square-differences

(SSD) or zero-mean normalized cross-correlation

(ZNCC). Consider a window W in image I and a cor-

responding region T(W )inimage J . The dissimilar-

ity between two image regions based on SSD is given

D =



[

J (T(x, y)) − I (x , y)

]

w(x, y) dx dy

(8)

where w(x, y)isaweighting function that is deﬁned

over W .Typically, w(x , y) = 1oritisaGaussian. The

similarity measure between two image regions based

on ZNCC is given by

S =



(J (T(x, y)) −

J ) · (I (x, y) −

I )w(x, y) dx dy





(J (T(x, y)) −

J )

w(x, y) dx dy ·





(I (x, y) −

I )

w(x, y) dx dy

(9)

with

J =



J (T(x, y)) dx dy and

I =



I (x, y)

dx dy the mean image intensity in the considered

region. Note that this last measure is invariant to global

intensity and contrast changes over the considered

regions.

2. Relating Images

Starting from a collection of images or a video se-

quence the ﬁrst step consists in relating the different

images to each other. This is not an easy problem. A

restricted number of corresponding points is sufﬁcient

to determine the geometric relationship or multi-view

constraints between the images. Since not all points

are equally suited for matching or tracking (e.g. a pixel

in a homogeneous region), the ﬁrst step consist of se-

lecting a number of interesting points or feature points.

Some approaches also use other features, such as lines

or curves, but these will not be discussed here. Depend-

ing on the type of image data (i.e. video or still pictures)

the feature points are tracked or matched and a number

of potential correspondences are obtained. From these

the multi-view constraints can be computed. However,

since the correspondence problem is an ill-posed prob-

lem, the set of corresponding points can be contam-

inated with an important number of wrong matches

or outliers.Inthis case, a traditional least-squares ap-

proach will fail and therefore a robust method is needed.

Once the multi-view constraints have been obtained

they can be used to guide the search for additional cor-

respondences. These can then be used to further reﬁne

the results for the multi-view constraints.

2.1. Feature Extraction and Matching

One of the most important requirements for a fea-

ture point is that it can be differentiated from its

neighboring image points. If this were not the case,

it wouldn’t be possible to match it uniquely with

a corresponding point in another image. Therefore,

the neighborhood of a feature should be sufﬁciently

210 Pollefeys et al.

different from the neighborhoods obtained after a small

displacement.

A second order approximation of the dissimilarity,

as deﬁned in Eq. (8), between an image window W and

a slightly translated image window is given by

d(x,y) =



x

y



x y

] with

M =





∂ I

∂x

∂ I

∂y





∂ I

∂x

∂ I

∂y



w(x, y) dx dy

(10)

To ensure that no displacement exists for which D

is small, the eigenvalues of M should both be large.

This can be achieved by enforcing a minimal value

for the smallest eigenvalue (Shi and Tomasi, 1994) or

alternatively for the following corner response func-

tion R = det M − k(trace M)

(Harris and Stephens,

1988) where k is a parameter set to 0.04 (a sugges-

tion of Harris). In the case of tracking this is sufﬁcient

to ensure that features can be tracked from one video

frame to the next. In this case it is natural to use the

tracking neighborhood to evaluate the quality of a fea-

ture (e.g. a 7 × 7 window with w(x, y) = 1). Tracking

itself is done by minimizing Eq. (8) over the param-

eters of T.For small steps a translation is sufﬁcient

for T.Toevaluate the accumulated difference from the

start of the track it is advised to use an afﬁne motion

model.

In the case of separate frames as obtained with a

still camera, there is the additional requirement that

as much image points originating from the same 3D

points as possible should be extracted. Therefore, only

local maxima of the corner response function are con-

sidered as suitable features. Sub-pixel precision can

be achieved through quadratic approximation of the

neighborhood of the local maxima. A typical choice

for w(x)inthis case is a Gaussian with σ = 0.7.

Matching is typically done by comparing small, e.g.

7 × 7, windows centered around the feature through

SSD or ZNCC. This measure is only invariant to image

translations and can therefore not cope with too large

variations in camera pose.

To match images that are more widely separated, it

is required to cope with a larger set of image variations.

Exhaustive search over all possible variations is com-

putationally intractable. A more interesting approach

consists of extracting a more complex feature that not

only determines the position, but also the other un-

knowns of a local similarity (Schmid and Mohr, 1997)

or afﬁne transformation (Lowe, 1999; Tuytelaars and

Van Gool, 2000).

2.2. Two View Geometry Computation

Even for an arbitrary geometric structure, the projec-

tions of points in two views contain some structure.

Finding back this structure is not only interesting to re-

trieve information on the relative pose between the two

views (Faugeras, 1992; Hartley et al., 1992), but also

to eliminate mismatches and to facilitate the search

for additional matches. This structure corresponds to

the epipolar geometry and is mathematically expressed

by the fundamental matrix. Given a number of corre-

sponding points Eq. (4) can be used to compute F. This

equation can be rewritten in the following form:

[



xy1

]f = 0 (11)

with m = [xy1]



, m



= [x





and f avector con-

taining the elements of the fundamental matrix. Stack-

ing 8 or more of these equations allows to linearly solve

for the fundamental matrix. Even for 7 corresponding

points the one parameter family of solutions obtained

by solving the linear equations can be restricted to 1

or 3 solutions by enforcing the cubic rank-2 constraint

det

(

+ λF

)

= 0. As pointed out by Hartley (1997) it

is important to normalize the image coordinates before

solving the linear equations. Otherwise the columns of

Eq. (11) would differ by several orders of magnitude

and the error would concentrate on the coefﬁcients cor-

responding to the smaller columns. This normalization

can be achieved by transforming the image center to the

origin and scaling the images so that the coordinates

have a standard deviation of unity. More advanced ap-

proaches have been proposed (e.g. Matei and Meer,

2000), but in practice the simple approach is sufﬁcient

to initialize a non-linear minimization. The result of

the linear equations can be reﬁned by minimizing the

following criterion (Zhang et al., 1995):

C(F) =



(d(m



, Fm)

+ d(m, F





)

) (12)

This criterion can be minimized through a Levenberg-

Marquard algorithm (Press et al., 1992). An even better

approach consists of computing a maximum-likelihood

estimation (MLE) by minimizing the following

Visual Modeling with a Hand-Held Camera 211

criterion:

C(F, ^m, ^m



) =



(d(^m, m)

+ d(^m



, m



)

) with

ˆm



Fˆm = 0 (13)

Although in this case the minimization has to be carried

out over a much larger set of variables, this can be

achieved efﬁciently by taking advantage of the sparsity

of the problem (see Section 3.3).

To compute the fundamental matrix from a set of

matches that were automatically obtained from a pair

of real images, it is important to explicitly deal with

outliers. If the set of matches is contaminated with even

a small set of outliers, the result of the above method

can become unusable. This is typical for all types of

least-squares approaches (even non-linear ones). The

problem is that the quadratic penalty (which is optimal

for Gaussian noise) allows for a single outlier that is

very far away from the true solution to completely bias

the ﬁnal result (Torr, 1995).

The approach that is used to cope with this problem is

the RANSAC algorithm that was proposed by Fischler

and Bolles (1981). A minimal subset of the data is

randomly selected and the solution obtained from it is

used to segment the remainder of the dataset in “inliers”

and “outliers”. If the initial set contains no outliers, it

can be expected that an important number of inliers

will support the solution, otherwise the initial subset is

probably contaminated with outliers. This procedure is

repeated until a satisfying solution is obtained. This can

for example be deﬁned as a probability in excess of 95%

that a good subsample was selected. The expression for

this probability is  = 1−(1−γ

)

with γ the fraction

of inliers, and p the number of features in each sample

and m the number of trials (see Rousseeuw (1987)).

Once the epipolar geometry has been computed it

can be used to guide the matching procedure toward

additional matches. At this point only features that

are close to the epipolar line should be considered for

matching. Table 1 summarizes the robust approach to

the determination of the two-view geometry.

3. Structure and Motion Recovery

In the previous section it was seen how different views

could be related to each other. In this section the relation

between the views and the correspondences between

the features will be used to retrieve the structure of the

scene and the motion of the camera. This problem is

called Structure and Motion.

Table 1.Overview of the two-view geometry computation

algorithm.

Step 1. Compute a set of potential matches

Step 2. While (#inliers, #samples) < 95% do

Step 2.1 select minimal sample (7 matches)

Step 2.2 compute solutions for F

Step 2.3 determine inliers

Step 3. Reﬁne F based on all inliers

Step 4. Look for additional matches

Step 5. Reﬁne F based on all correct matches

The approach that is proposed here extends

(Beardsley et al., 1997; Koch et al., 1999b) by be-

ing fully projective and therefore not dependent on

the quasi-euclidean initialization. This was achieved by

carrying out all measurements in the images. This ap-

proach provides an alternative for the triplet-based ap-

proach proposed in Fitzgibbon and Zisserman (1998).

An image-based measure that is able to obtain a quali-

tative distance between viewpoints is also proposed to

support initialization and determination of close views

(independently of the actual projective frame).

At ﬁrst two images are selected and an initial recon-

struction frame is set-up. Then the pose of the camera

for the other views is determined in this frame and

each time the initial reconstruction is reﬁned and ex-

tended. In this way the pose estimation of views that

have no common features with the reference views also

becomes possible. Typically, a view is only matched

with its predecessor in the sequence. In most cases this

works ﬁne, but in some cases (e.g. when the camera

moves back and forth) it can be interesting to also re-

late a new view to a number of additional views. Once

the structure and motion has been determined for the

whole sequence, the results can be reﬁned through a

projective bundle adjustment. Then the ambiguity will

be restricted to metric through self-calibration. Finally,

a metric bundle adjustment is carried out to obtain an

optimal estimation of the structure and motion.

3.1. Initial Structure and Motion

The ﬁrst step consists of selecting two views that are

suited for initializing the sequential structure and mo-

tion computation. On the one hand it is important that

sufﬁcient features are matched between these views, on

the other hand the views should not be too close to each

other so that the initial structure is well-conditioned.

The ﬁrst of these criteria is easy to verify, the

评论收藏

内容反馈

小米米

2015-03-01

还不错，虽然没看太懂
ke_cole_li

2012-07-29

挺好的，英文的
hzjbsjz

2012-05-07

挺好的，但就是英文的，看得头大……
Flying_Fish618

2011-10-14

挺好的，但就是英文的，看得头大……
超级僵尸

2018-07-10

感觉比较理论，能附上一些实现的源码更好

SlayersNada

粉丝: 2
资源: 1

基于图像序列的三维重建的两篇经典文章

三维重建经典论文集

基于图象的三维重建

基于图像序列的三维重建

(基于图像序列的三维重建技术

关于三维重建的两篇中文文献，详细介绍重建步骤与系统实现

可直接进行多张照片三维重建，速度很快

基于多幅图像的三维重建.pdf

基于多幅图像序列的三维重建

申抒含-基于图像的大规模场景三维重建

基于序列图像的地形地貌三维重建关键技术研究_王小虎.caj

基于序列图像的粮堆三维重建技术研究

基于序列图像的工业钣金件三维重建与视觉检测 (2004年)

论文研究-基于独立三视图的图像序列三维重建.pdf

论文研究-基于长序列图像的三维重建技术研究.pdf

基于图像的3D云重建

序列图像三维重建

基于matlab的图像三维重建

python图像处理三维重建所有代码

基于图像三维重建软件visualSFM

基于单幅图像的三维重建技术综述

医学图像处理与三维重建综合策略（论文）

论文研究-基于面绘制的医学序列图像三维重建.pdf

音视频-编解码-基于点线特征的大基线图像序列三维重建技术研究.pdf

音视频-编解码-基于非定标图像序列的三维重建关键技术研究.pdf

一种基于红外图像序列的深度学习三维重建仿真方法初探.pdf

基于长序列图像的三维重建技术研究.pdf

CT图像三维重建脚本matlab实现

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

仿真电路以及操作方法

最新资源

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar