三维重建论文-point-mvsnet资源-CSDN文库

三维重建

需积分: 5 108 浏览量 2024-10-26 08:36:53 上传评论收藏 4.92MB PDF 举报

资源推荐

资源详情

资源评论

Point-Based Multi-View Stereo Network

Rui Chen

1,3*

Songfang Han

2,3*

Jing Xu

Hao Su

Tsinghua University

The Hong Kong University of Science and Technology

University of California, San Diego

chenr17@mails.tsinghua.edu.cn shanaf@connect.ust.hk

jingxu@tsinghua.edu.cn haosu@eng.ucsd.edu

before ow after ow

PointFlow

Dynamic Feature Fetching

Coarse prediction Rened prediction Final prediction

surface

Figure 1: Point-MVSNet performs multi-view stereo reconstruction in a coarse-to-ﬁne fashion, learning to predict the 3D ﬂow of each point

to the groundtruth surface based on geometry priors and 2D image appearance cues dynamically fetched from multi-view images and regress

accurate and dense point clouds iteratively.

Abstract

We introduce Point-MVSNet, a novel point-based deep

framework for multi-view stereo (MVS). Distinct from

existing cost volume approaches, our method directly

processes the target scene as point clouds. More speciﬁcally,

our method predicts the depth in a coarse-to-ﬁne manner.

We ﬁrst generate a coarse depth map, convert it into a point

cloud and reﬁne the point cloud iteratively by estimating

the residual between the depth of the current iteration

and that of the ground truth. Our network leverages 3D

geometry priors and 2D texture information jointly and

effectively by fusing them into a feature-augmented point

cloud, and processes the point cloud to estimate the 3D ﬂow

for each point. This point-based architecture allows higher

accuracy, more computational efﬁciency and more ﬂexibility

than cost-volume-based counterparts. Experimental results

show that our approach achieves a signiﬁcant improvement

in reconstruction quality compared with state-of-the-art

methods on the DTU and the Tanks and Temples dataset.

Our source code and trained models are available at

https://github.com/callmeray/PointMVSNet.

Equal contribution.

1. Introduction

Recent learning-based multi-view stereo (MVS)

methods [

] have shown great success compared

with their traditional counterparts as learning-based

approaches are able to learn to take advantage of scene

global semantic information, including object materials,

specularity, and environmental illumination, to get more

robust matching and more complete reconstruction. All

these approaches apply dense multi-scale 3D CNNs to

predict the depth map or voxel occupancy. However, 3D

CNNs require memory cubic to the model resolution,

which can be potentially prohibitive to achieving optimal

performance. While Maxim et al. [

] addressed this

problem by progressively generating an Octree structure,

the quantization artifacts brought by grid partitioning

still remain, and errors may accumulate since the tree is

generated layer by layer.

In this work, we propose a novel point cloud multi-view

stereo network, where the target scene is directly processed

as a point cloud, a more efﬁcient representation, particularly

when the 3D resolution is high. Our framework is composed

of two steps: ﬁrst, in order to carve out the approximate

object surface from the whole scene, an initial coarse depth

map is generated by a relatively small 3D cost volume and

arXiv:1908.04422v1 [cs.CV] 12 Aug 2019

then converted to a point cloud. Subsequently, our novel

PointFlow module is applied to iteratively regress accurate

and dense point clouds from the initial point cloud. Similar to

ResNet [

], we explicitly formulate the PointFlow to predict

the residual between the depth of the current iteration and

that of the ground truth. The 3D ﬂow is estimated based on

geometry priors inferred from the predicted point cloud and

the 2D image appearance cues dynamically fetched from

multi-view input images (Figure 1).

We ﬁnd that our Point-based Multi-view Stereo Network

(Point-MVSNet) framework enjoys advantages in accuracy,

efﬁciency, and ﬂexibility when it is compared with previous

MVS methods that are built upon a predeﬁned 3D volume

with the ﬁxed resolution to aggregate information from

views. Our method adaptively samples potential surface

points in the 3D space. It keeps the continuity of the surface

structure naturally, which is necessary for high precision

reconstruction. Furthermore, because our network only

processes valid information near the object surface instead

of the whole 3D space as is the case in 3D CNNs, the

computation is much more efﬁcient. Lastly, the adaptive

reﬁnement scheme allows us to ﬁrst peek at the scene at

coarse resolution and then densify the reconstructed point

cloud only in the region of interest. For scenarios such as

interaction-oriented robot vision, this ﬂexibility would result

in saving of computational power.

Our method achieves state-of-the-art performance on

standard multi-view stereo benchmarks among learning-

based methods, including DTU [

] and Tanks and

Temples [

]. Compared with previous state-of-the-art, our

method produces better results in terms of both completeness

and overall quality. Besides, we show potential applications

of our proposed method, such as foveated depth inference.

2. Related work

Multi-view Stereo Reconstruction

MVS is a classical

problem that had been extensively studied before the rise of

deep learning. A number of 3D representations are adopted,

including volumes [

], deformation models [

], and

patches [

], which are iteratively updated through multi-view

photometric consistency and regularization optimization.

Our iterative reﬁnement procedure shares a similar idea

with these classical solutions by updating the depth map

iteratively. However, our learning-based algorithm achieves

improved robustness to input image corruption and avoids

the tedious manual hyper-parameters tuning.

Learning-based MVS

Inspired by the recent success of

deep learning in image recognition tasks, researchers began

to apply learning techniques to stereo reconstruction tasks

for better patch representation and matching [

Although these methods in which only 2D networks are

used have made a great improvement on stereo tasks, it is

difﬁcult to extend them to multi-view stereo tasks, and their

performance is limited in challenging scenes due to the lack

of contextual geometry knowledge. Concurrently, 3D cost

volume regularization approaches have been proposed [

], where a 3D cost volume is built either in the camera

frustum or the scene. Next, the 2D image features of multi-

views are warped in the cost volume, so that 3D CNNs can

be applied to it. The key advantage of 3D cost volume is

that the 3D geometry of the scene can be captured by the

network explicitly, and the photometric matching can be

performed in 3D space, alleviating the inﬂuence of image

distortion caused by perspective transformation and potential

occlusions, which makes these methods achieve better results

than 2D learning based methods. Instead of using voxel

grids, in this paper we propose to use a point-based network

for MVS tasks to take advantage of 3D geometry learning

without being buredened by the inefﬁciency found in 3D

CNN computation.

High-Resolution MVS

High-resolution MVS is critical to

real applications such as robot manipulation and augmented

reality. Traditional methods [

] generate dense 3D

patches by expanding from conﬁdent matching key-points

repeatedly, which is potentially time-consuming. These

methods are also sensitive to noise and change of viewpoint

owing to the usage of hand-crafted features. Recent learning

methods try to ease memory consumption by advanced space

partitioning [

]. However, most of these methods

construct a ﬁxed cost volume representation for the whole

scene, lacking ﬂexibility. In our work, we use point clouds

as the representation of the scene, which is more ﬂexible and

enables us to approach the accurate position progressively.

Point-based 3D Learning

Recently, a new type of deep

network architecture has been proposed in [

], which

is able to process point clouds directly without converting

them to volumetric grids. Compared with voxel-based

methods, this kind of architecture concentrates on the

point cloud data and saves unnecessary computation. Also,

the continuity of space is preserved during the process.

While PointNets have shown signiﬁcant performance and

efﬁciency improvement in various 3D understanding tasks,

such as object classiﬁcation and detection [

], it is under

exploration how this architecture can be used for MVS task,

where the 3D scene is unknown to the network. In this paper,

we propose PointFlow module, which estimates the 3D ﬂow

based on joint 2D-3D features of point hypotheses.

3. Method

This section describes the detailed network architecture

of Point-MVSNet (Figure 2). Our method can be divided

into two steps, coarse depth prediction, and iterative depth

GT Depth Map

Coarse Depth Map

Prediction

Rened Depth Map

Predictions

Loss

Coarse Depth

Prediction Network

Feature Augmented

Point Cloud

Loss

Image Feature

Pyramid

Depth Residual

Prediction

Unprojection

Point Hypotheses

Generation

Iterative Renement

PointFlow module

Dynamic

Feature Fetching

Figure 2: Overview of Point-MVSNet architecture. A coarse depth map is ﬁrst predicted with low GPU memory and computation cost and

then unprojected to a point cloud along with hypothesized points. For each point, the feature is fetched from the multi-view image feature

pyramid dynamically. The PointFlow module uses the feature augmented point cloud for depth residual prediction, and the depth map is

reﬁned iteratively.

reﬁnement. Let

denote the reference image and

}

i=1

denote a set of its neighbouring source images. We ﬁrst

generate a coarse depth map for

(subsection 3.1). Since

the resolution is low, the existing volumetric MVS method

has sufﬁcient efﬁciency and can be used. Second we

introduce the 2D-3D feature lifting (subsection 3.2), which

associates the 2D image information with 3D geometry

priors. Then we propose our novel PointFlow module

(subsection 3.3) to iteratively reﬁne the input depth map

to higher resolution with improved accuracy.

3.1. Coarse depth prediction

Recently, learning-based MVS [

] achieves state-

of-the-art performance using multi-scale 3D CNNs on cost

volume regularization. However, this step could be extremely

memory expensive as the memory requirement is increasing

cubically as the cost volume resolution grows. Taking

memory and time into consideration, we use the recently

proposed MVSNet [

] to predict a relatively low-resolution

cost volume.

Given the images and corresponding camera parameters,

MVSNet [

] builds a 3D cost volume upon the reference

camera frustum. Then the initial depth map for reference

view is regressed through multi-scale 3D CNNs and the

soft argmin [

] operation. In MVSNet, feature maps are

downsampled to

1/4

of the original input image in each

dimension and the number of virtual depth planes are

256

for

both training and evaluation. On the other hand, in our coarse

depth estimation network, the cost volume is constructed

with feature maps of

1/8

the size of the reference image,

containing

virtual depth planes for training and

evaluation, respectively. Therefore, our memory usage of

this 3D feature volume is about 1/20 of that in MVSNet.

3.2. 2D-3D feature lifting

Image Feature Pyramid

Learning-based image features

are demonstrated to be critical to boosting up dense pixel

correspondence quality [

]. In order to endow points

with a larger receptive ﬁeld of contextual information at

multiple scales, we construct a 3-scale feature pyramid.

2D convolutional networks with stride

are applied to

downsample the feature map, and each last layer before

the downsampling is extracted to construct the ﬁnal feature

pyramid

= [F

, F

]

for image

. Similar to common

MVS methods[

], feature pyramids are shared among

all input images.

Dynamic Feature Fetching

The point feature used in our

network is compromised of the fetched multi-view image

feature variance with the normalized 3D coordinates in world

space X

. We will introduce them separately.

Image appearance features for each 3D point can

be fetched from the multi-view feature maps using a

differentiable unprojection given corresponding camera

parameters. Note that features

, F

are at different

image resolutions, thus the camera intrinsic matrix should

be scaled at each level of the feature maps for correct feature

warping. Similar to MVSNet [

], we keep a variance-based

cost metric, i.e. the feature variance among different views,

to aggregate features warped from an arbitrary number of

views. For pyramid feature at level

, the variance metric for

N views is deﬁned as below:

i=1



− F



, (j = 1, 2, 3) (1)

剩余12页未读，继续阅读

评论收藏

内容反馈

三十度角阳光的问候

粉丝: 2208
资源: 352

三维重建论文 -point-mvsnet

最新资源

三维重建论文 -point-mvsnet

深度学习三维重建 R-MVSNet-CVPR-2019（源码、原文+译文）

深度学习三维重建 Fast-MVSNet-CVPR-2020（源码、原文、译文、批注）

深度学习三维重建 CVP-MVSNet-CVPR-2020（源码、原文、译文）

三维重建-曲面重建-论文

三维重建-SFM（合集）

深度学习三维重建 MVSNet-ECCV-2018 （源码、pytorch版、原文、注释、译文、批注）

深度学习三维重建 cascade-MVSNet-CVPR-202（源码、原文、译文）

基于图像的三维重建源码-课程作业.zip

深度学习 三维重建 UCS-Net-CVPR-2020 （源码、原文+译文）

MVS多视图三维重建-传统体素（集合）

三维重建-CVP-MVSNet论文

MVS多视图三维重建 -基于体素的学习方法（合集）

MVS多视图三维重建-深度图融合滤波方法（合集）

MVS多视图三维重建-传统深度图的MVS（合集）

三维重建论文-JDACS-MS

三维重建 -MVSNet 论文

MVS多视图三维重建-可变形多边形网格（基于表面演化）（合集）

多视图几何三维重建 -P-MVSNet

三维重建论文 -MVS2

无监督三维重建论文 -Unsupervised MVS

深度学习三维重建 SurfaceNet-ICCV-2017（源码+原文）

基于深度学习的三维重建论文 -M3VSNet

无监督的MVSNet （源码）-Unsupervised MVSNet-CVPR-2019

基于opencv3三维重建教程-英文版

FDk三维图像重建投影数据--参考论文.zip

深度学习三维重建 PointMVSNet-ICCV-2019（源码、原文、译文、批注）

三维重建-基于QT+VTK实现的CT图像三维重建算法-项目源码-优质项目实战.zip

yolov11源码+yolov11n、s、m.pt文件整合8.3.20版本

Google Chrome浏览器ChromeDriver驱动下载(Chrome版本：131.0.6778.205)win64

最新资源

深度学习三维重建 UCS-Net-CVPR-2020 （源码、原文+译文）