没有合适的资源?快使用搜索试试~ 我知道了~
单目深度估计+DSO1
试读
17页
需积分: 0 0 下载量 163 浏览量
更新于2022-08-03
收藏 2.45MB PDF 举报
单目深度估计与DSO(Direct Sparse Odometry)结合是一种解决传统单目视觉里程计(Visual Odometry, VO)中尺度漂移问题的有效方法。在本文中,作者提出利用深度学习预测单目图像的深度信息,将其融入DSO中,形成一种名为Deep Virtual Stereo Odometry(深度虚拟立体视觉里程计)的新技术。这种方法可以克服几何基础单目视觉里程计的局限,并且不需要连续帧之间有显著的视差变化就能进行运动估计和三维重建。
深度网络的设计是关键。为了预测深度,作者设计了一个新颖的深度神经网络,它通过两阶段过程对单个图像的预测深度进行精细化处理。网络的训练采用半监督学习方式,一方面基于立体图像的光一致性,另一方面基于Stereo DSO得到的精确稀疏深度重建进行一致性训练。在KITTI基准测试中,该深度预测方法超越了现有的单目深度估计方法。
Deep Virtual Stereo Odometry在准确性上明显优于之前的单目和基于深度学习的视觉里程计方法。它仅依赖一个摄像头,却能实现与最先进的立体视觉方法相当的性能。这表明,通过结合深度学习和DSO,即使在单目设置下,也能实现高精度的定位和场景理解。
单目深度估计通常是一个困难的任务,因为深度信息必须从二维图像中恢复出来,而这种恢复往往受到缺乏直接深度测量的限制。通过深度网络,可以从大量数据中学习到深度模式和上下文信息,从而提供更准确的深度预测。DSO则是一种直接法视觉里程计,它直接估计像素级别的光度误差,从而减少对特征匹配的依赖。将深度预测与DSO相结合,可以将深度信息转化为“虚拟立体”测量,弥补单目系统中的尺度不确定性。
此外,论文还讨论了这种方法在自动驾驶、机器人导航和增强现实等领域的潜在应用。由于不需要复杂的硬件设备,仅使用一个摄像头即可实现高精度的定位,这对于资源有限的移动平台尤其具有吸引力。
总结来说,这篇论文提出了一个创新的框架,将深度学习和直接法视觉里程计相结合,提高了单目视觉里程计的精度和鲁棒性。这种方法不仅在深度预测上取得了突破,而且在实际应用中表现出了强大的竞争力,为单目视觉定位和3D重建提供了新的思路。未来的研究可能会进一步探索如何优化网络架构,提高实时性能,以及如何扩展到更复杂的环境和动态场景。
Deep Virtual Stereo Odometry:
Leveraging Deep Depth Prediction for
Monocular Direct Sparse Odometry
Nan Yang
1,2[0000−0002−1497−9630]
, Rui Wang
1,2[0000−0002−2252−9955]
,
J¨org St¨uckler
1[0000−0002−2328−4363]
, and Daniel Cremers
1,2
1
Technical University of M u n i ch
2
Artisense
{yangn,wangr,stueckle,cremers}@in.tum.de
Abstract. Monocular visual odometry approaches that purely rely on
geometric cues are prone to scale drift an d require sufficient motion par-
allax in successive frames for motion estimation and 3D reconstructi o n .
In this pa per, we propose to leverage deep monocular depth prediction
to overcome limitations of geometry-based monocular visual odometry.
To this end, we incorporate deep d ep t h predictions into Direct Sparse
Odometry (DSO) as direct virtu a l stereo measurements. For depth pre-
diction, we design a novel deep network that refines predicted depth from
a single image in a two-stage process. We train our network in a semi-
supervised way on photoconsi st en c y in stereo images and on consistency
with accurate sparse depth reconstructions f ro m Stereo DSO. Our deep
predictions excel state-of-the-art approaches for monocular depth on the
KITTI benchmark. Moreover, our Deep Virtual Stereo Odometry clearly
exceeds previous monocular a n d deep-learning based method s i n a c c u -
racy. It even achieves comparable performance to the state-of-the-art
stereo methods, while only relying on a single camera.
Keywords: Monocular depth estimatio n · Monocular visual odometry
· Semi-supervised learning
1 Introduction
Visual odometry (VO) is a highly active field of research in c ompu t er vision with
a plethora of applications in domains such as autonomous driving, robotics,
and augmented reality. VO with a single camera using traditional geometric
approaches inherently suffers from t h e fact th at camera t r ajectory and map
can only be estimated up to an unknown scal e which also leads to scale drift.
Moreover, sufficient motion parallax is required to estimate motion and structure
from successive frames. To avoid these issues, typically more complex sensors
such as act ive depth cameras or stereo rigs are employed. However, these sensors
require larger efforts in calibration and increase the costs of the vision system.
Metric depth can also be recovered from a single image if a-priori knowledge
about the typical sizes or appearances of objects is used . Deep learning based
deep prediction目前最优
单目深度估计+DSO
2 N. Yang, R. Wang, J. St¨uckler and D. Cremers
Fig. 1: DVSO achieves monocular visual odometry on KITTI on par with
state-of-the-art stereo methods. It uses deep-learning based left-right dispar-
ity predictions (lower l ef t ) for ini t i al iz at i on and virt u al stereo c ons t rai nts in an
optimization-based direct visual odomet r y pipeline. This allows for recovering
accurate metric estimates.
approaches tackle this by training deep neural networks on large amounts of data.
In this paper, we propose a novel approach to mon ocular visual odometry, Deep
Virtual Stereo Odometry (DVSO), which incorporates deep depth predictions
into a geometric monocular odometry pipeline. We use deep stereo disparity
for virtual direct image alignment constraints within a framework for windowed
direct bundle adjustment (e.g. Direct Sparse Odometry [
8]). DVSO achieves
comparable performance to the s tat e -of -t he -ar t stereo visual odometry systems
on the KITTI odometry benchmark. It can even outperform the state-of-the-
art geometric VO methods when tuning scale-dependent parameters such as the
virtual stereo baseline.
As an additional contribution, we propose a novel stacked residual network
architecture that refines disparity estimates in two stages an d is trained in a semi-
supervised way. In typical supervised learning approaches [
6,25,24], depth ground
truth needs to be acquired for traini n g with active sensors like RGB-D cameras
and 3D laser scanners which are costly to obtain. Requiring a large amount of
such labeled data is an additional burde n that limits generalization to new en-
vironments. Self-supervised [
11,14] and unsupervised le arn i ng approaches [49],
on the other hand, overcome this limitation and do not require additional ac -
tive sensors. Commonly, they train the networks on photometric consistency,
for exampl e in stereo imagery [
11,14], whi ch reduces the effort for collecting
training data. Still, the current self-supervised approaches are not as accur ate
as supervised methods [
23]. We combine self-supervised and supervised train-
ing, but avoid the costly collection of LiDAR data in our approach. Instead, we
make use of Stereo Direct Sparse Odometry ( S te r eo DSO [
40]) to provide accu-
rate sparse 3D r ec ons t ru ct i on s on the training set. Our deep depth prediction
network outperforms the cur r ent state-of-the-art methods on KITTI.
A video demonstrating our methods as well as the results is avai l abl e at
https://youtu.be/sLZOeC9z_tw.
1.1 Related Work
Deep learning for monocular depth estimation. Deep learning based ap-
proaches have recently achieved great advances in monoc ul ar dep t h es ti m ati on .
堆叠残差网络结构
self-supervised不如supervised准确
DVSO: Leveraging Deep Depth Predict io n for Monocular DSO 3
Employing deep neural network avoids the hand-crafted features used in pre-
vious methods [
36,19]. Supervised de ep learning [6,25,24] has recently shown
great success for monocular depth estimation. Eigen et al. [
6,5] p r opose a two
scale CNN architecture which directly predicts the depth map from a single im-
age. Laina et al. [
24] propose a residual network [17] based fully convolutional
encoder-decoder architecture [
27] with a robust regression loss function. The
aforementioned supervised learning approaches need large amounts of ground-
truth depth data for trainin g. Self-supervised approaches [
11,44,14] overcome
this limit at i on by exploiting photoc ons i s t en cy and geometric constraints to de-
fine loss functions, for e xam pl e , in a stereo camera setup. This way, only stereo
images are needed for training w hi ch are typically easier to obtai n than ac-
curate depth measurements from active sensors such as 3D lase rs or RGB-D
cameras. Godard et al. [
14] achieve the state-of-the-art depth estimation accu-
racy for a fully self -s upervised approach. The semi-supervised scheme proposed
by Kuz ni e t sov et al. [23] combines the self-super v i se d loss with supervision with
sparse LiDAR ground truth. They do not need multi-scale depth supervis ion
or left-right consistency in th ei r loss, and achieve better performance than the
self-supervised approach in [
14]. The limitation of this semi-supervised approach
is the r eq u i re ment for LiDAR data which are costly to collect. In our approach
we use Stereo Direct Spar se Odometry to obtai n sparse depth ground -t r ut h for
semi-supervised training. Since the extracted depth maps are even spars er than
LiDAR data, we also employ multi-scale self-supervised training and left-right
consistency as in Godard et al. [
14]. Inspired by [20,34], we design a stacked
network architecture leveraging the concept of residual learning [
17].
Deep learning for VO / SLAM. In recent years, large progres s has b een
achieved in the development of monocular VO and SLAM methods [
31,9,8,32].
Due to projective geometry, metric scale cannot be observed w it h a single cam-
era [37] whi ch introduces sc ale drift. A popular approach is hence to use stereo
cameras for VO [
10,8,31] which avoid scale ambiguity and leverage stereo match-
ing with a fixed baseline for estimating 3D structu r e. While stereo VO delivers
more reliable depth estimation, it requires self-calibration for long-term opera-
tion [
4,46]. The integration of a second camera also introduces additional costs.
Some r ec ent monocular VO approaches have integrated monocular depth esti-
mation [ 46, 39] to recover the metric scale by scale-matching. CNN-SLAM [39]
extends LSD-SLAM [
9] by predicting depth with a CNN and refining the depth
maps using Bayesian filtering [
9,7]. Their method shows su perior performance
over monocular SLAM [
9,30,45,35] on indoor datasets [15,38]. Yin et al. [46]
propose to use convolutional neural fields and consecutive frames to improve the
monocular depth estimation from a CNN. Camera motion is estimated using
the refined depth. Cod eS LAM [
2] foc us es on the challenge of den se 3D recon-
struction. It jointly optimizes a learned compact representation of the dense
geometry with camera poses. Our work tackles the problem of odometry with
monocular cameras and integrates deep depth prediction with multi-view stereo
to improve camera p os e estimation. An ot he r line of research trains networks
to directly predict the ego-motion end-to-end usin g supervised [
41] or unsuper-
&
利用剩余学习的概念构建了stacked network architecture
4 N. Yang, R. Wang, J. St¨uckler and D. Cremers
Outputs
StackNet
ResidualNetSimpleNet
Fig. 2: Overview of StackNet architecture.
vised learning [
49,26]. However, the estimated ego-motion of these methods is
still by far inferior to geometric visual odom et r y approaches. In our approach,
we phrase visual odometry as a geometric optimization problem but incorporate
photoconsistency constraints with state-of-the-art deep monocular depth predic-
tions into the optimization. This way, we obtai n a highly accurate monocular
visual odometry that is not prone to scale drift and achieves comparable result s
to traditional stereo VO methods.
2 Semi-Supervised Deep Monocular Depth Estimation
In this section, we will introduce our semi-supervis ed approach to d eep monocu-
lar depth estimat ion . It builds on t hr ee key ingredients: se l f-s upervised learning
from photoconsistency in a stereo setu p simi l ar to [
14], supervised learning based
on accurate sparse depth reconstruction by Stereo DSO, and two-stage refine-
ment of the network predictions in a stacked encoder-decoder architecture.
2.1 Network Architecture
We coin our architecture StackNet s i nc e it stacks two sub-networks, SimpleNet
and ResidualNet, as depict ed in Figure 2. Both sub -n etworks are f ul l y convolu-
tional deep neural network adopted from DispNet [
28] with an encoder-decoder
scheme. Resid u alNe t has fewer layers and takes the ou t pu t s of SimpleNet as
inputs. Its purpose is to refine the disparity maps predicted by SimpleNet by
learning an additive r esi d ual signal. Similar residual learning architectures have
been successfully applied to related deep learning tasks [20,34]. The detai l ed
network architecture is illustrated in the
supplementary material.
SimpleNet. SimpleNet is an encoder- de coder architecture with a ResNet-50
based encoder and skip connections between corresponding encoder and decoder
layers. The decoder upprojects the feature maps to the origin al resolution and
generates 4 pairs of disparity maps disp
left
simple,s
and disp
right
simple,s
in different res-
olutions s ∈ [0, 3]. The upprojection is implemented by resize-convolution [
33],
i.e. a nearest-neighbor upsampling layer by a factor of two followed by a con-
volutional layer. The usage of skip connections enables the decoder to recover
high-resolution results with fine-grained details.
1.
2.
3.
剩余16页未读,继续阅读
资源推荐
资源评论
122 浏览量
136 浏览量
2010-11-24 上传
2019-04-07 上传
192 浏览量
5星 · 资源好评率100%
171 浏览量
5星 · 资源好评率100%
5星 · 资源好评率100%
105 浏览量
5星 · 资源好评率100%
200 浏览量
199 浏览量
5星 · 资源好评率100%
185 浏览量
资源评论
宝贝的麻麻
- 粉丝: 42
- 资源: 294
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 【培训实施】-05-培训计划及实施方案.docx.doc
- 【培训实施】-03-企业培训整体规划及实施流程.docx
- 【培训实施】-08-培训实施.docx
- 【培训实施】-06-培训实施方案.docx
- 【培训实施】-11-培训实施流程 .docx
- 【培训实施】-09-公司年度培训实施方案.docx
- 【培训实施】-10-培训实施计划表.docx
- 【培训实施】-14-培训实施流程图.xlsx
- 【培训实施】-13-培训实施流程.docx
- 【培训实施】-12-企业培训实施流程.docx
- CentOS7修改默认启动级别
- 基于web的旅游管理系统的设计与实现论文.doc
- 02-培训师管理制度.docx
- 01-公司内部培训师管理制度.docx
- 00-如何塑造一支高效的企业内训师队伍.docx
- 05-某集团内部培训师管理办法.docx
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功