【免费】单目深度估计+DSO1资源-CSDN文库

需积分: 0 163 浏览量更新于2022-08-03 收藏 2.45MB PDF 举报

单目深度估计与DSO（Direct Sparse Odometry）结合是一种解决传统单目视觉里程计（Visual Odometry, VO）中尺度漂移问题的有效方法。在本文中，作者提出利用深度学习预测单目图像的深度信息，将其融入DSO中，形成一种名为Deep Virtual Stereo Odometry（深度虚拟立体视觉里程计）的新技术。这种方法可以克服几何基础单目视觉里程计的局限，并且不需要连续帧之间有显著的视差变化就能进行运动估计和三维重建。深度网络的设计是关键。为了预测深度，作者设计了一个新颖的深度神经网络，它通过两阶段过程对单个图像的预测深度进行精细化处理。网络的训练采用半监督学习方式，一方面基于立体图像的光一致性，另一方面基于Stereo DSO得到的精确稀疏深度重建进行一致性训练。在KITTI基准测试中，该深度预测方法超越了现有的单目深度估计方法。 Deep Virtual Stereo Odometry在准确性上明显优于之前的单目和基于深度学习的视觉里程计方法。它仅依赖一个摄像头，却能实现与最先进的立体视觉方法相当的性能。这表明，通过结合深度学习和DSO，即使在单目设置下，也能实现高精度的定位和场景理解。单目深度估计通常是一个困难的任务，因为深度信息必须从二维图像中恢复出来，而这种恢复往往受到缺乏直接深度测量的限制。通过深度网络，可以从大量数据中学习到深度模式和上下文信息，从而提供更准确的深度预测。DSO则是一种直接法视觉里程计，它直接估计像素级别的光度误差，从而减少对特征匹配的依赖。将深度预测与DSO相结合，可以将深度信息转化为“虚拟立体”测量，弥补单目系统中的尺度不确定性。此外，论文还讨论了这种方法在自动驾驶、机器人导航和增强现实等领域的潜在应用。由于不需要复杂的硬件设备，仅使用一个摄像头即可实现高精度的定位，这对于资源有限的移动平台尤其具有吸引力。总结来说，这篇论文提出了一个创新的框架，将深度学习和直接法视觉里程计相结合，提高了单目视觉里程计的精度和鲁棒性。这种方法不仅在深度预测上取得了突破，而且在实际应用中表现出了强大的竞争力，为单目视觉定位和3D重建提供了新的思路。未来的研究可能会进一步探索如何优化网络架构，提高实时性能，以及如何扩展到更复杂的环境和动态场景。

Deep Virtual Stereo Odometry:

Leveraging Deep Depth Prediction for

Monocular Direct Sparse Odometry

Nan Yang

1,2[0000−0002−1497−9630]

, Rui Wang

1,2[0000−0002−2252−9955]

J¨org St¨uckler

1[0000−0002−2328−4363]

, and Daniel Cremers

1,2

Technical University of M u n i ch

Artisense

{yangn,wangr,stueckle,cremers}@in.tum.de

Abstract. Monocular visual odometry approaches that purely rely on

geometric cues are prone to scale drift an d require suﬃcient motion par-

allax in successive frames for motion estimation and 3D reconstructi o n .

In this pa per, we propose to leverage deep monocular depth prediction

to overcome limitations of geometry-based monocular visual odometry.

To this end, we incorporate deep d ep t h predictions into Direct Sparse

Odometry (DSO) as direct virtu a l stereo measurements. For depth pre-

diction, we design a novel deep network that reﬁnes predicted depth from

a single image in a two-stage process. We train our network in a semi-

supervised way on photoconsi st en c y in stereo images and on consistency

with accurate sparse depth reconstructions f ro m Stereo DSO. Our deep

predictions excel state-of-the-art approaches for monocular depth on the

KITTI benchmark. Moreover, our Deep Virtual Stereo Odometry clearly

exceeds previous monocular a n d deep-learning based method s i n a c c u -

racy. It even achieves comparable performance to the state-of-the-art

stereo methods, while only relying on a single camera.

Keywords: Monocular depth estimatio n · Monocular visual odometry

· Semi-supervised learning

1 Introduction

Visual odometry (VO) is a highly active ﬁeld of research in c ompu t er vision with

a plethora of applications in domains such as autonomous driving, robotics,

and augmented reality. VO with a single camera using traditional geometric

approaches inherently suﬀers from t h e fact th at camera t r ajectory and map

can only be estimated up to an unknown scal e which also leads to scale drift.

Moreover, suﬃcient motion parallax is required to estimate motion and structure

from successive frames. To avoid these issues, typically more complex sensors

such as act ive depth cameras or stereo rigs are employed. However, these sensors

require larger eﬀorts in calibration and increase the costs of the vision system.

Metric depth can also be recovered from a single image if a-priori knowledge

about the typical sizes or appearances of objects is used . Deep learning based

deep prediction目前最优

单目深度估计+DSO

2 N. Yang, R. Wang, J. St¨uckler and D. Cremers

Fig. 1: DVSO achieves monocular visual odometry on KITTI on par with

state-of-the-art stereo methods. It uses deep-learning based left-right dispar-

ity predictions (lower l ef t ) for ini t i al iz at i on and virt u al stereo c ons t rai nts in an

optimization-based direct visual odomet r y pipeline. This allows for recovering

accurate metric estimates.

approaches tackle this by training deep neural networks on large amounts of data.

In this paper, we propose a novel approach to mon ocular visual odometry, Deep

Virtual Stereo Odometry (DVSO), which incorporates deep depth predictions

into a geometric monocular odometry pipeline. We use deep stereo disparity

for virtual direct image alignment constraints within a framework for windowed

direct bundle adjustment (e.g. Direct Sparse Odometry [

8]). DVSO achieves

comparable performance to the s tat e -of -t he -ar t stereo visual odometry systems

on the KITTI odometry benchmark. It can even outperform the state-of-the-

art geometric VO methods when tuning scale-dependent parameters such as the

virtual stereo baseline.

As an additional contribution, we propose a novel stacked residual network

architecture that reﬁnes disparity estimates in two stages an d is trained in a semi-

supervised way. In typical supervised learning approaches [

6,25,24], depth ground

truth needs to be acquired for traini n g with active sensors like RGB-D cameras

and 3D laser scanners which are costly to obtain. Requiring a large amount of

such labeled data is an additional burde n that limits generalization to new en-

vironments. Self-supervised [

11,14] and unsupervised le arn i ng approaches [49],

on the other hand, overcome this limitation and do not require additional ac -

tive sensors. Commonly, they train the networks on photometric consistency,

for exampl e in stereo imagery [

11,14], whi ch reduces the eﬀort for collecting

training data. Still, the current self-supervised approaches are not as accur ate

as supervised methods [

23]. We combine self-supervised and supervised train-

ing, but avoid the costly collection of LiDAR data in our approach. Instead, we

make use of Stereo Direct Sparse Odometry ( S te r eo DSO [

40]) to provide accu-

rate sparse 3D r ec ons t ru ct i on s on the training set. Our deep depth prediction

network outperforms the cur r ent state-of-the-art methods on KITTI.

A video demonstrating our methods as well as the results is avai l abl e at

https://youtu.be/sLZOeC9z_tw.

1.1 Related Work

Deep learning for monocular depth estimation. Deep learning based ap-

proaches have recently achieved great advances in monoc ul ar dep t h es ti m ati on .

堆叠残差网络结构

self-supervised不如supervised准确

DVSO: Leveraging Deep Depth Predict io n for Monocular DSO 3

Employing deep neural network avoids the hand-crafted features used in pre-

vious methods [

36,19]. Supervised de ep learning [6,25,24] has recently shown

great success for monocular depth estimation. Eigen et al. [

6,5] p r opose a two

scale CNN architecture which directly predicts the depth map from a single im-

age. Laina et al. [

24] propose a residual network [17] based fully convolutional

encoder-decoder architecture [

27] with a robust regression loss function. The

aforementioned supervised learning approaches need large amounts of ground-

truth depth data for trainin g. Self-supervised approaches [

11,44,14] overcome

this limit at i on by exploiting photoc ons i s t en cy and geometric constraints to de-

ﬁne loss functions, for e xam pl e , in a stereo camera setup. This way, only stereo

images are needed for training w hi ch are typically easier to obtai n than ac-

curate depth measurements from active sensors such as 3D lase rs or RGB-D

cameras. Godard et al. [

14] achieve the state-of-the-art depth estimation accu-

racy for a fully self -s upervised approach. The semi-supervised scheme proposed

by Kuz ni e t sov et al. [23] combines the self-super v i se d loss with supervision with

sparse LiDAR ground truth. They do not need multi-scale depth supervis ion

or left-right consistency in th ei r loss, and achieve better performance than the

self-supervised approach in [

14]. The limitation of this semi-supervised approach

is the r eq u i re ment for LiDAR data which are costly to collect. In our approach

we use Stereo Direct Spar se Odometry to obtai n sparse depth ground -t r ut h for

semi-supervised training. Since the extracted depth maps are even spars er than

LiDAR data, we also employ multi-scale self-supervised training and left-right

consistency as in Godard et al. [

14]. Inspired by [20,34], we design a stacked

network architecture leveraging the concept of residual learning [

17].

Deep learning for VO / SLAM. In recent years, large progres s has b een

achieved in the development of monocular VO and SLAM methods [

31,9,8,32].

Due to projective geometry, metric scale cannot be observed w it h a single cam-

era [37] whi ch introduces sc ale drift. A popular approach is hence to use stereo

cameras for VO [

10,8,31] which avoid scale ambiguity and leverage stereo match-

ing with a ﬁxed baseline for estimating 3D structu r e. While stereo VO delivers

more reliable depth estimation, it requires self-calibration for long-term opera-

tion [

4,46]. The integration of a second camera also introduces additional costs.

Some r ec ent monocular VO approaches have integrated monocular depth esti-

mation [ 46, 39] to recover the metric scale by scale-matching. CNN-SLAM [39]

extends LSD-SLAM [

9] by predicting depth with a CNN and reﬁning the depth

maps using Bayesian ﬁltering [

9,7]. Their method shows su perior performance

over monocular SLAM [

9,30,45,35] on indoor datasets [15,38]. Yin et al. [46]

propose to use convolutional neural ﬁelds and consecutive frames to improve the

monocular depth estimation from a CNN. Camera motion is estimated using

the reﬁned depth. Cod eS LAM [

2] foc us es on the challenge of den se 3D recon-

struction. It jointly optimizes a learned compact representation of the dense

geometry with camera poses. Our work tackles the problem of odometry with

monocular cameras and integrates deep depth prediction with multi-view stereo

to improve camera p os e estimation. An ot he r line of research trains networks

to directly predict the ego-motion end-to-end usin g supervised [

41] or unsuper-

利用剩余学习的概念构建了stacked network architecture

4 N. Yang, R. Wang, J. St¨uckler and D. Cremers

Outputs

StackNet

ResidualNetSimpleNet

Fig. 2: Overview of StackNet architecture.

vised learning [

49,26]. However, the estimated ego-motion of these methods is

still by far inferior to geometric visual odom et r y approaches. In our approach,

we phrase visual odometry as a geometric optimization problem but incorporate

photoconsistency constraints with state-of-the-art deep monocular depth predic-

tions into the optimization. This way, we obtai n a highly accurate monocular

visual odometry that is not prone to scale drift and achieves comparable result s

to traditional stereo VO methods.

2 Semi-Supervised Deep Monocular Depth Estimation

In this section, we will introduce our semi-supervis ed approach to d eep monocu-

lar depth estimat ion . It builds on t hr ee key ingredients: se l f-s upervised learning

from photoconsistency in a stereo setu p simi l ar to [

14], supervised learning based

on accurate sparse depth reconstruction by Stereo DSO, and two-stage reﬁne-

ment of the network predictions in a stacked encoder-decoder architecture.

2.1 Network Architecture

We coin our architecture StackNet s i nc e it stacks two sub-networks, SimpleNet

and ResidualNet, as depict ed in Figure 2. Both sub -n etworks are f ul l y convolu-

tional deep neural network adopted from DispNet [

28] with an encoder-decoder

scheme. Resid u alNe t has fewer layers and takes the ou t pu t s of SimpleNet as

inputs. Its purpose is to reﬁne the disparity maps predicted by SimpleNet by

learning an additive r esi d ual signal. Similar residual learning architectures have

been successfully applied to related deep learning tasks [20,34]. The detai l ed

network architecture is illustrated in the

supplementary material.

SimpleNet. SimpleNet is an encoder- de coder architecture with a ResNet-50

based encoder and skip connections between corresponding encoder and decoder

layers. The decoder upprojects the feature maps to the origin al resolution and

generates 4 pairs of disparity maps disp

left

simple,s

and disp

right

simple,s

in diﬀerent res-

olutions s ∈ [0, 3]. The upprojection is implemented by resize-convolution [

33],

i.e. a nearest-neighbor upsampling layer by a factor of two followed by a con-

volutional layer. The usage of skip connections enables the decoder to recover

high-resolution results with ﬁne-grained details.

剩余16页未读，继续阅读

资源推荐

资源评论

宝贝的麻麻

粉丝: 42
资源: 294

单目深度估计+DSO1

dso dso开发资料

基于VI-DSO的改进单目视觉惯性里程计.docx

DSO 数据岛例子 DSO 数据岛例子 DSO 数据岛例子 DSO 数据岛例子

DS0201 DSO201 DSO nano 预制文件 下载

DSO-USB-Tool.rar_DSO-USB_DSO-USB-TOOL_dso_dso usb tool_tool

dso138-source-codes-basic_DSO138_

DSO算法和代码分析

DSO Frmaer Winform程序

袖珍示波器 Paul版固件 DS0201 DSO201 DSO nano

DSO5202P_Firmware_DSO5202_

STM32-微型示波器-DSO138源程序代码

论文研究-单目视觉里程计研究综述.pdf

DSO-2150USB驱动

dso源代码.zip

dso29xxc中文版

DSO138.zip_02dso_com_DSO138_stm32 示波器_示波器_示波器stm32

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

BurpLoaderKeygen.jar.zip

Chrome Header Editor 插件

Goby红队版-win-x64-2.4.7版本

软件工程导论(第六版)课后习题答案1

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

STM32F103C8T6核心板-电路原理图1.PDF

最新资源

DS0201 DSO201 DSO nano 预制文件下载