【免费】Keyframe-BasedVisual-InertialOdometryUsingNonlinearOptimization.pdf资源-CSDN文库

VSLAM

需积分: 0 152 浏览量 2017-11-06 11:12:37 上传评论收藏 5.29MB PDF 举报

资源推荐

资源详情

资源评论

Keyframe-Based Visual-Inertial Odometry Using

Nonlinear Optimization

Stefan Leutenegger

∗

, Simon Lynen

∗

, Michael Bosse

∗

, Roland Siegwart

∗

, and Paul Furgale

∗

Autonomous Systems Lab (ASL), ETH Zurich, Switzerland

Abstract

Combining visual and inertial measurements has become popular in mobile robotics, since the two sensing modalities

offer complementary characteristics that make them the ideal choice for accurate Visual-Inertial Odometry or Simultaneous

Localization and Mapping (SLAM). While historically the problem has been addressed with ﬁltering, advancements in

visual estimation suggest that non-linear optimization offers superior accuracy, while still tractable in complexity thanks

to the sparsity of the underlying problem. Taking inspiration from these ﬁndings, we formulate a rigorously probabilistic

cost function that combines reprojection errors of landmarks and inertial terms. The problem is kept tractable and thus

ensuring real-time operation by limiting the optimization to a bounded window of keyframes through marginalization.

Keyframes may be spaced in time by arbitrary intervals, while still related by linearized inertial terms. We present evaluation

results on complementary datasets recorded with our custom-built stereo visual-inertial hardware that accurately synchronizes

accelerometer and gyroscope measurements with imagery. A comparison of both a stereo and monocular version of our

algorithm with and without online extrinsics estimation is shown with respect to ground truth. Furthermore, we compare the

performance to an implementation of a state-of-the-art stochasic cloning sliding-window ﬁlter. This competititve reference

implementation performs tightly-coupled ﬁltering-based visual-inertial odometry. While our approach declaredly demands

more computation, we show its superior performance in terms of accuracy.

I. INTRODUCTION

Visual and inertial measurements offer complementary properties which make them particularly suitable for fusion,

in order to address robust and accurate localization and mapping, a primary need for any mobile robotic system. The

rich representation of structure projected into an image, together with the accurate short-term estimates by gyroscopes

and accelerometers contained in an IMU have been acknowledged to complement each other, with promising use-cases

in airborne (Mourikis and Roumeliotis, 2007; Weiss et al., 2012) and automotive (Li and Mourikis, 2012a) navigation.

Moreover, with the availability of these sensors in most smart phones, there is great interest and research activity in effective

solutions to visual-inertial SLAM (Li et al., 2013).

(a) Side view of the ETH main building. (b) 3D view of the building.

Fig. 1. ETH main building indoor reconstruction of both structure and pose as resulting from our suggested visual-inertial odometry framework (stereo

variant in this case, including online camera extrinsics calibration). The stereo-vision plus IMU sensor was walked handheld for 470 m through loops on

three ﬂoors as well as through staircases.

Historically, there have been two main concepts towards approaching the visual-inertial estimation problem: batch non-

linear optimization methods and recursive ﬁltering methods. While the former jointly minimizes the error originating from

integrated IMU measurements and the (reprojection) errors from visual terms (Jung and Taylor, 2001), recursive algorithms

commonly use the IMU measurements for state propagation while updates originate from the visual observations (Chai

et al., 2002; Roumeliotis et al., 2002).

Batch approaches offer the advantage of repeated linearization of the inherently non-linear cost terms involved in the

visual-inertial state estimation problem and thus they limit linearization errors. For a long time, however, the lack of

computational resources made recursive algorithms a favorable choice for online estimation. Nevertheless, both paradigms

have recently shown improvements over and compromises towards the other, so that recent work (Leutenegger et al., 2013;

Nerurkar et al., 2013; Indelman et al., 2012) showed batch based algorithms reaching real-time operation and ﬁltering

based methods providing results of nearly equal quality (Mourikis and Roumeliotis, 2007; Li et al., 2013) to batch-based

methods. Leaving aside computational demands, batch based methods promise results of higher accuracy compared to

ﬁltering approaches, given the inherent algorithmic differences as discussed in detail later in this article.

Apart from the separation into batch and ﬁltering, the visual-inertial fusion approaches found in the literature can be

divided into two other categories: loosely-coupled systems independently estimate the pose by a vision only algorithm and

fuse IMU measurements only in a separate estimation step, limiting computational complexity. Tightly-coupled approaches

in contrast include both the measurements from the IMU and the camera into a common problem where all states are jointly

estimated, thus considering all correlations amongst them. Comparisons of both approaches, however, show (Leutenegger

et al., 2013) that these correlations are key for any high precision Visual-Inertial Navigation System (VINS), which is also

why all high accuracy visual-inertial estimators presented recently have implemented a tightly-coupled VINS: for example

Mourikis and Roumeliotis (2007) proposed an Extended Kalman Filter (EKF) based real-time fusion using monocular

vision, named Multi-State Constraint Kalman Filter (MSCKF). This work performs impressively with open loop errors

below 0.5% of the distance traveled. We therefore compare our results to a competitive implementation of this sliding

window ﬁlter with on-the-ﬂy feature marginalization as published by Mourikis et al. (2009). For simpler reference we

denote this algorithm by “MSCKF” in the rest of the article, keeping in mind that the available reference implementation

does not include all of the possible modiﬁcations from (Li and Mourikis, 2012a,b; Li et al., 2013; Hesch et al., 2013).

In this article, which extends our previous work (Leutenegger et al., 2013), we propose a method that respects the

aforementioned ﬁndings: we advocate tightly-coupled fusion for best exploitation of all measurements and nonlinear

optimization where possible rather than ﬁltering, in order to reduce suboptimality due to linearization. Furthermore, the

optimization approach allows for employing robust cost functions which may drastically increase accuracy in the presence

of outliers that may occasionally occur mostly in the visual part, even after application of sophisticated rejection schemes.

We devise a cost function that combines visual and inertial terms in a fully probabilistic manner. We adopt the concept of

keyframes due to its successful application in classical vision-only approaches: it is implemented using partial linearization

and marginalization, i.e. variable elimination—a compromise towards ﬁltering that is made for real-time compliance and

tractability. The keyframe paradigm accounts for drift-free estimation also when slow or no motion at all is present:

rather than using an optimization window of time-successive poses, our kept keyframes may be spaced arbitrarily far in

time, keeping visual constraints—while still incorporating an IMU term. Our formulation of relative uncertainty between

keyframes takes inspiration from RSLAM (Mei et al., 2011), although our parameterization uses global coordinates. We

provide a strictly probabilistic derivation of IMU error terms and the respective information matrix, relating successive

image frames without explicitly introducing states at IMU-rate. At the system level, we developed both the hardware and

the algorithms for accurate real-time SLAM, including robust keypoint matching, bootstrapping and outlier rejection using

inertial cues.

Figure 1 shows the output of our stereo visual-inertial odometry algorithm as run on an indoor dataset: the stereo-vision

plus IMU sensor was walked for 470 m through several ﬂoors and staircases in the ETH main building. Along with the

state consisting of pose, speed, and IMU biases, we also obtain an impression of the environment represented as a sparse

map of 3D landmarks. Note that map and path are automatically aligned with gravity thanks to tightly-coupled IMU fusion.

In relation to the conference paper (Leutenegger et al., 2013), we make the following main additional contributions:

• After having shown the superior performance of the suggested method compared to a loosely-coupled approach, we

present extensive evaluation results with respect to a stochastic cloning sliding window ﬁlter (following the MSCKF

implementation of Mourikis et al. (2009), which includes ﬁrst-estimate Jacobians) in terms of accuracy on different

motion proﬁles. Our algorithm consistently outperforms the ﬁltering-based method, while it admittedly incurs higher

computational cost. To the best of our knowledge, such a direct comparison of visual-inertial state estimation algorithms

as suggested by different research groups is novel to the ﬁeld.

• Our framework has been extended to be used with a monocular camera setup. We present the necessary adaptations

concerning the estimation and bootstrapping parts. The monocular version was needed for fair comparison with the

reference implementation of the MSCKF algorithm which is currently only published in a monocular form. The

result is a generic N -camera (N ≥ 1) visual-inertial odometry framework. In the stereo-version, the performance

will gradually transform into the monocular case when the ratio between camera baseline and distance to structure

becomes small.

• We present the formulation for online camera extrinsics estimation that may be applied after standard intrinsics

calibration. Evaluation results demonstrate the applicability of this method, when initializing with inaccurate camera

pose estimates with respect to the IMU.

• We make an honest attempt to present our work to a level of detail that would allow the reader to re-implement our

framework.

• Various new datasets featuring individual characteristics in terms of motion, appearance, and scene depth were recorded

with our new hardware iteration ranging from hand-held indoor motion to bicycle riding. The comprehensive evaluation

demonstrates superior performance compared to our previously published results, owing to better calibration and

hardware synchronization available, as well as to algorithmic and software-level adaptations.

The remainder of this work is structured as follows: in Section II we provide a more detailed overview of how our work

relates to existing literature and differentiates itself. Section III introduces the notations and deﬁnitions used throughout

this article. The nonlinear error terms from camera and IMU measurements are described in-depth in Section IV, which is

then followed by an overview of frontend processing and initialization in Section V. As a last key element of the method,

Section VI introduces how the keyframe concept is applied by marginalization. Section VII describes the experimental

setup, evaluation scheme and presents extensive results on the different datasets.

II. RELATED WORK

The vision-only algorithms which form the foundation for today’s VINS can be categorized into batch Structure-from-

Motion (SfM) and ﬁltering based methods. Due to computational constraints, for a long time, Vision-based real-time

odometry or SLAM algorithms such as those presented in Davison (2003) were only possible using a ﬁltering approach.

Subsequent research (Strasdat et al., 2010), however, has shown that nonlinear optimization based approaches, as commonly

used for ofﬂine SfM, can provide better accuracy for a similar computational work when compared to ﬁltering approaches,

given that the structural sparsity of the problem is preserved. Henceforth, it has been popular to maintain a relatively sparse

graph of keyframes and their associated landmarks subject to non-linear optimizations (Klein and Murray, 2007).

The earliest results in VINS originate from the work of Jung and Taylor (2001) for (spline based) batch and of Chai

et al. (2002); Roumeliotis et al. (2002) for ﬁltering based approaches. Subsequently, a variety of ﬁltering based approaches

have been published based on EKFs (Kim and Sukkarieh, 2007; Mourikis and Roumeliotis, 2007; Li and Mourikis, 2012a;

Weiss et al., 2012; Lynen et al., 2013), Iterated EKFs (IEKFs) (Strelow and Singh, 2004, 2003) and Unscented Kalman

Filters (UKFs) (Shin and El-Sheimy, 2004; Ebcin and Veth, 2007; Kelly and Sukhatme, 2011) to name a few, which over

the years showed an impressive improvement in precision and a reduction computational complexity. Today such 6 DoF

visual-inertial estimation systems can be run online on consumer mobile devices (Li and Mourikis, 2012c; Li et al., 2013).

In order to limit computational complexity, many works follow the loosely coupled approach. Konolige et al. (2011)

integrate IMU measurements as independent inclinometer and relative yaw measurements into an optimization problem

using stereo vision measurements. In contrast, Weiss et al. (2012) use vision-only pose estimates as updates to an EKF

with indirect IMU propagation. Similar approaches can be followed for loosely coupled batch based algorithms such as in

Ranganathan et al. (2007) and Indelman et al. (2012), where relative stereo pose estimates are integrated into a factor-graph

with non-linear optimization including inertial terms and absolute GPS measurements. It is well known that loosely coupled

approaches are inherently sub-optimal since they disregard correlations amongst internal states of different sensors.

A notable contribution in the area of ﬁltering based VINS is the work of Mourikis and Roumeliotis (2007) who proposed

an EKF based real-time fusion using monocular vision, called the Multi-State Constraint Kalman Filter (MSCKF) which

performs nonlinear-triangulation of landmarks from a set of camera poses over time before using them in the EKF update.

This contrasts with other works that only use visual constraints between pairwise camera poses (Bayard and Brugarolas,

2005). Mourikis and Roumeliotis (2007) also show how the correlations between errors of the landmarks and the camera

locations—which are introduced by using the estimated camera poses for triangulation—can be eliminated and thus result

in an estimator which is consistent and optimal up to linearization errors. Another monocular visual-inertial ﬁlter was

proposed by Jones and Soatto (2011), presenting results on a long outdoor trajectory including IMU to camera calibration

and loop closure. Li and Mourikis (2013) showed that further increases in the performance of the MSCKF are attainable

by switching between the landmark processing model, as used in the MSCKF, and the full estimation of landmarks, as

employed by EKF-SLAM.

Further improvements and extensions to both loosely and tightly-coupled ﬁltering based approaches include an alternative

rotation parameterization (Li and Mourikis, 2012b), inclusion of rolling shutter cameras (Jia and Evans, 2012; Li et al.,

2013), ofﬂine (Lobo and Dias, 2007; Mirzaei and Roumeliotis, 2007, 2008) and online (Weiss et al., 2012; Kelly and

Sukhatme, 2011; Jones and Soatto, 2011; Dong-Si and Mourikis, 2012) calibration of the relative position and orientation

of camera and IMU.

In order to beneﬁt from increased accuracy offered by re-linearization in batch optimization, recent work focused on

approximating the batch problem in order to allow real-time operation. Approaches to keep the problem tractable for online-

estimation can be separated into three groups (Nerurkar et al., 2013): Firstly, incremental approaches, such as the factor-graph

based algorithms by Kaess et al. (2012); Bryson et al. (2009), apply incremental updates to the problem while factorizing the

associated information matrix of the optimization problem or the measurement Jacobian into square root form (Bryson et al.,

2009; Indelman et al., 2012). Secondly, ﬁxed-lag smoother or sliding-window ﬁlter approaches (Dong-Si and Mourikis,

2011; Sibley et al., 2010; Huang et al., 2011) consider only poses from a ﬁxed time interval in the optimization. Poses

and landmarks which fall outside the window are marginalized with their corresponding measurements being dropped.

Forming non-linear constraints between different optimization parameters in the marginalization step however destroys the

sparsity of the problem, such that the window size has to be kept fairly small for real-time performance. The smaller the

window, however, the smaller the beneﬁt of repeated re-linearization. Thirdly, keyframe based approaches preserve sparsity

by maintaining only a subset of camera poses and landmarks and discard (rather than marginalize) intermediate quantities.

Nerurkar et al. (2013) present an efﬁcient ofﬂine MAP algorithm which uses all information from non-keyframes and

landmarks to form constraints between keyframes by marginalizing a set of frames and landmarks without impacting the

sparsity of the problem. While this form of marginalization shows small errors when compared to the full batch MAP

estimator, we target a version with a ﬁxed window size suitable for online and real-time operations. In this article and

our previous work (Leutenegger et al., 2013) we therefore drop measurements from non-keyframes and marginalize the

respective state. When keyframes drop out of the window over time, we marginalize the respective states and some landmarks

commonly observed to form a (linear) prior for a remaining sub-part of the optimization problem. Our approximation scheme

strictly keeps the sparsity of the original problem. This is in contrast to e.g. Sibley et al. (2010), who accept some loss of

sparsity due to marginalization. The latter sliding window ﬁlter, in a visual-inertial variant, is used for comparison in Li

and Mourikis (2012a): it proves to perform better than the original MSCKF, but interestingly, an improved MSCKF variant

using ﬁrst-estimate Jacobians yields even better results. We aim at performing similar comparisons between an MSCKF

implementation—that includes the use ﬁrst estimate Jacobians—and our keyframe as well as optimization based algorithm.

Apart from the differentiation between batch and ﬁltering approaches, it has been a major interest to increase the

estimation accuracy by studying the observability properties of VINS. There is substantial work on the observability

properties given a particular combination of sensors or measurements (Martinelli, 2011; Weiss, 2012) or only using data

from a reduced set of IMU axes (Martinelli, 2014). Global unobservability of yaw and position, as well as growing

uncertainty with respect to an initial pose of reference are intrinsic to the visual-inertial estimation problem (Hesch et al.,

2012b; Huang et al., 2013; Hesch et al., 2013). This property is therefore of particular interest when comparing ﬁltering

approaches to batch-algorithms: the representation of pose and its uncertainty in a global frame of reference usually becomes

numerically problematic as the uncertainty for parts of the state undergoes unbounded growth, while remaining low for

the observable sub parts of the state. Our batch approach therefore uses a formulation of relative uncertainty of keyframes

to avoid expressing global uncertainty.

Unobservability of the VINS problem poses a particular challenge to ﬁltering approaches where repeated linearization

is typically not possible: Huang et al. (2009) have shown that these linearization errors may erroneously render parts of

the estimated state numerically observable. Hesch et al. (2012a) and others (Huang et al., 2011; Kottas et al., 2012; Hesch

et al., 2012b, 2013; Huang et al., 2013) derived formulations allowing to choose the linearization points of the VINS system

in a way such that the observability properties of the linearized and non-linear system are equal. In our proposed algorithm,

we employ ﬁrst-estimate Jacobians, i.e. whenever linearization of a variable is employed, we ﬁx the linearization point for

any subsequent linearization involving that particular variable.

III. NOTATION AND DEFINITIONS

A. Notation

We employ the following notation throughout this work: F

−→

denotes a reference frame A; a point P represented

in frame F

−→

is written as position vector

, or

when in homogeneous coordinates. A transformation between

frames is represented by a homogeneous transformation matrix T

that transforms the coordinate representation of

homogeneous points from F

−→

to F

−→

. Its rotation matrix part is written as C

; the corresponding quaternion is written

as q

= [

, η]

∈ S

,  and η representing the imaginary and real parts. We adopt the notation introduced in Barfoot

et al. (2011): concerning the quaternion multiplication q

= q

⊗ q

, we introduce a left-hand side compound

operator [.]

and a right-hand side operator [.]

⊕

that output matrices such that q

= [q

]

= [q

]

⊕

Taking velocity as an example of a physical quantity represented in frame F

−→

that relates frame F

−→

and F

−→

, we write

, i.e. the velocity of frame F

−→

with respect to F

−→

B. Frames

The performance of the proposed method is evaluated using an IMU and camera setup schematically depicted in Figure 2.

It is used both in monocular and stereo mode, where we want to emphasize that our methodology is generic enough to

handle an N-camera setup. Inside the tracked body that is represented relative to an inertial frame, F

−→

, we distinguish

camera frames, F

−→

(subscripted with i = 1, . . . N), and the IMU-sensor frame, F

−→

Fig. 2. Coordinate frames involved in the hardware setup used: two cameras are placed as a stereo setup with respective frames, F

−→

, i ∈ {1, 2}. IMU

data is acquired in F

−→

. The algorithms estimate the position and orientation of F

−→

with respect to the world (inertial) frame F

−→

C. States

The variables to be estimated comprise the robot states at the image times (index k) x

and landmarks x

. x

holds the

robot position in the inertial frame

, the body orientation quaternion q

W S

, the velocity expressed in the sensor frame

W S

(written in short as

v ), as well as the biases of the gyroscopes b

and the biases of the accelerometers b

. Thus,

is written as:

, q

W S

, b

∈ R

× S

× R

. (1)

Furthermore, we use a partition into the pose states x

:= [

, q

W S

]

and the speed/bias states x

:= [

, b

]

The j

landmark is represented in homogeneous (World) coordinates: x

∈ R

. At this point, we set the fourth

component to one.

Optionally, we may include camera extrinsics estimation as part of an online calibration process. Camera extrinsics

denoted x

:= [

, q

]

can either be treated as constant entities to be calibrated or time-varying states subjected

to a ﬁrst-order Gaussian process allowing to track changes that may occur e.g. due to temperature-induced mechanical

deformation of the setup.

剩余25页未读，继续阅读

评论收藏

内容反馈

cjn_

粉丝: 82
资源: 6

Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimiza...

最新资源

Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimiza...

Keyframe-Based Visual-Inertial SLAM Using Nonlinear Optimization

okvis：OKVIS：打开基于关键帧的视觉惯性SLAM

OKVIS 论文

OKVIS_Chinese_annotation:OKVIS中文注解版，方便初学者学习

Design and Algorithms for Efficient and Robust Autonomous Operation

Robust Visual Inertial Odometry Using a Direct EKF-Based Approach.pdf

视觉里程入门Visual Odometry part1

Robust Visual Inertial Odometry Using a Direct EKF-Based Approach

Keyframe-based monocular SLAM: design, survey, and future directions

create-keyframe-animation, 在带有JavaScript的浏览器中，动态生成CSS关键帧动画.zip

Keyframe-Animation-1.8.3

Vue中使用create-keyframe-animation与动画钩子完成复杂动画

视觉里程入门Visual Odometry part2

CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM

keyframe-js:Javascript控制CSS动画和过渡

Video-Keyframe-Extraction-Using-RGB-Features-in-Matlab:基于颜色直方图的关键帧提取

swift写一个简单的基于关键帧的动画框架.zip

编译好的x265，带y4m文件

(0分下载网)英文版2012年:178页Pro CSS3 Animation.pdf

SmoothMoves 2.4.0

Android--UI-新手必备源码master.zip

iOS的动画框架

NatCorder - Video Recording API 1.7.2.unitypackage

前端项目-css-animations.js.zip

ORB_SLAM3源码（附带详细注释）

用于车牌号识别的字符模板.zip

2020仿720云VR全景网站系统源码（含示例）.zip

StereoV3D-3rdparty.zip

最新资源