Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf

所需积分/C币:50 2017-11-06 11:12:37 5.29MB PDF
收藏 收藏

Combining visual and inertial measurements has become popular in mobile robotics, since the two sensing modalities offer complementary characteristics that make them the ideal choice for accurate Visual-Inertial Odometry or Simultaneous Localization and Mapping (SLAM). While historically the problem
Our framework has been extended to be used with a monocular camera setup We present the necessary adaptations concerning the estimation and bootstrapping parts. The monocular version was needed for fair comparison with the reference implementation of the MsCKF algorithm which is currently only published in a monocular form. The result is a generic N-camera (> 1) visual-inertial odometry framework. In the stereo-version, the performance will gradually transform into the monocular case when the ratio between camera baseline and distance to structure becomes smal We present the formulation for online camera extrinsic estimation that may be applied after standard intrinsics calibration. Evaluation results demonstrate the applicability of this method when initializing with inaccurate camera pose estimates with respect to the IMU We make an honest attempt to present our work to a level of detail that would allow the reader to re-implement our k Various new datasets featuring individual characteristics in terms of motion, appearance, and scene depth were recorded with our new hardware iteration ranging from hand-held indoor motion to bicycle riding. The comprehensive evaluation demonstrates superior performance compared to our previously published results, owing to better calibration and hardware synchronization available as well as to algorithmic and software-level adaptations The remainder of this work is structured as follows: in Section II we provide a more detailed overview of how our work relates to existing literature and differentiates itself. Section III introduces the notations and definitions used throughout this article. The nonlinear error terms from camera and IMu measurements are described in-depth in Section IV, which is then followed by an overview of frontend processing and initialization in Section v. as a last key element of the method Section VI introduces how the keyframe concept is applied by marginalization. Section Vll describes the experimental setup, evaluation scheme and presents extensive results on the different datasets IL. RELATED WORK The vision-only algorithms which form the foundation for todays VINs can be categorized into batch Structure-from Motion (SfM) and filtering based methods. Due to computational constraints, for a long time, Vision-based real-time odometry or SLAM algorithms such as those presented in Davison(2003) were only possible using a filtering approach Subsequent research(Strasdat et al, 2010), however, has shown that nonlinear optimization based approaches, as commonly used for offline SfM, can provide better accuracy for a similar computational work when compared to filtering approaches, given that the structural sparsity of the problem is preserved. Henceforth, it has been popular to maintain a relatively sparse graph of keyframes and their associated landmarks subject to non-linear optimizations( Klein and Murray, 2007) The earliest results in ViNs originate from the work of Jung and Taylor (2001) for(spline based) batch and of Chai et al.(2002); Roumeliotis et al.(2002)for filtering based approaches. Subsequently, a variety of filtering based approaches have been published based on EKFs(Kim and Sukkarieh, 2007; Mourikis and Roumeliotis, 2007; Li and Mouriki, 2012a Weiss et al, 2012: Lynen et al., 2013), Iterated EKFs(IEKFs)(Strelow and Singh, 2004, 2003)and Unscented Kalman Filters(UKFs)(Shin and El-Sheimy, 2004; Ebcin and Veth, 2007: Kelly and Sukhatme, 201 1)to name a few, which over the years showed an impressive improvement in precision and a reduction computational complexity. Today such 6 DoF visual-inertial estimation systems can be run online on consumer mobile devices(Li and mouriki, 2012c, Li et al., 2013) In order to limit computational complexity, many works follow the loosely coupled approach. Konolige et al.(2011) integrate IMU measurements as independent inclinometer and relative yaw measurements into an optimization problem using stereo vision measurements. In contrast, Weiss et al. (2012) use vision-only pose estimates as updates to an EKF with indirect IMU propagation. Similar approaches can be followed for loosely coupled batch based algorithms such as in Ranganathan et al. (2007)and Indelman et al(2012), where relative stereo pose estimates are integrated into a factor-graph with non-linear optimization including inertial terms and absolute GPs measurements. It is well known that loosely coupled approaches are inherently sub-optimal since they disregard correlations amongst internal states of different sensors A notable contribution in the area of filtering based VINS is the work of Mourikis and Roumeliotis(2007)who proposed an ekf based real-time fusion using monocular vision, called the Multi-State Constraint Kalman Filter (MSCKF) which performs nonlinear-triangulation of landmarks from a set of camera poses over time before using them in the eKf update This contrasts with other works that only use visual constraints between pairwise camera poses(Bayard and Brugarolas 2005). Mourikis and Roumeliotis (2007) also show how the correlations between errors of the landmarks and the camera locations--which are introduced by using the estimated camera poses for triangulation--can be eliminated and thus result in an estimator which is consistent and optimal up to linearization errors. Another monocular visual-inertial filter was proposed by Jones and Soatto(2011), presenting results on a long outdoor trajectory including IMU to camera calibration and loop closure. Li and Mourikis(2013) showed that further increases in the performance of the MSCKF are attainable by switching between the landmark processing model, as used in the msCKF, and the full estimation of landmarks, as employed by EKF-SLAM Further improvements and extensions to both loosely and tightly-coupled filtering based approaches include an alternative rotation parameterization(Li and Mourikis, 20126), inclusion of rolling shutter cameras (Jia and Evans, 2012; Li et al 2013), offline (Lobo and Dias, 2007; Mirzaei and roumeliotis, 2007, 2008) and online ( weiss et al 2012; Kelly and Sukhatme, 2011; Jones and Soatto, 2011; Dong-Si and Mourikis, 2012)calibration of the relative position and orientation of camera and Imu In order to benefit from increased accuracy offered by re-linearization in batch optimization, recent work focused on approximating the batch problem in order to allow real-time operation. Approaches to keep the problem tractable for online- estimation can be separated into three groups(Nerurkar et al., 2013): Firstly, incremental approaches, such as the factor-graph based algorithms by Kaess et al.(2012); Bryson et al. (2009), apply incremental updates to the problem while factorizing the associated information matrix of the optimization problem or the measurement Jacobian into square root form(Bryson et al 2009: Indelman et al., 2012). Secondly, fixed-lag smoother or sliding-window filter approaches (Dong-Si and Mourikis 2011; Sibley et al, 2010; Huang et aL., 2011) consider only poses from a fixed time interval in the optimization. Poses and landmarks which fall outside the window are marginalized with their corresponding measurements being dropped Forming non-linear constraints between different optimization parameters in the marginalization step however destroys the sparsity of the problem, such that the window size has to be kept fairly small for real-time performance. The smaller the window, however, the smaller the benefit of repeated re-linearization. Thirdly, keyframe based approaches preserve sparsity y maintaining only a subset of camera poses and landmarks and discard(rather than marginalize) intermediate quantities Nerurkar et al. (2013) present an efficient offine map algorithm which uses all information from non-keyframes and landmarks to form constraints between keyframes by marginalizing a set of frames and landmarks without impacting the sparsity of the problem. While this form of marginalization shows small errors when compared to the full batch MAP estimator, we target a version with a fixed window size suitable for online and real-time operations. In this article and our previous work(Leutenegger et al, 2013)we therefore drop measurements from non-keyframes and marginalize the respective state. When keyframes drop out of the window over time, we marginalize the respective states and some landmarks commonly observed to form a (linear) prior for a remaining sub-part of the optimization problem. Our approximation scheme strictly keeps the sparsity of the original problem. This is in contrast to e. g. Sibley et al.(2010), who accept some loss of sparsity due to marginalization. The latter sliding window filter, in a visual-inertial variant, is used for comparison in Li and Mourikis (2012a): it proves to perform better than the original MsCKF, but interestingly, an improved MSCKF variant using first-estimate Jacobians yields even better results. We aim at performing similar comparisons between an MSCKF implementation--that includes the use first estimate Jacobians-and our keyframe as well as optimization based algorithm Apart from the differentiation between batch and filtering approaches, it has been a major interest to increase the estimation accuracy by studying the observability properties of VINs. There is substantial work on the observability properties given a particular combination of sensors or measurements(Martinelli, 2011; Weiss, 2012)or only using data from a reduced set of IMU axes(Martinelli, 2014 ). Global unobservability of yaw and position, as well as growing uncertainty with respect to an initial pose of reference are intrinsic to the visual-inertial estimation problem(Hesch et al 2012b; Huang et al., 2013; Hesch et al., 2013). This property is therefore of particular interest when comparing filtering approaches to batch-algorithms: the representation of pose and its uncertainty in a global frame of reference usually becomes numerically problematic as the uncertainty for parts of the state undergoes unbounded growth, while remaining low for the observable sub parts of the state. Our batch approach therefore uses a formulation of relative uncertainty of keyframes to avoid expressing global uncertainty r Unobservability of the vins problem poses a particular challenge to filtering approaches where repeated linearization typically not possible: Huang et al.(2009) have shown that these linearization errors may erroneously render parts of the estimated state numerically observable. Hesch et al. (2012a) and others(Huang et al., 2011: Kottas et al., 2012; Hesch et al., 2012b, 2013; Huang et aL., 2013) derived formulations allowing to choose the linearization points of the VINS system in a way such that the observability properties of the linearized and non-linear system are equal. In our proposed algorithm, we employ first-estimate Jacobians, 1. e. whenever linearization of a variable is employed, we fix the linearization point for any subsequent linearization involving that particular variable II. NOTATION AND DEFINITIONS A. Notation We employ the following notation throughout this work: Fa denotes a reference frame A t p represented in frame F a is written as position vector Arp, or Arp when in homogencous coordinates. A transformation between frames is represented by a homogeneous transformation matrix TAB that transforms the coordinate representation of homogeneous points from b to FA. Its rotation matrix part is written as CAB; the corresponding quaternion is written S qAB mle s, e and n representing the imaginary and real parts We adopt the notation introduced in Barfoot et al.(2011): concerning the quaternion multiplication qAC = qAB Bc, we introduce a left-hand side compound Taking velocity as an example of a physical quantity represented in frame Fa that relates frame FB and ec] qAB operator and a right-hand side operator that output matrices such that qAC =AB qBC =9BC qAB c,we write AVBC, i.e. the velocity of frame B with respect to F,c B. frames The performance of the proposed method is evaluated using an IMU and camera setup schematically depicted in Figure 2 It is used both in monocular and stereo mode where we want to emphasize that our methodology is generic enough to handle an N-camera setup Inside the tracked body that is represented relative to an inertial frame, w, we distinguish camera frames. F C: (Subscripted wit th i= l,.. N), and the IMU-sensor frame, As C Fig. 2. Coordinatc frames involved in the hardwarc setup uscd: two cameras arc placed as a stcrco setup with rcspcctivc frames, Fc, iE(1, 2]. IMU data is acquired in Fs. The algorithms estimate the position and orientation of s with respect to the world (inertial) frame Fw C. States The variables to be estimated comprise the robot states at the image times (index k)xr and landmarkS xL XR holds the robot position in the inertial frame wrs, the body orientation quaternion qws, the velocity expressed in the sensor frame sWws(written in short as sv ), as well as the biases of the gyroscopes be and the biases of the accelerometers ba. Thus, XR IS written as T nrs,qws,s,bg,ba∈R×S 9 Furthermore, we use a partition into the pose states x TLTLTIT wrs, qws and the speedbias states xsb The j landmark is represented in homogeneous(World) coordinates: X,]: wlER. At this point, we set the fourth component to one o Optionally, we may include camera extrinsic estimation as part of an online calibration process. Camera extrinsic noted xc src,qsc.can either be treated as constant entities to be calibrated or time-varying states subjected to a first-order Gaussian process allowing to track changes that may occur e. g. due to temperature-induced mechanical deformation of the In general the states live in a manifold, therefore we use a perturbation in tangent space g and employ the group operator B, that is not commutative in general, the exponential exp and logarithm log. Now, we can define the perturbation Idimg g transforms from minimal coordinates to tangent space. Thus we obtain the transformations from and g Ox: =X BX around the estimate x. We use a minimal coordinate representation x e R. a bijective mapping minimal coordinates. 6x=exp(重(6x), 2 x=重(og(6x) 3) Concretely, we use the minimal (3D)axis-angle perturbation of orientation da E R which can be converted into its quaternion equivalent dy via the exponential map sine ll oall da oq: exp 0 CoS Therefore, using the group operator we write qws Sqoqws. Note that linearization of the exponential map around Sa=0 vields 6a≈ L where l denotes the identity quaternion. We obtain the minimal robot error state vector OXR 6b2,b|∈R15 (6) Analogously to the robot state decomposition xr and xsb, we use the pose error state xr: =[Sp, Sa] and the speed/bias error state Bxsb: =Sv, dbo, dl As landmark perturbation, we use a simple Euclidean version &per that is applied as 81: [8B, 0 by addition IV. BATCH VISUAL SLAM WITH INERTIAL TERMS In this section, we present our approach of incorporating inertial measurements into batch visual SLAM. In visual odometry and SLAM, a nonlinear optimization is formulated to find the camera poses and landmark positions by minimizing the reprojection error of landmarks observed in camera frames. Figure 3 shows the respective graph representation inspired by(Thrun and Montemerlo, 2006): it displays measurements as edges with square boxes and estimated quantities as round nodes. As soon as inertial measurements are introduced, they not only create temporal constraints between successive poses, Many landmarks Many landmarks .Speed /iMU biases d Many keypoint 口 IMU measurements Fig 3. Graphs of the state variables and measurements involved in the visual slam problem (left) versus visual-inertial SLaM (right): incorporatin inertial measurements introduces temporal constraints, and necessitates a state augmentation by the robot speed as well as iMU biases but also between successive speed and imu bias estimates of both accelerometers and gyroscopes by which the robot state vector is augmented We seek to formulate the visual-inertial localization and mapping problem as one joint optimization of a cost function J(x)containing both the(weighted)reprojection errors Cr and the(weighted) temporal error term from the IMU e I K :=∑∑∑ w ,j,。1,,k 2=1k=1j∈J(i,k) k:=1 visual inertial where i is the camera index of the assembly, k denotes the camera frame index, and j denotes the landmark index. The indices of landmarks visible in the kth frame and the ith camera are written as the set j(i, k). Furthermore, w: 3, k represents the information matrix of the respective landmark measurement, and Wk the information of the kth IMU error Throughout our work, we employ the Google Ceres optimizer(Agarwal et al, n.d. )integrated with our real-time capable C++ software infrastructure In the following, we will present the reprojection error formulation. Afterwards, an overview on iMU kinematics combined with bias term modeling is given, upon which we base the IMU error term A. Reprojection Error Formulation We use a rather standard formulation of the reprojection error adapted with minor modifications from Furgale(2011) h:(TCis Tsw wl (8) Hereby hi( denotes the camera projection model (which may include distortion)and z,stands for the measurement image coordinates. We also provide the Jacobians here, since they are not only needed for efficient solving, but also play a central role in the marginalization step explained in Section V k C S Sw w 3-wIsw 06x 0 C: s COX CA s axe k 1×3 where Jr, i denotes the Jacobian matrix of the projection hi ( into the i camera (including distortion) with respect to a landmark in homogeneous landmarks and variables with an overbar represent our current guess. Our framework currently supports radial-tangential as well as equidistant distortion models B. IMu Kinematics and Bias model Before being able to formulate the nonlinear IMU term, we overview the differential equations that describe IMU kinematics and bias evolution. The model is commonly used in estimation with IMUs originating from(Savage, 1998) using similar simplifications for MEMS-IMUS as in (Shin and El-Sheimy, 2004) 1)Nonlinear Model: Under the assumption that the measured effects of the Earth's rotation are small compared to the gyroscope accuracy, we can write the ImU kinematics combined with simple dynamic bias models as wIs WS S qws=22(sw )qws v=sa +wa-ba+ Csw wg-(sw) (12) where the elements of w TTTTIT are each uncorrelated zero-mean Gaussian white noise processes. sa are accelerometer measurements and wg represents the Earths gravitational acceleration vector. The gyro bias modeled as random walk, and in contrast the accelerometer bias is modeled as a bounded random walk with time constant T>0 The matrix S is formed from the estimated angular rate sw=sw-wg-bg, with gyro measurement sw (13) 2)Linearized Model of the Error States. The linearized version of the above equation around XR will play a major role in the marginalization step. We therefore review it here briefly: the error dynamics take the form SXR N FXRSXR +G(rW, (14) where g is straight-forward to derive and WS S wS 03×303×3 0 3×3 WS 3×3 F 3×3 sw lwg (15) 0 0 3×3 3×3 3×3 3×3 0 where [denotes the skew-symmetric cross-product matrix associated with a vector. Overbars generally stand for evaluation of the respective symbols with current estimates C. Formulation of the IMU Measurement Error Term Figure 4 illustrates the difference in measurement rates with camera measurements taken at time steps k and k +1 as well as faster IMU-measurements that are not necessarily synchronized with the camera measurements. Note also the Camera measurements k:+1 △t inal Imu measurements p+1 0 resampled IMU measurements zs R Fig 4. Different rates of IM and camera: one MU term uses all accelerometer and gyro readings between successive camera measurements We need the IMU error term *(sR,++l', R between camera measurements, along with respective time increments At introduction of a local time index r-o zs )to be a function of robot states at steps h and k+1 as well as of all the IMU measurements in-between these time instances(comprising accelerometer and gyro readings) summarized as zs. Hereby we have to assume an approximate normal conditional probability density f for given robot states at camera measurements k and k+1 esAR: AR ≈N(0,R (16) We are employing the propagation equations above to formulate a prediction XR, Z s)with associated con ditional covariance P(dxk+lk The respective computation requires numeric integration; as common in related literature(Mourikis and Roumeliotis, 2007), we applied the classical Runge-Kutta method, in order to obtain discrete time nonlinear state transition equations fd(XR)and the error state transition matrix Fa(XR). The latter is found by integrating SXR=FC(XRSXR over At keeping dXR symbolic Using the prediction we can now formulate the IMU error term as ∈R 15 (17) 1:3 This is simply the difference between the prediction based on the previous state and the actual state except for orientation, where we use a simple multiplicative minimal error Next, upon application of the error propagation law, the associated information matrix Ws is found as: 1 de R - P 6+1k k+1 (18) The jacobian I is straightforward to obtain but non-trivial, since the orientation error will be nonzero in general 0 ⑥ k qws⑧q k+1 0 3×9 (19 1:3,1:3 ×3 9×3 Finally, the Jacobians with respect to oxR and xR+ will be needed for efficient solution of the optimization problem While differentiating with respect to di-k+l is straightforward(but non-trivial), some attention is given to the other Jacobian XR Recall that the imu error term(17) is calculated by iteratively applying the prediction. Differentiation with respect to the state dxR thus leads to application of the chain rule, yielding Fa(xR,△t)F4(xR1,△-).Fa(x,△t)F4(xR,△ (20) OxR OXR V. FRONTEND OVERⅤIEW This section overviews the image processing steps and data association along with outlier detection and initialization of landmarks and states A. Keypoint Detection, Matching, and Variable initialization Our processing pipeline employs a customized multi-scale SsE-optimized Harris corner detector(Harris and Stephens 1988)followed by BrisK descriptor extraction(Leutenegger et al., 2011). The detection scheme favors a uniform keypoint distribution in the image by gradually suppressing corners with weaker corner response close to a stronger corner BRISK would allow automatic orientation detection -however, better matching results are obtained by extracting descriptors oriented along the gravity direction that is projected into the image This direction is globally observable thanks to IMU fu As a first step to initialization and matching, we propagate the last pose using acquired IMU measurements in order to obtain a preliminary uncertain estimate of the states Assume a set of past frames (including keyframes) as well as a local map consisting of landmarks with sufficiently well known 3d position is available at this point(see V-B for details). As a first stage of establishing correspondences, we perform a 3D-2D matching step. Given the current pose prediction, all landmarks that should be visible are considered for brute-force descriptor matching. Outliers are only rejected afterwards. This scheme may seem illogical to the reader cho might intuitively want to apply the inverse order in the sense of a guided matching strategy; however, owing to the super-fast matching of binary descriptors, it would actually be more expensive to first look at image-space consistency. The outlier rejection consists of two steps: first of all, we use the uncertain pose predictions in order to perform a mahalanobis test in image coordinates. Second, an absolute pose RANSaC provided in OpenGV(Kneip and Furgale, 2014) is applied Next, 2D-2D matching is performed in order to associate keypoints without 3D landmark correspondences. Again, we use brute-force matching first, followed by triangulation, in order to initialize landmark positions and as a first step to rejecting outlier pairings. Both stereo-triangulation across stereo image pairs (in the non-mono case)is performed, as well as between the current frame and any previous frame available. Only triangulations with sufficiently low depth uncertainty are labeled to be initialized-the rest will be treated as 2d measurements in subsequent matching finally, a relative ransac step(Kneip and Furgale, 2014) is performed between the current frame and the newest keyframe. The respective pose guess is furthermore used for bootstrapping in the very beginning Figure 5 illustrates a typical detection and matching result in the stereo case. Note the challenging illumination with overexposed sky due to facing towards the sun Fig. 5. Visualization of typical data association on a bicycle dataset: current stereo image pair (bottom) with match lines to the newest keyframe (top) Green stands for a 3D-2D match, yellow for 2D-2D match, blue for keypoints with left-right stereo match only, and red keypoints are left unmatched KF 2 KF 3 KF 4 Temporal/MU window Fig. 6. Frames kept for matching and subsequent optimization in the stereo case: in this example, M =3 keyframes and S=4 most current frames are used B. Keyframe Selection For the subsequent optimization, a bounded set of camera frames is maintained, i. e. poses with associated image(s) taken at that time instant; all landmarks co-visible in these images are kept in the local map. As illustrated in Figure 6 we distinguish two kinds of frames: we introduce a temporal window of the s most recent frames including the current frame; and we use a number of M keyframes that may have been taken far in the past For keyframe selection, we use a

试读 26P Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折

    关注 私信 TA的资源

    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf 50积分/C币 立即下载
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第1页
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第2页
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第3页
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第4页
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第5页
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第6页
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第7页
    Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf第8页


    50积分/C币 立即下载 >