4
in an estimator which is consistent and optimal up to linearization errors. Another monocular visual-inertial filter was
proposed by Jones and Soatto (2011), presenting results on a long outdoor trajectory including IMU to camera calibration
and loop closure. Li and Mourikis (2013) showed that further increases in the performance of the MSCKF are attainable
by switching between the landmark processing model, as used in the MSCKF, and the full estimation of landmarks, as
employed by EKF-SLAM.
Further improvements and extensions to both loosely and tightly-coupled filtering based approaches include an alternative
rotation parameterization (Li and Mourikis, 2012b), inclusion of rolling shutter cameras (Jia and Evans, 2012; Li et al.,
2013), offline (Lobo and Dias, 2007; Mirzaei and Roumeliotis, 2007, 2008) and online (Weiss et al., 2012; Kelly and
Sukhatme, 2011; Jones and Soatto, 2011; Dong-Si and Mourikis, 2012) calibration of the relative position and orientation
of camera and IMU.
In order to benefit from increased accuracy offered by re-linearization in batch optimization, recent work focused on
approximating the batch problem in order to allow real-time operation. Approaches to keep the problem tractable for online-
estimation can be separated into three groups (Nerurkar et al., 2013): Firstly, incremental approaches, such as the factor-graph
based algorithms by Kaess et al. (2012); Bryson et al. (2009), apply incremental updates to the problem while factorizing the
associated information matrix of the optimization problem or the measurement Jacobian into square root form (Bryson et al.,
2009; Indelman et al., 2012). Secondly, fixed-lag smoother or sliding-window filter approaches (Dong-Si and Mourikis,
2011; Sibley et al., 2010; Huang et al., 2011) consider only poses from a fixed time interval in the optimization. Poses
and landmarks which fall outside the window are marginalized with their corresponding measurements being dropped.
Forming non-linear constraints between different optimization parameters in the marginalization step however destroys the
sparsity of the problem, such that the window size has to be kept fairly small for real-time performance. The smaller the
window, however, the smaller the benefit of repeated re-linearization. Thirdly, keyframe based approaches preserve sparsity
by maintaining only a subset of camera poses and landmarks and discard (rather than marginalize) intermediate quantities.
Nerurkar et al. (2013) present an efficient offline MAP algorithm which uses all information from non-keyframes and
landmarks to form constraints between keyframes by marginalizing a set of frames and landmarks without impacting the
sparsity of the problem. While this form of marginalization shows small errors when compared to the full batch MAP
estimator, we target a version with a fixed window size suitable for online and real-time operations. In this article and
our previous work (Leutenegger et al., 2013) we therefore drop measurements from non-keyframes and marginalize the
respective state. When keyframes drop out of the window over time, we marginalize the respective states and some landmarks
commonly observed to form a (linear) prior for a remaining sub-part of the optimization problem. Our approximation scheme
strictly keeps the sparsity of the original problem. This is in contrast to e.g. Sibley et al. (2010), who accept some loss of
sparsity due to marginalization. The latter sliding window filter, in a visual-inertial variant, is used for comparison in Li
and Mourikis (2012a): it proves to perform better than the original MSCKF, but interestingly, an improved MSCKF variant
using first-estimate Jacobians yields even better results. We aim at performing similar comparisons between an MSCKF
implementation—that includes the use first estimate Jacobians—and our keyframe as well as optimization based algorithm.
Apart from the differentiation between batch and filtering approaches, it has been a major interest to increase the
estimation accuracy by studying the observability properties of VINS. There is substantial work on the observability
properties given a particular combination of sensors or measurements (Martinelli, 2011; Weiss, 2012) or only using data
from a reduced set of IMU axes (Martinelli, 2014). Global unobservability of yaw and position, as well as growing
uncertainty with respect to an initial pose of reference are intrinsic to the visual-inertial estimation problem (Hesch et al.,
2012b; Huang et al., 2013; Hesch et al., 2013). This property is therefore of particular interest when comparing filtering
approaches to batch-algorithms: the representation of pose and its uncertainty in a global frame of reference usually becomes
numerically problematic as the uncertainty for parts of the state undergoes unbounded growth, while remaining low for
the observable sub parts of the state. Our batch approach therefore uses a formulation of relative uncertainty of keyframes
to avoid expressing global uncertainty.
Unobservability of the VINS problem poses a particular challenge to filtering approaches where repeated linearization
is typically not possible: Huang et al. (2009) have shown that these linearization errors may erroneously render parts of
the estimated state numerically observable. Hesch et al. (2012a) and others (Huang et al., 2011; Kottas et al., 2012; Hesch
et al., 2012b, 2013; Huang et al., 2013) derived formulations allowing to choose the linearization points of the VINS system
in a way such that the observability properties of the linearized and non-linear system are equal. In our proposed algorithm,
we employ first-estimate Jacobians, i.e. whenever linearization of a variable is employed, we fix the linearization point for