ORB-SLAM_ a Versatile and Accurate Monocular SLAM System.pdf

所需积分/C币:50 2019-05-14 10:40:26 3.98MB PDF
收藏 收藏

ORB-SLAM_ a Versatile and Accurate Monocular SLAM System 原版pdf
IEEE TRANSACTIONS ON ROBOTICS in Section IX-B on the possible causes that can make feature- from a homography using the method of Faugeras et al [23], based methods more accurate than direct methods or compute an essential matrix [24], [25] that models planar The loop closing and relocalization methods here presented and general scenes, using the five-point algorithm of nister are based on our previous work [11. A preliminary version [26], which requires to deal with multiple solutions. Both of the system was presented in [12. In the current paper we reconstruction methods are not well constrained under low add the initialization method, the Essential graph, and perfect parallax and suffer from a twofold ambiguity solution if all all methods involved. We also describe in detail all building points of a planar scene are closer to one of the camera centers blocks and perform an exhaustive experimental validation. [27 On the other hand if a non-planar scene is seen with To the best of our knowledge, this is the most complete and parallax a unique fundamental matrix can be computed with reliable solution to monocular SLAM, and for the benefit of the eight-point al gorithm [2] and the relative camera pose can the community we make the source code public. Demonstra- be recovered without ambiguity tion videos and the code can be found in our project webpage We present in Section IV a new automatic approach based on model selection between a homography for planar scenes IL RELATED WORK and a fundamental matrix for non-planar scenes. A statistical A. Place recognition approach to model selection was proposed by Torr et al [28]. Under a similar rationale we have developed a heuristic The survey by williams et aL. [131 compared several ap- initialization algorithm that takes into account the risk of proaches for place recognition and concluded that techniques selecting a fundamental matrix in close to degenerate cases based on appearance, that is image to image matching, scale (i.e. planar, nearly planar, and low parallax), favoring the better in large environments than map to map or image to map selection of the homography. In the planar case, for the sake of methods. Within appearance based methods, bags of words safe operation, we refrain from initializing if the solution has techniques [14], such as the probabilistic approach FAB-MAP a twofold ambiguity, as a corrupted solution could be selected 15, are to the fore because of their high efficiency. DBoW2 We delay the initialize ation until the method produces a unique [5] used for the first time bags of binary words obtained from solution with significant parallax BRIEF descriptors [16] along with the very efficient FAST feature detector [17]. This reduced in more than one order of magnitude the time needed for feature extraction, compared to C. Monocular SlaM SURF [18 and sift [19 features that were used in bags of Monocular slam was initially solved by filtering [20] words approaches so far. Although the system demonstrated [211, [291, [30]. In that approach every fi ed to be very efficient and robust, the use of BRIEF, neither by the hlter to jointly estimate the map feature locations and rotation nor scale invariant, limited the system to in-plane the camera pose. It has the drawbacks of wasting computation trajectories and loop detection from similar viewpoints in processing consecutive frames with little new information our previous work [I 1, we proposed a bag of words place and the accumulation of linearization errors. On the other recognizer built on DBow2 with ORB [9]. ORB are binary hand keyframe-based approaches [3], [4] estimate the map features invariant to rotation and scale (in a certain range), using only selected frames(keyframes) allowing to perform resulting in a very fast recognizer with good invariance to more costly but accurate bundle adjustment optimizations, as viewpoint. We demonstrated the high recall and robustness mapping is not tied to frame-rate. Strasdat et. al [31] demon of the recognizer in four different datasets, requiring less than strated that keyframe-based techniques are more accurate than 39ms (including feature extraction) to retrieve a loop candidate filtering for the same computational cost from a loK image database In this work we use an improved The most representative keyframe-based SLAM system is version of that place recognizer, using covisibility information probably PTAM by Klein and Murray [4]. It was the first work and returning several hypotheses when querying the database to introduce the idea of splitting camera tracking and mapping instead of just the best match in parallel threads, and demonstrated to be successful for real time augmented reality applications in small environments B. Map initialization The original version was later improved with edge features, a Monocular slam requires a procedure to create an initi rotation estimation step during tracking and a better relocal- map because depth cannot be recovered from a single image ization method [32]. The map points of pTaM correspond to One way to solve the problem is to initially track a known FAST corners matched by patch correlation. This makes the structure In the context of filtering approaches, points points only useful for tracking but not for pl can be initialized with high uncertainty in depth using an fact PtaM does not detect large loops, and the relocalization inverse depth parametrization [211, which hopefully will later is based on the correlation of low resolution thumbnails of the converge to their real positions. The recent semi-dense wor keyframes, yielding a low invariance to viewpoint of Engel et al. [10], follows a similar approach initializing the Strasdat et al [6l presented a large scale monocular SLam depth of the pixels to a random value with high variance system with a front-end based on optical flow implemented Initialization methods from two views either assumes locally on a GPU, followed by FAST feature matching and motion scene planarity [41,[22] and recover the relative camera pose only Ba, and a back-end based on sliding-window BA. Loop closures were solved with a pose graph optimization with Ihttp://webdiis.unizar.es/raulmur/orbslam similarity constraints(7DoF), that was able to correct the scale IEEE TRANSACTIONS ON ROBOTICS drift appearing in monocular SLAM. From this work we take RACKING the idea of loop closing with 7DoF pose graph optimization and apply it to the essential graph defined in Section III-D Frame -/ Extract Initial Pose Estimation ORB from last frame or Track‖ New KeyFrame Relocalisation Local Map Decision Strasdat et. al [7] used the front-end of ptaM, but per MAP formed the tracking only in a local map retrieved from a covi- Map initialization KeyFrame sibility graph. They proposed a double window optimization PLACE AppOints) KeyFrame back-end that continuously performs Ba in the inner window RECOGNITION Insertion and pose graph in a limited-size outer window. However, loop Visual KeyFrames Recent closing is only effective if the size of the outer window is cabulary MapPoint∥ Covisibility Graph Culling large enough to include the whole loop In our system we Recognition Database ng Spannin New Points take advantage of the excellent ideas of using a local map Tree Creation based on covisibility, and building the pose graph from the Local ba covisibility graph, but apply them in a totally redesigned front Loop Correction Loop detection end and back-end. Another difference is that, instead of using Optimize KeyFrames Essential Compute Candidates specific features for loop detection (SURF), we perform the Graph Sim Detection Culling place recognition on the same tracked and mapped features L。 OP CLOS|NG obtaining robust frame-rate relocalization and loop detection Pirker et. al [33] proposed CD-SLAM, a very complete Fig. . ORB-SLAM system overview, showing all the steps performed by the tracking, local mapping and loop closing threads. The main components system including loop closing, relocalization, large scale oper- of the place recognition module and the map are also shown ation efforts to work on dynamic environments. However map initialization is not mentioned. The lack of a public implementation does not allow us to perform a comparison with all the points and all the frames is not feasible. The of accuracy, robustness or large-scale capabilitie work of Strasdat et al. [31] showed that the most cost- The visual odometry of Song et al. 34] uses Orb features effective approach is to keep as much points as possible for tracking and a temporal sliding window ba back-end. In while keeping only non-redundant key frames. The PTAM comparison our system is more general as they do not have approach was to insert keyframes very cautiously to avoid global relocalization, loop closing and do not reuse the map an excessive growth of the computational complexity. This They are also using the known distance from the camera to restrictive keyframe insertion policy akes the tracking fail in the ground to limit monocular scale drift hard exploration conditions. Our survival of the fittest strategy Lim et. al [25], work published after we submitted our achieves unprecedented robustness in difficult scenarios by preliminary version of this work [12], use also the same inserting keyframes as quickly as possible, and removing later features for tracking, mapping and loop detection. However the redundant ones. to avoid the extra cost the choice of BrIEF limits the system to in-plane trajectories IIL. SYSTEM OVERⅤIEW Their system only tracks points from the last keyframe so the map is not reused if revisited(Similar to visual odometry) A. Feature choice and has the problem of growing unbounded. We compare One of the main design ideas in our system is that the qualitatively our results with this approach in section VIII-E. same features used by the mapping and tracking are used The recent work of Engel et. al [10], known as LSD- for place recognition to perform frame-rate relocalization and SLAM, is able to build large scale semi-dense maps, using loop detection. This makes our system efficient and avoids direct methods (i. e. optimization directly over image pixel the need to interpolate the depth of the recognition features intensities)instead of bundle adjustment over features. Their from near SLAM features as in previous works [6 We results are very impressive as the system is able to operate requiere features that need for extraction much less than 33ms in real time, without GPU acceleration, building a semi-dense per image, which excludes the popular SIFT (300ms)[19 map,with more potential applications for robotics than the SURF 300ms)[18] or the recent A-KAZE( 100ms)[35 sparse output generated by feature-based SLAM. Nevertheless To obtain general place recognition capabilities, we require they still need features for loop detection and their camera rotation invariance, which excludes BRIEF [16]and LDB [36] localization accuracy is significantly lower than in our systen We chose OrB [91, which are oriented multi-scale FAST and PTAM, as we show experimentally in Section VIII-B. This corners with a 256 bits descriptor associated. They are ex- surprising result is discussed in Section IX-B tremely fast to compute and match, while they have good In a halfway between direct and feature-based methods is invariance to viewpoint. This allows to match them from wide the semi-direct visual odometry SVO of Forster et al. [22]. baselines, boosting the accuracy of BA. We already shown the Without requiring to extract features in every frame they are good performance of orB for place recognition in [11].While ble to operate at high frame-rates obtaining impressive results our current implementation make use of orB, the techniques in quadracopters. However no loop detection is performed and proposed are not restricted to these features the current implementation is mainly thought for downward ooking cameras B. Three Threads: Tracking, Local Mapping and Loop closing Finally we want to discuss about keyframe selection. All Our system, see an overview in Fig. 1, incorporates three visual SLAM works in the literature agree that running ba threads that run in parallel: tracking, local mapping and loop IEEE TRANSACTIONS ON ROBOTICS losing. The tracking is in charge of localizing the camera with every frame and deciding when to insert a new keyframe We perform first an initial feature matching with the previous frame and optimize the pose using motion-only BA. If the tracking is lost(e.g. due to occlusions or abrupt movements) the place recognition module is used to perform a global relocalization Once there is an initial estimation of the camera pose and feature matchings, a local visible map is retrieved using the covisibility graph of keyframes that is maintained by the system, see Fig. 2(a) and Fig. 2(b). Then matches with rames(blue), Current the local map points are searched by reprojection, and camera era n), Map Points(black, Cam (b)Covisibility Graph pose is optimized again with all matches. Finally the tracking Current Local MapPoint (red) thread decides if a new keyframe is inserted. All the tracking steps are explained in detail in Section V.The novel procedure to create an initial map is presented in Section IV The local mapping processes new keyframes and performs local ba to achieve an optimal reconstruction in the sur roundings of the camera pose New correspondences for un matched orb in the new keyframe are searched in connected keyframes in the covisibi lity graph to triangulate new points Some time after creation, based on the information gathered during the tracking, an exigent point culling policy is applied (c)Spanning Tree(green) and loop (d) Essential (raph in order to retain only high quality points. The local mapping Closure(red) is also in charge of culling redundant keyframes. We explain in detail all local mapping steps in Section VI Fig. 2. Reconstruction and graphs in the sequence fr3_long_office_household from the TUM RGB-D Benchmark [38 The loop closing searches for loops with every new keyframe. If a loop is detected, we compute a similarity trans formation that informs about the drift accumulated in the loop. The camera intrinsics, including focal length and princi Then both sides of the loop are aligned and duplicated points pal point. are fused. Finally a pose graph optimization over similarity All the orb features extracted in the frame, associated constraints [6] is performed to achieve global consistency. The or not to a map point, whose coordinates are undistorted main novelty is that we perform the optimization over the if a distortion model is provided Essential Graph, a sparser subgraph of the covisibility graph Map points and keyframes are created with a generous pol which is explained in Section III-D. The loop detection and icy, while a later very exigent culling mechanism is in charge correction steps are explained in detail in Section VII of detecting redundant keyframes and wrongly matched or not We use the levenberg-Marquardt algorithm implemented in trackable map points. This permits a flexible map expansion g2o [37 to carry out all optimizations. In the Appendix we during exploration, which boost tracking robustness under hard describe the error terms, cost functions, and variables involved conditions (e. g. rotations, fast movements), while its size is in each optimization bounded in continual revisits to the same en nment, 1.e lifelong operation. Additionally our maps contain very few C. Map Points, KeyFrames and their Seleclion outliers compared with PTAM, at the expense of containing Each map point pi stores less points. Culling procedures of map points and keyframes are explained in Sections VI-B and VI-E respectively Its 3D position Xw.i in the world coordinate system The viewing direction ni, which is the mean unit vector of all its viewing directions(the rays that join the point D. Covisibility graph and Essential graph with the optical center of the keyframes that observe it Covisibility information between keyframes is very useful in a representative orb descriptor D, which is the as several tasks of our system, and is represented as an undirected sociated OrB descriptor whose hamming distance is weighted graph as in [7]. Each node is a key frame and an edge minimum with respect to all other associated descriptors between two keyframes exists if they share observations of the in the keyframes in which the point is observed same map points(at least 15), being the weight 0 of the edge e The maximum dmax and minimum dmin distances at the number of common map points which the point can be observed, according to the scale In order to correct a loop we perform a pose graph opti invariance limits of the orb features mization [6] that distributes the loop closing error along the Each keyframe K; stores graph. In order not to include all the edges provided by the The camera pose Tiu, which is a rigid body transforma- covisibility graph, which can be very dense, we propose to tion that transforms points from the world to the camera build an Essential graph that retains all the nodes ( keyframes ) coordinate system but less edges, still preserving a strong network that yields IEEE TRANSACTIONS ON ROBOTICS accurate results. The system builds incrementally a spanning IV. AUTOMATIC MAP INITIALIZATION tree from the initial keyframe, which provides a connected The goal of the map initialization is to compute the relative subgraph of the covisibility graph with minimal number of edges. When a new keyframe is inserted, it is included in pose between two frames to triangulate an initial set of map the tree linked to the keyframe which shares most point points. This method should be independent of the scene(planar or general) and should not require human intervention to observations, and when a key frame is erased by the culling select a good two-view configuration, i.e. a configuration with policy, the system updates the links affected by that keyframe The Essential Graph contains the spanning tree, the subset significant parallax. We propose to compute in parallel two of edges from the covisibility graph with high covisibility geometrical models, a homography assuming a planar scene and a fundamental matrix assuming a non-planar scene. We (0min=100), and the loop closure edges, resulting in a strong then use a heuristic to select a model and try to recover the network of cameras. Fig. 2 shows an example of a covisibility relative pose with a specific method for the selected model graph, spanning tree and associated essential graph. As shown Our method only initializes when it is certain that the two in the experiments of Section VIlI-E, when performing the pose graph optimization, the solution is so accurate that an view configuration is safe, detecting low-parallax cases and additional full bundle adjustment optimization barely improves the well-known twofold planar ambiguity [27], avoiding to the solution. The efficiency of the essential graph and the initialize a corrupted map The steps of our algorithm are influence of the ]min is shown at the end of Section VIll-E 1)Fil Extract ORB features(only at the finest scale) in the current frame f and search for matches x t>x in the E. Bags of Words Place recognition reference frame . If not enough matches are found reset the reference frame The system has embedded a bags of words place recognition 2) Parallel computation of the two models module, based on DBoW2-[5, to perform loop detection and Compute in parallel threads a homography Hcr and a relocalization. Visual words are just a discretization of the fundamental matrix Fcr descriptor space, which is known as the visual vocabulary The vocabulary is created offline with the OrB descriptors X =HmX cFC 0 extracted from a large set of images. If the images are general enough, the same vocabulary can be used for different environ with the normalized dlt and 8-point algorithms respec- ments getting a good performance, as shown in our previous tively as explained in [2] inside a RaNSAC scheme work [ll]. The system builds incrementally a database that To make homogeneous the procedure for both models contains an invert index, which stores for each visual word the number of iterations is prefixed and the same for in the vocabulary, in which keyframes it has been seen, So both models, along with the points to be used at each that querying the database can be done very efficiently. Th iteration 8 for the fundamental matrix. and 4 of them for database is also updated when a keyframe is deleted by the the homography. At each iteration we compute a score culling procedure SM for each model M(H for the homography, F for Because there exists visual overlap between keyframes the fundamental matrix) when querying the database there will not exist a unique keyframe with a high score. The original DBow2 took this >(PM(d2r(x& x,, M))+pM(dc(xc, x,, M))) overlapping into account, adding up the score of images that re close in time. This has the limitation of not including T-d2 if d2< TM PM(d)= keyframes viewing the same place but inserted at a different if d> Tm time. Instead we group those keyframes that are connected in the covisibility graph. In addition our database returns all where dr and d] are the symmetric transfer errors [2] keyframe matches whose scores are higher than the 75% of from one frame to the other. TM is the outlier rejection the best score threshold based on the x- test at 95%(TH =5.99, An additional benefit of the bags of words representation TF=3.84, assuming a standard deviation of 1 pixel in for feature matching was reported in [5]. When we want the measurement error). T is defined equal to TH SO that to compute the correspondences between two sets of ORB both models score equally for the same d in their inlier features, we can constraint the brute force matching only to region, again to make the process homogeneous those features that belong to the same node in the vocabulary We keep the homography and fundamental matrix with tree at a certain level (we select the second out of six highest score. If no model could be found (not enough speeding up the search. We use this trick when searching inliers), we restart the process again from step matches for triangulating new points, and at loop detection 3) Model selection and relocalization We also refine the correspondences with an If the scene is planar, nearly planar or there is low orientation consistency test, see [11] for details, that discards parallax, it can be explained by a homography. However outliers ensuring a coherent rotation for all correspondences a fundamental matrix can also be found, but the problem is not well constrained [2] and any attempt to recover 2https://github.com/dorian3d/dbow2 the motion from the fundamental matrix would yield IEEE TRANSACTIONS ON ROBOTICS wrong results. We should select the homography as the reconstruction method will correctly initialize from a plane or it. will detect the low parallax case and refuse the initialization. On the other hand a non-planar scene with enough parallax can only be explained by the fundamental matrix, but a homography can also be found explaining a subset of the matches if they lie on a plane or they have low parallax(they are far away ) In this nxhM动030 4 MD: 1684P BKE hould select the fundamental matrix, We have found that a robust heuristic is to compute H H and select the homography if RH > 0.45, which adequately captures the planar and low parallax cases Otherwise we select the fundamental matrix 4) Motion and Structure from Motion recovery Once a model is selected we retrieve the motion hy potheses associated. In the case of the homography we retrieve 8 motion hypotheses using the method of Faugeras et. al [23]. The method proposes cheriality tests to select the valid solution However these tests fail if there is low parallax as points easily go in front or back of the cameras which could vield the selection TRACKING- KFs: 31, MFs: 2045, Tracked: 267 of a wrong solution. We propose to directly triangulate the eight solutions and check if there is one solution Fig. 3. Top: PTAM, middle LSD-SLAM, bottom: ORB-SLAM, some time after initialize ation in the New College sequence [39]. PTAM and ISD-SLAM with most points seen with parallax, in front of both initialize a corrupted planar solution while our Method has automalically cameras and with low reprojection error If there is initialized from the fundamental matrix when it has detected enough parallax Depending on which keyframes are manually selected, PTAM is also able to not a clear winner solution, we do not initialize and initialize well continue from step 1. This technique to disambiguate the solutions makes our initialization robust under low parallax and the twofold ambiguity configuration, and A. ORB Extraction could be considered the key of the robustness of our method We extract Fast corners at 8 scale levels with a scale factor In the case of the fundamental matrix, we convert it in of 1. 2. For image resolutions from 512x 384 to 752x 480 an essential matrix using the calibration matrix K pixels we found suitable to extract 1000 corners, for higher resolutions, as the 1241 x 376 in the kitti dataset [401 Erc =k Frck (4) we extract 2000 corners. In order to ensure an homogeneous distributio in a grid, trying to and then retrieve 4 motion hypotheses with the singular value decomposition method explained in [2]. We trian- extract at least 5 corners per cell. Then we detect corners in each cell, adapting the detector threshold if not enough gulate the four solutions and select the reconstruction as corners are found. The amount of corners retained per cell is done for the homography also adapted if some cells contains no corners (textureless or 5)Bundle adjustment low contrast). The orientation and orb descriptor are then Finally we perform a full Ba, see the Appendix for computed on the retained FAST corners. The Orb descriptor details, to refine the initial reconstruction is used in all feature matching. in contrast to the search b An example of a challenging initialization in the outdoor patch correlation in PTAM New College robot sequence [39] is shown in Fig 3. It can be seen how PTAM and LSD-SLAM have initialized all points in a plane, while our method has waited until there is enough B. Initial Pose Estimation from previous frame parallax, initializing correctly from the fundamental matrix If tracking was successful for last frame, we use a constant velocity motion model to predict the camera pose and perform V. TRACKING a guided search of the map points observed in the last frame. If In this section we describe the steps of the tracking thread not enough matches were found (i.e. motion model is clearly that are performed with every frame from the camera. The violated), we use a wider search of the map points around camera pose optimizations, mentioned in several steps, consist their position in the last frame. The pose is then optimized in motion-only BA, which is described in the Appendix with the found correspondences IEEE TRANSACTIONS ON ROBOTICS C. Initial pose estimation via global relocalization 4). Condition 1 ensures a good relocalization and condition If the tracking is lost, we convert the frame into bag 3 a good tracking. If a keyframe is inserted when the local of words and query the recognition database for keyframe mapping is busy(second part of condition 2), a signal is sent candidates for global relocalization. We compute correspon- to stop local bundle adjustment, so that it can process as soon dences with ORB associated to map points in each keyframe, as possible the new keyframe as explained in section III-E. We then perform alternative RANSAC iterations for each keyframe and try to find a camera VⅠ. LOCAL MAPPINO pose using the PnP algorithm [41]. If we find a camera In this section we describe the steps performed by the local pose with enough inliers, we optimize the pose and perform mapping with every new keyframe K a guided search of more matches with the map points of the candidate keyframe. Finally the camera pose is again A. Key frame Insertion optimized, and if supported with enough inliers procedure continues At first we update the covisibility graph, adding a new node D. Truck Local Map linking Ki with the keyframe with most points in common Once we have an estimation of the camera pose and an We then compute the bags of words representation of the initial set of feature matches, we can project the map into the keyframe that will help in the data association for triangu frame and search more map point correspondences. To bound lating new points the complexity in large maps, we only project a local map This local map contains the set of keyframes Cl, that share map points with the current frame, and a set C2 with neighbors B. Recent Map Points Culling to the keyframes Ci in the covisibility graph. The local map Map points, in order to be retained in the map, must also has a reference keyframe Kref E K1 which shares most pass a restrictive test during the first three keyframes after map points with current frame. Now each map point seen in creation, that ensures that they are trackable and not wrongly CI and K2 is searched in the current frame as follows triangulated, i.e due to spurious data association. a point must 1)Compute the map point projection x in the current fulfill these two conditions frame. Discard if it lays out of the image bounds. 1) The tracking must find the point in more than the 25%0 2)Compute the angle between the current viewing ray v of the frames in which it is predicted to be visible and the map point mean viewing direction n Discard if 2)If more than one keyframe has passed from map vn<cos(60° point creation, it must be observed from at least three 3)Compute the distance d from map point to camera keyframes center. Discard if it is out of the scale invariance region Once a map point have passed this test, it can only be of the map point d E [dmin, dm removed if at any time it is observed from less than three axI. 4)Compute the scale in the frame by the ratio d/ dmin keyframes. This can happen when keyframes are culled and )Compare the representative descriptor d of the map when local bundle adjustment discards outlier observations point with the still unmatched ORB features in the This policy makes our map contain very few outliers. frame, at the predicted scale, and near x, and associate the map point with the best match C. New Man point Creation The camera pose is finally optimized with all the map points New map points are created by triangulating OrB from found in the frame connected keyframes Cc in the covisibility graph. For each unmatched orb in we search a match with other un E. New Key/rame Decision matched point in other keyframe. This matching is done as The last step is to decide if the current frame is spawned as explained in Section III-E and discard those matches that do a new keyframe. As there is a mechanism in the local mapping not fulfill the epipolar constraint ORB pairs are triangulated to cull redundant keyframes, we will try to insert keyframes as and to accept the new points, positive depth in both cameras, fast as possible, because that makes the tracking more robust to parallax, reprojection error and scale consistency are checked. challenging camera movements, typically rotations. To insert Initially a map point is observed from two keyframes but a new keyframe all the following conditions must be met it could be matched in others, so it is projected in the rest 1)More than 20 frames must have passed from the last of connected keyframes, and correspondences are searched as iled in s tion v-D global relocalization 2) Local mapping is idle, or more than 20 frames have passed from last keyframe insertion D. Local Bundle adjustment 3)Current frame tracks at least 50 points The local Ba optimizes the currently processed keyframe 4)Current frame tracks less than 90%o points than kref. Ki, all the keyframes connected to it in the covisibility graph Instead of using a distance criterion to other keyframes Kc, and all the map points seen by those keyframes. All other as PTAM, we impose a minimum visual change(condition keyframes that see those points but are not connected to the IEEE TRANSACTIONS ON ROBOTICS currently processed keyframe are included in the optimization a guided search of more correspondences We optimize it again but remain fixed. Observations that are marked as outliers are and, if Sil is supported by enough inliers, the loop with Kiis discarded at the middle and at the end of the optimization. accepted See the appendix for more details about this optimization C. Loop Fusion e. Local Keyframe Culling The first step in the loop correction is to fuse duplicated In order to maintain a compact reconstruction, the local map points and insert new edges in the covisibility graph mapping tries to detect redundant keyframes and delete them. that will attach the loop closure. At first the current keyframe This is beneficial as bundle adjustment complexity grows with pose tiw is corrected with the similarity transformation sil the number of keyframes, but also because it enables lifelong and this correction is propagated to all the neighbors of K operation in the same environment as the number of keyframes concatenating transformations, so that both sides of the loop will not grow unbounded, unless the visual content in the scene get aligned. All map points seen by the loop keyframe and its changes. We discard all the keyframes in Kc whose 90%o of the neighbors are projected into Ki and its neighbors and matches map points have been seen in at least other three keyframes in are searched in a narrow area around the projection, as done the same or finer scale. The scale condition ensures that map in section V-D. All those map points matched and those that points maintain keyframes from which they are measured with were inliers in the computation of Sil are fused. All keyframes most accuracy. This policy was inspired by the one proposed involved in the fusion will update their edges in the covisibility in the work of Tan et al [24], where keyframes were discarded graph effectively creating edges that attach the loop closure after a process of change detection D. Essential graph optimization VIL. LOOP CLOSING To effectively close the loop, we perform a pose graph The loop closing thread takes Ki, the last keyframe pro- optimization over the Essential Graph, described in Section cessed by the local mapping, and tries to detect and close III-D, that distributes the loop closing error along the grapl loops. The steps are next described The optimization is performed over similarity transformations to correct the scale drift [6]. The error terms and cost functi A. Loop Candidates Detection are detailed in the Appendix. After the optimization each map At first we compute the similarity between the bag of point is transformed according to the correction of one of the words vector of Ki and all its neighbors in the covisibility keyframes that observes it graph(e 30) and retain the lowest score min. Then we query the recognition database and discard all those keyframes VIIL. EXPERIMENTS whose score is lower than Smin. This is a similar operation We have performed an extensive experimental validation of to gain robustness as the normalizing score in DBow2, our system in the large robot sequence of New College [39] which is computed from the previous image, but here we evaluating the general performance of the system, in 16 hand use covisibility information In addition all those keyframes held indoor sequences of the TUM RGB-D benchmark [38] directly connected to Ki are discarded from the results. To evaluating the localization accuracy, relocalization and lifelong accept a loop candidate we must detect consecutively three capabilities, and in 10 car outdoor sequences from the KITTI loop candidates that are consistent (keyframes connected in dataset [40], evaluating real-time large scale operation, local the covisibility graph). There can be several loop candidates ization accuracy and efficiency of the pose graph optimization if there are several places with similar appearance to Ki Our system runs in real time and processes the images xactly at the frame rate they were acquired. We have carried B. Compute the Similarity Transformation out all experiments with an Intel Core 17-4700MQ (4 cores 240GHz) and &Gb RAM. ORB-SLAM has three main In monocular slAM there are seven degrees of freedom in which the map can drift, three translations, three rotations threads, that run in parallel with other tasks from ROS and the operating system, which introduces some randomness in and a scale factor [6]. Therefore to close a loop we need to the results. For this reason, in some experiments, we report compute a similarity transformation from the current keyframe Ki to the loop keyframe Ki that inf the median from several runs at the error accumulated in the loop. The computation of this similarity will serve also as geometrical validation of the loop A. System Performance in the New College dataset We first compute correspondences between ORB associated The NewCollege dataset [39] contains a 2. 2km sequence to map points in the current keyframe and the loop candidate from a robot traversing a campus and adjacent parks. The keyframes, following the procedure explained in section III-E. sequence is recorded by a stereo camera at 20 fps and a resolu At this point we have 3D to 3D correspondences for each tion 512 X 382. It contains several loops and fast rotations that loop candidate. We alternatively perform RANSAC iterations makes the sequence quite challenging for monocular vision with each candidate, trying to find a similarity transformation To the best of our knowledge there is no other monocular using the method of Horn [42]. If we find a similarity Si with system in the literature able to process this whole sequence enough inliers, we optimize it(see the Appendix), and perform For example Strasdat et al. [71, despite being able to close IEEE TRANSACTIONS ON ROBOTICS TABLE II LOOP CLOSING TIMES IN NEWCOLLEGE Loop Detection(ms) Loop Correction (S) Essential Graph Loop KeyFrames Essential Graph Candidates Similarity Fusion Ed Optimization Total (s) Detection Transtormation 287 1347 4.71 0.20 0.26 0.51 108 4.14 1798 1.06 1.52 1279 7128 9.82 31.29 0.95 1.26 2.27 4 2648 12547 12.37 30.3 2.30 3.33 3150 16033 14.71 4.60 21797 13.52 48.68 0.97 4.6 TABLE I TRACKING AND MAPPING TIMES IN NEWCOLLEGE Thread Std Operation Median Mean (ms)(ms) m ORB extraction 1.101.421.61 TRACKING Initial Posc Est. 3.45 rack Local Map l4.84 16.0l Total 30.5731601039 Keyframe Insertion 10.2( 1.85.03 0.10 3.18 6.70 LOCAL Map point Culling MAPPING Map Point Creation6679729631.48 Local ba 2960836041171.11 KeyFrame Cullins 8.0715791898 383:594642721789 Fig. 5. Map before and after a loop closure in the New College sequence The loop closure match is drawn in blue, the trajectory in green, and the local map for the tracking at that moment in red. The local map is extended alon Fig 4. Example of loop detected in the New College sequence. We draw the both sides of the loop aftcr it is closcd lier correspondences supporting the similarily transformation found loops and work in large scale environments, only showed tracking and the local mapping. Tracking works at frame-rates monocular results for a small part of this sequence around 25-30HZ, being the most demanding task to track the As an example of our loop closing procedure we show in local map. If needed this time could be reduced limiting the Fig. 4 the detection of a loop with the inliers that support number of keyframes that are included in the local map. In the similarity transformation. Fig. 5 shows the reconstruction the local mapping thread the most demanding task is local before and after the loop closure. In red it is shown the local bundle adjustment. The local bA time varies if the robot is map,which after the loop closure extends along both sides exploring or in a well mapped area, because during exploration of the loop closure. The whole map after processing the full bundle adjustment is interrupted if tracking inserts a new sequence at its real frame-rate is shown in Fig. 6. The big loop keyframe, as explained in section V-E. In case of not needing on the right does not perfectly align because it was traversed new keyframes local bundle adjustment performs a generous in opposite directions and the place recognizer was not able number of prefixed iterations to find loop closures Table II shows the results for each of the 6 loop clo- We have extracted statistics of the times spent by each sures found. It can be seen how the loop detection increases thread in this experiment. Table I shows the results for the sublinearly with the number of keyframes. This is due to

试读 18P ORB-SLAM_ a Versatile and Accurate Monocular SLAM System.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
qq_32776443 是要找的资源,不错哦
ORB-SLAM_ a Versatile and Accurate Monocular SLAM System.pdf 50积分/C币 立即下载
ORB-SLAM_ a Versatile and Accurate Monocular SLAM System.pdf第1页
ORB-SLAM_ a Versatile and Accurate Monocular SLAM System.pdf第2页
ORB-SLAM_ a Versatile and Accurate Monocular SLAM System.pdf第3页
ORB-SLAM_ a Versatile and Accurate Monocular SLAM System.pdf第4页

试读结束, 可继续读2页

50积分/C币 立即下载