b) Prevent the mapping algorithm from including
moving objects as part of the 3D map.
2) How to complete the part of the 3D map that is
temporally occluded by a moving object.
Many applications would greatly benefit from progress
along these lines. Among others, augmented reality, au-
tonomous vehicles, and medical imaging. All of them could
for instance safely reuse maps from previous runs. Detecting
and dealing with dynamic objects is a requisite to estimate
stable maps, useful for long-term applications. If the dynamic
content is not detected, it becomes part of the 3D map,
complicating its usability for tracking or relocation purposes.
In this work we propose an on-line algorithm to deal with
dynamic objects in RGB-D, stereo and monocular SLAM.
This is done by adding a front-end stage to the state-of-
the-art ORB-SLAM2 system [1], with the purpose of having
a more accurate tracking and a reusable map of the scene.
In the monocular and stereo cases our proposal is to use a
CNN to pixel-wise segment the a priori dynamic objects
in the frames (e.g., people and cars), so that the SLAM
algorithm does not extract features on them. In the RGB-
D case we propose to combine multi-view geometry models
and deep-learning-based algorithms for detecting dynamic
objects and, after having removed them from the images,
inpaint the occluded background with the correct information
of the scene (Fig. 1).
The rest of the paper is structured as follows: section II
discusses related work, section III gives the details of our
proposal, section IV details the experimental results, and
section V presents the conclusions and lines for future work.
II. RELATED WORK
Dynamic objects are, in most SLAM systems, classified as
spurious data and therefore neither included in the map nor
used for camera tracking. The most typical outlier rejection
algorithms are RANSAC (e.g., in ORB-SLAM [3], [1]) and
robust cost functions (e.g., in PTAM [2]).
There are several SLAM systems that address more
specifically the dynamic scene content. Within feature-based
SLAM methods, some of the most relevant are:
• Tan et al. [9] detect changes that take place in the scene
by projecting the map features into the current frame for
appearance and structure validation.
• Wangsiripitak and Murray [10] track known 3D dy-
namic objects in the scene. Similarly, Riazuelo et al.
[11] deal with human activity by detecting and tracking
people.
• More recently, the work of Li and Lee [12] uses
depth edges points, which have an associated weight
indicating its probability of belonging to a dynamic
object.
Direct methods are, in general, more sensitive to dynamic
objects in the scene. The most relevant works specifically
designed for dynamic scenes are:
• Alcantarilla et al. [13] detect moving objects by means
of a scene flow representation with stereo cameras.
• Wang and Huang [14] segment the dynamic objects in
the scene using RGB optical flow.
• Kim et al. [15] propose to obtain the static parts of the
scene by computing the difference between consecutive
depth images projected over the same plane.
• Sun et al. [16] calculate the difference in intensity
between consecutive RGB images. Pixel classification
is done with the segmentation of the quantized depth
image.
All the methods –both feature-based and direct ones–
that map the static scene parts only from the information
contained in the sequence [1], [3], [9], [12], [13], [14], [15],
[16], [17], fail to estimate lifelong models when an a priori
dynamic object remains static, e.g., parked cars or people
sitting. On the other hand, Wangsiripitak and Murray [10],
and Riazuelo et al. [11] would detect those a priori dynamic
objects, but would fail to detect changes produced by static
objects, e.g., a chair a person is pushing, or a ball that
someone has thrown. That is, the former approach succeeds
in detecting moving objects, and the second one in detecting
several movable objects. Our proposal, DynaSLAM, com-
bines multi-view geometry and deep learning in order to
address both situations. Similarly, Anrus et al. [18] segment
dynamic objects by combining a dynamic classifier and
multi-view geometry.
III. SYSTEM DESCRIPTION
Fig. 2 shows an overview of our system. First of all, the
RGB channels pass through a CNN that segments out pixel-
wise all the a priori dynamic content, e.g., people or vehicles.
In the RGB-D case, we use multi-view geometry to im-
prove the dynamic content segmentation in two ways. First,
we refine the segmentation of the dynamic objects previously
obtained by the CNN. Second, we label as dynamic new
object instances that are static most of the time (i.e., detect
moving objects that were not set to movable in the CNN
stage).
For that purpose, it is necessary to know the camera pose,
for which a low-cost tracking module has been implemented
to localize the camera within the already created scene map.
These segmented frames are the ones which are used to
obtain the camera trajectory and the map of the scene.
Notice that if the moving objects in the scene are not within
the CNN classes, the multi-view geometry stage would still
detect the dynamic content, but the accuracy might decrease.
Once this full dynamic object detection and localization
of the camera have been done, we aim to reconstruct the
occluded background of the current frame with static in-
formation from previous views. These synthetic frames are
relevant for applications like augmented and virtual reality,
and place recognition in lifelong mapping.
In the monocular and stereo cases, the images are seg-
mented by the CNN so that keypoints belonging to the a
priori dynamic objects are neither tracked nor mapped.
All the different stages are described in depth in the next
subsections (III-A to III-E).
评论0