2
be detected. Common feature tracks scattered over these
subsequences can also be reliably matched.
Our ENFT method can help reduce estimation errors
for long loopback sequences. Given limited memory, it is
generally intractable to use global bundle adjustment to
refine camera poses and 3D points for very long sequences.
Iteratively applying local bundle adjustment is difficult to
effectively distribute estimation errors to all frames. We
address this issue by adopting a segment-based coarse-to-
fine SfM estimation algorithm, which globally optimizes
structure and motion only requiring limited memory.
Based on our ENFT algorithm and segment-based
coarse-to-fine estimation scheme, we present a novel SfM
system called ENFT-SFM, which can effectively handle
long loopback sequences and even multiple sequences.
Fig. 1 shows a challenging example containing 6 sequences
with about 95, 476 frames in total in a large-scale scene.
Our system first splits them to 37 shorter sequences, then
quickly computes many long and accurate feature tracks, ef-
ficiently estimates camera trajectories in different sequences
and accurately registers them in a unified 3D system, as
shown in Fig. 1(b). The whole process only takes about 90
minutes (excluding I/O) on a desktop PC, i.e., 17.7 FPS on
average. Our supplementary video
1
contains the complete
result.
A preliminary conference version paper was presented
in [51]. In this manuscript, we have made a number of
major modifications to improve robustness and efficiency.
Particularly, we altered the second-pass matching by for-
mulating it as minimizing an energy function incorpo-
rating two geometric constraints. We have developed an
enhanced non-consecutive track matching algorithm, which
can significantly reduce the matching time and robustly
eliminate outliers. Finally, we proposed a novel segment-
based coarse-to-fine SfM method, which can handle large
sequence datasets with only limited memory.
II. RELATED WORK
We review feature tracking and large-scale SfM methods
in this section.
A. Feature Matching and Tracking
For video tracking, sequential matchers are used for
establishing correspondences between consecutive frames.
Kanade-Lucas-Tomasi (KLT) tracker [27], [38], [50] is
widely used for small baseline matching. Other methods
detect image features and match them considering local
image patches [32], [35] or advanced descriptors [26], [29],
[28].
Both the KLT tracker and invariant feature algorithms
depend on modeling feature appearance, and can be dis-
tracted by occlusion, similar structures, and noise. Gen-
erally, sequential matchers cannot match non-consecutive
frames under image transformation. Scale-invariant feature
1
http://www.cad.zju.edu.cn/home/gfzhang/projects/tracking
/featuretracking/featuretracking.wmv
detection and matching algorithms [26], [3] are effective
in matching images with large transformation. But they
generally produce many short tracks in consecutive point
tracking due primarily to the global indistinctiveness and
feature dropout problems. In addition, invariant features are
relatively sensitive to image distortion. Although variations,
such as ASIFT [30], can improve matching performance
under substantial viewpoint change, computation overhead
significantly increases owing to exhaustive viewpoint sim-
ulation.
In this paper, we propose a novel two-pass matching
method to address this problem. In [7], memory-based
tracking method was used to extend feature trajectories,
by matching each frame to its neighbors. However, if an
object re-enters the field-of-view after a long period of
time, the size of the neighboring windows has to be very
large and the computation becomes expensive. Besides, this
method cannot cope with multiple videos. Our method does
not have this limitation, and the computation complexity is
approximately linear to the number of processed frames.
There are methods using invariant features for object
and location recognition in images/videos [41], [36], [18],
[37], [19]. These methods typically use the bag-of-words
technique to perform global localization and loop-closure
detection in an image classification framework. For lo-
cation recognition, Nist
´
er and Stew
´
enius [33] proposed
using the feature descriptors to construct a vocabulary
tree, and computing an appearance vector for each input
image. Exhaustively comparing all image pairs is still time
consuming for a long sequence. Cummins and Newman [9]
proposed clustering similar images as a location, such that
the computation can be reduced to only comparing the
input image with previously visited locations. This method
reduces the number of frames to be compared, but could
perform less satisfyingly if consecutive frames have large
overlaps.
In addition, these methods divide the location recognition
and non-consecutive feature matching into two separated
phases [24], [6], [10], [17]. Because the match matrix by
bag-of-words only roughly reflects the match confidence,
completely trusting it may lose many common features.
In this paper, we introduce a novel strategy where the
match matrix can be refined and updated along with non-
consecutive feature matching. Our method can reliably and
efficiently match the common features even with a coarse
match matrix.
Engels et al. [11] proposed integrating wide-baseline
local features with the tracked ones to improve SfM. The
method creates small and independent submaps and links
them via feature recognition. This approach also cannot
produce many long and accurate point tracks. Short tracks
are not enough for drift-free SfM estimation. In compar-
ison, our method is effective in high-quality point track
estimation. We also address the ubiquitous nondistinctive
feature matching problem in dense frames. Similar to
[15], we utilize track descriptors, instead of the feature
descriptors, to reduce computation redundancy.
Wu et al. [49] proposed using dense 3D geometry infor-