• Achieving one-order of magnitude speed-up compared
to state-of-the-art 3D recognition algorithms.
• Removing all requirements for prior object segmenta-
tion or detector training needed in other algorithms.
2. Related Work
Existing 3D recognition methods usually require seg-
mentation or detector training, and they are slow due to
3D complexity. Methods for object recognition in urban
street data often require segmentation of objects from the
ground [10, 12, 13, 15]. Then a set of object types is defined
to train either a global detector or a set of local descriptors.
Golovinskiy et al. [12] extend the targets to over twenty
types of street objects, using classifiers trained with global
features, while requiring the scene to be pre-processed
based on ground estimation, so that candidate objects are
segmented before applying recognition algorithms. Pang
et al. [14] employ Adaboost to train a combination of
weighted 3D Haar-like features for detectors and exhaus-
tively searches for objects in 3D space, thus avoiding the
requirement for segmentation. However, this method only
handles limited rotation changes. Song et al. [17] use depth
maps for object detection with a 3D detector scanning in
3D space. This shares some similarity with our depth-based
projections, but their method focuses on RGB-D data rather
than point clouds, and it is very time-consuming due to the
extensive costs for detector training.
Local 3D shape descriptors are frequently used by exist-
ing methods. Most popular are spin images (SI) [1] which
encodes surface properties in a local object-oriented system,
as well as others such as 3D shape context [9], fast point
feature histogram [16], signature of histograms of orienta-
tions [2] and unique shape context [3]. Several surveys
compare these 3D descriptors in more detail [11, 18, 19].
A few others focus on improving descriptor matching [6, 7,
8, 15]. However, 3D descriptor-based recognition methods
require prior segmentation of background points, as well
as descriptor computation and matching in 3D space, time-
consuming processes that make these methods inefficient.
The strategy of reducing 3D problem into 2D space
was similarly employed for 3D object retrieval. Chen
et al. [20] use 2D shapes and silhouettes to retrieve 3D
object mesh models. Ohbuchi et al. [21] extract 2D multi-
scale local features from range images to aid in 3D object
model retrieval. Shang and Greenspan [22] use view sphere
sampling to extract features from the minima of the error
surfaces for 3D recoginition. Aubry et al. [23] also apply
the idea of 3D-to-2D to align 3D CAD chair models to 2D
images with trained mid-level visual elements. However,
these methods focus on matching individual clean object
mesh models, much simpler than the unsegmented noisy
large-scale 3D point clouds in our case with tremendously
increased complexity.
Figure 2. Flow of the proposed algorithm: First project the 3D
point clouds into 2D images from multiple views, then detect
object in each view separately, and finally re-project all 2D results
back into 3D for a fused 3D object location estimate.
3. Algorithm Introduction
3.1. Multi-View Projection
The core idea of our recognition algorithm is to trans-
form a 3D detection problem into a series of 2D detec-
tion problem, thereby transforming the complexity of an
exhaustive 3D search into a fixed number of 2D search-
es. This is achieved by projection of 3D point clouds at
multiple viewpoints to decompose it into a series of 2D
images, which works like the reverse process of multi-view
stereo reconstruction [24] where 2D images from multiple
viewpoints are fused to reconstruct 3D information. To
ensure that the original 3D information is not lost, the 3D
to 2D projection is done at multiple viewing angles (evenly
chosen on a sphere). Depth information is utilized when
projecting 2D images for each view, and kept for later re-
projection back into 3D for fusion of 2D results. As shown
in the algorithm flow in fig. 2, after the input 3D point cloud
is projected into 2D images from multiple views, each view
is used to locate the target object. Lastly, all 2D detection
results are re-projected back into 3D space for a fused 3D
object location estimate.
The benefits of this multi-view projection are three-
fold. Firstly, each view can compensate for others’ missing
information, equivalent to a pseudo 3D recognition process
with reduced complexity. Secondly, target objects are also
projected from multiple views and detected in all projected
scene views, making the recognition process invariant to
rotation changes. Thirdly, multiple independent 2D detec-
tion processes stabilize the final fused 3D object locations,
filtering discrete location offsets common in 2D detection.