Gradient Response Maps for Real-Time
Detection of Textureless Objects
Stefan Hinterstoisser, Member, IEEE, Cedric Cagniart, Slobodan Ilic, Member, IEEE,
Peter Sturm, Member, IEEE, Nassir Navab, Member, IEEE,
Pascal Fua, Fellow, IEEE, and Vincent Lepetit
Abstract—We present a method for real-time 3D object instance detection that does not require a time-consuming training stage, and
can handle untextured objects. At its core, our approach is a novel image representation for template matching designed to be robust
to small image transformations. This robustness is based on spread image gradient orientations and allows us to test only a small
subset of all possible pixel locations when parsing the image, and to represent a 3D object with a limited set of templates. In addition,
we demonstrate that if a dense depth sensor is available we can extend our approach for an even better performance also taking 3D
surface normal orientations into account. We show how to take advantage of the architecture of modern computers to build an efficient
but very discriminant representation of the input images that can be used to consider thousands of templates in real time. We
demonstrate in many experiments on real data that our method is much faster and more robust with respect to background clutter than
current state-of-the-art methods.
Index Terms—Computer vision, real-time detection and object recognition, tracking, multimodality template matching.
Ç
1INTRODUCTION
R
EAL-TIME object instance detection and learning are two
important and challenging tasks in computer vision.
Among the application fields that drive development in this
area, robotics especially has a strong need for computation-
ally efficient approaches as autonomous systems continu-
ously have to adapt to a changing and unknown
environment and to learn and recognize new objects.
For such time-critical applications, real-time template
matching is an attractive solution because new objects can
be easily learned and matched online, in contrast to
statistical-learning techniques that require many training
samples and are often too computationally intensive for
real-time performance [1], [2], [3], [4], [5]. The reason for
this inefficiency is that those learning approaches aim at
detecting unseen objects from certain object classes instead
of detecting a priori, known object instances from multiple
viewpoints. Classical template matching tries to achieve the
latter in classical template matching where generalization is
not performed on the object class but on the viewpoint
sampling. While this is considered as an easier task, it does
not make the problem trivial, as the data still exhibit
significant changes in viewpoint, in illumination, and in
occlusion between the training and the runtime sequence.
When the object is textured enough for keypoints to be
found and recognized on the basis of their appearance, this
difficulty has been successfully addressed by defining patch
descriptors that can be computed quickly and used to
characterize the object [6]. However, this kind of approach
will fail on textureless objects such as those of Fig. 1, whose
appearance is often dominated by their projected contours.
To overcome this problem, we propose a novel approach
based on real-time template recognition for rigid 3D object
instances, where the t emplates can be both built and
matched very quickly. We will show that this makes it
very easy and virtually instantaneous to learn new
incoming objects by simply adding new templates to the
database while maintaining reliable real-time recognition.
However, we also wish to keep the e fficiency and
robustness of statistical methods, as they learn how to reject
unpromising image locations very quickly and tend to be
very robust because they can generalize well from the
training set. We therefore propose a new image representa-
tion that holds local image statistics and is fast to compute. It
is designed to be invariant to small translations and
deformations of the templates, which has been shown to
be a key factor to generalization to different viewpoints of the
same object [6]. In addition, it allows us to quickly parse the
image by skipping many locations without loss of reliability.
Our approach is related to recent and efficient template
matching methods [7], [8] which consider only images and
their gradients to detect objects. As such, they work even
when the object is not textured enough to use feature point
techniques, and learn new objects virtually instantaneously.
In addition, they can directly provide a coarse estimation of
the object pose, which is especially important for robots
876 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 5, MAY 2012
. S. Hinterstoisser, C. Cagniart, S. Ilic, and N. Navab are with the
Department of Computer Aided Medical Procedures (CAMP), Technische
Universita
¨
tMu
¨
nchen, Garching bei Mu
¨
nchen 85478, Germany.
E-mail: {hinterst, cagniart, Slobodan.Ilic, navab}@in.tum.de.
. P. Sturm is with the STEEP Team, INRIA Grenoble-Rho
ˆ
ne-Alpes, Saint-
Ismier Cedex 38334, France. E-mail: Peter.Sturm@inrialpes.fr.
. P. Fua and V. Lepetit are with the Computer Vision Lab (CVLAB), Ecole
Polytechnique Fe
´
de
´
rale de Lausane, Lausanne 1015, Switzerland.
E-mail: {pascal.fua, vincent.lepetit}@epfl.ch.
Manuscript received 29 Sept. 2010; revised 15 Sept. 2011; accepted 17 Sept.
2011; published online 8 Oct. 2011.
Recommended for acceptance by S. Sclaroff.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2010-09-0748.
Digital Object Identifier no. 10.1109/TPAMI.2011.206.
0162-8828/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society