Pose Guided RGBD Feature Learning for 3D Object Pose Estimation
Vassileios Balntas
†
, Andreas Doumanoglou
∗
, Caner Sahin
∗
, Juil Sock
∗
, Rigas Kouskouridas
‡
, Tae-Kyun Kim
∗
†
University of Oxford, UK
‡
Scape Technologies, UK
∗
Imperial College London, UK
balntas@robots.ox.ac.uk {a.doumanoglou12,c.sahin14,ju-il.sock08,tk.kim}@ic.ac.uk rigas@scape.io
Abstract
In this paper we examine the effects of using object poses
as guidance to learning robust features for 3D object pose
estimation. Previous works have focused on learning fea-
ture embeddings based on metric learning with triplet com-
parisons and rely only on the qualitative distinction of sim-
ilar and dissimilar pose labels. In contrast, we consider
the exact pose differences between the training samples,
and aim to learn embeddings such that the distances in the
pose label space are proportional to the distances in the
feature space. However, since it is less desirable to force
the pose-feature correlation when objects are symmetric,
we discuss the use of weights that reflect object symmetry
when measuring the pose distances. Furthermore, end-to-
end pose regression is investigated and is shown to further
boost the discriminative power of feature learning, improv-
ing pose recognition accuracies. Experimental results show
that the features that are learnt guided by poses, are signifi-
cantly more discriminative than the ones learned in the tra-
ditional way, outperforming state-of-the-art works. Finally,
we measure the generalisation capacity of pose guided fea-
ture learning in previously unseen scenes containing objects
under different occlusion levels, and we show that it adapts
well to novel tasks.
1. Introduction
Detecting objects and estimating their 3D pose is a very
challenging task due to the fact that severe occlusions, back-
ground clutter and large scale changes dramatically affect
the performance of any contemporary solution. State of the
art methods make use of Hough Forests for casting patch
votes in the 3D space [
27, 7] or train CNNs to either per-
form classification into the quantized 3D pose space [
13] or
regress the object position [
14] from local patches.
Another approach to the 3D object pose estimation prob-
lem involves transforming the initial problem into a nearest
Work done while VB and RK were at Imperial College London.
neighbour matching one, where extracted feature descrip-
tors are matched with a set of templates via nearest neigh-
bour search [9]. End-to-end deep networks for feature-
based nearest neighbour matching entail training a clas-
sification network with a classifier layer which is subse-
quently removed, while the penultimate layer serve as a
feature descriptor [
24]. Direct feature learning for discrete
object classes with deep neural networks [
26, 12] demon-
strated successful results by using siamese and triplet net-
works optimised for discriminative embeddings. The latter
are learned in a way that ensures that features extracted from
samples belonging to the same class are close in the learned
embedding space, and samples from different classes are
further apart. Wohlhart and Lepetit [
29] adapted this frame-
work to the problem of learning feature descriptors for 3D
object pose estimation, by sampling the qualitative relation
of pose similarity, and forming triplets consisting of similar
and dissimilar poses.
It is apparent that moving from the continuous space of
3D object poses to the qualitative one of similar and dis-
similar pose pairs, leads to inevitable information loss. To-
wards this end in this paper we are interested in creating
a feature learning framework that directly uses the object
pose in the optimization process. Our key idea is that by
using the pose labels in the feature learning layer, we can
devise a learning procedure that has inherit knowledge of
the final goal (i.e. 3D pose estimation), thus allowing for a
switch from a qualitative optimization (similar and dissimi-
lar poses), to a quantitative one (directly computed distance
in the pose space). In the proposed learning framework, the
pose-feature correlation is established with the adjusted dis-
tances in the pose space. Direct 3D pose regression seems
challenging due to ambiguities in appearance/pose space,
the continuous nature of the multi-dimensional output space
and the discrepancy between synthetic data used for training
and real data used for testing. However, training an end-to-
end pose regression network can still facilitate feature learn-
ing. Similar to [
24], while evaluating our system’s perfor-
mance, we remove the regression layer and use the feature
layer for nearest neighbour matching. The regression term
along with the pose-guided feature one further improves the
1
3856