of the 3D object recognition. To handle these issues, some resear-
ches attempt to extract the middle or even high level feature via
resorting to machine learning methods. For instance, to extract
middle level features, sparse coding [36,35] is introduced to build
part-based features for 3D object. Suppose that a 3D object to be
processed is human, sparse coding can successfully decompose the
object as head, leg, and foot body parts by the constrained matrix
factorization. In addition, to extract semantic information of the
higher level, recently, the deep learning method [26,21,23] has
been exploited to extract features for 3D objects.
Compared with the low level feature extraction approaches, the
sparse coding and deep learning methods can automatically
extract the conceptual and semantic level features respectively.
However, they have their own shortcomings. For example, sparse
coding is hard to be scalable to large-scale dataset. As to the deep
learning methods, training their model is very time-consuming.
Apart from this, it requires to be trained by skillful researchers.
That is, the performance of the obtained deep learning model is
highly dependent on the skills of the researcher. It follows that
either sparse coding or deep learning fails to be widely applied to
massive data. In other words, handling massive data is a challen-
ging task for the two aforementioned feature learning methods.
To avoid the limitations of the two aforementioned feature
learning methods, we intend to propose a fast and scalable feature
learning method via a set of one-vs.-all logistic regression classi-
fiers with ℓ
2
-norm, also called the ℓ
2
-norm logistic regression
(abbreviated as ℓ
2
-LR). Its motivations are as follows. According
to the literature [29], the classifier response values of the input
features can be used to measure the similarity among the input
features. Therefore, the response values of two similar 3D objects
is very close, vise versa. However, if only using a single classifier, it
may perform well in the case of measuring the similarity for the
similar 3D objects, but fails to recognize dissimilar ones fallen into
different categories. For the sake of comprehensively measuring
the similarity for various kinds of 3D object, we plan to learn a set
of classifiers for all classes of training data and combine their
response values into the feature of 3D objects. In addition, to make
the feature learning method scalable to massive data, we exploit
stochastic gradient ascent to update the model parameters, which
is a representative scalable machine learning method.
It is worth highlighting the characteristics of this paper as
follows:
The label information is taken into account when training the
feature learning model. This will effectively avoid the semantic
gap issue often suffered by the handcrafted feature extraction
method.
Stochastic gradient ascent updating is utilized, which is helpful
to training over the massive training data. This implies that we
can better the generalization ability via increasing the volume
of training data.
The proposed feature learning method allows us to extract
features for test objects in an online way. In contrast, either
sparse coding or deep learning fails to achieve this goal, for that
the sparse code of the test object is obtained by repeated
iterations and deep learning model is too computationally
expensive.
The remainder of this paper is organized as follows. In Section 2,
we review the state-of-the-art of the feature extraction methods for
3D objects. We present the feature learning algorithm via a set of
logistic regression classifiers in Section 3.Andthen,inSection 4,the
extensive experiments are conducted to demonstrate the effective-
ness and the efficiency of the proposed algorithm. Finally, we draw
aconclusioninSection 5.
2. Related works
Considering the fact that the typical feature extraction met-
hods of 3D objects have been talked about in the Section 1, for
simplicity, we plan to just select the state-of-the-art method out of
each class to review, such as SIFT and its variants, sparse coding,
along with deep learning. The detailed survey is as follows.
2.1. SIFT and its variants
Similar to the 2D object retrieval and recognition applications,
scale-invariant feature transform (SIFT) descriptor [19] is also
popular and dominant in 3D object applications [30,18,3,1] since
the SIFT feature is more superior to other features in terms of
handling intensity, rotation, scale and affine variations. However,
matching the similarity between 3D object is very tedious if
computing the correspondence in the level of SIFT feature. For
the ease of matching, the bag-of-feature (BOF) idea is then applied
to aggregate the SIFT features in 3D object retrieval and recogni-
tion tasks [8,31,3,5].
In spite of the fact that the BOF method is able to successfully
aggregate the local features, it will decrease the discriminant ability
to some extent because features are aggregated in a disordered way.
To remedy this limitation, a multi-resolution version of BoF has been
proposed via combing a series of BOFs of different resolutions,
which is also called the pyramid kernel [9].Forfurtherimproving
the discriminant ability, pyramid kernel is extended to spatial
pyramid kernel [14] via taking spatial information into account.
With the above-mentioned advantages, the pyramid and spatial
pyramid kernel have been applied to 3D object recognition [17,13].
As summarized in Section 1, the low level features, such as SIFT and
its variants, are readily to result in semantic gap issues because of
being constructed in a handcrafted way and label information not
being considered.
2.2. Sparse coding
Sparse coding [15] refers to a class of algorithms for automatically
finding succinct representations of data as a (often linear) combina-
tion of a few typical atoms (usually called dictionaries or codebooks)
learned from data. Given only unlabeled input data, it learns basis
functions that capture middle level features (i.e., concept level
features) in the data. For example, the very large set of English
sentences can be encoded by a small number of symbols (i.e. letters,
numbers, punctuation, and spaces) combined in a particular order
for a particular sentence, and so a sparse coding for English would
be those symbols.
More specifically, given a k-dimensional set of real-numbe-
red input vectors x
!
A R
k
, the goal of sparse coding is to find n
k-dimensional basis vectors b
1
!
; …; b
n
!
A R
k
along with a sparse
n-dimensional vector of weights or coefficients y
!
A R
n
for each
input vector, such that a linear combination of the basis vectors
with proportions given by the coefficients results in a close
approximation to the input vector: x
!
∑
n
j ¼ 1
y
j
b
!
j
[15]. Presently,
sparse coding has been leveraged to extract conceptual features
for 2D images of 3D objects [36,35]. Theoretically, the perform-
ance of sparse coding based 3D object recognition and retrieval
will outperform those on top of low level features (i.e., SIFT and
its variants). In spite of this, its extrapolation is quite time-
consuming.
2.3. Deep learning
Deep learning is a set of algorithms in machine learning that
attempt to learn layered models of inputs, commonly neural
F. Zou et al. / Neurocomputing 151 (2015) 603–611604