Multistage Committees of Deep Feedforward
Convolutional Sparse Denoise Autoencoder for
Object Recognition
Shicao Luo
1
1.College of Information Science and Technology,
Donghua University
Shanghai 201620, China
Email: shicaoLuo@163.com
Yongsheng Ding
1,2
* , Kuangrong Hao
1,2
2. Engineering Research Center of Digitized Textile &
Apparel Technology
Shanghai 201620, China
* Email: ysding@dhu.edu.cn
Abstract—Deep learning and unsupervised feature learning
systems are known to achieve good performance in benchmarks
by using extremely large architectures with many features at
each layer. However, we found that the number of features’
contribution to performance is very small when it is more than
the threshold. Meanwhile, the size of pooling layer has an
important influence on performance. In this paper, we present an
unsupervised method to improve the classification result by going
deep and combining multistage classifiers in a committee with a
small amount of features at each layer. The network is trained
layer-wise via denoise autoencoder (dA) with L-BFGS to optimize
convolutional kernels and no backpropagation is used. In
addition, we regularize the dA encouraging representations to fit
sparse for each coding layer. We apply it on the STL-10 dataset
which has very few training examples and a large amount of
unlabeled data. Experimental results show that our method
presents higher performance than the existing ones on the
condition via individual network.
Keywords—multistage classifiers; sparse denoise autoencoder;
object recognition; deep learning
Recent theoretical studies indicate that unsupervised
learning methods are able to automatically build feature
extractors instead of handcrafting them. Classical methods for
dimensionality reduction or clustering, such as principal
component analysis and K-means, have been used routinely in
numerous vision applications [1, 2].
In the context of object recognition, a key problem of
current machine vision systems is whether unsupervised
learning can be used to learn robust and invariant features.
Traditionally, scale invariant feature transform (SITF) or
speed up robust features (SURF) can be understood and
generalized as a way to go from pixels to patch descriptors [3].
However, it is often difficult to adapt to new settings. Unlike
the above mentioned artificial systems for feature learning,
primate visual system [24] can accomplish these tasks
effortlessly in common sense. The groundbreaking work of
Hubel and Wiesel [28] played a major role in the computer
vision community via Marr’s work [29] on building visual
hierarchies analogous to the primate visual system. To narrow
the gap with biological systems and play computer’s own
advantages, much recent research has focused on training deep,
multi-layered networks of feature, such as deep belief nets [4],
deep auto-encoders [5], deep convolutional neural networks
[6], hierarchical sparse coding [7, 8], and SIFT-based MFs
coding [25]. The main benefit of these models is their high
genericity, since deep learning approaches that learn to push
pixels through multiple layers of feature transforms, without
any use of prior knowledge. And, a trick called “receptive
field” dramatically reduces the number of parameters that
must be trained and is a key element of several state-of-the-art
systems [9, 10].
It has been proved that deep models are potentially more
capable than shallow models in handling complex tasks [11].
Deep belief nets [12] learn a hierarchy of features by greedy
layer-wise using the unsupervised restricted Boltzmann
machine. This pre-training method is useful to jump out of
local minimum. The learned weights are then further adjusted
to the current task using supervised information. To make
deep belief nets applicable to full-size images, convolutional
deep belief nets [13] are proposed to use a small receptive
fields and share the weights between the hidden and visible
layers among all locations in an image. Stacked denoising
autoencoders [14] build deep networks by stacking layers of
denoising autoencoders that train one-layer neural network to
reconstruct input data from partial random corruption. Spike-
and-slab sparse coding (S3C) [26] is a preexisting model
which gains very good performance, particularly when the
number of labeled examples is low.
In this paper, we propose an unsupervised learning model
for large size image recognition. In this network, we learn the
following key components: 1) A sparse and overcomplete
feature bank at each layer; 2) The factor of multistage
classifiers for voting. The network is trained layer-wise via
denoise autoencoder and no backpropagation is used. We
apply it on the STL-10 dataset which is an image recognition
dataset for developing unsupervised feature learning, deep
learning, and self-taught learning algorithms. In particular,
each class has few labeled training examples, but a very large
set of unlabeled examples. Zou [15] built a hierarchical
network to learns invariant features via simulated fixations in
video. With this method, it gains 4.5% improvement in STL-
10 classification accuracy. Coates [16] introduced “Selecting
Receptive Fields” method to limit the number of connections
from lower level features to higher ones and achieved the
978-1-4673-7189-6/15/$31.00©2015 IEEE