IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014 2303
Learning Deep and Wide: A Spectral Method
for Learning Deep Networks
Ling Shao, Senior Member, IEEE, Di Wu, and Xuelong Li, Fellow, IEEE
Abstract—Building intelligent systems that are capable of extracting
high-level representations from high-dimensional sensory data lies at
the core of solving many computer vision-related tasks. We propose
the multispectral neural networks (MSNN) to learn features from
multicolumn deep neural networks and embed the penultimate hier-
archical discriminative manifolds into a compact representation. The
low-dimensional embedding explores the complementary property of
different views wherein the distribution of each view is sufficiently smooth
and hence achieves robustness, given few labeled training data. Our
experiments show that spectrally embedding several deep neural net-
works can explore the optimum output from the multicolumn networks
and consistently decrease the error rate compared with a single deep
network.
Index Terms—Deep networks, multispectral embedding,
representation learning.
I. I
NTRODUCTION
Recent publications suggest that unsupervised pretraining
of deep, hierarchical neural networks improves supervised
pattern classification [1]–[4]. Learning machines that are able
to automatically build feature extractors instead of hand-crafting
them is a wide research area in pattern recognition. The main
benefit of these models is their high generation since they can
automatically learn to extract salient patterns directly from the raw
input, without any use of prior knowledge. Recent advancement
and applications using learned features have yielded excellent
results in several tasks, e.g., object recognition and video sequence
classification. Krizhevsky et al. [5] train a large, deep convolutional
neural network (CNN) to classify 1000 different classes;
Baccouche et al. [6] learn a sparse shift-invariant representation of
the local salient information using a spatio-temporal convolutional
Manuscript received July 21, 2013; accepted February 22, 2014. Date of
publication March 11, 2014; date of current version November 17, 2014.
This work was supported in part by the National Basic Research Program of
China (973 Program) under Grant 2012CB316400, in part by the University
of Sheffield, in part by the China Scholarship Council, in part by the National
Natural Science Foundation of China under Grant 61125106, and in part by
the Shaanxi Key Innovation Team of Science and Technology under Grant
2012KCT-04.
L. Shao is with the College of Electronic and Information Engineer-
ing, Nanjing University of Information Science and Technology, Nanjing
210044, China, and also with the Department of Electronic and Electri-
cal Engineering, University of Sheffield, Sheffield S1 3JD, U.K. (e-mail:
ling.shao@sheffield.ac.uk).
D. Wu is with the Department of Electronic and Electrical Engineering, Uni-
versity of Sheffield, Sheffield S1 3JD, U.K. (e-mail: stevenwudi@gmail.com).
X. Li is with the Center for Optical Imagery Analysis and Learning, State
Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics
and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China
(e-mail: xuelong_li@opt.ac.cn).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2014.2308519
sparse autoencoder, without any use of prior knowledge, and classify
each sequence by a long short-term memory recurrent neural
network [7]. Meanwhile, various architectures and techniques have
been proposed to enhance the learning capacity: a multiresolution
deep belief network (DBN) [8] combines a Laplacian pyramid
with deep learning to learn coarse structures from low-resolution
images, leading to a better generative model; multicolumn deep
neural networks proposed by Cire¸san et al. [9]–[11] use GPUs to
train several deep neural columns and average the output of each
individual network under the condition that given enough labeled
data, their networks do not need additional heuristics, such as
unsupervised pretraining or carefully prewired synapses.
Inspired by microcolumns of neurons in the cerebral cortex, several
deep neural columns are trained and become experts to unfold
their potential when they are wide. Conventional multicolumn deep
neural networks average the output of the prediction under the
condition that there are enough labeled training data and an individual
neural network is close to the global optimum. However, simple
output averaging may not achieve the model’s optimum if only few
labeled data are provided. As indicated in [9], several deep neural
columns are trained to become experts to unfold their potential
when they are wide. However, if the labeled training instances are
few, i.e., fine-tuning information is scarce, the deep networks can
suffer from overfitting. Such a setting is pervasive in real-world
applications, such as the gender prediction (Section IV-C), where
the randomized, controlled experiments may be costly, unethical, and
intrusive.
In this brief, we show how combining several deep network
columns as a basic building block into the multicolumn deep nets and
embedding the spectral relationships can further enhance robustness
and hence decrease the error rate. We define the wide deep net
as the juxtaposition of multiple randomly initialized nonconvex
deep nets, and refer to our proposed architecture as multispectral
neural networks (MSNN). The multicolumn procedure can be easily
implemented in a parallelized, multithreaded fashion that requires no
significant extra training time for MSNN. Our architecture does this
by combining several techniques in a novel way.
1) Through encouraging the neural networks to learn deep models
reusing intermediate features to extract more abstract repre-
sentations that are more correlated with the underlying causes
generating the data, we utilize the penultimate layer of the
hierarchy as our intermediate feature space in contrast to the
paradigm that outputs the top predictor layer (also known as
softmax output layer). Such nets can be DBNs or CNNs with
fully connected penultimate layers.
2) Our architecture renders the networks to learn wide horizontally
to explore the feature space admitting stochasticity of the deep
nets, rendering a mixture-of-experts style field. Unlike the
conventional multicommittee systems that extract the trivial 1-
D winner-take-all regions, that is, the top part of the hierarchy
2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.