1350 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 3, MARCH 2016
computationally efficient classification techniques. The
current state-of-the-art support vector machine (SVM)
[16], [17] is not, however, able to cope with more than
some few thousands of labeled data points.
A very convenient way to alleviate the aforementioned prob-
lems is to extract relevant, potentially useful, nonredundant,
and nonlinear features from images in order to facilitate the
subsequent classification step. The extracted features could be
fed into a simple cost-effective (ideally linear) classifier. The
bottleneck would be then the feature learning step. Learning
expressive spatial–spectral features from HS images in an effi-
cient way is thus of paramount relevance. In addition, and very
importantly, learning such features in an unsupervised fashion
has also become extremely relevant given the few labeled pixels
typically available.
A. Background
Given the typically high dimensionality of remote sensing
data, feature extraction techniques have been widely used in the
literature to reduce the data dimensionality. While the classical
principal component analysis (PCA) [18] is still one of the
most popular choices, a plethora of nonlinear dimensionality
reduction methods, m anifold learning and dictionary learning
algorithms, h ave been introduced in the last decade.
State-of-the-art manifold learning methods [19] include the
following: local approaches for the description of remote
sensing image manifolds [20]; kernel-based and spectral de-
compositions that learn mappings optimizing for maximum
variance, correlation, entropy, or minimum noise fraction [21];
neural networks that generalize PCA to encode nonlinear data
structures via autoassociative/auto encoding networks [22]; and
projection pursuit approaches leading to convenient Gaussian
domains [23]. In remote sensing, autoencoders have been
widely used [24]–[27]. However, a number of (critical) free
parameters are to be tuned; regularization is an important issue,
which is mainly addressed by limiting the network’s structure
heuristically, and only sh allow stru ctures are considered mainly
due to the limitations on computational resources and efficiency
of the training algorithms. On top of this, very often, autoen-
coders employ only the spectral information, and in the best
of the cases, spatial information is naively included through
stacking handcrafted spatial features.
To the authors’ knowledge, there is few evidence of the good
performance of deep architectures in remote sensing image
classification: [28] introduces a d eep learning algorithm for
classification of (low-dimensional) VHR images; [29] explores
the robustness of deep networks to noisy class labels for aerial
image classification, and [30] introduces hybrid deep neural
networks to enable the extraction of variable-scale features
to detect vehicles in satellite images; [31] proposes a hybrid
framework based on stacked autoencoders for classification of
HS data. Although deep learning methods can cope with the dif-
ficulties of nonlinear spatial–spectral image analysis, the issues
of sparsity in the feature representation and efficiency of train-
ing algorithms are not obvious in state-of-the-art frameworks.
In recent y ears, dictionary learning has emerged as an
efficient way to learn sparse image features in unsupervised
settings, which are eventually used for image classification
and object recognition: discriminative dictionaries have been
proposed for spatial–spectral sparse representatio n and image
classification [32]; sparse kernel networks h ave been recently
introduced for classification [33], sparse representations over
learned dictionaries for image pansharpening [34], saliency-
based codes for segmentation [35], [36], sparse bag-of-words
codes for automatic target detection [37], and unsupervised
learning of sparse features for aerial image classification [38].
These methods describe the input images in sparse represen-
tation spaces but do not take advantage of the high nonlinear
nature of deep architectures.
Therefore, in the context of remote sensing, unsupervised
learning of features in a deep convolutional neural network
(CNN) architecture seeking sparse representations has not been
approached so far.
B. Contributions
In this paper, we aim to address the two main challenges in
the field of remote sensing data. Therefore, we introduce the
use of deep convolutional networks for remote sensing data
analysis [39] trained by means of an unsupervised learning
method seeking sparse feature representations. On one hand,
the following are observed: 1) deep architectures have a highly
nonlin ear nature that is well suited to cope with the difficulties
of nonlinear spatial–spectral image analysis; 2) convolutional
architectures only capture local interactions, mak ing them well
suited when the input shares similar statistics at all location, i.e.,
when there is high redundancy; 3) sparse features are supposed
to be convenient to describe remote sensing images [4], [6]–[8].
On the other hand, we want to train deep convolutional archi-
tectures efficiently to alleviate the high-computationalproblems
involved in remote sensing. Given the typically few labeled
data, applying unsupervised learning algorithms to train deep
architectures is a paramount aspect of remote sensing.
We propose the combination of greedy layerwise unsuper-
vised pretraining [40]–[43] coupled with the highly efficient
enforcing population and lifetime sparsity (EPLS) algorithm
[44] for unsupervised learning of sparse features and show the
applicability and potential of the method to extract hierarchical
(i.e., deep) sparse feature representations of remote sensing im-
ages. The EPLS seeks a sparse representation of the input data
(remo te sensing images) and allows training systems with large
numbers of input channels efficiently (and numerous filters/
parameters), without requiring any metaparameter tuning.
Thus, deep convolutional networks are trained efficiently in
an unsupervised greedy layerwise fashion [40]–[43] using the
EPLS algorithm [44] to learn the network filters. The learned
hierarchical representations of the input remote sensing images
are used for image/pixel classification, where lower layers ex-
tract low-level features, and higher layers exhibit more abstract
and complex representations.
To our knowledge, this is the first work dealing with sparse
unsupervised deep convolutional networks in remote sensing
data analysis in a systematic way. We want to emphasize the
fact that the methodology presented here is fully unsuper-
vised, which is a different (and more ch allenging) setting to