Flexible, High Performance Convolutional
Neural Networks for Image Classification
Dan C. Cires¸an, Ueli Meier, Jonathan Masci, Luca M. Gambardella, J
¨
urgen Schmidhuber
IDSIA, USI and SUPSI
Galleria 2, 6928 Manno-Lugano, Switzerland
{dan,ueli,jonathan,luca,juergen}@idsia.ch
Abstract
We present a fast, fully parameterizable GPU im-
plementation of Convolutional Neural Network
variants. Our feature extractors are neither care-
fully designed nor pre-wired, but rather learned in
a supervised way. Our deep hierarchical architec-
tures achieve the best published results on bench-
marks for object classification (NORB, CIFAR10)
and handwritten digit recognition (MNIST), with
error rates of 2.53%, 19.51%, 0.35%, respectively.
Deep nets trained by simple back-propagation per-
form better than more shallow ones. Learning is
surprisingly rapid. NORB is completely trained
within five epochs. Test error rates on MNIST
drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17
epochs, respectively.
1 Introduction
The human visual system efficiently recognizes and local-
izes objects within cluttered scenes. For artificial systems,
however, this is still difficult due to viewpoint-dependent ob-
ject variability, and the high in-class variability of many ob-
ject types. Deep hierarchical neural models roughly mimick
the nature of mammalian visual cortex, and by community
consensus are among the most promising architectures for
such tasks. The most successful hierarchical object recog-
nition systems all extract localized features from input im-
ages, convolving image patches with filters. Filter responses
are then repeatedly sub-sampled and re-filtered, resulting in a
deep feed-forward network architecture whose output feature
vectors are eventually classified. One of the first hierarchi-
cal neural systems was the Neocognitron
[
Fukushima, 1980
]
which inspired many of the more recent variants.
Unsupervised learning methods applied to patches of nat-
ural images tend to produce localized filters that resemble
off-center-on-surround filters, orientation-sensitive bar detec-
tors, Gabor filters
[
Schmidhuber et al., 1996; Olshausen and
Field, 1997; Hoyer and Hyv
¨
arinen, 2000
]
. These findings
in conjunction with experimental studies of the visual cor-
tex justify the use of such filters in the so-called standard
model for object recognition
[
Riesenhuber and Poggio, 1999;
Serre et al., 2007; Mutch and Lowe, 2008
]
, whose filters are
fixed, in contrast to those of Convolutional Neural Networks
(CNNs)
[
LeCun et al., 1998; Behnke, 2003; Simard et al.,
2003
]
, whose weights (filters) are randomly initialized and
changed in a supervised way using back-propagation (BP).
Despite the hardware progress of the past decades, compu-
tational speed is still a limiting factor for CNN architectures
characterized by many building blocks typically set by trial
and error. To systematically test the impact of various archi-
tectures on classification performance, we present a fast CNN
implementation on Graphics Processing Units (GPUs). Previ-
ous GPU implementations of CNNs
[
Chellapilla et al., 2006;
Uetz and Behnke, 2009; Strigl et al., 2010
]
were hard-coded
to satisfy GPU hardware constraints or use general purpose
libraries, whereas our implementation is flexible and fully on-
line (i.e., weight updates after each image). A notable excep-
tion is
[
Jarrett et al., 2009
]
who performed a thorough analy-
sis of the influence of all building blocks of a multistage ar-
chitecture on recognition performance. Our implementation
allows for training large CNNs within days instead of months,
such that we can investigate the influence of various structural
parameters by exploring large parameter spaces
[
Pinto et al.,
2009
]
and performing error analysis on repeated experiments.
We evaluate various networks on the handwritten digit
benchmark MNIST
[
LeCun et al., 1998
]
and two image clas-
sification benchmarks: NORB
[
LeCun et al., 2004
]
and CI-
FAR10
[
Krizhevsky, 2009
]
.
2 Convolutional neural networks
CNNs are hierarchical neural networks whose convolutional
layers alternate with subsampling layers, reminiscent of sim-
ple and complex cells in the primary visual cortex
[
Wiesel
and Hubel, 1959
]
. CNNs vary in how convolutional and sub-
sampling layers are realized and how the nets are trained.
2.1 Image processing layer
The image processing layer is an optional pre-processing
layer of predefined filters that are kept fixed during train-
ing. Thus additional information besides the raw input im-
age can be provided to the network, such as edges and gra-
dients. In particular, we find that a contrast-extracting layer
[
Fukushima, 2003
]
helps to improve the recognition rate for
NORB.
1237
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence