Rethinking the Inception Architecture for Computer Vision
Christian Szegedy
Google Inc.
szegedy@google.com
Vincent Vanhoucke
vanhoucke@google.com
Sergey Ioffe
sioffe@google.com
Jonathon Shlens
shlens@google.com
Zbigniew Wojna
University College London
zbigniewwojna@gmail.com
Abstract
Convolutional networks are at the core of most state-
of-the-art computer vision solutions for a wide variety of
tasks. Since 2014 very deep convolutional networks started
to become mainstream, yielding substantial gains in vari-
ous benchmarks. Although increased model size and com-
putational cost tend to translate to immediate quality gains
for most tasks (as long as enough labeled data is provided
for training), computational efficiency and low parameter
count are still enabling factors for various use cases such as
mobile vision and big-data scenarios. Here we are explor-
ing ways to scale up networks in ways that aim at utilizing
the added computation as efficiently as possible by suitably
factorized convolutions and aggressive regularization. We
benchmark our methods on the ILSVRC 2012 classification
challenge validation set demonstrate substantial gains over
the state of the art: 21.2% top-1 and 5.6% top-5 error for
single frame evaluation using a network with a computa-
tional cost of 5 billion multiply-adds per inference and with
using less than 25 million parameters. With an ensemble of
4 models and multi-crop evaluation, we report 3.5% top-5
error and 17.3% top-1 error.
1. Introduction
Since the 2012 ImageNet competition [16] winning en-
try by Krizhevsky et al [9], their network “AlexNet” has
been successfully applied to a larger variety of computer
vision tasks, for example to object-detection [5], segmen-
tation [12], human pose estimation [22], video classifica-
tion [8], object tracking [23], and superresolution [3].
These successes spurred a new line of research that fo-
cused on finding higher performing convolutional neural
networks. Starting in 2014, the quality of network architec-
tures significantly improved by utilizing deeper and wider
networks. VGGNet [18] and GoogLeNet [20] yielded simi-
larly high performance in the 2014 ILSVRC [16] classifica-
tion challenge. One interesting observation was that gains
in the classification performance tend to transfer to signifi-
cant quality gains in a wide variety of application domains.
This means that architectural improvements in deep con-
volutional architecture can be utilized for improving perfor-
mance for most other computer vision tasks that are increas-
ingly reliant on high quality, learned visual features. Also,
improvements in the network quality resulted in new appli-
cation domains for convolutional networks in cases where
AlexNet features could not compete with hand engineered,
crafted solutions, e.g. proposal generation in detection[4].
Although VGGNet [18] has the compelling feature of
architectural simplicity, this comes at a high cost: evalu-
ating the network requires a lot of computation. On the
other hand, the Inception architecture of GoogLeNet [20]
was also designed to perform well even under strict con-
straints on memory and computational budget. For exam-
ple, GoogleNet employed only 5 million parameters, which
represented a 12⇥ reduction with respect to its predeces-
sor AlexNet, which used 60 million parameters. Further-
more, VGGNet employed about 3x more parameters than
AlexNet.
The computational cost of Inception is also much lower
than VGGNet or its higher performing successors [6]. This
has made it feasible to utilize Inception networks in big-data
scenarios[17], [13], where huge amount of data needed to
be processed at reasonable cost or scenarios where memory
or computational capacity is inherently limited, for example
in mobile vision settings. It is certainly possible to mitigate
parts of these issues by applying specialized solutions to tar-
get memory use [2], [15] or by optimizing the execution of
certain operations via computational tricks [10]. However,
these methods add extra complexity. Furthermore, these
methods could be applied to optimize the Inception archi-
tecture as well, widening the efficiency gap again.
Still, the complexity of the Inception architecture makes
1
arXiv:1512.00567v3 [cs.CV] 11 Dec 2015