In this paper, we will focus on an efficient deep neural network architecture for computer vision,
codenamed Inception, which derives its name from the Network in network paper by Lin et al [12]
in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word
“deep” is used in two different meanings: first of all, in the sense that we introduce a new level of
organization in the form of the “Inception module” and also in the more direct sense of increased
network depth. In general, one can view the Inception model as a logical culmination of [12]
while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits
of the architecture are experimentally verified on the ILSVRC 2014 classification and detection
challenges, on which it significantly outperforms the current state of the art.
2 Related Work
Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard
structure – stacked convolutional layers (optionally followed by contrast normalization and max-
pooling) are followed by one or more fully-connected layers. Variants of this basic design are
prevalent in the image classification literature and have yielded the best results to-date on MNIST,
CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such
as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14],
while using dropout [7] to address the problem of overfitting.
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same
convolutional network architecture as [9] has also been successfully employed for localization [9,
14], object detection [6, 14, 18, 5] and human pose estimation [19]. Inspired by a neuroscience
model of the primate visual cortex, Serre et al. [15] use a series of fixed Gabor filters of different sizes
in order to handle multiple scales, similarly to the Inception model. However, contrary to the fixed
2-layer deep model of [15], all filters in the Inception model are learned. Furthermore, Inception
layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet
model.
Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representa-
tional power of neural networks. When applied to convolutional layers, the method could be viewed
as additional 1× 1 convolutional layers followed typically by the rectified linear activation [9]. This
enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our
architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they
are used mainly as dimension reduction modules to remove computational bottlenecks, that would
otherwise limit the size of our networks. This allows for not just increasing the depth, but also the
width of our networks without significant performance penalty.
The current leading approach for object detection is the Regions with Convolutional Neural Net-
works (R-CNN) proposed by Girshick et al. [6]. R-CNN decomposes the overall detection problem
into two subproblems: to first utilize low-level cues such as color and superpixel consistency for
potential object proposals in a category-agnostic fashion, and to then use CNN classifiers to identify
object categories at those locations. Such a two stage approach leverages the accuracy of bound-
ing box segmentation with low-level cues, as well as the highly powerful classification power of
state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have ex-
plored enhancements in both stages, such as multi-box [5] prediction for higher object bounding
box recall, and ensemble approaches for better categorization of bounding box proposals.
3 Motivation and High Level Considerations
The most straightforward way of improving the performance of deep neural networks is by increas-
ing their size. This includes both increasing the depth – the number of levels – of the network and its
width: the number of units at each level. This is as an easy and safe way of training higher quality
models, especially given the availability of a large amount of labeled training data. However this
simple solution comes with two major drawbacks.
Bigger size typically means a larger number of parameters, which makes the enlarged network more
prone to overfitting, especially if the number of labeled examples in the training set is limited.
This can become a major bottleneck, since the creation of high quality training sets can be tricky
2