Algorithm 1 Train a neural network with mini-batch
stochastic gradient descent.
initialize(net)
for epoch = 1, . . . , K do
for batch = 1, . . . , #images/b do
images ← uniformly random sample b images
X, y ← preprocess(images)
z ← forward(net, X)
` ← loss(z, y)
grad ← backward(`)
update(net, grad)
end for
end for
useful for efficient training on new hardware in Section 3. In
Section 4 we review three minor model architecture tweaks
for ResNet and propose a new one. Four additional train-
ing procedure refinements are then discussed in Section 5.
At last, we study if these more accurate models can help
transfer learning in Section 6.
Our model implementations and training scripts are pub-
licly available in GluonCV
1
.
2. Training Procedures
The template of training a neural network with mini-
batch stochastic gradient descent is shown in Algorithm 1.
In each iteration, we randomly sample b images to com-
pute the gradients and then update the network parameters.
It stops after K passes through the dataset. All functions
and hyper-parameters in Algorithm 1 can be implemented
in many different ways. In this section, we first specify a
baseline implementation of Algorithm 1.
2.1. Baseline Training Procedure
We follow a widely used implementation [8] of ResNet
as our baseline. The preprocessing pipelines between train-
ing and validation are different. During training, we per-
form the following steps one-by-one:
1. Randomly sample an image and decode it into 32-bit
floating point raw pixel values in [0, 255].
2. Randomly crop a rectangular region whose aspect ratio
is randomly sampled in [3/4, 4/3] and area randomly
sampled in [8%, 100%], then resize the cropped region
into a 224-by-224 square image.
3. Flip horizontally with 0.5 probability.
4. Scale hue, saturation, and brightness with coefficients
uniformly drawn from [0.6, 1.4].
5. Add PCA noise with a coefficient sampled from a nor-
mal distribution N (0, 0.1).
1
https://github.com/dmlc/gluon-cv
Model
Baseline Reference
Top-1 Top-5 Top-1 Top-5
ResNet-50 [9] 75.87 92.70 75.3 92.2
Inception-V3 [26] 77.32 93.43 78.8 94.4
MobileNet [11] 69.03 88.71 70.6 -
Table 2: Validation accuracy of reference implementa-
tions and our baseline. Note that the numbers for Incep-
tion V3 are obtained with 299-by-299 input images.
6. Normalize RGB channels by subtracting 123.68,
116.779, 103.939 and dividing by 58.393, 57.12,
57.375, respectively.
During validation, we resize each image’s shorter edge
to 256 pixels while keeping its aspect ratio. Next, we crop
out the 224-by-224 region in the center and normalize RGB
channels similar to training. We do not perform any random
augmentations during validation.
The weights of both convolutional and fully-connected
layers are initialized with the Xavier algorithm [6]. In par-
ticular, we set the parameter to random values uniformly
drawn from [−a, a], where a =
p
6/(d
in
+ d
out
). Here
d
in
and d
out
are the input and output channel sizes, respec-
tively. All biases are initialized to 0. For batch normaliza-
tion layers, γ vectors are initialized to 1 and β vectors to
0.
Nesterov Accelerated Gradient (NAG) descent [20] is
used for training. Each model is trained for 120 epochs on
8 Nvidia V100 GPUs with a total batch size of 256. The
learning rate is initialized to 0.1 and divided by 10 at the
30th, 60th, and 90th epochs.
2.2. Experiment Results
We evaluate three CNNs: ResNet-50 [9], Inception-
V3 [1], and MobileNet [11]. For Inception-V3 we resize the
input images into 299x299. We use the ISLVRC2012 [23]
dataset, which has 1.3 million images for training and 1000
classes. The validation accuracies are shown in Table 2. As
can be seen, our ResNet-50 results are slightly better than
the reference results, while our baseline Inception-V3 and
MobileNet are slightly lower in accuracy due to different
training procedure.
3. Efficient Training
Hardware, especially GPUs, has been rapidly evolving
in recent years. As a result, the optimal choices for many
performance related trade-offs have changed. For example,
it is now more efficient to use lower numerical precision and
larger batch sizes during training. In this section, we review
various techniques that enable low precision and large batch