cient learning when input and output are highly correlated.
Moreover, our initial learning rate is 10
4
times higher than
that of SRCNN [
6]. This is enabled by residual-learning
and gradient clipping.
Scale Factor We propose a single-model SR approach.
Scales are typically user-specified and can be arbitrary in-
cluding fractions. For example, one might need smooth
zoom-in in an image viewer or resizing to a specific dimen-
sion. Training and storing many scale-dependent models in
preparation for all possible scenarios is impractical. We find
a single convolutional network is sufficient for multi-scale-
factor super-resolution.
Contribution In summary, in this work, we propose a
highly accurate SR method based on a very deep convolu-
tional network. Very deep networks converge too slowly
if small learning rates are used. Boosting convergence rate
with high learning rates lead to exploding gradients and we
resolve the issue with residual-learning and gradient clip-
ping. In addition, we extend our work to cope with multi-
scale SR problem in a single network. Our method is rel-
atively accurate and fast in comparison to state-of-the-art
methods as illustrated in Figure
1.
2. Related Work
SRCNN is a representative state-of-art method for deep
learning-based SR approach. So, let us analyze and com-
pare it with our proposed method.
2.1. Convolutional Network for Image Super-
Resolution
Model SRCNN consists of three layers: patch extrac-
tion/representation, non-linear mapping and reconstruction.
Filters of spatial sizes 9 × 9, 1 × 1, and 5 × 5 were used
respectively.
In [
6], Dong et al. attempted to prepare deeper models,
but failed to observe superior performance after a week of
training. In some cases, deeper models gave inferior perfor-
mance. They conclude that deeper networks do not result in
better performance (Figure 9).
However, we argue that increasing depth significantly
boosts performance. We successfully use 20 weight lay-
ers (3 × 3 for each layer). Our network is very deep (20
vs. 3 [
6]) and information used for reconstruction (recep-
tive field) is much larger (41 × 41 vs. 13 × 13).
Training For training, SRCNN directly models high-
resolution images. A high-resolution image can be de-
composed into a low frequency information (corresponding
to low-resolution image) and high frequency information
(residual image or image details). Input and output images
share the same low-frequency information. This indicates
that SRCNN serves two purposes: carrying the input to the
end layer and reconstructing residuals. Carrying the input
to the end is conceptually similar to what an auto-encoder
does. Training time might be spent on learning this auto-
encoder so that the convergence rate of learning the other
part (image details) is significantly decreased. In contrast,
since our network models the residual images directly, we
can have much faster convergence with even better accu-
racy.
Scale As in most existing SR methods, SRCNN is
trained for a single scale factor and is supposed to work
only with the specified scale. Thus, if a new scale is on de-
mand, a new model has to be trained. To cope with multiple
scale SR (possibly including fractional factors), we need to
construct individual single scale SR system for each scale
of interest.
However, preparing many individual machines for all
possible scenarios to cope with multiple scales is inefficient
and impractical. In this work, we design and train a sin-
gle network to handle multiple scale SR problem efficiently.
This turns out to work very well. Our single machine is
compared favorably to a single-scale expert for the given
sub-task. For three scales factors (×2, 3, 4), we can reduce
the number of parameters by three-fold.
In addition to the aforementioned issues, there are some
minor differences. Our output image has the same size as
the input image by padding zeros every layer during train-
ing whereas output from SRCNN is smaller than the input.
Finally, we simply use the same learning rates for all lay-
ers while SRCNN uses different learning rates for different
layers in order to achieve stable convergence.
3. Proposed Method
3.1. Proposed Network
For SR image reconstruction, we use a very deep convo-
lutional network inspired by Simonyan and Zisserman [
19].
The configuration is outlined in Figure
2. We use d layers
where layers except the first and the last are of the same
type: 64 filter of the size 3 × 3 × 64, where a filter operates
on 3 × 3 spatial region across 64 channels (feature maps).
The first layer operates on the input image. The last layer,
used for image reconstruction, consists of a single filter of
size 3 × 3 × 64.
The network takes an interpolated low-resolution image
(to the desired size) as input and predicts image details.
Modelling image details is often used in super-resolution
methods [
21, 22, 15, 3] and we find that CNN-based meth-
ods can benefit from this domain-specific knowledge.
In this work, we demonstrate that explicitly modelling
image details (residuals) has several advantages. These are
further discussed later in Section
4.2.
One problem with using a very deep network to predict
dense outputs is that the size of the feature map gets reduced
every time convolution operations are applied. For example,
when an input of size (n +1)×(n+1) is applied to a network
1647