of video style transfer by imposing temporal constraints.
The framework of Gatys et al. [16] is based on a slow
optimization process that iteratively updates the image to
minimize a content loss and a style loss computed by a loss
network. It can take minutes to converge even with mod-
ern GPUs. On-device processing in mobile applications is
therefore too slow to be practical. A common workaround
is to replace the optimization process with a feed-forward
neural network that is trained to minimize the same ob-
jective [24, 51, 31]. These feed-forward style transfer ap-
proaches are about three orders of magnitude faster than
the optimization-based alternative, opening the door to real-
time applications. Wang et al. [53] enhanced the granularity
of feed-forward style transfer with a multi-resolution archi-
tecture. Ulyanov et al. [52] proposed ways to improve the
quality and diversity of the generated samples. However,
the above feed-forward methods are limited in the sense that
each network is tied to a fixed style. To address this prob-
lem, Dumoulin et al. [11] introduced a single network that
is able to encode 32 styles and their interpolations. Con-
current to our work, Li et al. [32] proposed a feed-forward
architecture that can synthesize up to 300 textures and trans-
fer 16 styles. Still, the two methods above cannot adapt to
arbitrary styles that are not observed during training.
Very recently, Chen and Schmidt [6] introduced a feed-
forward method that can transfer arbitrary styles thanks to
a style swap layer. Given feature activations of the content
and style images, the style swap layer replaces the content
features with the closest-matching style features in a patch-
by-patch manner. Nevertheless, their style swap layer cre-
ates a new computational bottleneck: more than 95% of the
computation is spent on the style swap for 512 × 512 input
images. Our approach also permits arbitrary style transfer,
while being 1-2 orders of magnitude faster than [6].
Another central problem in style transfer is which style
loss function to use. The original framework of Gatys et
al. [16] matches styles by matching the second-order statis-
tics between feature activations, captured by the Gram ma-
trix. Other effective loss functions have been proposed,
such as MRF loss [30], adversarial loss [31], histogram
loss [54], CORAL loss [41], MMD loss [33], and distance
between channel-wise mean and variance [33]. Note that all
the above loss functions aim to match some feature statistics
between the style image and the synthesized image.
Deep generative image modeling. There are several al-
ternative frameworks for image generation, including varia-
tional auto-encoders [27], auto-regressive models [40], and
generative adversarial networks (GANs) [18]. Remarkably,
GANs have achieved the most impressive visual quality.
Various improvements to the GAN framework have been
proposed, such as conditional generation [43, 23], multi-
stage processing [9, 20], and better training objectives [46,
1]. GANs have also been applied to style transfer [31] and
cross-domain image generation [50, 3, 23, 38, 37, 25].
3. Background
3.1. Batch Normalization
The seminal work of Ioffe and Szegedy [22] introduced
a batch normalization (BN) layer that significantly ease the
training of feed-forward networks by normalizing feature
statistics. BN layers are originally designed to acceler-
ate training of discriminative networks, but have also been
found effective in generative image modeling [42]. Given
an input batch x ∈ R
N×C×H×W
, BN normalizes the mean
and standard deviation for each individual feature channel:
BN(x) = γ
x − µ(x)
σ(x)
+ β (1)
where γ, β ∈ R
C
are affine parameters learned from data;
µ(x), σ(x) ∈ R
C
are the mean and standard deviation,
computed across batch size and spatial dimensions indepen-
dently for each feature channel:
µ
c
(x) =
1
NHW
N
X
n=1
H
X
h=1
W
X
w=1
x
nchw
(2)
σ
c
(x) =
v
u
u
t
1
NHW
N
X
n=1
H
X
h=1
W
X
w=1
(x
nchw
− µ
c
(x))
2
+
(3)
BN uses mini-batch statistics during training and replace
them with popular statistics during inference, introducing
discrepancy between training and inference. Batch renor-
malization [21] was recently proposed to address this issue
by gradually using popular statistics during training. As
another interesting application of BN, Li et al. [34] found
that BN can alleviate domain shifts by recomputing popular
statistics in the target domain. Recently, several alternative
normalization schemes have been proposed to extend BN’s
effectiveness to recurrent architectures [35, 2, 47, 8, 29, 44].
3.2. Instance Normalization
In the original feed-forward stylization method [51], the
style transfer network contains a BN layer after each con-
volutional layer. Surprisingly, Ulyanov et al. [52] found
that significant improvement could be achieved simply by
replacing BN layers with IN layers:
IN(x) = γ
x − µ(x)
σ(x)
+ β (4)
Different from BN layers, here µ(x) and σ(x) are com-
puted across spatial dimensions independently for each
channel and each sample:
µ
nc
(x) =
1
HW
H
X
h=1
W
X
w=1
x
nchw
(5)