video clips. A similar approach was proposed in the TGAN
work [30]. We argue that assuming a video clip is a point
in the latent space unnecessarily increases the complexity
of the problem, because videos of the same action with dif-
ferent execution speed are represented by different points
in the latent space. Moreover, this assumption forces ev-
ery generated video clip to have the same length, while the
length of real-world video clips varies. An alternative (and
likely more intuitive and efficient) approach would assume
a latent space of images and consider that a video clip is
generated by traversing the points in the latent space. Video
clips of different lengths correspond to latent space trajec-
tories of different lengths.
In addition, as videos are about objects (content) per-
forming actions (motion), the latent space of images should
be further decomposed into two subspaces, where the devi-
ation of a point in the first subspace (the content subspace)
leads content changes in a video clip and the deviation in
the second subspace (the motion subspace) results in tem-
poral motions. Through this modeling, videos of an action
with different execution speeds will only result in different
traversal speeds of a trajectory in the motion space. Decom-
posing motion and content allows a more controlled video
generation process. By changing the content representation
while fixing the motion trajectory, we have videos of dif-
ferent objects performing the same motion. By changing
motion trajectories while fixing the content representation,
we have videos of the same object performing different mo-
tion as illustrated in Fig. 1.
In this paper, we propose the Motion and Content de-
composed Generative Adversarial Network (MoCoGAN)
framework for video generation. It generates a video clip
by sequentially generating video frames. At each time step,
an image generative network maps a random vector to an
image. The random vector consists of two parts where the
first is sampled from a content subspace and the second is
sampled from a motion subspace. Since content in a short
video clip usually remains the same, we model the content
space using a Gaussian distribution and use the same real-
ization to generate each frame in a video clip. On the other
hand, sampling from the motion space is achieved through
a recurrent neural network where the network parameters
are learned during training. Despite lacking supervision re-
garding the decomposition of motion and content in nat-
ural videos, we show that MoCoGAN can learn to disen-
tangle these two factors through a novel adversarial train-
ing scheme. Through extensive qualitative and quantitative
experimental validations with comparison to the state-of-
the-art approaches including VGAN [40] and TGAN [30],
as well as the future frame prediction methods including
Conditional-VGAN (C-VGAN) [40] and Motion and Con-
tent Network (MCNET) [39], we verify the effectiveness of
MoCoGAN.
1.1. Related Work
Video generation is not a new problem. Due to limita-
tions in computation, data, and modeling tools, early video
generation works focused on generating dynamic texture
patterns [34, 41, 9]. In the recent years, with the availability
of GPUs, Internet videos, and deep neural networks, we are
now better positioned to tackle this intriguing problem.
Various deep generative models were recently proposed
for image generation including GANs [12], variational au-
toencoders (VAEs) [20, 28, 36], and PixelCNNs [38]. In
this paper, we propose the MoCoGAN framework for video
generation, which is based on GANs.
Multiple GAN-based image generation frameworks
were proposed. Denton et al. [8] showed a Laplacian pyra-
mid implementation. Radford et al. [27] used a deeper con-
volution network. Zhang et al. [43] stacked two generative
networks to progressively render realistic images. Coupled
GANs [22] learned to generate corresponding images in dif-
ferent domains, later extended to translate an image from
one domain to a different domain in an unsupervised fash-
ion [21]. InfoGAN [5] learned a more interpretable latent
representation. Salimans et al. [31] proposed several GAN
training tricks. The WGAN [3] and LSGAN [23] frame-
works adopted alternative distribution distance metrics for
more stable adversarial training. Roth et al. [29] proposed
a special gradient penalty to further stabilize training. Kar-
ras et al. [18] used progressive growing of the discriminator
and the generator to generate high resolution images. The
proposed MoCoGAN framework generates a video clip by
sequentially generating images using an image generator.
The framework can easily leverage advances in image gen-
eration in the GAN framework for improving the quality of
the generated videos. As discussed in Section 1, [40, 30]
extended the GAN framework to the video generation prob-
lem by assuming a latent space of video clips where all the
clips have the same length.
Recurrent neural networks for image generation were
previously explored in [14, 16]. Specifically, some works
used recurrent mechanisms to iteratively refine a generated
image. Our work is different to [14, 16] in that we use
the recurrent mechanism to generate motion embeddings
of video frames in a video clip. The image generation is
achieved through a convolutional neural network.
The future frame prediction problem studied in [33, 26,
24, 17, 10, 37, 42, 39, 7] is different to the video gen-
eration problem. In future frame prediction, the goal is
to predict future frames in a video given the observed
frames in the video. Previous works on future frame pre-
diction can be roughly divided into two categories where
one focuses on generating raw pixel values in future frames
based on the observed ones [33, 26, 24, 17, 42, 39], while
the other focuses on generating transformations for reshuf-
fling the pixels in the previous frames to construct fu-
2