VideoGPT-VideoGenerationusingVQ-VAEandTransformers.pdf_VQ-VAE结合Transformer做条件生成资源-CSDN文库

167 浏览量 2024-07-11 09:23:10 上传评论收藏 2.61MB PDF 举报

适用人群本论文适用于以下专业读者：计算机视觉和机器学习领域的研究人员和学者。对生成对抗网络（GANs）在视频生成任务上的应用感兴趣的工程师和开发者。探索深度学习在视频处理和动作识别中应用的数据科学家。人工智能领域的学生和教育工作者，特别是那些专注于视频内容生成和分析的。使用场景及目标研究与开发：研究人员可以使用DVD-GAN模型来探索视频生成的新方法，提高视频合成和预测的质量和效率。教育应用：作为教学案例，帮助学生理解GANs在视频处理领域的应用，以及如何评估生成模型的性能。工业应用：在娱乐、虚拟现实、游戏开发等行业中，利用DVD-GAN生成的视频内容创造新的用户体验。数据分析：数据科学家可以使用DVD-GAN来模拟视频数据，用于增强现有数据集，或进行数据增强以改善机器学习模型的训练。技术评估：研究人员和开发人员可以利用论文中提到的评估指标（如IS和FID）来比较不同模型生成的视频质量。论文的目标是通过展示DVD-GAN在复杂视频数据集上的应用，推动视频生成技术的发展，并为未来在更大规模和更复杂数据集上的模型训练和评估提供基准。通过这项研究，作者希望强调在大型和复杂的视频数据集上训练生成模型的重要性，并期待DVD-GAN能成为未来研究的参考点。 ### VideoGPT: Video Generation Using VQ-VAE and Transformers #### 概述本文献介绍了一种称为VideoGPT的新型视频生成架构，该架构采用向量量化自编码器(VQ-VAE)和变换器(Transformers)相结合的方式，实现了自然视频的生成。VideoGPT的设计目的是为了在复杂视频数据集上实现高质量视频的生成，同时保持模型训练的简便性和高效性。此方法已经在多个视频数据集上进行了验证，包括BAIR机器人数据集、UCF-101和TGIF数据集，其结果与当前最先进的生成对抗网络(GANs)模型相媲美。 #### 适用人群 - **计算机视觉和机器学习领域的研究人员和学者**：他们可以通过研究VideoGPT模型来了解如何利用VQ-VAE和Transformers进行高效的视频生成，并探索这些技术在视频合成和预测中的应用。 - **对生成对抗网络（GANs）在视频生成任务上的应用感兴趣的工程师和开发者**：通过学习VideoGPT的原理和实现细节，他们能够更好地理解GANs在视频处理领域的应用，尤其是在视频生成方面。 - **探索深度学习在视频处理和动作识别中应用的数据科学家**：这些专业人员可以利用VideoGPT来模拟视频数据，从而增强现有数据集或进行数据增强，以改进机器学习模型的训练效果。 - **人工智能领域的学生和教育工作者**：特别是对于那些专注于视频内容生成和分析的研究者来说，VideoGPT提供了一个良好的教学案例，帮助他们理解和评估生成模型的性能。 #### 使用场景及目标 - **研究与开发**：VideoGPT为研究人员提供了一个新的工具箱，帮助他们在视频生成领域探索新的方法和技术。此外，它还提高了视频合成和预测的质量和效率。 - **教育应用**：VideoGPT作为教学案例，有助于学生深入了解GANs在视频处理领域的应用，并学习如何评估生成模型的性能。 - **工业应用**：VideoGPT可以在娱乐、虚拟现实、游戏开发等行业中得到广泛应用，以创造新的用户体验。例如，在游戏开发中，可以使用VideoGPT生成逼真的环境背景或角色动画。 - **数据分析**：数据科学家可以利用VideoGPT模拟视频数据，从而增强现有数据集或进行数据增强，以改善机器学习模型的训练效果。 - **技术评估**：研究人员和开发人员可以利用论文中提到的评估指标（如Inception Score(IS)和Fréchet Inception Distance(FID)）来比较不同模型生成的视频质量。 #### 技术亮点 - **VQ-VAE**: VideoGPT使用3D卷积和轴向自注意力机制的VQ-VAE来学习原始视频的下采样离散潜表示。这一步骤对于降低计算成本并提高模型的可训练性至关重要。 - **GPT-like架构**: 在获得离散潜表示后，VideoGPT采用类似于GPT的架构来进行空间和时间位置编码，从而自回归地建模这些潜表示。这种方法简单且易于训练，但生成的样本质量却可以与最先进的GAN模型相匹敌。 - **高保真度自然视频生成**：VideoGPT不仅能够在BAIR Robot数据集上生成具有竞争力的样本，而且还能生成UCF-101和TGIF数据集中高质量的自然视频。 #### 结论 VideoGPT为视频生成领域带来了新的视角和技术方案，通过结合VQ-VAE和Transformers的优势，实现了自然视频的有效生成。该模型不仅简化了视频生成过程，而且还提高了生成视频的质量。此外，VideoGPT还提供了一个可供参考的开源实现，这将有助于进一步推动视频生成技术的发展，并为未来在更大规模和更复杂数据集上的模型训练和评估提供基准。通过这项研究，作者强调了在大型和复杂的视频数据集上训练生成模型的重要性，并期望VideoGPT能成为未来研究的重要参考点。

资源推荐

资源详情

资源评论

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan

* 1

Yunzhi Zhang

* 1

Pieter Abbeel

Aravind Srinivas

Abstract

We present VideoGPT: a conceptually simple ar-

chitecture for scaling likelihood based generative

modeling to natural videos. VideoGPT uses VQ-

VAE that learns downsampled discrete latent rep-

resentations of a raw video by employing 3D con-

volutions and axial self-attention. A simple GPT-

like architecture is then used to autoregressively

model the discrete latents using spatio-temporal

position encodings. Despite the simplicity in for-

mulation and ease of training, our architecture is

able to generate samples competitive with state-

of-the-art GAN models for video generation on

the BAIR Robot dataset, and generate high ﬁ-

delity natural videos from UCF-101 and Tum-

bler GIF Dataset (TGIF). We hope our proposed

architecture serves as a reproducible reference

for a minimalistic implementation of transformer

based video generation models. Samples and

code are available at

https://wilson1yan.

github.io/videogpt/index.html.

1. Introduction

Deep generative models of multiple types (Kingma &

Welling, 2013; Goodfellow et al., 2014; van den Oord et al.,

2016b; Dinh et al., 2016) have seen incredible progress in

the last few years on multiple modalities including natural

images (van den Oord et al., 2016c; Zhang et al., 2019;

Brock et al., 2018; Kingma & Dhariwal, 2018; Ho et al.,

2019a; Karras et al., 2017; 2019; Van Den Oord et al.,

2017; Razavi et al., 2019; Vahdat & Kautz, 2020; Ho et al.,

2020; Chen et al., 2020; Ramesh et al., 2021), audio wave-

forms conditioned on language features (van den Oord et al.,

2016a; Oord et al., 2017; Prenger et al., 2019; Bi

nkowski

et al., 2019), natural language in the form of text (Rad-

ford et al., 2019; Brown et al., 2020), and music genera-

tion (Dhariwal et al., 2020). These results have been made

possible thanks to fundamental advances in deep learning

Equal contribution

University of California, Berkeley.

Correspondence to: Wilson Yan, Aravind Srinivas

wil-

son1.yan@berkeley.edu, aravind srinivas@berkeley.edu>.

architectures (He et al., 2015; van den Oord et al., 2016b;c;

Vaswani et al., 2017; Zhang et al., 2019; Menick & Kalch-

brenner, 2018) as well as the availability of compute re-

sources (Jouppi et al., 2017; Amodei & Hernandez, 2018)

that are more powerful and plentiful than a few years ago.

While there have certainly been impressive efforts to model

videos (Vondrick et al., 2016; Kalchbrenner et al., 2016;

Tulyakov et al., 2018; Clark et al., 2019), high-ﬁdelity nat-

ural videos is one notable modality that has not seen the

same level of progress in generative modeling as compared

to images, audio, and text. This is reasonable since the

complexity of natural videos requires modeling correlations

across both space and time with much higher input dimen-

sions. Video modeling is therefore a natural next challenge

for current deep generative models. The complexity of the

problem also demands more compute resources which can

also be deemed as one important reason for the relatively

slow progress in generative modeling of videos.

Why is it useful to build generative models of videos? Con-

ditional and unconditional video generation implicitly ad-

dresses the problem of video prediction and forecasting.

Video prediction (Srivastava et al., 2015; Finn et al., 2016;

Kalchbrenner et al., 2017; Sønderby et al., 2020) can be

seen as learning a generative model of future frames con-

ditioned on the past frames. Architectures developed for

video generation can be useful in forecasting applications

for weather prediction (Sønderby et al., 2020), autonomous

driving (for e.g., such as predicting the future in more se-

mantic and dense abstractions like segmentation masks (Luc

et al., 2017)). Finally, building generative models of the

world around us is considered as one way to measure our

understanding of physical common sense and predictive

intelligence (Lake et al., 2015).

Multiple classes of generative models have been shown

to produce strikingly good samples such as autoregres-

sive models (van den Oord et al., 2016b;c; Parmar et al.,

2018; Menick & Kalchbrenner, 2018; Radford et al.,

2019; Chen et al., 2020), generative adversarial networks

(GANs) (Goodfellow et al., 2014; Radford et al., 2015),

variational autoencoders (VAEs) (Kingma & Welling, 2013;

Kingma et al., 2016; Mittal et al., 2017; Marwah et al., 2017;

Vahdat & Kautz, 2020; Child, 2020), Flows (Dinh et al.,

2014; 2016; Kingma & Dhariwal, 2018; Ho et al., 2019a),

arXiv:2104.10157v2 [cs.CV] 14 Sep 2021

VideoGPT

Figure 1. 64 × 64 and 128 × 128 video samples generated by VideoGPT

vector quantized VAE (VQ-VAE) (Van Den Oord et al.,

2017; Razavi et al., 2019; Ramesh et al., 2021), and lately

diffusion and score matching models (Sohl-Dickstein et al.,

2015; Song & Ermon, 2019; Ho et al., 2020). These different

generative model families have their tradeoffs across various

dimensions: sampling speed, sample diversity, sample qual-

ity, optimization stability, compute requirements, ease of

evaluation, and so forth. Excluding score-matching models,

at a broad level, one can group these models into likelihood-

based (PixelCNNs, iGPT, NVAE, VQ-VAE, Glow), and

adversarial generative models (GANs). The natural ques-

tion is: What is a good model class to pick for studying and

scaling video generation?

First, we make a choice between likelihood-based and adver-

sarial models. Likelihood-based models are convenient to

train since the objective is well understood, easy to optimize

across a range of batch sizes, and easy to evaluate. Given

that videos already present a hard modeling challenge due to

the nature of the data, we believe likelihood-based models

present fewer difﬁculties in the optimization and evaluation,

hence allowing us to focus on the architecture modeling

Next, among likelihood-based models, we pick autoregres-

sive models simply because they have worked well on dis-

crete data in particular, have shown greater success in terms

of sample quality (Ramesh et al., 2021), and have well es-

tablished training recipes and modeling architectures that

take advantage of latest innovations in Transformer archi-

tectures (Vaswani et al., 2017; Child et al., 2019; Ho et al.,

2019b; Huang et al., 2019).

Finally, among autoregressive models, we consider the fol-

lowing question: Is it better to perform autoregressive mod-

eling in a downsampled latent space without spatio-temporal

redundancies compared to modeling at the atomic level of all

It is not the focus of this paper to say likelihood models are

better than GANs for video modeling. This is purely a design

choice guided by our inclination to explore likelihood based gener-

ative models and non-empirically established beliefs with respect

to stability of training.

pixels across space and time? Below, we present our reasons

for choosing the former: Natural images and videos contain

a lot of spatial and temporal redundancies and hence the

reason we use image compression tools such as JPEG (Wal-

lace, 1992) and video codecs such as MPEG (Le Gall, 1991)

everyday. These redundancies can be removed by learning

a denoised downsampled encoding of the high resolution

inputs. For example, 4x downsampling across spatial and

temporal dimensions results in 64x downsampled resolu-

tion so that the computation of powerful deep generative

models is spent on these more fewer and useful bits. As

shown in VQ-VAE (Van Den Oord et al., 2017), even a lossy

decoder can transform the latents to generate sufﬁciently

realistic samples. This framework has in recent times pro-

duce high quality text-to-image generation models such as

DALL-E (Ramesh et al., 2021). Furthermore, modeling in

the latent space downsampled across space and time instead

of the pixel space improves sampling speed and compute

requirements due to reduced dimensionality.

The above line of reasoning leads us to our proposed model:

VideoGPT

, a simple video generation architecture that is

a minimal adaptation of VQ-VAE and GPT architectures

for videos. VideoGPT employs 3D convolutions and trans-

posed convolutions (Tran et al., 2015) along with axial at-

tention (Huang et al., 2019; Ho et al., 2019b) for the autoen-

coder in VQ-VAE, learning a downsampled set of discrete

latents from raw pixels of the video frames. These latents

are then modeled using a strong autoregressive prior using

a GPT-like (Radford et al., 2019; Child et al., 2019; Chen

et al., 2020) architecture. The generated latents from the au-

toregressive prior are then decoded to videos of the original

resolution using the decoder of the VQ-VAE.

We note that Video Transformers (Weissenborn et al., 2019)

also employ generative pre-training for videos using the Subscale

Pixel Networks (SPN) (Menick & Kalchbrenner, 2018) architec-

ture. Despite this, it is fair to use the GPT terminology for our

model because our architecture more closely resembles the vanilla

Transformer in a manner similar to iGPT (Chen et al., 2020).

VideoGPT

Transformer

Target

Flattened sequence

Discrete Latents

Codebook

Conv3D

Encoder

Discrete Latents

Conv3D

Decoder

Figure 2.

We break down the training pipeline into two sequential stages: training VQ-VAE (Left) and training an autoregressive

transformer in the latent space (Right). The ﬁrst stage is similar to the original VQ-VAE training procedure. During the second stage,

VQ-VAE encodes video data to latent sequences as training data for the prior model. For inference, we ﬁrst sample a latent sequence from

the prior, and then use VQ-VAE to decode the latent sequence to a video sample.

Our results are highlighted below:

1. On the widely benchmarked BAIR Robot Pushing

dataset (Ebert et al., 2017), VideoGPT can generate realistic

samples that are competitive with existing methods such as

TrIVD-GAN (Luc et al., 2020), achieving an FVD of 103

when benchmarked with real samples, and an FVD* (Razavi

et al., 2019) of 94 when benchmarked with reconstructions.

2. In addition, VideoGPT is able to generate realistic sam-

ples from complex natural video datasets, such as UCF-101

and the Tumblr GIF dataset

3. We present careful ablation studies for the several archi-

tecture design choices in VideoGPT including the beneﬁt of

axial attention blocks, the size of the VQ-VAE latent space,

number of codebooks, and the capacity (model size) of the

autoregressive prior.

4. VideoGPT can easily be adapted for action conditional

video generation. We present qualitative results on the BAIR

Robot Pushing dataset and Vizdoom simulator (Kempka

et al., 2016).

2. Background

2.1. VQ-VAE

The Vector Quantized Variational Autoencoder (VQ-

VAE) (Van Den Oord et al., 2017) is a model that learns to

compress high dimensional data points into a discretized

latent space and reconstruct them. The encoder

E(x) → h

ﬁrst encodes

into a series of latent vectors

which is

then discretized by performing a nearest neighbors lookup

in a codebook of embeddings

C = {e

}

i=1

of size

. The

decoder

D(e) → ˆx

then learns to reconstruct

from the

quantized encodings. The VQ-VAE is trained using the

following objective:

L = kx − D(e)k

| {z }

recon

+ ksg[E(x)] − ek

| {z }

codebook

+ β ksg[e] − E(x)k

| {z }

commit

where

refers to a stop-gradient. The objective consists

of a reconstruction loss

recon

, a codebook loss

codebook

and a commitment loss

commit

. The reconstruction loss

encourages the VQ-VAE to learn good representations to

accurately reconstruct data samples. The codebook loss

brings codebook embeddings closer to their corresponding

encoder outputs, and the commitment loss is weighted by

a hyperparameter

and prevents the encoder outputs from

ﬂuctuating between different code vectors.

An alternative replacement for the codebook loss described

in (Van Den Oord et al., 2017) is to use an EMA update

which empirically shows faster training and convergence

speed. In this paper, we use the EMA update when training

the VQ-VAE.

2.2. GPT

GPT and Image-GPT (Chen et al., 2020) are a class of

autoregressive transformers that have shown tremendous

success in modelling discrete data such as natural language

and high dimensional images. These models factorize the

data distribution

p(x)

according to

p(x) =

i=1

p(x

)

through masked self-attention mechanisms and are opti-

mized through maximum likelihood. The architectures em-

ploy multi-head self-attention blocks followed by pointwise

MLP feedforward blocks following the standard design from

(Vaswani et al., 2017).

3. VideoGPT

Our primary contribution is VideoGPT, a new method to

model complex video data in a computationally efﬁcient

剩余13页未读，继续阅读

评论收藏

内容反馈

雨过朦胧影

粉丝: 1448
资源: 125

VideoGPT- Video Generation using VQ-VAE and Transformers.pdf

最新资源

VideoGPT- Video Generation using VQ-VAE and Transformers.pdf

sora学习文档集合三

vq-vae-pytorch:VQ-VAE实施pytorch

图像生成-基于Pytorch实现VQ-VAE-2生成多样化高保真图像算法-附项目源码+模型权重+简单流程教程-优质项目实战

在PyTorch中使用VQ-VAE-2生成各种高保真图像的实现-Python开发

VQ-VAE.zip

Python-在PyTorch中使用VQVAE2生成多种高保真图像的实现

论文研究-一种基于VQ的自适应多重水印算法.pdf

FS-35-5-vq-620-1200-1-40.pdf

vq-vae-2-pytorch-master

speech-recognition-java-hidden-markov-model-vq-mfcc:从 code.google.compspeech-recognition-java-hidden-markov-model-vq-mfcc 自动导出

vq_vae.zip

qou_vq87.zip_87vq.com_www. 87vq. com

Python-VQVAEimplementationpytorch

VQ---Matlab.rar_vq 说话人_vq说话人_说话人_说话人 识别

VODN VQ电动执行机构说明书.pdf

VQ-VAE-ASR：对VAE-ASR的初步研究

VAE-distilled:在学习过程中尝试构建VQ-VAE的简化实现

HD-2008使用说明书VQ636(完整版).doc

ss-vq-vae:自我监督的VQ-VAE，可进行一键式音乐风格转换

基于VQ的声纹程序-matlab.rar

RTL8822BSH-VQ-CG+DataSheet_v0.1r13_20171114.pdf

基于矢量量化（VQ）的说话人识别实验.zip_matlab_matlab vq 识别_矢量量化_矢量量化matlab_语音识别

Pytorch实现VAE变分自动编码器生成MNIST手写数字图像

语音识别配套的VQ及DHMM模型训练程序.zip

基于VQ的语音识别程序-VQ.zip

One-shot Learning Gesture Recognition from RGB-D Data Using BoF

基于适量量化(VQ)的说话人识别.zip

上海沃电VQ电动执行机构样本.pdf

最新资源

VQ---Matlab.rar_vq 说话人_vq说话人_说话人_说话人识别