AIGC生成视频模型+StreamingT2V兼容SVD和animatediff资源-CSDN文库

人工智能

需积分: 1 113 浏览量 2024-04-16 14:21:58 上传评论收藏 16.25MB PDF 举报

资源推荐

资源详情

资源评论

StreamingT2V: Consistent, Dynamic, and Extendable

Long Video Generation from Text

Roberto Henschel

1∗

, Levon Khachatryan

1∗

, Daniil Hayrapetyan

1∗

, Hayk Poghosyan

, Vahram Tadevosyan

Zhangyang Wang

1,2

, Shant Navasardyan

, Humphrey Shi

1,3

Picsart AI Resarch (PAIR)

UT Austin

SHI Labs @ Georgia Tech, Oregon & UIUC

https://github.com/Picsart-AI-Research/StreamingT2V

Figure 1. StreamingT2V is an advanced autoregressive technique that enables the creation of long videos featuring rich motion dynamics

without any stagnation. It ensures temporal consistency throughout the video, aligns closely with the descriptive text, and maintains high

frame-level image quality. Our demonstrations include successful examples of videos up to 1200 frames, spanning 2 minutes, and can be

extended for even longer durations. Importantly, the effectiveness of StreamingT2V is not limited by the speciﬁc Text2Video model used,

indicating that improvements in base models could yield even higher-quality videos.

Abstract

Text-to-video diffusion models enable the generation of

high-quality videos that follow text instructions, making it

easy to create diverse and individual content. However, ex-

Equal contribution.

isting approaches mostly focus on high-quality short video

generation (typically 16 or 24 frames), ending up with hard-

cuts when naively extended to the case of long video syn-

thesis. To overcome these limitations, we introduce Stream-

ingT2V, an autoregressive approach for long video gener-

ation of 80, 240, 600, 1200 or more frames with smooth

arXiv:2403.14773v1 [cs.CV] 21 Mar 2024

transitions. The key components are: (i) a short-term mem-

ory block called conditional attention module (CAM), which

conditions the current generation on the features extracted

from the previous chunk via an attentional mechanism, lead-

ing to consistent chunk transitions, (ii) a long-term mem-

ory block called appearance preservation module, which

extracts high-level scene and object features from the ﬁrst

video chunk to prevent the model from forgetting the ini-

tial scene, and (iii) a randomized blending approach that

enables to apply a video enhancer autoregressively for in-

ﬁnitely long videos without inconsistencies between chunks.

Experiments show that StreamingT2V generates high mo-

tion amount. In contrast, all competing image-to-video

methods are prone to video stagnation when applied naively

in an autoregressive manner. Thus, we propose with Stream-

ingT2V a high-quality seamless text-to-long video genera-

tor that outperforms competitors with consistency and mo-

tion.

1. Introduction

In recent years, with the raise of Diffusion Models [15, 26,

28, 34], the task of text-guided image synthesis and manipu-

lation gained enormous attention from the community. The

huge success in image generation led to the further exten-

sion of diffusion models to generate videos conditioned by

textual prompts [4, 5, 7, 11–13, 17, 18, 20, 32, 37, 39, 45].

Despite the impressive generation quality and text align-

ment, the majority of existing approaches such as [4, 5,

17, 39, 45] are mostly focused on generating short frame

sequences (typically of 16 or 24 frame-length). However,

short videos are limited in real-world use-cases such as ad

making, storytelling, etc.

The na

ıve approach of simply training existing methods

on long videos (e.g. ≥ 64 frames) is normally unfeasi-

ble. Even for generating short sequences, a very expensive

training (e.g. using more that 260K steps and 4500 batch

size [39]) is typically required. Without training on longer

videos, video quality commonly degrades when short video

generators are made to output long videos (see appendix).

Existing approaches, such as [5, 17, 23], thus extend the

baselines to autoregressively generate short video chunks

conditioned on the last frame(s) of the previous chunk.

However, the straightforward long-video generation ap-

proach of simply concatenating the noisy latents of a video

chunk with the last frame(s) of the previous chunk leads to

poor conditioning with inconsistent scene transitions (see

Sec. 5.3). Some works [4, 8, 40, 43, 48] incorporate also

CLIP [25] image embeddings of the last frame of the previ-

ous chunk reaching to a slightly better consistency, but are

still prone to inconsistent global motion across chunks (see

Fig. 5) as the CLIP image encoder loses information impor-

tant for perfectly reconstructing the conditional frames. The

concurrent work SparseCtrl [12] utilizes a more sophisti-

cated conditioning mechanism by sparse encoder. Its archi-

tecture requires to concatenate additional zero-ﬁlled frames

to the conditioning frames before being plugged into sparse

encoder. However, this inconsistency in the input leads to

inconsistencies in the output (see Sec.5.4). Moreover, we

observed that all image-to-video methods that we evaluated

in our experiments (see Sec.5.4) lead eventually to video

stagnation, when applied autoregressively by conditioning

on the last frame of the previous chunk.

To overcome the weaknesses and limitations of current

works, we propose StreamingT2V, an autoregressive text-

to-video method equipped with long/short-term memory

blocks that generates long videos without temporal incon-

sistencies.

To this end, we propose the Conditional Attention Mod-

ule (CAM) which, due to its attentional nature, effectively

borrows the content information from the previous frames

to generate new ones, while not restricting their motion by

the previous structures/shapes. Thanks to CAM, our results

are smooth and with artifact-free video chunk transitions.

Existing approaches are not only prone to temporal in-

consistencies and video stagnation, but they suffer from

object appearance/characteristic changes and video quality

degradation over time (see e.g., SVD [4] in Fig. 7). The

reason is that, due to conditioning only on the last frame(s)

of the previous chunk, they overlook the long-term depen-

dencies of the autoregressive process. To address this is-

sue we design an Appearance Preservation Module (APM)

that extracts object or global scene appearance information

from an initial image (anchor frame), and conditions the

video generation process of all chunks with that informa-

tion, which helps to keep object and scene features across

the autoregressive process.

To further improve the quality and resolution of our long

video generation, we adapt a video enhancement model for

autoregressive generation. For this purpose, we choose a

high-resolution text-to-video model and utilize the SDEdit

[22] approach for enhancing consecutive 24-frame chunks

(overlapping with 8 frames) of our video. To make the

chunk enhancement transitions smooth, we design a ran-

domized blending approach for seamless blending of over-

lapping enhanced chunks.

Experiments show that StreamingT2V successfully gen-

erates long and temporal consistent videos from text with-

out video stagnation. To summarize, our contributions are

three-fold:

• We introduce StreamingT2V, an autoregressive approach

for seamless synthesis of extended video content using

short and long-term dependencies.

• Our Conditional Attention Module (CAM) and Appear-

ance Preservation Module (APM) ensure the natural

continuity of the global scene and object characteristics

of generated videos.

• We seamlessly enhance generated long videos by intro-

ducing our randomized blending approach of consecu-

tive overlapping chunks.

2. Related Work

Text-Guided Video Diffusion Models. Generating videos

from textual instructions using Diffusion Models [15, 33]

is a recently established yet very active ﬁeld of research

introduced by Video Diffusion Models (VDM) [17]. The

approach requires massive training resources and can gen-

erate only low-resolution videos (up to 128x128), which

are severely limited by the generation of up to 16 frames

(without autoregression). Also, the training of text-to-video

models usually is done on large datasets such as WebVid-

10M [3], or InternVid [41]. Several methods employ video

enhancement in the form of spatial/temporal upsampling

[5, 16, 17, 32], using cascades with up to 7 enhancer mod-

ules [16]. Such an approach produces high-resolution and

long videos. Yet, the generated content is still limited by the

key frames.

Towards generating longer videos (i.e. more keyframes),

Text-To-Video-Zero (T2V0) [18] and ART-V [42] employ

a text-to-image diffusion model. Therefore, they can gen-

erate only simple motions. T2V0 conditions on its ﬁrst

frame via cross-frame attention and ART-V on an anchor

frame. Due to the lack of global reasoning, it leads to un-

natural or repetitive motions. MTVG [23] turns a text-to-

video model into an autoregressive method by a trainin-free

approach. It employs strong consistency priors between

and among video chunks, which leads to very low motion

amount, and mostly near-static background. FreeNoise [24]

samples a small set of noise vectors, re-uses them for the

generation of all frames, while temporal attention is per-

formed on local windows. As the employed temporal at-

tentions are invariant to such frame shufﬂing, it leads to

high similarity between frames, almost always static global

motion and near-constant videos. Gen-L [38] generates

overlapping short videos and aggregates them via temporal

co-denoising, which can lead to quality degradations with

video stagnation.

Image-Guided Video Diffusion Models as Long Video

Generators. Several works condition the video generation

by a driving image or video [4, 6–8, 10, 12, 21, 27, 40,

43, 44, 48]. They can thus be turned into an autoregres-

sive method by conditioning on the frame(s) of the previous

chunk.

VideoDrafter [21] uses a text-to-image model to obtain

an anchor frame. A video diffusion model is conditioned

on the driving anchor to generate independently multiple

videos that share the same high-level context. However, no

consistency among the video chunks is enforced, leading

to drastic scene cuts. Several works [7, 8, 44] concatenate

the (encoded) conditionings with an additional mask (which

indicates which frame is provided) to the input of the video

diffusion model.

In addition to concatenating the conditioning to the in-

put of the diffusion model, several works [4, 40, 48] replace

the text embeddings in the cross-attentions of the diffusion

model by CLIP [25] image embeddings of the conditional

frames. However, according to our experiments, their appli-

cability for long video generation is limited. SVD [4] shows

severe quality degradation over time (see Fig.7), and both,

I2VGen-XL[48] and SVD[4] generate often inconsistencies

between chunks, still indicating that the conditioning mech-

anism is too weak.

Some works [6, 43] such as DynamiCrafter-XL [43] thus

add to each text cross-attention an image cross-attention,

which leads to better quality, but still to frequent inconsis-

tencies between chunks.

The concurrent work SparseCtrl [12] adds a ControlNet

[46]-like branch to the model, which takes the conditional

frames and a frame-indicating mask as input. It requires by

design to append additional frames consisting of black pix-

els to the conditional frames. This inconsistency is difﬁcult

to compensate for the model, leading to frequent and severe

scene cuts between frames.

Overall, only a small number of keyframes can currently

be generated at once with high quality. While in-between

frames can be interpolated, it does not lead to new content.

Also, while image-to-video methods can be used autore-

gressively, their used conditional mechanisms lead either to

inconsistencies, or the method suffers from video stagna-

tion. We conclude that existing works are not suitable for

high-quality and consistent long video generation without

video stagnation.

3. Preliminaries

Diffusion Models. Our text-to-video model, which we term

StreamingT2V, is a diffusion model that operates in the la-

tent space of the VQ-GAN [9, 35] autoencoder D(E(·)),

where E and D are the corresponding encoder and decoder,

respectively. Given a video V ∈ R

F ×H×W ×3

, composed

of F frames with spatial resolution H × W , its latent code

∈ R

F ×h×w×c

is obtained through frame-by-frame ap-

plication of the encoder. More precisely, by identifying

each tensor x ∈ R

F ×

h× ˆw×ˆc

as a sequence (x

)

f=1

with

∈ R

h× ˆw×ˆc

, we obtain the latent code via x

:= E(V

for all f = 1, . . . , F . The diffusion forward process gradu-

ally adds Gaussian noise ϵ ∼ N (0, I) to the signal x

q(x

t−1

) = N (x

;

1 − β

t−1

, β

I), t = 1, . . . , T

(1)

where q(x

t−1

) is the conditional density of x

given

t−1

, and {β

}

t=1

are hyperparameters. A high value for T

is chosen such that the forward process completely destroys

Figure 2. The overall pipeline of StreamingT2V: In the Initialization Stage the ﬁrst 16-frame chunk is synthesized by a text-to-video

model (e.g. Modelscope [39]). In the Streaming T2V Stage the new content for further frames are autoregressively generated. Finally, in

the Streaming Reﬁnement Stage the generated long video (600, 1200 frames or more) is autoregressively enhanced by applying a high-

resolution text-to-short-video model (e.g. MS-Vid2Vid-XL [48]) equipped with our randomized blending approach.

the initial signal x

resulting in x

∼ N (0, I). The goal of

a diffusion model is then to learn a backward process

t−1

) = N (x

t−1

; µ

, t), Σ

, t)) (2)

for t = T, . . . , 1 (see DDPM [15]), which allows to gen-

erate a valid signal x

from standard Gaussian noise x

Once x

is obtained from x

, we obtain the generated video

through frame-wise application of the decoder:

D(x

), for all f = 1, . . . , F . Yet, instead of learning

a predictor for mean and variance in Eq. (2), we learn a

model ϵ

, t) to predict the Gaussian noise ϵ that was

used to form x

from input signal x

(which is a common

reparametrization [15]).

To guide the video generation by a textual prompt τ , we

use a noise predictor ϵ

, t, τ ) that is conditioned on τ.

We model ϵ

, t, τ ) as a neural network with learnable

weights θ and train it on the denoising task:

min

t,(x

,τ )∼p

data

,ϵ∼N (0,I)

||ϵ − ϵ

, t, τ )||

, (3)

using the data distribution p

data

. To simplify notation, we

will denote by x

r:s

= (x

)

j=r

the latent sequence of x

from frame r to frame s, for all r, t, s ∈ N.

Text-To-Video Models. Most text-to-video models [5, 11,

16, 32, 39] extend pre-trained text-to-image models [26, 28]

by inserting new layers that operate on the temporal axis.

Modelscope (MS) [39] follows this approach by extend-

ing the UNet-like [29] architecture of Stable Diffusion [28]

with temporal convolutional and attentional layers. It was

trained in a large-scale setup to generate videos with 3

FPS@256x256 and 16 frames. The quadratic growth in

memory and compute of the temporal attention layers (used

in recent text-to-video models) together with very high

training costs limits current text-to-video models to gen-

erate long sequences. In this paper, we demonstrate our

StreamingT2V method by taking MS as a basis and turn it

into an autoregressive model suitable for long video gener-

ation with high motion dynamics and consistency.

4. Method

In this section, we introduce our method for high-resolution

text-to-long video generation. We ﬁrst generate 256 × 256

resolution long videos for 5 seconds (16fps), then enhance

them to higher resolution (720 × 720). The overview of the

whole pipeline is provided in Fig. 2. The long video gener-

ation part consists of (Initialization Stage) synthesizing the

ﬁrst 16-frame chunk by a pre-traiend text-to-video model

(for example one may take Modelscope [39]), and (Stream-

ing T2V Stage) autoregressively generating the new con-

tent for further frames. For the autoregression (see Fig. 3),

we propose our conditional attention module (CAM) that

leverages short-term information from the last F

cond

= 8

frames of the previous chunk to enable seamless transitions

between chunks. Furthermore, we leverage our appearance

preservation module (APM), which extracts long-term in-

formation from a ﬁxed anchor frame making the autore-

gression process robust against loosing object appearance

or scene details in the generation.

After having a long video (80, 240, 600, 1200 frames

or more) generated, we apply the Streaming Reﬁnement

Stage which enhances the video by autoregressively apply-

ing a high-resolution text-to-short-video model (for exam-

ple one may take MS-Vid2Vid-XL [48]) equipped by our

randomized blending approach for seamless chunk process-

ing. The latter step is done without additional training by so

making our approach affordable with lower computational

costs.

Figure 3. Method overview: StreamingT2V extends a video diffusion model (VDM) by the conditional attention module (CAM) as short-

term memory, and the appearance preservation module (APM) as long-term memory. CAM conditions the VDM on the previous chunk

using a frame encoder E

cond

. The attentional mechanism of CAM leads to smooth transitions between chunks and videos with high motion

amount at the same time. APM extracts from an anchor frame high-level image features and injects it to the text cross-attentions of the

VDM. APM helps to preserve object/scene features across the autogressive video generation.

4.1. Conditional Attention Module

To train a conditional network for our Streaming T2V stage,

we leverage the pre-trained power of a text-to-video model

(e.g. Modelscope [39]), as a prior for long video gener-

ation in an autoregressive manner. In the further writing

we will refer this pre-trained text-to-(short)video model as

Video-LDM. To autoregressively condition Video-LDM by

some short-term information from the previous chunk (see

Fig. 2, mid), we propose the Conditional Attention Mod-

ule (CAM), which consists of a feature extractor, and a fea-

ture injector into Video-LDM UNet, inspired by ControlNet

[46]. The feature extractor utilizes a frame-wise image en-

coder E

cond

, followed by the same encoder layers that the

Video-LDM UNet uses up to its middle layer (and initial-

ized with the UNet’s weights). For the feature injection,

we let each long-range skip connection in the UNet at-

tend to corresponding features generated by CAM via cross-

attention.

Let x denote the output of E

cond

after zero-convolution.

We use addition, to fuse x with the output of the ﬁrst

temporal transformer block of CAM. For the injection of

CAM’s features into the Video-LDM Unet, we consider

the UNet’s skip-connection features x

∈ R

b×F ×h×w×c

(see Fig.3) with batch size b. We apply spatio-temporal

group norm, and a linear projection P

on x

. Let

′

∈ R

(b·w·h)×F ×c

be the resulting tensor after reshap-

ing. We condition x

′

on the corresponding CAM fea-

ture x

CAM

∈ R

(b·w·h)×F

cond

×c

(see Fig.3), where F

cond

is the number of conditioning frames, via temporal multi-

head attention (T-MHA) [36], i.e. independently for each

spatial position (and batch). Using learnable linear maps

, P

, for queries, keys, and values, we apply T-

MHA using keys and values from x

CAM

and queries from

剩余21页未读，继续阅读

评论收藏

内容反馈

雨过朦胧影

粉丝: 82
资源: 13

AIGC生成视频模型+ StreamingT2V兼容SVD和animatediff

最新资源

AIGC生成视频模型+ StreamingT2V兼容SVD和animatediff

一个AI开源的文生视频StreamingT2V

SVD++_SVD_SVD++代码_推荐系统svd_源码.zip

SVD++_SVD_SVD++代码_推荐系统svd_

SVD++三篇相关论文

RISC-V SVD 文件生成工具

confyui + SVD 文本生成视频 工作流

【数字信号去噪】基于matlab中值滤波+奇异值分解（SVD）数字信号降噪【含Matlab源码 1021期】.zip

PCA降维+利用svd降维+利用sklearn库svd降维

confyui + SVD 图片生成视频 工作流

svd2rust, 从SVD文件生成 Rust register 映射( `struct` ).zip

【图像隐写】基于离散小波变换(DWT)+奇异值分解(SVD)数字水印（含PSNR、NC）【含Matlab源码 521期】.mp4

svd++.docx

基于FPGA的SVD奇异值分解verilog编程实现,含testbench测试程序+代码操作视频

Stability AI 的生成模型

stm32 SVD文件合集

基于python+SVD的视频剪切技术，使用SVD计算每帧间的差异，实现对视频在分镜上的剪切

从 SVD 文件生成 Rust 寄存器映射（`struct`s）_rust_代码_相关文件_下载

W-SVD模型下水印的嵌入及检测

第十五届蓝桥杯大赛软件赛省赛-PythonB组题目

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

（免费）Chrome浏览器插件axure-chrome-extension

axure谷歌浏览器插件

免费插件-AI插件-illustrator插件集合-尺寸标注-智能填充-颜色自动处理-自动批处理-Windows安装包.zip

第十五届蓝桥杯大赛软件赛省赛-PythonA组题目

火狐Firefox浏览器的插件Video DownloadHelper 8.0 的合作应用VdhCoAppSetup2.0.19

最新版YS9082HC主控开卡工具 YS9082HC-MPToolV8.00.00.18.826-HCS1A25E2023062

安卓期末大作业（AndroidStudio开发），垃圾分类助手app，分为前台后台，代码有注释，均能正常运行

恩山新版中兴Telnet工具体验版-20230830.rar

SwitchHosts

最新资源

confyui + SVD 文本生成视频工作流

confyui + SVD 图片生成视频工作流