STIV:ScalableTextandImageConditionedVideoGenerationFramework资源-CSDN文库

计算机视觉

视频生成

66 浏览量 2024-12-18 23:07:10 上传评论收藏 18.74MB PDF 举报

资源推荐

资源详情

资源评论

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin

1⋆

Wei Liu

1⋆

Chen Chen

2⋆

Jiasen Lu

2⋆

Wenze Hu

2⋆

Tsu-Jui Fu

2⋆

Jesse Allardice

2⋆

Zhengfeng Lai

2⋆

Liangchen Song

2⋆

Bowen Zhang

2⋆

Cha Chen

2⋆

Yiran Fei

⋆

Yifan Jiang

⋆

Lezhi Li

⋆

Yizhou Sun

⋄†

Kai-Wei Chang

⋄†

Yinfei Yang

⋄⋆

⋆

Apple

†

University of California, Los Angeles

Abstract

The ﬁeld of video generation has made remarkable advancements, yet there remains a pressing need for a clear,

systematic recipe that can guide the development of robust and scalable models. In this work, we present a

comprehensive study that systematically explores the interplay of model architectures, training recipes, and

data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method,

named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame

replacement, while incorporating text conditioning via a joint image-text conditional classiﬁer-free guidance. This

design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously.

Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation,

multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V,

STIV demonstrate strong performance, despite its simple design. An 8.7B model with

512

resolution achieves

83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and

Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at

512

resolution.

By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to

empower future research and accelerate progress toward more versatile and reliable video generation solutions.

1. Introduction

The ﬁeld of video generation has witnessed a signiﬁcant progress with the introduction of Sora [

], a video

generation model based on Diffusion Transformer (DiT) [

] architecture. Researchers have been actively

exploring optimal methods to incorporate text and other conditions into the DiT architecture. For example,

PixArt-

[

] leverages cross attention, while SD3 [

] concatenates text with the noised patches and applies

self-attention using the MMDiT block. Several video generation models [

] adopt similar approaches

and have made substantial progress in the text-to-video (T2V) task. Pure T2V approaches often struggle with

producing coherent and realistic videos, as their outputs are not grounded in external references or contextual

constraints [

]. To address this limitation, text-image-to-video (TI2V) introduce an initial image frame along

with the textual prompt, providing a more concrete grounding for the generated video.

This work was done during an internship at Apple.

First authors

Core authors

⋄

Senior authors

arXiv:2412.07730v1 [cs.CV] 10 Dec 2024

Figure 1. Performance comparison of our Text-to-Video

model against both open-source and closed-source state-of-

the-art models on VBench [31].

Despite substantial progress in video generation,

achieving Sora-level performance for T2V and TI2V

remains challenging. A central challenge is how to

seamlessly integrate image-based conditions into the

DiT architecture, calling for innovative techniques

blend visual inputs smoothly with textual cues. Mean-

while, there is a pressing need for stable, efﬁcient

large-scale training strategies, as well as improving

the overall quality of training datasets. To address

these issues, a comprehensive, step-by-step “recipe”

would greatly assist in developing uniﬁed models that

handle both T2V and TI2V task under one framework.

Overcoming these challenges is essential for advanc-

ing the ﬁeld and fully realizing the potential of video

generation models.

Although various studies [

]

have examined methods of integrating image condi-

tions into the U-Net architectures, how to effectively

incorporate such conditions into the DiT architecture remains unsolved. Moreover, existing studies in video

generation often focuses on individual aspects independently, overlooking the how their collective impact on

overall performance. For instance, while stability tricks like QK-norm [

] have been introduced, they prove

insufﬁcient as models scale to larger sizes [

], and no existing approach has successfully uniﬁed T2V and TI2V

capabilities within a single model. This lack of systematic, holistic research limits progress toward more efﬁcient

and versatile video generation solutions.

In this work, we ﬁrst present a comprehensive study of model architectures and training strategies to establish

a robust foundation for T2V. Our analysis reveals three key insights: (1) stability techniques such as QK-norm

and sandwich-norm [

] are critical for effectively scaling larger video generation models; (2) employing

factorized spatial-temporal attention [

], MaskDiT [

], and switching to AdaFactor [

] signiﬁcantly improve

training efﬁciency and reduce memory usage with minimal impact on performance loss; (3) progressive training,

where spatial and temporal layers are initialized from separate models, outperforms using a single model under

the same compute constraints. Starting from a PixArt-

baseline architecture, we address scaling challenges with

these stability and efﬁciency measures, and further enhance performance with Flow Matching [

], RoPE [

and micro conditions [

]. As a result, our largest T2V model (8.7B parameters) achieves state-of-the-art semantic

alignment and a VBench score of 83.1.

We then identify the optimal model architecture and hyperparameters established in the T2V setting and apply

them to the TI2V task. Our results show that simply replacing the ﬁrst noised latent frame with the un-noised

image condition latent yields strong performance. Although ConsistI2V [

] introduced a similar idea in a U-Net

setting, it required spatial self-attention for each frame and window-based temporal self-attention to match our

quality. In contrast, the DiT architecture natively propagates the image-conditioned ﬁrst frame through stacked

spatial-temporal attention layers, eliminating the need for these additional operations. However, as we scale up

spatial resolution, we observe the model producing slow or nearly static motion. To solve this, we introduce

random dropout of the image condition during training and apply joint image-text conditional classiﬁer-free

guidance (JIT-CFG) for both text and image conditions during inference. This strategy resolves the motion issue

and also enables a single model to excel at both T2V and TI2V tasks.

With all these changes, we ﬁnalize our model and scale it up from 600M to 8.7B parameters. Our best STIV

model achieves a state-of-the-art result of 90.1 in the VBench I2V task at

512

resolution. Beyond enhancing video

generation quality, we demonstrate the potential of extending our framework to various downstream applications,

including video prediction, frame interpolation, multi-view generation and long video generation. These results

Spatial

Attention

MHA with

QK-norm

Shared

AdaLN

FFN

Scale & Shift

STIV Block

x N

Cross

Attention

MHA With

QK-norm

Linear

Scaled Dot-Product

Attention

MHA with QK-Norm

Frame Replacement +

Image Condition Dropout

Gate

A pirate ship sailing

through a

thunderstorm with

enormous waves.

Pooled Text

Embedding

VAE Enc

Temporal

Attention

MHA With

QK-norm

Micro Conditions

Timestep

Projection norm

Sequence Text

Embedding

norm

Scale & Shift

Gate

RMSNorm

Projection

norm

CLIP Text

Enc

Linear

RMSNorm

Linear

Figure 3. We replace the ﬁrst frame of the noised video latents with the ground truth latent and randomly drop out the

image condition. We use cross attention to incorporate the text embedding, and use QK-norm in multi-head attention, the

sandwich-norm in both attention and feedforward, and stateless layernorm after singleton conditions to stabilize the training.

validate the scalability and versatility of our approach, showcasing its ability to address diverse video generation

challenges. We summarize our contributions as follows:

•

We present STIV, a single model capable of performing both T2V and TI2V tasks. At its core, we replace the

noised latent with the un-noised image condition latent and introduce joint image-text conditioned CFG.

•

We conduct a systematic study for T2I, T2V and TI2V, covering model architectures, efﬁcient and stable training

techniques, and progressive training recipes to scale up the model size, spatial resolution, and duration.

•

These design features make it easy to train and adaptable to various tasks, including video prediction, frame

interpolation, and long video generation.

•

Our experiments include detailed ablation studies on different design choices and hyperparameters, evaluated

on VBench, VBench-I2V and MSRVTT. These studies demonstrate the effectiveness of the proposed model

compared with a range of recent state-of-the-art open-source and closed-source video generation models. Some

of the generated videos are shown in Fig. 2. More examples can be found in the Sec. K in the Appendix.

2. Basics for STIV

This section describes our key components of our proposed STIV method for text-image-to-video (TI2V) genera-

tion, which is illustrated in Fig. 3. Afterward, Sec. 3 and 4 presents detailed experimental results.

2.1. Base Model Architecture

The STIV model is based on PixArt-

[

], which converts the input frames into spatial and temporal latent

embeddings using a frozen Variational Autoencoder (VAE). These embeddings are then processed by a stack of

learnable DiT-like blocks. We employ the T5 [

] tokenizer and an internally trained CLIP [

] text encoder to

process text prompts. The overall framework is illustrated in Fig. 3. For more details, please refer to the appendix.

The other signiﬁcant architectural changes are outlined below.

Spatial-Temporal Attention We employ factorized spatial and temporal attention [

] to handle video frames. We

ﬁrst fold the temporal dimension into the batch dimension and perform spatial self-attention on spatial tokens.

Then, we permute the outputs and fold the spatial dimension into the batch dimension to perform temporal

self-attention on temporal tokens. By using factorized spatial and temporal attention, we can easily preload

weights from a text-to-image (T2I) model, as images are a special case of videos with only one temporal token

and only need spatial attention.

Singleton Condition We use the original image resolution, crop coordinates, sampling stride, and number of

frames as micro conditions to encode the meta information of the training data. We ﬁrst use a sinusoidal embedding

layer to encode these properties, followed by an MLP to project them into a d-dimensional embedding space.

These micro condition embeddings, along with the diffusion timestep embedding and the last text token embedding

from the last layer of the CLIP model, are added to form a singleton condition. We also apply stateless layer

normalization to each singleton embedding and then add them together. This singleton condition is used to

produce shared scale-shift-gate parameters that are utilized in the spatial attention and feed-forward layers of each

Transformer layer.

Rotary Positional Embedding Rotary Positional Embeddings (RoPE) [

] are used so that the model has a

strong inductive bias for processing relative temporal and spatial relationships. Additionally, RoPE can be made

compatible with the masking methods used in high compute applications and are highly adaptable to variations in

resolution [

]. We apply 2D RoPE [

] for the spatial attention and 1D RoPE for the temporal attention inside

the factorized Spatial-Temporal attention.

Flow Matching Instead of employing the conventional diffusion loss, we opt for a Flow Matching training

objective. This objective deﬁnes a conditional optimal transport between two examples drawn from a source and

target distribution. In our case, we assume the source distribution to be Gaussian and utilize linear interpolates

[41] to achieve this.

= t · x

+ (1 − t) · ϵ. (1)

The training objective is then formulated as

min

x,ϵ∈N (0,I),c,t

∥F

, c, t) − v

∥

(2)

where the velocity vector ﬁeld v

= x

− ϵ.

In inference time, we solve the corresponding reverse-time SDE, from timestep 0 to 1, to generate images from

a randomly sampled Gaussian noise ϵ.

2.2. Model Scaling

As we scale up the model, we encounter training instability and infrastructure challenges in ﬁtting larger models

into memory. In this section, we outline the methods to stabilize the training and enhance training efﬁciency.

Stable Training Recipes We discovered that QK-Norm — applying RMSNorm [

] to the query and key vectors

prior to computing attention logits — signiﬁcantly stabilizes training. This ﬁnding aligns with the results reported

in SD3 [

]. We also change from pre-norm to sandwich-norm [

] for both MHA and FFN, which involves

adding pre-norm and post-norm with stateless layer normalization [37] to both the layers within the STIV block.

MHA(x) = x + gate · norm (Attn (scale · norm(x) + shift))

FFN(x) = x + gate · norm (MLP (scale · norm(x) + shift))

Efﬁcient DiT Training We follow MaskDiT [

] by randomly masking 50% of spatial tokens before passing

them into the major DiT blocks. After unmasking, we add two additional DiT blocks. We also switch from

AdamW to AdaFactor optimizer and employ gradient checkpointing to only store the self-attention outputs. These

modiﬁcations signiﬁcantly enhance efﬁciency and reduce memory consumption, enabling the training of larger

models at higher resolution and longer duration.

剩余45页未读，继续阅读

评论收藏

内容反馈

码流怪侠

粉丝: 2w+
资源: 109

STIV: Scalable Text and Image Conditioned Video Generation Frame...

最新资源

STIV: Scalable Text and Image Conditioned Video Generation Frame...

STIV（时空图像测速）的Python河面测速

实时水面模拟方法研究.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

zotero翻译插件.xpi

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

YOLOv8目标追踪实战全套资源包 - 源码与数据集完整分享

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

YOLOv5 人脸口罩图片数据集

mamba、causal-conv1d安装.whl文件

LabVIEW AI Vision(LabVIEW AI视觉工具包)

labelme v5.3.1 （2023年8月新版本，双击打开即用）

皮肤病语义分割数据集+代码+unet模型 2000张标注好的数据+教学视频

时间序列预测实战(十九)魔改Informer模型进行滚动长期预测（科研版本，结果可视化）

第二版Science Research Writing for Non-Native Speakers of English

【大作业-08】YOLOV5火灾检测数据集+代码+模型 2000张标注好的数据+教学视频

pycharm连接autodl服务器（yolov8训练自己的数据集）

基于YOLOv5实现垃圾分类目标检测

Informer模型实战案例(代码+数据集+参数讲解)

yolov8-seg模型源码，实例分割，带数据集，测试可执行demo

MNIST160 手写数字图片数据集 - 用于 YOLOv8 图像分类

YOLOV5 + 双目相机实现三维测距（新版本）

最新资源