DiffiT-DiffusionVisionTransformersforImageGeneration

需积分: 5 181 浏览量 2024-03-26 14:59:49 上传评论收藏 43.22MB PDF 举报

### DiffiT: Diffusion Vision Transformers for Image Generation #### 摘要《DiffiT: Diffusion Vision Transformers for Image Generation》是一篇由Ali Hatamizadeh、Jiaming Song、Guilin Liu、Jan Kautz和Arash Vahdat等来自NVIDIA的研究人员共同撰写的论文。该研究探讨了在扩散模型（Diffusion Models）领域中引入视觉变换器（Vision Transformers, ViTs）作为去噪网络架构的有效性，并提出了名为DiffiT的新模型。 #### 引言近年来，扩散模型因其强大的表达能力和高质量样本生成能力，在多个领域实现了许多新的应用和用例。这些模型通过一个去噪神经网络来生成图像，该网络通过迭代去噪过程逐步构建图像。然而，关于去噪网络架构的选择并没有得到充分的研究，大多数工作依赖于卷积残差U-Net结构。本文提出了一种结合了视觉变换器的新型扩散模型——DiffiT，旨在解决这一问题。 #### DiffiT：Diffusion Vision Transformers ##### 模型概述 DiffiT模型采用了一种混合层次结构，包括U形编码器和解码器。它引入了一种新的时间依赖自注意力模块，使注意力层能够在去噪过程的不同阶段适应性地调整其行为，从而更高效地处理图像生成任务。此外，DiffiT还利用了Transformer模型的优势，特别是在高分辨率图像生成方面表现突出。 ##### 自注意力机制新提出的自注意力模块允许模型根据去噪过程的不同阶段动态调整注意力权重，这对于处理复杂的图像结构尤为重要。这种机制有助于模型更好地捕捉到图像中的细节特征，提高生成图像的质量。 ##### 高分辨率图像生成针对高分辨率图像生成任务，研究人员开发了latent DiffiT模型，该模型基于具有上述自注意力层的Transformer架构。实验结果显示，latent DiffiT在ImageNet-256数据集上取得了目前最低的FID分数1.73，表明该模型在高分辨率图像生成方面达到了当前最优水平。 #### 实验结果与分析研究团队对DiffiT进行了广泛的评估，包括类条件合成任务和无条件合成任务。实验结果表明，DiffiT在生成高质量图像方面表现出色，并在多个基准测试中取得了最佳性能。 #### 结论与展望《DiffiT: Diffusion Vision Transformers for Image Generation》这篇论文通过引入视觉变换器来改进传统的扩散模型，成功地提高了图像生成的质量。DiffiT不仅在理论上证明了视觉变换器在扩散模型中的潜力，还在实际应用中验证了这种方法的有效性。未来的研究可以进一步探索如何将视觉变换器与其他类型的网络结构相结合，以应对更多样的生成任务需求。 #### 相关知识点解析 1. **扩散模型**：扩散模型是一种基于深度学习的生成模型，它通过逐渐添加噪声并训练一个去噪模型来生成数据。这种模型能够捕获复杂的数据分布，并在各种生成任务中展现出优异的表现。 2. **视觉变换器(Vision Transformers)**：视觉变换器是近年来兴起的一种用于计算机视觉领域的神经网络架构。不同于传统的卷积神经网络(CNNs)，ViTs利用自注意力机制来处理图像数据，展现了在图像分类和其他视觉任务上的出色性能。 3. **去噪网络**：在扩散模型中，去噪网络负责逐步去除加到输入数据上的噪声，从而恢复原始数据。这个过程中，网络的架构设计至关重要，直接影响着生成数据的质量。 4. **FID分数(Frechet Inception Distance)**：FID分数是一种常用的评估生成模型性能的指标，用于量化生成图像与真实图像之间的相似度。较低的FID分数通常意味着更好的图像质量。 5. **ImageNet-256数据集**：这是一个广泛使用的图像数据集，包含了256x256分辨率的图像。在图像识别和生成领域，它是评估算法性能的重要基准之一。 6. **类条件合成任务**：这类任务要求生成模型能够根据特定类别或条件信息生成图像。例如，根据“狗”的类别生成相应的图像。 7. **无条件合成任务**：与类条件合成任务不同，无条件合成任务不要求生成图像时提供任何额外的条件信息，即模型随机生成图像。通过上述介绍可以看出，《DiffiT: Diffusion Vision Transformers for Image Generation》这篇论文不仅为扩散模型带来了新的视角，也为视觉变换器在图像生成领域的应用提供了有价值的参考。

资源推荐

资源详情

资源评论

DifﬁT: Diffusion Vision Transformers for Image Generation

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat

NVIDIA

{ahatamizadeh, jiamings, guilinl, jkautz, avahdat}@nvidia.com

Figure 1 – Uncurated generated images by latent DifﬁT on ImageNet [13] dataset. Best viewed in color.

Abstract

Diffusion models with their powerful expressivity and

high sample quality have enabled many new applications

and use-cases in various domains. For sample generation,

these models rely on a denoising neural network that gen-

erates images by iterative denoising. Yet, the role of de-

noising network architecture is not well-studied with most

efforts relying on convolutional residual U-Nets. In this

paper, we study the effectiveness of vision transformers in

diffusion-based generative learning. Speciﬁcally, we pro-

pose a new model, denoted as Diffusion Vision Transform-

ers (DifﬁT), which consists of a hybrid hierarchical archi-

tecture with a U-shaped encoder and decoder. We intro-

duce a novel time-dependent self-attention module that al-

lows attention layers to adapt their behavior at different

stages of the denoising process in an efﬁcient manner. We

also introduce latent DifﬁT which consists of transformer

model with the proposed self-attention layers, for high-

resolution image generation. Our results show that Dif-

ﬁT is surprisingly effective in generating high-ﬁdelity im-

ages, and it achieves state-of-the-art (SOTA) benchmarks

on a variety of class-conditional and unconditional synthe-

sis tasks. In the latent space, DifﬁT achieves a new SOTA

FID score of 1.73 on ImageNet-256 dataset. Repository:

https://github.com/NVlabs/DiffiT.

1. Introduction

Diffusion models [

] have revolutionized the do-

main of generative learning, with successful frameworks in

the front line such as DALL

E 3 [

], Imagen [

], Stable

diffusion [

], and eDiff-I [

]. They have enabled gen-

erating diverse complex scenes in high ﬁdelity which were

once considered out of reach for prior models. Speciﬁcally,

synthesis in diffusion models is formulated as an iterative

process in which random image-shaped Gaussian noise is

arXiv:2312.02139v1 [cs.CV] 4 Dec 2023

denoised gradually towards realistic samples [

The core building block in this process is a denoising au-

toencoder network that takes a noisy image and predicts the

denoising direction, equivalent to the score function [

This network, which is shared across different time steps of

the denoising process, is often a variant of U-Net [

]

that consists of convolutional residual blocks as well as self-

attention layers in several resolutions of the network. Al-

though the self-attention layers have shown to be impor-

tant for capturing long-range spatial dependencies, yet there

exists a lack of standard design patterns on how to incor-

porate them. In fact, most denoising networks often lever-

age self-attention layers only in their low-resolution feature

maps [

] to avoid their expensive computational complex-

ity. Recently, several works [

] have observed that

diffusion models exhibit a unique temporal dynamic dur-

ing generation. At the beginning of the denoising process,

when the image contains strong Gaussian noise, the high-

frequency content of the image is completely perturbed, and

the denoising network primarily focuses on predicting the

low-frequency content. However, towards the end of denois-

ing, in which most of the image structure is generated, the

network tends to focus on predicting high-frequency details.

The time dependency of the denoising network is often im-

plemented via simple temporal positional embeddings that

are fed to different residual blocks via arithmetic operations

such as spatial addition. In fact, the convolutional ﬁlters in

the denoising network are not time-dependent and the time

embedding only applies a channel-wise shift and scaling.

Hence, such a simple mechanism may not be able to opti-

mally capture the time dependency of the network during

the entire denoising process.

In this work, we aim to address the issue of lacking ﬁne-

grained control over capturing the time-dependent compo-

nent in self-attention modules for denoising diffusion models.

We introduce a novel Vision Transformer-based model for

image generation, called DifﬁT (pronounced di-feet) which

achieves state-of-the-art performance in terms of FID score

of image generation on CIFAR10 [

] and FFHQ-64 [

]

(image space) as well as ImageNet-256 [

] and ImageNet-

512 [

] (latent space) datasets. Speciﬁcally, DifﬁT proposes

a new paradigm in which temporal dependency is only in-

tegrated into the self-attention layers where the key, query,

and value weights are adapted per time step. This allows the

denoising model to dynamically change its attention mech-

anism for different denoising stages. In an effort to unify

the architecture design patterns, we also propose a hierarchi-

cal transformer-based architecture for latent space synthesis

tasks.

The following summarizes our contributions in this work:

•

We introduce a novel time-dependent self-attention mod-

ule that is speciﬁcally tailored to capture both short- and

long-range spatial dependencies. Our proposed time-

dependent self-attention dynamically adapts its behavior

over sampling time steps.

•

We propose a novel transformer-based architecture, de-

noted as DifﬁT, which uniﬁes the design patterns of de-

noising networks.

•

We show that DifﬁT can achieve state-of-the-art perfor-

mance on a variety of datasets for both image and latent

space generation tasks.

2. Related Work

Transformers in Generative Modeling Transformer-

based models have achieved competitive performance in

different generative learning models in the visual domain [

]. A number of transformer-based archi-

tectures have emerged for GANs [

]. Trans-

GAN [

] proposed to use a pure transformer-based genera-

tor and discriminator architecture for pixel-wise image gen-

eration. Gansformer [

] introduced a bipartite transformer

that encourages the similarity between latent and image fea-

tures. Styleformer [

] uses Linformers [

] to scale the

synthesis to higher resolution images. Recently, a number

of efforts [

] have leveraged Transformer-based

architectures for diffusion models and achieved competitive

performance. In particular, Diffusion Transformer (DiT) [

]

proposed a latent diffusion model in which the regular U-Net

backbone is replaced with a Transformer. In DiT, the condi-

tioning on input noise is done by using Adaptive LayerNorm

(AdaLN) [

] blocks. Using the DiT architecture, Masked

Diffusion Transformer (MDT) [

] introduced a masked

latent modeling approach to effectively capture contextual

information. In comparison to DiT, although MDT achieves

faster learning speed and better FID scores on ImageNet-256

dataset [

], it has a more complex training pipeline. Unlike

DiT and MDT, the proposed DifﬁT does not use shift and

scale, as in AdaLN formulation, for conditioning. Instread,

DifﬁT proposes a time-dependent self-attention (i.e. TMSA)

to jointly learn the spatial and temporal dependencies. In ad-

dition, DifﬁT proposes both image and latent space models

for different image generation tasks with different resolu-

tions with SOTA performance.

Diffusion Image Generation Diffusion models [

] have driven signiﬁcant advances in various domains,

such as text-to-image generation [

], natural language

processing [

], text-to-speech synthesis [

], 3D point

cloud generation [

], time series modeling [

molecular conformal generation [

], and machine learning

security [

]. These models synthesize samples via an itera-

tive denoising process and thus are also known in the com-

munity as noise-conditioned score networks. Since its initial

success on small-scale datasets like CIFAR-10 [

], diffu-

sion models have been gaining popularity compared to other

Tokenizer

 



DiffiT

ResBlock

Downsample

DiffiT

ResBlock

Downsample

DiffiT

ResBlock

Downsample

DiffiT

ResBlock

DiffiT

ResBlock

Upsample

DiffiT

ResBlock

Upsample

DiffiT

ResBlock

Upsample

 



 



 



 



 



 



Head

DiffiT ResBlock

Swish

- GN

Conv 3

3

Time

Embedding

DiffiT

Transformer

Conv 3

3

Head

Conv 3

3

Tokenizer

    

    

    

    

    

    

Figure 2 – Overview of the image-space DifﬁT model. Downsample and Upsample denote convolutional downsampling and upsampling

layers, respectively. Please see the supplementary materials for more information regarding the DifﬁT architecture.

existing families of generative models. Compared with vari-

ational autoencoders [

], diffusion models divide the syn-

thesis procedure into small parts that are easier to optimize,

and have better coverage of the latent space [

]; com-

pared with generative adversarial networks [

], diffusion

models have better training stability and are much easier to

invert [

]. Diffusion models are also well-suited for im-

age restoration, editing and re-synthesis tasks with minimal

modiﬁcations to the existing architecture [

–

], making it well-suited for various downstream

applications.

3. Methodology

3.1. Diffusion Model

Diffusion models [

] are a family of generative

models that synthesize samples via an iterative denoising

process. Given a data distribution as

)

, a family of

random variables

for

t ∈ [0, T ]

are deﬁned by injecting

Gaussian noise to

, i.e.,

) =

q(z

)dz

where

q(z

) = N(z

, σ

is a Gaussian distribution.

Typically,

is chosen as a non-decreasing sequence such

that

= 0

and

being much larger than the data variance.

This is called the “Variance-Exploding” noising schedule

in the literature [

]; for simplicity, we use these notations

throughout the paper, but we note that it can be equiva-

lently converted to other commonly used schedules (such

as “Variance-Preserving” [

]) by simply rescaling the data

with a scaling term, dependent on t [34, 65].

The distributions of these random variables are the

marginal distributions of forward diffusion processes

(Markovian or not [

]) that gradually reduces the “signal-

to-noise” ratio between the data and noise. As a generative

model, diffusion models are trained to approximate the re-

verse diffusion process, that is, to transform from the initial

noisy distribution (that is approximately Gaussian) to a dis-

tribution that is close to the data one.

Training Despite being derived from different perspec-

tives, diffusion models can generally be written as learning

the following denoising autoencoder objective [71]

),t∼p(t),ϵ∼N (0,I)

[λ(t)∥ϵ − ϵ

+ σ

ϵ, t)∥

]. (1)

Intuitively, given a noisy sample from

q(z

)

(generated via

:= z

+ σ

), a neural network

is trained to predict

the amount of noise added (i.e.,

). Equivalently, the neural

network can also be trained to predict

instead [

The above objective is also known as denoising score match-

ing [

], where the goal is to try to ﬁt the data score (i.e.,

∇

log q(z

)

) with a neural network, also known as the

score network

, t)

. The score network can be related to

via the relationship s

, t) := −ϵ

, t)/σ

Sampling Samples from the diffusion model can be simu-

lated by the following family of stochastic differential equa-

tions that solve from t = T to t = 0 [17, 24, 34, 81]:

dz = −( ˙σ

+ β

)σ

(z, t)dt +

2β

dω

, (2)

Time

embedding

Linear

Multi-head

Self-Attention

Linear

DiffiT Transformer Block

Q, K, V

Linear

LN GELU

TMSA MLP

Figure 3 – The DifﬁT Transformer block applies linear projection

to spatial and time-embedding tokens before combining them

together to form query, key, and value vectors for each token.

These vectors are then used to compute multi-head self-attention

activations, followed by two linear layers.

where

is the reverse standard Wiener process, and

is a

function that describes the amount of stochastic noise during

the sampling process. If

= 0

for all

, then the process

becomes a probabilistic ordinary differential equation [

]

(ODE), and can be solved by ODE integrators such as de-

noising diffusion implicit models (DDIM [

]). Otherwise,

solvers for stochastic differential equations (SDE) can be

used, including the one for the original denoising diffusion

probabilistic models (DDPM [

]). Typically, ODE solvers

can converge to high-quality samples in fewer steps and SDE

solvers are more robust to inaccurate score models [34].

3.2. DifﬁT Model

Time-dependent Self-Attention At every layer, our trans-

former block receives

}

, a set of tokens arranged spatially

on a 2D grid in its input. It also receives

, a time token

representing the time step. Similar to Ho et al.

[25]

, we

obtain the time token by feeding positional time embeddings

to a small MLP with swish activation [

]. This time token

is passed to all layers in our denoising network. We intro-

duce our time-dependent multi-head self-attention, which

captures both long-range spatial and temporal dependen-

cies by projecting feature and time token embeddings in a

shared space. Speciﬁcally, time-dependent queries

, keys

and values

in the shared space are computed by a linear

projection of spatial and time embeddings x

and x

via

= x

+ x

, (3)

= x

+ x

, (4)

= x

+ x

, (5)

where

denote spatial

and temporal linear projection weights for their correspond-

ing queries, keys, and values respectively.

We note that the operations listed in Eq. 3 to 5 are equiva-

lent to a linear projection of each spatial token, concatenated

with the time token. As a result, key, query, and value are all

linear functions of both time and spatial tokens and they can

adaptively modify the behavior of attention for different time

steps. We deﬁne

Q := {q

}

K := {k

}

, and

V := {v

}

which are stacked form of query, key, and values in rows of

a matrix. The self-attention is then computed as follows

Attention(Q, K, V) = Softmax



⊤

√

+ B



V. (6)

In which,

is a scaling factor for keys

, and

corresponds

to a relative position bias [

]. For computing the attention,

the relative position bias allows for the encoding of infor-

mation across each attention head. Note that although the

relative position bias is implicitly affected by the input time

embedding, directly integrating it with this component may

result in sub-optimal performance as it needs to capture both

spatial and temporal information. Please see Sec. 5.4 for

more analysis.

DifﬁT Block The DifﬁT transformer block (see Fig. 3) is

a core building block of the proposed architecture and is

deﬁned as

ˆx

= TMSA (LN (x

), x

) + x

, (7)

= MLP (LN (ˆx

)) + ˆx

, (8)

where TMSA denotes time-dependent multi-head self-

attention, as described in the above,

is the time-

embedding token,

is a spatial token, and LN and MLP

denote Layer Norm [

] and multi-layer perceptron (MLP)

respectively.

3.2.1 Image Space

DifﬁT Architecture DifﬁT uses a symmetrical U-Shaped

encoder-decoder architecture in which the contracting and

expanding paths are connected to each other via skip con-

nections at every resolution. Speciﬁcally, each resolution

of the encoder or decoder paths consists of

consecutive

DifﬁT blocks, containing our proposed time-dependent self-

attention modules. In the beginning of each path, for both

the encoder and decoder, a convolutional layer is employed

to match the number of feature maps. In addition, a convo-

lutional upsampling or downsampling layer is also used for

transitioning between each resolution. We speculate that the

use of these convolutional layers embeds inductive image

bias that can further improve the performance. In the remain-

der of this section, we discuss the DifﬁT Transformer block

and our proposed time-dependent self-attention mechanism.

We use our proposed Transformer block as the residual cells

when constructing the U-shaped denoising architecture.

剩余22页未读，继续阅读

评论收藏

内容反馈

muyu_525

粉丝: 2
资源: 19

DiffiT- Diffusion Vision Transformers for Image Generation

最新资源

DiffiT- Diffusion Vision Transformers for Image Generation

Stable-Diffusion-WebUI（秋叶）和Stable-Diffusion–forge

stable-diffusion-webui-extensions 扩展

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

stable diffusion（stable-diffusion-webui-rembg）抠图模型

stable-diffusion-webui codeformer.pth

AI-绘画的工具准备：Stable-Diffusion使用教程.pdf

Novel Diffusion-Based Models for Image Restoration and Interpolation2019.pdf

stable-diffusion-webui中repositories文件

Stable Diffusion之Ubuntu下部署,使用的 k-diffusion 指定版本

Spatial Ecology via Reaction-Diffusion Equations

Stable-diffusion安装clip-vit-large-patch14

Analysis and Control of Coupled Neural Networks with Reaction-Diffusion Terms

稳定扩散Web工具:stable-diffusion-webui

stable-diffusion-webui-master

stable-diffusion-webui-1.9.4

AI本地作画-Disco-Diffusion-v5.2-jupyterNote

stable-diffusion-webui

stable-diffusion默认vae

stable-diffusion-webui.zip

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

机器学习期末复习题及答案

神经网络回归预测--气温数据集

Mathwork+Matlab+编程手册

中文短信数据集-带标签

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

Ollama软件windows安装包(版本0.3.10)

亚博K210模型训练部署

Plecs电力电子仿真PLECS41.64 电力系统仿真软件免安装版本

hugging face的models-openai-clip-vit-large-patch14文件夹

最新资源