没有合适的资源？快使用搜索试试~ 我知道了~

资源推荐

资源详情

资源评论

Improved Denoising Diffusion Probabilistic Models

Alex Nichol

* 1

Prafulla Dhariwal

* 1

Abstract

Denoising diffusion probabilistic models (DDPM)

are a class of generative models which have re-

cently been shown to produce excellent sam-

ples. We show that with a few simple modiﬁ-

cations, DDPMs can also achieve competitive log-

likelihoods while maintaining high sample quality.

Additionally, we ﬁnd that learning variances of

the reverse diffusion process allows sampling with

an order of magnitude fewer forward passes with

a negligible difference in sample quality, which

is important for the practical deployment of these

models. We additionally use precision and re-

call to compare how well DDPMs and GANs

cover the target distribution. Finally, we show

that the sample quality and likelihood of these

models scale smoothly with model capacity and

training compute, making them easily scalable.

We release our code at

https://github.com/

openai/improved-diffusion.

1. Introduction

Sohl-Dickstein et al. (2015) introduced diffusion probabilis-

tic models, a class of generative models which match a

data distribution by learning to reverse a gradual, multi-step

noising process. More recently, Ho et al. (2020) showed

an equivalence between denoising diffusion probabilistic

models (DDPM) and score based generative models (Song

& Ermon, 2019; 2020), which learn a gradient of the log-

density of the data distribution using denoising score match-

ing (Hyv

¨

arinen, 2005). It has recently been shown that this

class of models can produce high-quality images (Ho et al.,

2020; Song & Ermon, 2020; Jolicoeur-Martineau et al.,

2020) and audio (Chen et al., 2020b; Kong et al., 2020),

but it has yet to be shown that DDPMs can achieve log-

likelihoods competitive with other likelihood-based models

such as autoregressive models (van den Oord et al., 2016c)

and VAEs (Kingma & Welling, 2013). This raises various

questions, such as whether DDPMs are capable of capturing

all the modes of a distribution. Furthermore, while Ho et al.

*

Equal contribution

1

OpenAI, San Francisco, USA. Correspon-

dence to: <alex@openai.com>, <prafulla@openai.com>.

(2020) showed extremely good results on the CIFAR-10

(Krizhevsky, 2009) and LSUN (Yu et al., 2015) datasets, it

is unclear how well DDPMs scale to datasets with higher di-

versity such as ImageNet. Finally, while Chen et al. (2020b)

found that DDPMs can efﬁciently generate audio using a

small number of sampling steps, it has yet to be shown that

the same is true for images.

In this paper, we show that DDPMs can achieve log-

likelihoods competitive with other likelihood-based models,

even on high-diversity datasets like ImageNet. To more

tightly optimise the variational lower-bound (VLB), we

learn the reverse process variances using a simple reparame-

terization and a hybrid learning objective that combines the

VLB with the simpliﬁed objective from Ho et al. (2020).

We ﬁnd surprisingly that, with our hybrid objective, our

models obtain better log-likelihoods than those obtained

by optimizing the log-likelihood directly, and discover that

the latter objective has much more gradient noise during

training. We show that a simple importance sampling tech-

nique reduces this noise and allows us to achieve better

log-likelihoods than with the hybrid objective.

After incorporating learned variances into our model, we

surprisingly discovered that we could sample in fewer steps

from our models with very little change in sample quality.

While DDPM (Ho et al., 2020) requires hundreds of for-

ward passes to produce good samples, we can achieve good

samples with as few as 50 forward passes, thus speeding

up sampling for use in practical applications. In parallel to

our work, Song et al. (2020a) develops a different approach

to fast sampling, and we compare against their approach,

DDIM, in our experiments.

While likelihood is a good metric to compare against other

likelihood-based models, we also wanted to compare the

distribution coverage of these models with GANs. We use

the improved precision and recall metrics (Kynk

¨

a

¨

anniemi

et al., 2019) and discover that diffusion models achieve

much higher recall for similar FID, suggesting that they do

indeed cover a much larger portion of the target distribution.

Finally, since we expect machine learning models to con-

sume more computational resources in the future, we evalu-

ate the performance of these models as we increase model

size and training compute. Similar to (Henighan et al.,

arXiv:2102.09672v1 [cs.LG] 18 Feb 2021

Improved Denoising Diffusion Probabilistic Models 2

2020), we observe trends that suggest predictable improve-

ments in performance as we increase training compute.

2. Denoising Diffusion Probabilistic Models

We brieﬂy review the formulation of DDPMs from Ho et al.

(2020). This formulation makes various simplifying assump-

tions, such as a ﬁxed noising process

q

which adds diagonal

Gaussian noise at each timestep. For a more general deriva-

tion, see Sohl-Dickstein et al. (2015).

2.1. Deﬁnitions

Given a data distribution

x

0

∼ q(x

0

)

, we deﬁne a forward

noising process

q

which produces latents

x

1

through

x

T

by

adding Gaussian noise at time

t

with variance

β

t

∈ (0, 1)

as

follows:

q(x

1

, ..., x

T

|x

0

)

:

=

T

Y

t=1

q(x

t

|x

t−1

) (1)

q(x

t

|x

t−1

)

:

= N(x

t

;

p

1 − β

t

x

t−1

, β

t

I) (2)

Given sufﬁciently large

T

and a well behaved schedule of

β

t

, the latent

x

T

is nearly an isotropic Gaussian distribution.

Thus, if we know the exact reverse distribution q(x

t−1

|x

t

),

we can sample

x

T

∼ N(0, I)

and run the process in reverse

to get a sample from

q(x

0

)

. However, since

q(x

t−1

|x

t

)

depends on the entire data distribution, we approximate it

using a neural network as follows:

p

θ

(x

t−1

|x

t

)

:

= N(x

t−1

; µ

θ

(x

t

, t), Σ

θ

(x

t

, t)) (3)

The combination of

q

and

p

is a variational auto-encoder

(Kingma & Welling, 2013), and we can write the variational

lower bound (VLB) as follows:

L

vlb

:

= L

0

+ L

1

+ ... + L

T −1

+ L

T

(4)

L

0

:

= −log p

θ

(x

0

|x

1

) (5)

L

t−1

:

= D

KL

(q(x

t−1

|x

t

, x

0

) || p

θ

(x

t−1

|x

t

)) (6)

L

T

:

= D

KL

(q(x

T

|x

0

) || p(x

T

)) (7)

Aside from

L

0

, each term of Equation 4 is a

KL

divergence

between two Gaussians, and can thus be evaluated in closed

form. To evaluate

L

0

for images, we assume that each color

component is divided into 256 bins, and we compute the

probability of

p

θ

(x

0

|x

1

)

landing in the correct bin (which is

tractable using the CDF of the Gaussian distribution). Also

note that while

L

T

does not depend on

θ

, it will be close to

zero if the forward noising process adequately destroys the

data distribution so that q(x

T

|x

0

) ≈ N(0, I).

As noted in (Ho et al., 2020), the noising process deﬁned

in Equation 2 allows us to sample an arbitrary step of the

noised latents directly conditioned on the input

x

0

. With

α

t

:

= 1−β

t

and

¯α

t

:

=

Q

t

s=0

α

s

, we can write the marginal

q(x

t

|x

0

) = N(x

t

;

√

¯α

t

x

0

, (1 − ¯α

t

)I) (8)

x

t

=

√

¯α

t

x

0

+

√

1 − ¯α

t

(9)

where

∼ N(0, I)

. Here,

1 − ¯α

t

tells us the variance of the

noise for an arbitrary timestep, and we could equivalently

use this to deﬁne the noise schedule instead of β

t

.

Using Bayes theorem, one can calculate the posterior

q(x

t−1

|x

t

, x

0

)

in terms of

˜

β

t

and

˜µ

t

(x

t

, x

0

)

which are de-

ﬁned as follows:

˜

β

t

:

=

1 − ¯α

t−1

1 − ¯α

t

β

t

(10)

˜µ

t

(x

t

, x

0

)

:

=

√

¯α

t−1

β

t

1 − ¯α

t

x

0

+

√

α

t

(1 − ¯α

t−1

)

1 − ¯α

t

x

t

(11)

q(x

t−1

|x

t

, x

0

) = N(x

t−1

; ˜µ(x

t

, x

0

),

˜

β

t

I) (12)

2.2. Training in Practice

The objective in Equation 4 is a sum of independent terms

L

t−1

, and Equation 9 provides an efﬁcient way to sample

from an arbitrary step of the forward noising process and

estimate

L

t−1

using the posterior (Equation 12) and prior

(Equation 3). We can thus randomly sample

t

and use the

expectation

E

t,x

0

,

[L

t−1

]

to estimate

L

vlb

. Ho et al. (2020)

uniformly sample t for each image in each mini-batch.

There are many different ways to parameterize

µ

θ

(x

t

, t)

in

the prior. The most obvious option is to predict

µ

θ

(x

t

, t)

directly with a neural network. Alternatively, the network

could predict

x

0

, and this output could be used in Equation

11 to produce

µ

θ

(x

t

, t)

. The network could also predict the

noise and use Equations 9 and 11 to derive

µ

θ

(x

t

, t) =

1

√

α

t

x

t

−

β

t

√

1 − ¯α

t

θ

(x

t

, t)

(13)

Ho et al. (2020) found that predicting

worked best, es-

pecially when combined with a reweighted loss function:

L

simple

= E

t,x

0

,

|| −

θ

(x

t

, t)||

2

(14)

This objective can be seen as a reweighted form of

L

vlb

(without the terms affecting

Σ

θ

). The authors found that

optimizing this reweighted objective resulted in much better

sample quality than optimizing

L

vlb

directly, and explain

this by drawing a connection to generative score matching

(Song & Ermon, 2019; 2020).

One subtlety is that

L

simple

provides no learning signal for

Σ

θ

(x

t

, t)

. This is irrelevant, however, since Ho et al. (2020)

achieved their best results by ﬁxing the variance to

σ

2

t

I

rather than learning it. They found that they achieve similar

Improved Denoising Diffusion Probabilistic Models 3

sample quality using either

σ

2

t

= β

t

or

σ

2

t

=

˜

β

t

, which are

the upper and lower bounds on the variance given by

q(x

0

)

being either isotropic Gaussian noise or a delta function,

respectively.

3. Improving the Log-likelihood

While Ho et al. (2020) found that DDPMs can generate high-

ﬁdelity samples according to FID (Heusel et al., 2017) and

Inception Score (Salimans et al., 2016), they were unable to

achieve competitive log-likelihoods with these models. Log-

likelihood is a widely used metric in generative modeling,

and it is generally believed that optimizing log-likelihood

forces generative models to capture all of the modes of

the data distribution (Razavi et al., 2019). Additionally,

recent work (Henighan et al., 2020) has shown that small

improvements in log-likelihood can have a dramatic impact

on sample quality and learnt feature representations. Thus, it

is important to explore why DDPMs seem to perform poorly

on this metric, since this may suggest a fundamental short-

coming such as bad mode coverage. This section explores

several modiﬁcations to the algorithm described in Section

2 that, when combined, allow DDPMs to achieve much bet-

ter log-likelihoods on image datasets, suggesting that these

models enjoy the same beneﬁts as other likelihood-based

generative models.

To study the effects of different modiﬁcations, we train

ﬁxed model architectures with ﬁxed hyperparameters on

the ImageNet

64 × 64

(van den Oord et al., 2016b) and

CIFAR-10 (Krizhevsky, 2009) datasets. While CIFAR-10

has seen more usage for this class of models, we chose

to study ImageNet

64 × 64

as well because it provides a

good trade-off between diversity and resolution, allowing us

to train models quickly without worrying about overﬁtting.

Additionally, ImageNet

64×64

has been studied extensively

in the context of generative modeling (van den Oord et al.,

2016c; Menick & Kalchbrenner, 2018; Child et al., 2019;

Roy et al., 2020), allowing us to compare DDPMs directly

to many other generative models.

The setup from Ho et al. (2020) (optimizing

L

simple

while

setting

σ

2

t

= β

t

and

T = 1000

) achieves a log-likelihood

of 3.99 (bits/dim) on ImageNet

64 × 64

after 200K training

iterations. We found in early experiments that we could

get a boost in log-likelihood by increasing

T

from 1000 to

4000; with this change, the log-likelihood improves to 3.77.

For the remainder of this section, we use

T = 4000

, but we

explore this choice in Section 4.

3.1. Learning Σ

θ

(x

t

, t)

In Ho et al. (2020), the authors set

Σ

θ

(x

t

, t) = σ

2

t

I

, where

σ

t

is not learned. Oddly, they found that ﬁxing

σ

2

t

to

β

t

yielded roughly the same sample quality as ﬁxing it to

˜

β

t

.

0.0 0.2 0.4 0.6 0.8 1.0

diffusion step (t/T)

10

0

3 × 10

−1

4 × 10

−1

6 × 10

−1

β

t

/

β

t

T = 100 steps

T = 1000 steps

T = 10000 steps

Figure 1.

The ratio

˜

β

t

/β

t

for every diffusion step for diffusion

processes of different lengths.

0 500 1000 1500 2000 2500 3000 3500 4000

diffusion step

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

loss term (bits)

Figure 2.

Terms of the VLB vs diffusion step. The ﬁrst few terms

contribute most to NLL.

Considering that

β

t

and

˜

β

t

represent two opposite extremes,

it is reasonable to ask why this choice doesn’t affect samples.

One clue is given by Figure 1, which shows that

β

t

and

˜

β

t

are almost equal except near

t = 0

, i.e. where the model

is dealing with imperceptible details. Furthermore, as we

increase the number of diffusion steps,

β

t

and

˜

β

t

seem to

remain close to one another for more of the diffusion process.

This suggests that, in the limit of inﬁnite diffusion steps,

the choice of

σ

t

might not matter at all for sample quality.

In other words, as we add more diffusion steps, the model

mean

µ

θ

(x

t

, t)

determines the distribution much more than

Σ

θ

(x

t

, t).

While the above argument suggests that ﬁxing

σ

t

is a reason-

able choice for the sake of sample quality, it says nothing

about log-likelihood. In fact, Figure 2 shows that the ﬁrst

few steps of the diffusion process contribute the most to

the variational lower bound. Thus, it seems likely that we

could improve log-likelihood by using a better choice of

Σ

θ

(x

t

, t)

. To achieve this, we must learn

Σ

θ

(x

t

, t)

without

the instabilities encountered by Ho et al. (2020).

Since Figure 1 shows that the reasonable range for

Σ

θ

(x

t

, t)

is very small, it would be hard for a neural network to predict

Σ

θ

(x

t

, t)

directly, even in the log domain, as observed by

Ho et al. (2020). Instead, we found it better to parameterize

the variance as an interpolation between

β

t

and

˜

β

t

in the

Improved Denoising Diffusion Probabilistic Models 4

Figure 3.

Latent samples from linear (top) and cosine (bottom)

schedules respectively at linearly spaced values of

t

from

0

to

T

.

The latents in the last quarter of the linear schedule are almost

purely noise, whereas the cosine schedule adds noise more slowly

0.0 0.1 0.2 0.3 0.4 0.5

fraction of reverse diffusion process skipped

20

30

40

50

60

FID

cosine schedule

linear schedule

Figure 4.

FID when skipping a preﬁx of the reverse diffusion

process on ImageNet 64 × 64.

log domain. In particular, our model outputs a vector

v

containing one component per dimension, and we turn this

output into variances as follows:

Σ

θ

(x

t

, t) = exp(v log β

t

+ (1 − v) log

˜

β

t

) (15)

We did not apply any constraints on

v

, theoretically allowing

the model to predict variances outside of the interpolated

range. However, we did not observe the network doing

this in practice, suggesting that the bounds for

Σ

θ

(x

t

, t)

are

indeed expressive enough.

Since

L

simple

doesn’t depend on

Σ

θ

(x

t

, t)

, we deﬁne a new

hybrid objective:

L

hybrid

= L

simple

+ λL

vlb

(16)

For our experiments, we set

λ = 0.001

to prevent

L

vlb

from

overwhelming

L

simple

. Along this same line of reasoning,

we also apply a stop-gradient to the

µ

θ

(x

t

, t)

output for the

L

vlb

term. This way,

L

vlb

can guide

Σ

θ

(x

t

, t)

while

L

simple

is still the main source of inﬂuence over µ

θ

(x

t

, t).

3.2. Improving the Noise Schedule

We found that while the linear noise schedule used in Ho

et al. (2020) worked well for high resolution images, it was

sub-optimal for images of resolution

64 × 64

and

32 × 32

.

In particular, the end of the forward noising process is too

0.0 0.2 0.4 0.6 0.8 1.0

diffusion step (t/T)

0.0

0.2

0.4

0.6

0.8

1.0

α

t

linear

cosine

Figure 5. ¯α

t

throughout diffusion in the linear schedule and our

proposed cosine schedule.

noisy, and so doesn’t contribute very much to sample quality.

This can be seen visually in Figure 3. The result of this

effect is studied in Figure 4, where we see that a model

trained with the linear schedule does not get much worse (as

measured by FID) when we skip up to 20% of the reverse

diffusion process.

To address this problem, we construct a different noise

schedule in terms of ¯α

t

:

¯α

t

=

f(t)

f(0)

, f(t) = cos

t/T + s

1 + s

·

π

2

2

(17)

To go from this deﬁnition to variances

β

t

, we note that

β

t

= 1 −

¯α

t

¯α

t−1

. In practice, we clip

β

t

to be no larger than

0.999 to prevent singularities at the end of the diffusion

process near t = T .

Our cosine schedule is designed to have a linear drop-off of

¯α

t

in the middle of the process, while changing very little

near the extremes of

t = 0

and

t = T

to prevent abrupt

changes in noise level. Figure 5 shows how

¯α

t

progresses

for both schedules. We can see that the linear schedule from

Ho et al. (2020) falls towards zero much faster, destroying

information more quickly than necessary.

We use a small offset

s

to prevent

β

t

from being too small

near

t = 0

, since we found that having tiny amounts of

noise at the beginning of the process made it hard for the

network to predict

accurately enough. In particular, we

selected

s

such that

√

β

0

was slightly smaller than the pixel

bin size

1/127.5

, which gives

s = 0.008

. We chose to

use

cos

2

in particular because it is a common mathematical

function with the shape we were looking for. This choice

was arbitrary, and we expect that many other functions with

similar shapes would work as well.

3.3. Reducing Gradient Noise

We expected to achieve the best log-likelihoods by optimiz-

ing

L

vlb

directly, rather than by optimizing

L

hybrid

. However,

剩余16页未读，继续阅读

资源评论

IT徐师兄

- 粉丝: 1284
- 资源: 2691

上传资源 快速赚钱

- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功