基于扩散模型逆向生成的图像超分辨率方法研究与应用

117 浏览量 2024-12-27 09:40:29 上传评论 1 收藏 19.38MB PDF 举报

资源推荐

资源详情

资源评论

Arbitrary-steps Image Super-resolution via Diffusion Inversion

Zongsheng Yue, Kang Liao, Chen Change Loy

S-Lab, Nanyang Technological University

{zongsheng.yue, kang.liao, ccloy}@ntu.edu.sg

DiffBIR-50 (7937ms)

Ours-2 (149ms)

DiffBIR-50

Ours-2

ResShift-4 (319ms)

Ours-3 (176ms)

ResShift-4

Ours-3

SinSR-1 (138ms)

Ours-4 (207ms)

SinSR-1

Ours-4

OSEDiff-1 (176ms)

Ours-5 (244ms)

OSEDiff-1

Ours-5

StableSR-50 (3459ms)

Ours-1 (117ms)

StableSR-50

Ours-1

Zoomed LR

Figure 1. Qualitative comparisons of our proposed method to recent state-of-the-art diffusion-based approaches on two real-world ex-

amples, where the number of sampling steps is annotated in the format “Method name-Steps”. We provide the runtime (in milliseconds)

highlighted by red in the sub-caption of the ﬁrst example , which is tested on ×4 (128 → 512) SR task on an A100 GPU. Our method

offers an efﬁcient and ﬂexible sampling mechanism, allowing users to freely adjust the number of sampling steps based on the degradation

type or their speciﬁc requirements. In the ﬁrst example, mainly degraded by blurriness, multi-step sampling is preferable to single-step

sampling as it progressively recovers ﬁner details. Conversely, in the second example with severe noise, a single sampling step is sufﬁcient

to achieve satisfactory results, whereas additional steps may amplify the noise and introduce unwanted artifacts. (Zoom-in for best view)

Abstract

This study presents a new image super-resolution (SR) tech-

nique based on diffusion inversion, aiming at harnessing

the rich image priors encapsulated in large pre-trained

diffusion models to improve SR performance. We design

a Partial noise Prediction strategy to construct an inter-

mediate state of the diffusion model, which serves as the

starting sampling point. Central to our approach is a

deep noise predictor to estimate the optimal noise maps

for the forward diffusion process. Once trained, this noise

predictor can be used to initialize the sampling process

partially along the diffusion trajectory, generating the de-

sirable high-resolution result. Compared to existing ap-

proaches, our method offers a ﬂexible and efﬁcient sam-

pling mechanism that supports an arbitrary number of sam-

pling steps, ranging from one to ﬁve. Even with a single

sampling step, our method demonstrates superior or com-

parable performance to recent state-of-the-art approaches.

The code and model are publicly available at https:

//github.com/zsyOAOA/InvSR.

arXiv:2412.09013v1 [cs.CV] 12 Dec 2024

1. Introduction

Image super-resolution (SR) is a fundamental yet chal-

lenging problem in computer vision, aiming to restore a

high-resolution (HR) image from a given low-resolution

(LR) observation. The main challenge of SR arises from

the complexity and often unknown nature of the degra-

dation model in real-world scenarios, making SR an ill-

posed problem. Recent breakthroughs in diffusion mod-

els [16, 45, 48], particularly large-scale text-to-image (T2I)

models, have demonstrated remarkable success in generat-

ing high-quality images. Owing to the strong generative

capability of these T2I models, recent studies have begun to

use them as a reliable prior to alleviate the ill-posedness of

SR. This work follows this research line, further exploring

the potential of diffusion priors in SR.

The prevailing SR approaches leveraging diffusion pri-

ors usually attempt to modify the intermediate features of

the diffusion network, either through optimization [7, 23,

56] or ﬁne-tuning [30, 56, 60, 64], to better align them with

the given LQ observations. In this work, we propose a new

technique based on diffusion inversion to harness diffusion

priors. Unlike existing approaches, it attempts to ﬁnd an

optimal noise map as the input of the diffusion model, with-

out any modiﬁcation to the diffusion network itself, thereby

maximizing the utility of diffusion prior.

While considerable advances have been made in gener-

ative adversarial networks (GANs) [14] inversion for var-

ious applications [62, 75], including SR [4, 15, 41], ex-

tending these principles to diffusion models presents unique

challenges, particularly for SR tasks that demand high ﬁ-

delity preservation. In particular, the multi-step stochastic

sampling process of diffusion models makes inversion non-

trivial. The straightforward inversion approach to optimize

the distinct noise maps at each diffusion step is expensive

and complex. Additionally, the iterative inference mecha-

nism would accumulate prediction errors and randomness

at each step, which can signiﬁcantly compromise ﬁdelity.

Therefore, recent diffusion inversion methods have mainly

focused on tasks with lower ﬁdelity requirements, such as

image editing [13, 36].

In this work, we reformulate diffusion inversion for the

more challenging task of SR. To enable diffusion inversion

for SR, we introduce a deep neural network called noise pre-

dictor to estimate the noise map from a given LR image. In

addition, a Partial noise Prediction (PnP) strategy is devised

to construct an intermediate state for the diffusion model,

serving as the starting point for sampling. This is made

possible by adding noise onto the LR image according to

the diffusion model’s forward process, where the noise pre-

dictor predicts the added noise instead of random sampling.

This approach is driven by the following key motivations:

• Rationality. LR and HR images differ only in high-

frequency details. With the addition of appropriate noise,

the LR image becomes indistinguishable from its HR

counterpart. Thus, the noisy LR can serve as a proxy for

deriving the inversion trajectory during reverse diffusion.

• Complexity. Rather than predicting noise maps for all

diffusion steps, the PnP strategy simpliﬁes the inversion

task by limiting predictions to the starting step, thereby

reducing the overall complexity of the inversion process.

• Flexibility. The noise predictor can be trained to predict

noise maps for multiple predeﬁned starting steps. During

inference, we can freely select a starting step from them

and then use any existing sampling algorithm with an ar-

bitrary number of steps, offering favorable ﬂexibility in

controlling the sampling process.

• Fidelity. The starting steps during training are carefully

selected to have a high signal-to-noise ratio (SNR), en-

suring robust ﬁdelity preservation for SR. In practice, we

enforce an SNR threshold greater than 1.44, correspond-

ing to the timestep of 250 in Stable Diffusion [43].

• Efﬁciency. As the sampling process begins from a step

earlier than 250 (SNR larger than 1.44), the PnP strategy

effectively reduces the number of sampling steps to fewer

than ﬁve when combined with off-the-shelf accelerated

sampling algorithms [22, 46]. This addresses the com-

mon inefﬁciency issue in diffusion-based SR approaches.

Unlike most existing diffusion-based methods that rely

on ﬁxed sampling steps, our ﬂexible sampling mechanism

offers a versatile solution for handling varying degrees of

degradation in SR. In SR, it is common to encounter differ-

ent types and intensities of corruption. Intuitively, the num-

ber of sampling steps should adapt to the speciﬁc degrada-

tion conditions. For example, as shown in Fig. 1, multi-step

sampling is preferable to single-step sampling in the ﬁrst

case, as it effectively reduces blurriness and restores ﬁner

details. In contrast, for the second example with severe

noise, a single sampling step achieves satisfactory results,

while additional steps may amplify the noise and introduce

unwanted artifacts. Our method uniquely allows users to

adjust sampling to suit different degradation types.

The main contributions of this work are twofold. First,

we propose a novel SR approach based on diffusion inver-

sion, which effectively leverages the diffusion prior by inte-

grating an auxiliary noise predictor while keeping the entire

diffusion backbone ﬁxed. Second, our method introduces

a ﬂexible and efﬁcient sampling mechanism that allows for

arbitrary sampling steps, ranging from one to ﬁve. Remark-

ably, even when the steps are reduced to just one, our ap-

proach still achieves superior or comparable performance

to recent dedicated one-step diffusion methods.

2. Related Work

Diffusion Prior for SR. Existing diffusion prior-based SR

approaches can be broadly categorized into two classes.

The ﬁrst class of methods involves re-optimizing the in-

termediate results of the diffusion model to ensure consis-

tency with the given LR images via pre-deﬁned or esti-

mated degradation models. Representative works include

DDRM [23], CCDF [7], and DDNM [56], among oth-

ers [6, 8, 11, 38, 47, 63, 67]. While effective, these meth-

ods are limited by their computational complexity, as they

require solving an optimization problem at each diffusion

step, leading to slow inference. Furthermore, they often

rely on manually deﬁned assumed degradation models and

thus cannot handle the blind SR problem in real-world sce-

narios. The second class directly ﬁne-tunes a pre-trained

large T2I model for the SR task. StableSR [53] pioneers

this paradigm by incorporating spatial feature transform

layers [54] to guide the T2I model toward generating HR

outputs. Subsequent works follow by proposing various

ﬁne-tuning strategies to exploit diffusion priors, including

DiffBIR [30], SeeSR [60], PASD [64], S3Diff [71], and so

on [27, 40, 49, 59, 61, 66]. These methods have achieved

impressive performance, validating the effectiveness of dif-

fusion priors for SR.

Diffusion Inversion. Diffusion inversion focuses on de-

termining the optimal noise map set that, when processed

through the diffusion model, reconstructs a given image.

DDIM [46] ﬁrst addressed this by generalizing the diffu-

sion model via a class of non-Markovian processes, thereby

establishing a deterministic generation process. Subse-

quent approaches, such as those by Rinon et al. [12] and

Mokady et al. [36], proposed optimizing the text embed-

ding to better align with the desired textual guidance. Re-

cent efforts have further reﬁned the optimization strategies

for both the textual and visual prompts [35, 39], as well as

for intermediate noise maps [13, 19, 20, 33, 51, 72], leading

to notable enhancements in inversion quality. Despite these

advances, existing methods mainly focus on image editing

and cannot meet the high-ﬁdelity requirements of SR.

In this work, we tailor the diffusion inversion technique

for SR. While Chihaoui et al. [5] have recently explored

diffusion inversion for image restoration, their method re-

lies on solving an optimization problem at each inversion

step, signiﬁcantly limiting its inference efﬁciency. In con-

trast, our approach introduces a noise prediction module

that, once trained, enables efﬁcient inversion without re-

quiring iterative optimization during inference. This leads

to substantial improvements in both the efﬁciency and prac-

ticality of diffusion inversion for SR tasks.

3. Methodology

In this section, we present the proposed diffusion inversion

technique for SR. To maintain consistency with the nota-

tions used in diffusion models, we denote the LR image as

and the corresponding HR image as x

3.1. Motivation

The diffusion model [16, 45] was ﬁrst introduced as a prob-

abilistic generative model inspired by nonequilibrium ther-

modynamics. Subsequently, Song et al. [48] reformulated

it within the framework of stochastic differential equations

(SDEs). In this paper, we propose a general diffusion in-

version technique that is applicable to both the probabilis-

tic and SDE-based diffusion formulations. To facilitate

understanding, we employ the probabilistic framework of

the Denoising Diffusion Probabilistic Model (DDPM) [16]

throughout our presentation.

The DDPM framework [16] is indeed a Markov chain of

length T , where the forward process is characterized by a

Gaussian transition kernel:

q(x

t−1

) = N(x

;

1 − β

t−1

, β

I), (1)

where β

is a pre-deﬁned hyper-parameter controlling vari-

ance schedule. Notably, this transition kernel allows the

derivation of the marginal distribution q(x

), i.e.,

q(x

) = N(x

;

√

¯α

, (1 − ¯α

)I), (2)

where ¯α

s=1

, α

= 1 − β

. The reverse process

aims to generate a high-quality image from an initial ran-

dom noise map x

∼ N(0, I), which can be expressed as:

t−1

= g

, t) + σ

t−1

, t = T, ··· , 1, (3)

where

, t) =

√



−

1 − α

√

1 − ¯α

, t)



, (4)

, t) is a pre-trained denoising network parameterized

by θ. The noise term z

satisﬁes z

= 0 and z

∼ N(0, I)

for t = 1, ··· , T − 1.

Equation (3) indicates that the synthesized image x

fully determined by the set of noise maps M = x

∪

}

T −1

t=1

. In the context of SR, our goal is to generate an HR

image x

conditioned on an LR image y

. To this end, we

propose diffusion inversion to ﬁnd an optimal set of noise

maps M

∗

that reconstruct the target HR image x

via the

reverse process of Eq. (3). In the following sections, we

detail how to achieve this goal by training a noise predictor.

3.2. Diffusion Inversion

To achieve diffusion inversion, we introduce a noise predic-

tion network with parameter w, denoted as f

, which takes

the LR image y

and the timestep t as input and outputs

the desired noise maps M

∗

. Unlike the strategy [5] of di-

rectly optimizing M

∗

for each testing image, we train such

a noise predictor to enable fast sampling during inference,

thereby signiﬁcantly improving the inference efﬁciency. To

ensure the output of f

conforms to Gaussian distribution,

we adopt the reparameterization trick of VAE [25], which

predicts the mean and variance parameters of Gaussian dis-

tribution rather than directly estimating the noise map.

3.2.1. Problem Simpliﬁcation

Training this noise predictor is inherently challenging. The

noise map set M consists of T noise maps (typically T =

1000 in most current diffusion models), corresponding to

each step of the diffusion process. Naturally, it is non-trivial

to simultaneously estimate such a large number of noise

maps using a single, compact network. What’s worse, the

iterative sampling paradigm of diffusion models can gradu-

ally accumulate prediction errors, which may adversely af-

fect the ﬁnal SR performance.

To address these challenges, we design a Partial Noise

Prediction (PnP) strategy. Speciﬁcally, let’s consider dif-

fusion inversion in the context of SR, where the observed

LR image y

only slightly deviates from the target HR im-

age x

in most cases, primarily in high-frequency compo-

nents. This observation inspires us to initiate the sampling

process from an intermediate timestep N (N < T ), effec-

tively reducing the number of noise maps in M from T to

N, i.e., M = {z

}

t=1

. Furthermore, given the high-ﬁdelity

requirements of SR, we constrain x

to have a relatively

high SNR, implying mild noise corruption. This constraint

encourages the selection of a smaller N, and in practice, we

set N ≤ 250, corresponding to an SNR threshold of 1.44 in

the widely used Stable Diffusion [43].

In addition, we further compress the set of the noise

maps M = {z

}

t=1

by integrating existing diffusion ac-

celeration algorithms [22, 46]. The common idea of these

algorithms is to skip certain steps during inference, which

are selected based on speciﬁc rules [29], e.g., “linspace” and

“trailing”. Combining with this skipping strategy, the noise

map set is simpliﬁed as follows:

M = {z

}

i=1

, (5)

where {κ

, ··· , κ

} ⊆ {1, ··· , N }. In practice, we set

M ≤ 5, thus largely reducing the prediction burden on the

noise predictor and improving the sampling efﬁciency.

3.2.2. Inversion Trajectory

Given the set of noise maps M = {z

}

i=1

and the noise

prediction network f

, our goal is to restore the HR image

from a given LR observation y

, following an inversion

trajectory deﬁned by:

i−1

= g

, κ

) + σ

, κ

i−1

), (6)

where κ

= 0, and g

(·, ·) is deﬁned in Eq. (4). The key to

initiating this inversion trajectory is constructing the start-

ing state x

from the LR image y

The marginal distribution q(x

), as deﬁned in

Eq. (2), suggests to achieve x

as follows:

¯α

1 − ¯α

ξ, ξ ∼ N(0, I). (7)

In the context of SR, since the HR image x

is not accessi-

ble during testing, we thus construct an analogous formula-

tion for x

directly from the LR image y

using the noise

predictor f

(·), namely

¯α

1 − ¯α

, κ

). (8)

This design is inspired by the observation that the LR im-

age y

and the HR image x

become increasingly indis-

tinguishable when perturbed by Gaussian noise with an ap-

propriate magnitude. Therefore, we aim to seek an optimal

noise map f

, κ

) to perturb y

in such a way that the

pre-trained diffusion model can generate the corresponding

from x

that is deﬁned in Eq. (8).

To summarize, we establish an inversion trajectory by

combining Eqs. (6) and (8), which can be used to solve the

SR problem via iterative generation along this trajectory.

3.2.3. Model Training

Given a pre-trained large-scale diffusion model ϵ

(·), an es-

timation of the HR image x

can be obtained from x

taking a reverse diffusion step:

0←κ

√

¯α

−

1 − ¯α

, κ

)

, (9)

where x

is deﬁned by Eq. (8) for i = M and Eq. (6) for

i < M . It is thus possible to train the noise predictor f

(·)

by minimizing the distance between

0←κ

and x

However, directly training with this objective is compu-

tationally impractical. Speciﬁcally, as shown in Eq. (6),

calculating x

(i < M ) necessitates recurrent application

of the large-scale diffusion model ϵ

, which leads to pro-

hibitive GPU memory usage. To circumvent this, we adopt

an alternative version for x

based on the marginal distri-

bution in Eq. (2), i.e.,

¯α

1 − ¯α

, κ

), i < M. (10)

This modiﬁcation also aligns better with the training pro-

cess of the employed diffusion model, allowing for more

effective leveraging of the prior knowledge embedded in it.

We now detail the training procedure step by step:

Gaussian Constraint. The pre-trained diffusion model is

a powerful denoiser tailored for Gaussian noise with zero

mean and varying variances. Hence, it is reasonable to en-

force the predicted noise map by f

to obey a Gaussian

distribution. For the initial state x

, it is observed that

the predicted noise map f

, κ

) exhibits a mean shift,

which is evident when comparing Eqs. (7) and (8), due to

the substitution of y

for x

. Moreover, the visualization

presented in Figs. 2 and 3 further validates this observation,

illustrating that the predicted noise map is clearly correlated

with the LR image. Therefore, we do not consider the Gaus-

sian constraint for x

剩余15页未读，继续阅读

评论收藏

内容反馈

码流怪侠

粉丝: 3w+
资源: 651

基于扩散模型逆向生成的图像超分辨率方法研究与应用

高分辨率图像扩散模型：该模型能够生成高质量的图像，并在图像合成过程中保持稳定性，提供了一种先进的图像合成方法

人工智能，扩散模型，Sora，论文

CVPR 2022 Tutorial Denoising Diffusion-based Generative Model

AI绘画关键词大全.zip

PSFGenerator-master_PSF生成器_

stable diffusion提示词-城堡系列

target_雷达matlab_SARpointtarget_matlab雷达回波_SAR点目标分析_雷达_

编码孔径成像:一种全息逆转滤波处理

DeepSeek从入门到精通-清华大学-202502.pdf

DeepSeek从入门到精通-清华大学

清华deepseek入门到精通文档 夸克网盘资源下载

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

Ollama windows安装包 0.5.7（截止2025-02-01）

人工智能应用：DeepSeek从入门到精通的操作指南与多功能实战详解

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

DeepSeek从入门到精通

浏览器插件Page Assist-1.4.5

AI使用+DeepSeek+DeepSeek清华大学第二版

大模型微调自我认知数据集

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

DeepSeek-R1-技术报告中文版-由deepseek翻译.pdf

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

DeepSeek论文合集

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

Dify + Ollama + DeepSeek 实现 DeepSeek 离线部署个人知识库(Dify GitHub zip包)

行人跌倒数据集（VOC格式）

Cherry Studio Windows 安装包

DeepSeek-R1 源码 + 文档

Chatbox-1.9.8-Setup，OllamaSetup安装包

最新资源

清华deepseek入门到精通文档夸克网盘资源下载