DragYourGANInteractivePoint-basedManipulationGenerative资源-CSDN文库

人工智能

需积分: 5 79 浏览量 2023-11-07 10:03:31 上传评论收藏 11.5MB PDF 举报

资源推荐

资源详情

资源评论

Drag Your GAN: Interactive Point-based Manipulation on the

Generative Image Manifold

XINGANG PAN, Max Planck Institute for Informatics, Germany and Saarbrücken Research Center for Visual Computing,

Interaction and AI, Germany

AYUSH TEWARI, MIT CSAIL, USA

THOMAS LEIMKÜHLER, Max Planck Institute for Informatics, Germany

LINGJIE LIU, Max Planck Institute for Informatics, Germany and University of Pennsylvania, USA

ABHIMITRA MEKA, Google AR/VR, USA

CHRISTIAN THEOBALT,

Max Planck Institute for Informatics, Germany and Saarbrücken Research Center for Visual

Computing, Interaction and AI, Germany

Image + User input (1

Edit) Result

Edit Result

Fig. 1. Our approach DragGAN allows users to "drag" the content of any GAN-generated images. Users only need to click a few handle points (red) and

target points (blue) on the image, and our approach will move the handle points to precisely reach their corresponding target points. Users can optionally

draw a mask of the flexible region (brighter area), keeping the rest of the image fixed. This flexible point-based manipulation enables control of many spatial

aributes like pose, shape, expression, and layout across diverse object categories. Project page: hps://vcai.mpi-inf.mpg.de/projects/DragGAN/.

Synthesizing visual content that meets users’ needs often requires exible

and precise controllability of the pose, shape, expression, and layout of the

generated objects. Existing approaches gain controllability of generative

adversarial networks (GANs) via manually annotated training data or a

prior 3D model, which often lack exibility, precision, and generality. In

this work, we study a powerful yet much less explored way of controlling

GANs, that is, to "drag" any points of the image to precisely reach target

points in a user-interactive manner, as shown in Fig.1. To achieve this, we

propose DragGAN, which consists of two main components: 1) a feature-

based motion supervision that drives the handle point to move towards

the target position, and 2) a new point tracking approach that leverages

the discriminative generator features to keep localizing the position of the

handle points. Through DragGAN, anyone can deform an image with precise

control over where pixels go, thus manipulating the pose, shape, expression,

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA

ACM ISBN 979-8-4007-0159-7/23/08.

https://doi.org/10.1145/3588432.3591500

and layout of diverse categories such as animals, cars, humans, landscapes,

etc. As these manipulations are performed on the learned generative image

manifold of a GAN, they tend to produce realistic outputs even for chal-

lenging scenarios such as hallucinating occluded content and deforming

shapes that consistently follow the object’s rigidity. Both qualitative and

quantitative comparisons demonstrate the advantage of DragGAN over prior

approaches in the tasks of image manipulation and point tracking. We also

showcase the manipulation of real images through GAN inversion.

CCS Concepts: • Computing methodologies → Computer vision.

Additional Key Words and Phrases: GANs, interactive image manipulation,

point tracking

ACM Reference Format:

Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra

Meka, and Christian Theobalt. 2023. Drag Your GAN: Interactive Point-

based Manipulation on the Generative Image Manifold. In Special Interest

Group on Computer Graphics and Interactive Techniques Conference Conference

Proceedings (SIGGRAPH ’23 Conference Proceedings), August 6–10, 2023, Los

Angeles, CA, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.

1145/3588432.3591500

arXiv:2305.10973v1 [cs.CV] 18 May 2023

SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA X. Pan, A. Tewari, T. Leimkühler, L. Liu, A. Meka, C. Theobalt

1 INTRODUCTION

Deep generative models such as generative adversarial networks

(GANs) [Goodfellow et al

2014] have achieved unprecedented suc-

cess in synthesizing random photorealistic images. In real-world

applications, a critical functionality requirement of such learning-

based image synthesis methods is the controllability over the syn-

thesized visual content. For example, social-media users might want

to adjust the position, shape, expression, and body pose of a hu-

man or animal in a casually-captured photo; professional movie

pre-visualization and media editing may require eciently creating

sketches of scenes with certain layouts; and car designers may want

to interactively modify the shape of their creations. To satisfy these

diverse user requirements, an ideal controllable image synthesis

approach should possess the following properties 1) Flexibility: it

should be able to control dierent spatial attributes including posi-

tion, pose, shape, expression, and layout of the generated objects

or animals; 2) Precision: it should be able to control the spatial at-

tributes with high precision; 3) Generality: it should be applicable

to dierent object categories but not limited to a certain category.

While previous works only satisfy one or two of these properties,

we target to achieve them all in this work.

Most previous approaches gain controllability of GANs via prior

3D models [Deng et al

2020; Ghosh et al

2020; Tewari et al

2020] or

supervised learning that relies on manually annotated data [Abdal

et al

2021; Isola et al

2017; Ling et al

2021; Park et al

2019; Shen

et al

2020]. Thus, these approaches fail to generalize to new object

categories, often control a limited range of spatial attributes or pro-

vide little control over the editing process. Recently, text-guided

image synthesis has attracted attention [Ramesh et al

2022; Rom-

bach et al

2021; Saharia et al

2022]. However, text guidance lacks

precision and exibility in terms of editing spatial attributes. For

example, it cannot be used to move an object by a specic number

of pixels.

To achieve exible, precise, and generic controllability of GANs,

in this work, we explore a powerful yet much less explored interac-

tive point-based manipulation. Specically, we allow users to click

any number of handle points and target points on the image and

the goal is to drive the handle points to reach their corresponding

target points. As shown in Fig. 1, this point-based manipulation

allows users to control diverse spatial attributes and is agnostic to

object categories. The approach with the closest setting to ours is

UserControllableLT [Endo 2022], which also studies dragging-based

manipulation. Compared to it, the problem studied in this paper

has two more challenges: 1) we consider the control of more than

one point, which their approach does not handle well; 2) we require

the handle points to precisely reach the target points while their

approach does not. As we will show in experiments, handling more

than one point with precise position control enables much more

diverse and accurate image manipulation.

To achieve such interactive point-based manipulation, we pro-

pose DragGAN, which addresses two sub-problems, including 1)

supervising the handle points to move towards the targets and 2)

tracking the handle points so that their positions are known at

each editing step. Our technique is built on the key insight that

the feature space of a GAN is suciently discriminative to enable

both motion supervision and precise point tracking. Specically, the

motion supervision is achieved via a shifted feature patch loss that

optimizes the latent code. Each optimization step leads to the handle

points shifting closer to the targets; thus point tracking is then per-

formed through nearest neighbor search in the feature space. This

optimization process is repeated until the handle points reach the

targets. DragGAN also allows users to optionally draw a region of

interest to perform region-specic editing. Since DragGAN does not

rely on any additional networks like RAFT [Teed and Deng 2020],

it achieves ecient manipulation, only taking a few seconds on a

single RTX 3090 GPU in most cases. This allows for live, interactive

editing sessions, in which the user can quickly iterate on dierent

layouts till the desired output is achieved.

We conduct an extensive evaluation of DragGAN on diverse

datasets including animals (lions, dogs, cats, and horses), humans

(face and whole body), cars, and landscapes. As shown in Fig.1,

our approach eectively moves the user-dened handle points to

the target points, achieving diverse manipulation eects across

many object categories. Unlike conventional shape deformation

approaches that simply apply warping [Igarashi et al

2005], our

deformation is performed on the learned image manifold of a GAN,

which tends to obey the underlying object structures. For example,

our approach can hallucinate occluded content, like the teeth inside

a lion’s mouth, and can deform following the object’s rigidity, like

the bending of a horse leg. We also develop a GUI for users to

interactively perform the manipulation by simply clicking on the

image. Both qualitative and quantitative comparison conrms the

advantage of our approach over UserControllableLT. Furthermore,

our GAN-based point tracking algorithm also outperforms existing

point tracking approaches such as RAFT [Teed and Deng 2020] and

PIPs [Harley et al

2022] for GAN-generated frames. Furthermore,

by combining with GAN inversion techniques, our approach also

serves as a powerful tool for real image editing.

2 RELATED WORK

2.1 Generative Models for Interactive Content Creation

Most current methods use generative adversarial networks (GANs)

or diusion models for controllable image synthesis.

Unconditional GANs. GANs are generative models that transform

low-dimensional randomly sampled latent vectors into photorealis-

tic images. They are trained using adversarial learning and can be

used to generate high-resolution photorealistic images [Creswell

et al

2018; Goodfellow et al

2014; Karras et al

2021, 2019]. Most

GAN models like StyleGAN [Karras et al

2019] do not directly

enable controllable editing of the generated images.

Conditional GANs. Several methods have proposed conditional

GANs to address this limitation. Here, the network receives a con-

ditional input, such as segmentation map [Isola et al

2017; Park

et al

2019] or 3D variables [Deng et al

2020; Ghosh et al

2020], in

addition to the randomly sampled latent vector to generate photo-

realistic images. Instead of modeling the conditional distribution,

EditGAN [Ling et al

2021] enables editing by rst modeling a joint

distribution of images and segmentation maps, and then computing

new images corresponding to edited segmentation maps.

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA

Controllability using Unconditional GANs. Several methods have

been proposed for editing unconditional GANs by manipulating the

input latent vectors. Some approaches nd meaningful latent direc-

tions via supervised learning from manual annotations or prior 3D

models [Abdal et al

2021; Leimkühler and Drettakis 2021; Patashnik

et al

2021; Shen et al

2020; Tewari et al

2020]. Other approaches

compute the important semantic directions in the latent space in

an unsupervised manner [Härkönen et al

2020; Shen and Zhou

2020; Zhu et al

2023]. Recently, the controllability of coarse object

position is achieved by introducing intermediate “blobs" [Epstein

et al

2022] or heatmaps [Wang et al

2022b]. All of these approaches

enable editing of either image-aligned semantic attributes such as

appearance, or coarse geometric attributes such as object position

and pose. While Editing-in-Style [Collins et al

2020] showcases

some spatial attributes editing capability, it can only achieve this by

transferring local semantics between dierent samples. In contrast

to these methods, our approach allows users to perform ne-grained

control over the spatial attributes using point-based editing.

GANWarping [Wang et al

2022a] also use point-based editing,

however, they only enable out-of-distribution image editing. A few

warped images can be used to update the generative model such

that all generated images demonstrate similar warps. However, this

method does not ensure that the warps lead to realistic images.

Further, it does not enable controls such as changing the 3D pose

of the object. Similar to us, UserControllableLT [Endo 2022] en-

ables point-based editing by transforming latent vectors of a GAN.

However, this approach only supports editing using a single point

being dragged on the image and does not handle multiple-point

constraints well. In addition, the control is not precise, i.e., after

editing, the target point is often not reached.

3D-aware GANs. Several methods modify the architecture of the

GAN to enable 3D control [Chan et al

2022, 2021; Chen et al

2022;

Gu et al

2022; Pan et al

2021; Schwarz et al

2020; Tewari et al

2022; Xu et al

2022]. Here, the model generates 3D representations

that can be rendered using a physically-based analytic renderer.

However, unlike our approach, control is limited to global pose or

lighting.

Diusion Models. More recently, diusion models [Sohl-Dickstein

et al

2015] have enabled image synthesis at high quality [Ho et al

2020; Song et al

2020, 2021]. These models iteratively denoise a

randomly sampled noise to create a photorealistic image. Recent

models have shown expressive image synthesis conditioned on text

inputs [Ramesh et al

2022; Rombach et al

2021; Saharia et al

2022].

However, natural language does not enable ne-grained control

over the spatial attributes of images, and thus, all text-conditional

methods are restricted to high-level semantic editing. In addition,

current diusion models are slow since they require multiple denois-

ing steps. While progress has been made toward ecient sampling,

GANs are still signicantly more ecient.

2.2 Point Tracking

To track points in videos, an obvious approach is through optical

ow estimation between consecutive frames. Optical ow estimation

is a classic problem that estimates motion elds between two images.

Conventional approaches solve optimization problems with hand-

crafted criteria [Brox and Malik 2010; Sundaram et al

2010], while

deep learning-based approaches started to dominate the eld in

recent years due to better performance [Dosovitskiy et al

2015;

Ilg et al

2017; Teed and Deng 2020]. These deep learning-based

approaches typically use synthetic data with ground truth optical

ow to train the deep neural networks. Among them, the most

widely used method now is RAFT [Teed and Deng 2020], which

estimates optical ow via an iterative algorithm. Recently, Harley

et al. [2022] combines this iterative algorithm with a conventional

“particle video” approach, giving rise to a new point tracking method

named PIPs. PIPs considers information across multiple frames and

thus handles long-range tracking better than previous approaches.

In this work, we show that point tracking on GAN-generated

images can be performed without using any of the aforementioned

approaches or additional neural networks. We reveal that the fea-

ture spaces of GANs are discriminative enough such that tracking

can be achieved simply via feature matching. While some previous

works also leverage the discriminative feature in semantic segmen-

tation [Tritrong et al

2021; Zhang et al

2021], we are the rst to

connect the point-based editing problem to the intuition of discrim-

inative GAN features and design a concrete method. Getting rid of

additional tracking models allows our approach to run much more

eciently to support interactive editing. Despite the simplicity of

our approach, we show that it outperforms the state-of-the-art point

tracking approaches including RAFT and PIPs in our experiments.

3 METHOD

This work aims to develop an interactive image manipulation method

for GANs where users only need to click on the images to dene

some pairs of (handle point, target point) and drive the handle points

to reach their corresponding target points. Our study is based on

the StyleGAN2 architecture [Karras et al

2020]. Here we briey

introduce the basics of this architecture.

StyleGAN Terminology. In the StyleGAN2 architecture, a 512 di-

mensional latent code

𝒛 ∈ N (

, 𝑰 )

is mapped to an intermediate

latent code

𝒘 ∈ R

512

via a mapping network. The space of

𝒘

is com-

monly referred to as

𝒘

is then sent to the generator

𝐺

to produce

the output image

I = 𝐺 (𝒘)

. In this process,

𝒘

is copied several times

and sent to dierent layers of the generator

𝐺

to control dierent

levels of attributes. Alternatively, one can also use dierent

𝒘

for

dierent layers, in which case the input would be

𝒘 ∈ R

𝑙×512

= W

where

𝑙

is the number of layers. This less constrained

space is

shown to be more expressive [Abdal et al

2019]. As the generator

𝐺

learns a mapping from a low-dimensional latent space to a much

higher dimensional image space, it can be seen as modelling an

image manifold [Zhu et al. 2016].

3.1 Interactive Point-based Manipulation

An overview of our image manipulation pipeline is shown in Fig. 2.

For any image

I ∈ R

3×𝐻×𝑊

generated by a GAN with latent code

𝒘

, we allow the user to input a number of handle points

{𝒑

𝑖

(𝑥

𝑝,𝑖

, 𝑦

𝑝,𝑖

)|𝑖 =

, ..., 𝑛}

and their corresponding target points

{𝒕

𝑖

(𝑥

𝑡,𝑖

, 𝑦

𝑡,𝑖

)|𝑖 =

, ..., 𝑛}

(i.e., the corresponding target point of

𝒑

𝑖

𝒕

𝑖

). The goal is to move the object in the image such that the

剩余10页未读，继续阅读

评论收藏

内容反馈

TechLeadKrisChang

粉丝: 2w+
资源: 247

Drag Your GAN Interactive Point-based Manipulation Generative

最新资源

Drag Your GAN Interactive Point-based Manipulation Generative

Drag Your GAN: Interactive Point-based Manipulation on the Gener

前端项目-angular-drag-and-drop-lists.zip

21.[开源][安卓][拖拽]drag-sort-listview-master

drag-drop-folder-tree(功能强大的动态树)

基于HTML5 拖拽接口(Drag and drag-and-drop interfaces based on HTML5

module-drag-drop-sort-delete.html

Android应用源码之drag-sort-listview-master.rar

drag-drop-folder-tree.rar_Tree 菜单_drag drop java_drag-drop-fold

Android应用源码之drag-sort-listview-master.zip项目安卓应用源码下载

vue-drag-it-dude-Vue2组件，它使您可以将对象拖到所需的位置。-Vue.js开发

react-native-drag-sort：reactreact-native的拖放排序控制

react-native-drag-to-sort-tags:平稳拖动以将标签组件排序为react-native。 （iOS和Android）

draft-js-drag-n-drop-upload-plugin

drag-and-drop-alt

v-drag:在Vue.js上集成拖动的最简单方法

Beautiful-and-Accessible-Drag-and-Drop-with-react-beautiful-dnd-notes

AS3 Drag Manager ---- 拖动管理器

drag-sort-listview

drag-drop-folder-tree.rar

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

李飞飞自传 我看见的世界 The World I see

农村公交与异构无人机协同配送优化

4个亲测好用的ChatGPT4渠道

2023泛娱乐社交出海手册-ZEGO即构科技

最新资源

react-native-drag-to-sort-tags:平稳拖动以将标签组件排序为react-native。（iOS和Android）

李飞飞自传我看见的世界 The World I see