没有合适的资源?快使用搜索试试~ 我知道了~
Drag Your GAN Interactive Point-based Manipulation Generative
需积分: 5 1 下载量 79 浏览量
2023-11-07
10:03:31
上传
评论
收藏 11.5MB PDF 举报
温馨提示
试读
11页
专业研究机构研究人工智能相关技术书籍、发展报告、行业报告 供人工智能行业学生,程序员,产品经理,从业者等多个职位人员需要 适合了解,研究,调研,学习等多项用途。
资源推荐
资源详情
资源评论
Drag Your GAN: Interactive Point-based Manipulation on the
Generative Image Manifold
XINGANG PAN, Max Planck Institute for Informatics, Germany and Saarbrücken Research Center for Visual Computing,
Interaction and AI, Germany
AYUSH TEWARI, MIT CSAIL, USA
THOMAS LEIMKÜHLER, Max Planck Institute for Informatics, Germany
LINGJIE LIU, Max Planck Institute for Informatics, Germany and University of Pennsylvania, USA
ABHIMITRA MEKA, Google AR/VR, USA
CHRISTIAN THEOBALT,
Max Planck Institute for Informatics, Germany and Saarbrücken Research Center for Visual
Computing, Interaction and AI, Germany
Image + User input (1
st
Edit) Result
2
nd
Edit Result
Fig. 1. Our approach DragGAN allows users to "drag" the content of any GAN-generated images. Users only need to click a few handle points (red) and
target points (blue) on the image, and our approach will move the handle points to precisely reach their corresponding target points. Users can optionally
draw a mask of the flexible region (brighter area), keeping the rest of the image fixed. This flexible point-based manipulation enables control of many spatial
aributes like pose, shape, expression, and layout across diverse object categories. Project page: hps://vcai.mpi-inf.mpg.de/projects/DragGAN/.
Synthesizing visual content that meets users’ needs often requires exible
and precise controllability of the pose, shape, expression, and layout of the
generated objects. Existing approaches gain controllability of generative
adversarial networks (GANs) via manually annotated training data or a
prior 3D model, which often lack exibility, precision, and generality. In
this work, we study a powerful yet much less explored way of controlling
GANs, that is, to "drag" any points of the image to precisely reach target
points in a user-interactive manner, as shown in Fig.1. To achieve this, we
propose DragGAN, which consists of two main components: 1) a feature-
based motion supervision that drives the handle point to move towards
the target position, and 2) a new point tracking approach that leverages
the discriminative generator features to keep localizing the position of the
handle points. Through DragGAN, anyone can deform an image with precise
control over where pixels go, thus manipulating the pose, shape, expression,
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0159-7/23/08.
https://doi.org/10.1145/3588432.3591500
and layout of diverse categories such as animals, cars, humans, landscapes,
etc. As these manipulations are performed on the learned generative image
manifold of a GAN, they tend to produce realistic outputs even for chal-
lenging scenarios such as hallucinating occluded content and deforming
shapes that consistently follow the object’s rigidity. Both qualitative and
quantitative comparisons demonstrate the advantage of DragGAN over prior
approaches in the tasks of image manipulation and point tracking. We also
showcase the manipulation of real images through GAN inversion.
CCS Concepts: • Computing methodologies → Computer vision.
Additional Key Words and Phrases: GANs, interactive image manipulation,
point tracking
ACM Reference Format:
Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra
Meka, and Christian Theobalt. 2023. Drag Your GAN: Interactive Point-
based Manipulation on the Generative Image Manifold. In Special Interest
Group on Computer Graphics and Interactive Techniques Conference Conference
Proceedings (SIGGRAPH ’23 Conference Proceedings), August 6–10, 2023, Los
Angeles, CA, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.
1145/3588432.3591500
1
arXiv:2305.10973v1 [cs.CV] 18 May 2023
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA X. Pan, A. Tewari, T. Leimkühler, L. Liu, A. Meka, C. Theobalt
1 INTRODUCTION
Deep generative models such as generative adversarial networks
(GANs) [Goodfellow et al
.
2014] have achieved unprecedented suc-
cess in synthesizing random photorealistic images. In real-world
applications, a critical functionality requirement of such learning-
based image synthesis methods is the controllability over the syn-
thesized visual content. For example, social-media users might want
to adjust the position, shape, expression, and body pose of a hu-
man or animal in a casually-captured photo; professional movie
pre-visualization and media editing may require eciently creating
sketches of scenes with certain layouts; and car designers may want
to interactively modify the shape of their creations. To satisfy these
diverse user requirements, an ideal controllable image synthesis
approach should possess the following properties 1) Flexibility: it
should be able to control dierent spatial attributes including posi-
tion, pose, shape, expression, and layout of the generated objects
or animals; 2) Precision: it should be able to control the spatial at-
tributes with high precision; 3) Generality: it should be applicable
to dierent object categories but not limited to a certain category.
While previous works only satisfy one or two of these properties,
we target to achieve them all in this work.
Most previous approaches gain controllability of GANs via prior
3D models [Deng et al
.
2020; Ghosh et al
.
2020; Tewari et al
.
2020] or
supervised learning that relies on manually annotated data [Abdal
et al
.
2021; Isola et al
.
2017; Ling et al
.
2021; Park et al
.
2019; Shen
et al
.
2020]. Thus, these approaches fail to generalize to new object
categories, often control a limited range of spatial attributes or pro-
vide little control over the editing process. Recently, text-guided
image synthesis has attracted attention [Ramesh et al
.
2022; Rom-
bach et al
.
2021; Saharia et al
.
2022]. However, text guidance lacks
precision and exibility in terms of editing spatial attributes. For
example, it cannot be used to move an object by a specic number
of pixels.
To achieve exible, precise, and generic controllability of GANs,
in this work, we explore a powerful yet much less explored interac-
tive point-based manipulation. Specically, we allow users to click
any number of handle points and target points on the image and
the goal is to drive the handle points to reach their corresponding
target points. As shown in Fig. 1, this point-based manipulation
allows users to control diverse spatial attributes and is agnostic to
object categories. The approach with the closest setting to ours is
UserControllableLT [Endo 2022], which also studies dragging-based
manipulation. Compared to it, the problem studied in this paper
has two more challenges: 1) we consider the control of more than
one point, which their approach does not handle well; 2) we require
the handle points to precisely reach the target points while their
approach does not. As we will show in experiments, handling more
than one point with precise position control enables much more
diverse and accurate image manipulation.
To achieve such interactive point-based manipulation, we pro-
pose DragGAN, which addresses two sub-problems, including 1)
supervising the handle points to move towards the targets and 2)
tracking the handle points so that their positions are known at
each editing step. Our technique is built on the key insight that
the feature space of a GAN is suciently discriminative to enable
both motion supervision and precise point tracking. Specically, the
motion supervision is achieved via a shifted feature patch loss that
optimizes the latent code. Each optimization step leads to the handle
points shifting closer to the targets; thus point tracking is then per-
formed through nearest neighbor search in the feature space. This
optimization process is repeated until the handle points reach the
targets. DragGAN also allows users to optionally draw a region of
interest to perform region-specic editing. Since DragGAN does not
rely on any additional networks like RAFT [Teed and Deng 2020],
it achieves ecient manipulation, only taking a few seconds on a
single RTX 3090 GPU in most cases. This allows for live, interactive
editing sessions, in which the user can quickly iterate on dierent
layouts till the desired output is achieved.
We conduct an extensive evaluation of DragGAN on diverse
datasets including animals (lions, dogs, cats, and horses), humans
(face and whole body), cars, and landscapes. As shown in Fig.1,
our approach eectively moves the user-dened handle points to
the target points, achieving diverse manipulation eects across
many object categories. Unlike conventional shape deformation
approaches that simply apply warping [Igarashi et al
.
2005], our
deformation is performed on the learned image manifold of a GAN,
which tends to obey the underlying object structures. For example,
our approach can hallucinate occluded content, like the teeth inside
a lion’s mouth, and can deform following the object’s rigidity, like
the bending of a horse leg. We also develop a GUI for users to
interactively perform the manipulation by simply clicking on the
image. Both qualitative and quantitative comparison conrms the
advantage of our approach over UserControllableLT. Furthermore,
our GAN-based point tracking algorithm also outperforms existing
point tracking approaches such as RAFT [Teed and Deng 2020] and
PIPs [Harley et al
.
2022] for GAN-generated frames. Furthermore,
by combining with GAN inversion techniques, our approach also
serves as a powerful tool for real image editing.
2 RELATED WORK
2.1 Generative Models for Interactive Content Creation
Most current methods use generative adversarial networks (GANs)
or diusion models for controllable image synthesis.
Unconditional GANs. GANs are generative models that transform
low-dimensional randomly sampled latent vectors into photorealis-
tic images. They are trained using adversarial learning and can be
used to generate high-resolution photorealistic images [Creswell
et al
.
2018; Goodfellow et al
.
2014; Karras et al
.
2021, 2019]. Most
GAN models like StyleGAN [Karras et al
.
2019] do not directly
enable controllable editing of the generated images.
Conditional GANs. Several methods have proposed conditional
GANs to address this limitation. Here, the network receives a con-
ditional input, such as segmentation map [Isola et al
.
2017; Park
et al
.
2019] or 3D variables [Deng et al
.
2020; Ghosh et al
.
2020], in
addition to the randomly sampled latent vector to generate photo-
realistic images. Instead of modeling the conditional distribution,
EditGAN [Ling et al
.
2021] enables editing by rst modeling a joint
distribution of images and segmentation maps, and then computing
new images corresponding to edited segmentation maps.
2
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA
Controllability using Unconditional GANs. Several methods have
been proposed for editing unconditional GANs by manipulating the
input latent vectors. Some approaches nd meaningful latent direc-
tions via supervised learning from manual annotations or prior 3D
models [Abdal et al
.
2021; Leimkühler and Drettakis 2021; Patashnik
et al
.
2021; Shen et al
.
2020; Tewari et al
.
2020]. Other approaches
compute the important semantic directions in the latent space in
an unsupervised manner [Härkönen et al
.
2020; Shen and Zhou
2020; Zhu et al
.
2023]. Recently, the controllability of coarse object
position is achieved by introducing intermediate “blobs" [Epstein
et al
.
2022] or heatmaps [Wang et al
.
2022b]. All of these approaches
enable editing of either image-aligned semantic attributes such as
appearance, or coarse geometric attributes such as object position
and pose. While Editing-in-Style [Collins et al
.
2020] showcases
some spatial attributes editing capability, it can only achieve this by
transferring local semantics between dierent samples. In contrast
to these methods, our approach allows users to perform ne-grained
control over the spatial attributes using point-based editing.
GANWarping [Wang et al
.
2022a] also use point-based editing,
however, they only enable out-of-distribution image editing. A few
warped images can be used to update the generative model such
that all generated images demonstrate similar warps. However, this
method does not ensure that the warps lead to realistic images.
Further, it does not enable controls such as changing the 3D pose
of the object. Similar to us, UserControllableLT [Endo 2022] en-
ables point-based editing by transforming latent vectors of a GAN.
However, this approach only supports editing using a single point
being dragged on the image and does not handle multiple-point
constraints well. In addition, the control is not precise, i.e., after
editing, the target point is often not reached.
3D-aware GANs. Several methods modify the architecture of the
GAN to enable 3D control [Chan et al
.
2022, 2021; Chen et al
.
2022;
Gu et al
.
2022; Pan et al
.
2021; Schwarz et al
.
2020; Tewari et al
.
2022; Xu et al
.
2022]. Here, the model generates 3D representations
that can be rendered using a physically-based analytic renderer.
However, unlike our approach, control is limited to global pose or
lighting.
Diusion Models. More recently, diusion models [Sohl-Dickstein
et al
.
2015] have enabled image synthesis at high quality [Ho et al
.
2020; Song et al
.
2020, 2021]. These models iteratively denoise a
randomly sampled noise to create a photorealistic image. Recent
models have shown expressive image synthesis conditioned on text
inputs [Ramesh et al
.
2022; Rombach et al
.
2021; Saharia et al
.
2022].
However, natural language does not enable ne-grained control
over the spatial attributes of images, and thus, all text-conditional
methods are restricted to high-level semantic editing. In addition,
current diusion models are slow since they require multiple denois-
ing steps. While progress has been made toward ecient sampling,
GANs are still signicantly more ecient.
2.2 Point Tracking
To track points in videos, an obvious approach is through optical
ow estimation between consecutive frames. Optical ow estimation
is a classic problem that estimates motion elds between two images.
Conventional approaches solve optimization problems with hand-
crafted criteria [Brox and Malik 2010; Sundaram et al
.
2010], while
deep learning-based approaches started to dominate the eld in
recent years due to better performance [Dosovitskiy et al
.
2015;
Ilg et al
.
2017; Teed and Deng 2020]. These deep learning-based
approaches typically use synthetic data with ground truth optical
ow to train the deep neural networks. Among them, the most
widely used method now is RAFT [Teed and Deng 2020], which
estimates optical ow via an iterative algorithm. Recently, Harley
et al. [2022] combines this iterative algorithm with a conventional
“particle video” approach, giving rise to a new point tracking method
named PIPs. PIPs considers information across multiple frames and
thus handles long-range tracking better than previous approaches.
In this work, we show that point tracking on GAN-generated
images can be performed without using any of the aforementioned
approaches or additional neural networks. We reveal that the fea-
ture spaces of GANs are discriminative enough such that tracking
can be achieved simply via feature matching. While some previous
works also leverage the discriminative feature in semantic segmen-
tation [Tritrong et al
.
2021; Zhang et al
.
2021], we are the rst to
connect the point-based editing problem to the intuition of discrim-
inative GAN features and design a concrete method. Getting rid of
additional tracking models allows our approach to run much more
eciently to support interactive editing. Despite the simplicity of
our approach, we show that it outperforms the state-of-the-art point
tracking approaches including RAFT and PIPs in our experiments.
3 METHOD
This work aims to develop an interactive image manipulation method
for GANs where users only need to click on the images to dene
some pairs of (handle point, target point) and drive the handle points
to reach their corresponding target points. Our study is based on
the StyleGAN2 architecture [Karras et al
.
2020]. Here we briey
introduce the basics of this architecture.
StyleGAN Terminology. In the StyleGAN2 architecture, a 512 di-
mensional latent code
𝒛 ∈ N (
0
, 𝑰 )
is mapped to an intermediate
latent code
𝒘 ∈ R
512
via a mapping network. The space of
𝒘
is com-
monly referred to as
W
.
𝒘
is then sent to the generator
𝐺
to produce
the output image
I = 𝐺 (𝒘)
. In this process,
𝒘
is copied several times
and sent to dierent layers of the generator
𝐺
to control dierent
levels of attributes. Alternatively, one can also use dierent
𝒘
for
dierent layers, in which case the input would be
𝒘 ∈ R
𝑙×512
= W
+
,
where
𝑙
is the number of layers. This less constrained
W
+
space is
shown to be more expressive [Abdal et al
.
2019]. As the generator
𝐺
learns a mapping from a low-dimensional latent space to a much
higher dimensional image space, it can be seen as modelling an
image manifold [Zhu et al. 2016].
3.1 Interactive Point-based Manipulation
An overview of our image manipulation pipeline is shown in Fig. 2.
For any image
I ∈ R
3×𝐻×𝑊
generated by a GAN with latent code
𝒘
, we allow the user to input a number of handle points
{𝒑
𝑖
=
(𝑥
𝑝,𝑖
, 𝑦
𝑝,𝑖
)|𝑖 =
1
,
2
, ..., 𝑛}
and their corresponding target points
{𝒕
𝑖
=
(𝑥
𝑡,𝑖
, 𝑦
𝑡,𝑖
)|𝑖 =
1
,
2
, ..., 𝑛}
(i.e., the corresponding target point of
𝒑
𝑖
is
𝒕
𝑖
). The goal is to move the object in the image such that the
3
剩余10页未读,继续阅读
资源评论
TechLeadKrisChang
- 粉丝: 2w+
- 资源: 247
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功