基于transformer的高分辨率图像合成.zip资源-CSDN文库

共719个文件

jpg：493个

png：121个

py：42个

版权申诉

transformer

机器学习

深度学习

140 浏览量 2024-04-15 23:03:23 上传评论收藏 140.48MB ZIP 举报

【基于Transformer的高分辨率图像合成】是一个深度学习项目，它利用了Transformer架构来生成高清晰度的图像。Transformer最初由Vaswani等人在2017年提出的《Attention is All You Need》论文中引入，主要用于自然语言处理任务，如翻译。然而，近年来，Transformer模型在计算机视觉领域也取得了显著进展，特别是在图像生成和图像修复等任务中。项目的核心是通过训练一个Transformer模型，学习捕捉图像的全局依赖关系，并生成高分辨率的细节。Transformer模型摒弃了传统的RNN或CNN中局部卷积的结构，而是依赖于自注意力机制，这一机制允许模型在序列中的任何位置之间建立联系，无论它们在空间上的距离多远。这使得Transformer在处理复杂的、长距离的依赖关系时表现优异，对于图像合成来说，意味着能生成更连贯、更细致的图像。 `main.py`是项目的主脚本，其中包含了模型的训练、验证和推理过程。它可能包含设置超参数、构建模型、加载数据集、定义损失函数和优化器、执行训练循环等关键步骤。用户可以直接运行这个脚本来复现实验结果。 `setup.py`文件用于安装项目依赖的Python库。这通常包括深度学习框架（如PyTorch或TensorFlow）、数据处理工具（如Pandas和Numpy）以及可能的特定于项目的库。通过运行`python setup.py install`，可以确保所有必要的包都被正确安装到环境中。 `environment.yaml`文件列出了项目所需的Python环境依赖，可能包括具体的Python版本、深度学习框架版本以及其他科学计算库。用户可以通过Anaconda或Miniconda等环境管理器，根据这个文件创建一个匹配的虚拟环境。 `assets`目录可能包含一些辅助资源，如模型权重、预处理脚本或可视化工具。`data`目录则很可能存放训练和测试用的数据集，这些数据可能已经被预处理为适合模型输入的格式。 `configs`目录可能包含配置文件，这些文件定义了模型的具体结构、训练参数等。这有助于用户调整模型的设定，以适应不同的应用场景或者复现实验。 `taming`可能是项目中包含Transformer模型实现的子模块，它可能包括Transformer网络结构的定义、训练策略的实现等。`scripts`目录可能包含一些额外的脚本，例如数据预处理脚本、模型评估脚本或者模型部署相关的脚本。通过深入理解Transformer的工作原理，熟悉项目代码结构，以及适当地调整配置文件，用户可以在此基础上进行二次开发，例如将模型应用到其他类型的图像生成任务，或者优化模型以提高生成图像的质量和效率。同时，由于项目提供了预训练模型，用户可以直接体验到高分辨率图像合成的效果，而无需从头开始训练。

资源推荐

资源详情

资源评论

收起资源包目录

基于transformer的高分辨率图像合成.zip （719个子文件）

oidv6-train-annotations-bbox.csv 142KB

validation-annotations-bbox.csv 72KB

train-images-boxable.csv 12KB

class-descriptions-boxable.csv 12KB

validation-images.csv 21B

reconstruction_usage.ipynb 13.07MB

taming-transformers.ipynb 3.41MB

stormy.jpeg 701KB

mountain.jpeg 426KB

ILSVRC2012_val_00008236.JPEG 218KB

ILSVRC2012_val_00001336.JPEG 209KB

ILSVRC2012_val_00042133.JPEG 202KB

ILSVRC2012_val_00000028.JPEG 197KB

ILSVRC2012_val_00030540.JPEG 186KB

ILSVRC2012_val_00023344.JPEG 157KB

ILSVRC2012_val_00005697.JPEG 155KB

ILSVRC2012_val_00032333.JPEG 153KB

ILSVRC2012_val_00022439.JPEG 149KB

ILSVRC2012_val_00022364.JPEG 146KB

ILSVRC2012_val_00047325.JPEG 137KB

ILSVRC2012_val_00046802.JPEG 130KB

ILSVRC2012_val_00003068.JPEG 127KB

ILSVRC2012_val_00011473.JPEG 119KB

ILSVRC2012_val_00008310.JPEG 113KB

ILSVRC2012_val_00024691.JPEG 108KB

ILSVRC2012_val_00034784.JPEG 108KB

ILSVRC2012_val_00039863.JPEG 106KB

ILSVRC2012_val_00046547.JPEG 96KB

ILSVRC2012_val_00047491.JPEG 86KB

ILSVRC2012_val_00042625.JPEG 67KB

ILSVRC2012_val_00031252.JPEG 64KB

ILSVRC2012_val_00023471.JPEG 63KB

ILSVRC2012_val_00013651.JPEG 37KB

ILSVRC2012_val_00012298.JPEG 29KB

000b1b3b85edd850.jpg 1.76MB

0a600f1148d1023c.jpg 1.29MB

0a78374f2d3949ae.jpg 944KB

000baa6f7dae9b79.jpg 783KB

000ba28d70b1a999.jpg 773KB

000b4935979bf4b5.jpg 770KB

0a3873442ad329c2.jpg 760KB

000ad6fa67b5ad96.jpg 744KB

09dcb9b52055d40f.jpg 742KB

000b432ae644b679.jpg 712KB

000b2d1789d5f80d.jpg 699KB

0adc373e996aadc2.jpg 667KB

000b9814a07fd974.jpg 665KB

000bc5006eb7fd98.jpg 662KB

09d2112596d9155b.jpg 657KB

000b09d5d3fc821f.jpg 647KB

000b55559b0244d7.jpg 642KB

0ad7bad30cd432df.jpg 641KB

000ba3ca8a2ca955.jpg 639KB

0aad9fc79a35bd53.jpg 614KB

000b8d80f7386698.jpg 600KB

000b9a97776b3634.jpg 596KB

000bb0ae453283b0.jpg 586KB

0a563d05ebab4fe3.jpg 583KB

000b7dfaa1810a83.jpg 573KB

0a7c597abf1e90d4.jpg 573KB

000ac95750ac7399.jpg 558KB

0ab5c690eebfad95.jpg 514KB

09e617d9d3120b32.jpg 512KB

0a3f577a327ca7cc.jpg 502KB

0a8657e8b5c9d7bb.jpg 497KB

0a13dcaaab9a35e0.jpg 485KB

000acf666d991c39.jpg 484KB

0ab10a6417ef2301.jpg 481KB

0ac166d12e401a98.jpg 473KB

0a9f73b3c2557150.jpg 462KB

0aa206fa7ea80036.jpg 454KB

000ab7bec71cc50a.jpg 449KB

000aee0af66d4237.jpg 448KB

0ade7aef439e2102.jpg 438KB

000ab31e6be35fed.jpg 437KB

000b81b5757963e0.jpg 432KB

0ac52440f73b5c80.jpg 430KB

000bab5b1a67844e.jpg 426KB

0ada35baba28134b.jpg 425KB

000b260e1f08a32a.jpg 414KB

18864473291_844325caab_b.jpg 411KB

000b093da01e5bfe.jpg 411KB

000b63a1445f53c8.jpg 409KB

000b42cae15622e0.jpg 408KB

09c993afacd01547.jpg 407KB

0a7f4d9a0ccb9afe.jpg 406KB

000b76a9b80ba43a.jpg 399KB

000bc387c731dd97.jpg 388KB

000adef7197e3118.jpg 386KB

0a3c01759e77a02d.jpg 385KB

000ad6c520be9ec5.jpg 382KB

0a34d80ee1db201e.jpg 381KB

09c863d76bcf6b00.jpg 378KB

000abe5eddc5b303.jpg 375KB

000b0f5159f54105.jpg 371KB

09f8b77a88f224d9.jpg 369KB

0a08a4711c728078.jpg 368KB

09e094375efab7fe.jpg 356KB

000b72e1446f8849.jpg 354KB

000bb8bd9b1bca65.jpg 348KB

共 719 条

# Taming Transformers for High-Resolution Image Synthesis ##### CVPR 2021 (Oral) ![teaser](assets/mountain.jpeg) [**Taming Transformers for High-Resolution Image Synthesis**](https://compvis.github.io/taming-transformers/)<br/> [Patrick Esser](https://github.com/pesser)\*, [Robin Rombach](https://github.com/rromb)\*, [Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/> \* equal contribution **tl;dr** We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer. ![teaser](assets/teaser.png) [arXiv](https://arxiv.org/abs/2012.09841) | [BibTeX](#bibtex) | [Project Page](https://compvis.github.io/taming-transformers/) ### News #### 2022 - More pretrained VQGANs (e.g. a f8-model with only 256 codebook entries) are available in our new work on [Latent Diffusion Models](https://github.com/CompVis/latent-diffusion). - Added scene synthesis models as proposed in the paper [High-Resolution Complex Scene Synthesis with Transformers](https://arxiv.org/abs/2105.06458), see [this section](#scene-image-synthesis). #### 2021 - Thanks to [rom1504](https://github.com/rom1504) it is now easy to [train a VQGAN on your own datasets](#training-on-custom-data). - Included a bugfix for the quantizer. For backward compatibility it is disabled by default (which corresponds to always training with `beta=1.0`). Use `legacy=False` in the quantizer config to enable it. Thanks [richcmwang](https://github.com/richcmwang) and [wcshin-git](https://github.com/wcshin-git)! - Our paper received an update: See https://arxiv.org/abs/2012.09841v3 and the corresponding changelog. - Added a pretrained, [1.4B transformer model](https://k00.fr/s511rwcv) trained for class-conditional ImageNet synthesis, which obtains state-of-the-art FID scores among autoregressive approaches and outperforms BigGAN. - Added pretrained, unconditional models on [FFHQ](https://k00.fr/yndvfu95) and [CelebA-HQ](https://k00.fr/2xkmielf). - Added accelerated sampling via caching of keys/values in the self-attention operation, used in `scripts/sample_fast.py`. - Added a checkpoint of a [VQGAN](https://heibox.uni-heidelberg.de/d/2e5662443a6b4307b470/) trained with f8 compression and Gumbel-Quantization. See also our updated [reconstruction notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/reconstruction_usage.ipynb). - We added a [colab notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/reconstruction_usage.ipynb) which compares two VQGANs and OpenAI's [DALL-E](https://github.com/openai/DALL-E). See also [this section](#more-resources). - We now include an overview of pretrained models in [Tab.1](#overview-of-pretrained-models). We added models for [COCO](#coco) and [ADE20k](#ade20k). - The streamlit demo now supports image completions. - We now include a couple of examples from the D-RIN dataset so you can run the [D-RIN demo](#d-rin) without preparing the dataset first. - You can now jump right into sampling with our [Colab quickstart notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/taming-transformers.ipynb). ## Requirements A suitable [conda](https://conda.io/) environment named `taming` can be created and activated with: ``` conda env create -f environment.yaml conda activate taming ``` ## Overview of pretrained models The following table provides an overview of all models that are currently available. FID scores were evaluated using [torch-fidelity](https://github.com/toshas/torch-fidelity). For reference, we also include a link to the recently released autoencoder of the [DALL-E](https://github.com/openai/DALL-E) model. See the corresponding [colab notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/reconstruction_usage.ipynb) for a comparison and discussion of reconstruction capabilities. | Dataset | FID vs train | FID vs val | Link | Samples (256x256) | Comments | ------------- | ------------- | ------------- |------------- | ------------- |------------- | | FFHQ (f=16) | 9.6 | -- | [ffhq_transformer](https://k00.fr/yndvfu95) | [ffhq_samples](https://k00.fr/j626x093) | | CelebA-HQ (f=16) | 10.2 | -- | [celebahq_transformer](https://k00.fr/2xkmielf) | [celebahq_samples](https://k00.fr/j626x093) | | ADE20K (f=16) | -- | 35.5 | [ade20k_transformer](https://k00.fr/ot46cksa) | [ade20k_samples.zip](https://heibox.uni-heidelberg.de/f/70bb78cbaf844501b8fb/) [2k] | evaluated on val split (2k images) | COCO-Stuff (f=16) | -- | 20.4 | [coco_transformer](https://k00.fr/2zz6i2ce) | [coco_samples.zip](https://heibox.uni-heidelberg.de/f/a395a9be612f4a7a8054/) [5k] | evaluated on val split (5k images) | ImageNet (cIN) (f=16) | 15.98/15.78/6.59/5.88/5.20 | -- | [cin_transformer](https://k00.fr/s511rwcv) | [cin_samples](https://k00.fr/j626x093) | different decoding hyperparameters | | | | | || | | FacesHQ (f=16) | -- | -- | [faceshq_transformer](https://k00.fr/qqfl2do8) | S-FLCKR (f=16) | -- | -- | [sflckr](https://heibox.uni-heidelberg.de/d/73487ab6e5314cb5adba/) | D-RIN (f=16) | -- | -- | [drin_transformer](https://k00.fr/39jcugc5) | | | | | || | | VQGAN ImageNet (f=16), 1024 | 10.54 | 7.94 | [vqgan_imagenet_f16_1024](https://heibox.uni-heidelberg.de/d/8088892a516d4e3baf92/) | [reconstructions](https://k00.fr/j626x093) | Reconstruction-FIDs. | VQGAN ImageNet (f=16), 16384 | 7.41 | 4.98 |[vqgan_imagenet_f16_16384](https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/) | [reconstructions](https://k00.fr/j626x093) | Reconstruction-FIDs. | VQGAN OpenImages (f=8), 256 | -- | 1.49 |https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip | --- | Reconstruction-FIDs. Available via [latent diffusion](https://github.com/CompVis/latent-diffusion). | VQGAN OpenImages (f=8), 16384 | -- | 1.14 |https://ommer-lab.com/files/latent-diffusion/vq-f8.zip | --- | Reconstruction-FIDs. Available via [latent diffusion](https://github.com/CompVis/latent-diffusion) | VQGAN OpenImages (f=8), 8192, GumbelQuantization | 3.24 | 1.49 |[vqgan_gumbel_f8](https://heibox.uni-heidelberg.de/d/2e5662443a6b4307b470/) | --- | Reconstruction-FIDs. | | | | | || | | DALL-E dVAE (f=8), 8192, GumbelQuantization | 33.88 | 32.01 | https://github.com/openai/DALL-E | [reconstructions](https://k00.fr/j626x093) | Reconstruction-FIDs. ## Running pretrained models The commands below will start a streamlit demo which supports sampling at different resolutions and image completions. To run a non-interactive version of the sampling process, replace `streamlit run scripts/sample_conditional.py --` by `python scripts/make_samples.py --outdir <path_to_write_samples_to>` and keep the remaining command line arguments. To sample from unconditional or class-conditional models, run `python scripts/sample_fast.py -r <path/to/config_and_checkpoint>`. We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models, respectively. ### S-FLCKR ![teaser](assets/sunset_and_ocean.jpg) You can also [run this model in a Colab notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/taming-transformers.ipynb), which includes all necessary steps to start sampling. Download the [2020-11-09T13-31-51_sflckr](https://heibox.uni-heidelberg.de/d/73487ab6e5314cb5adba/) folder and place it into `logs`. Then, run ``` streamlit run scripts/sample_conditional.py -- -r logs/2020-11-09T13-31-51_sflckr/ ``` ### ImageNet ![teaser](assets/imagenet.png) Download the [2021-04-03T19-39-50_cin_transformer](https://k00.fr/s511rwcv) folder and place it into logs. Sampling from the class-conditional ImageNet model does not require any data preparation. To produce 50 samples for each of the 1000 c

评论收藏

内容反馈

版权申诉