基于VisionTransformer的图像去雾算法研究与实现python源码+使用说明.zip

共334个文件

py：204个

png：39个

yaml：16个

版权申诉

源码

毕业设计

课程设计

软件工程

139 浏览量 2024-05-12 08:55:02 上传评论收藏 156.34MB ZIP 举报

标题中的“基于Vision Transformer的图像去雾算法研究与实现”是指使用Transformer架构在图像处理领域进行去雾操作的研究。Transformer最初由Vaswani等人在2017年提出，主要用于自然语言处理任务，但近年来已被成功应用于计算机视觉任务，如图像分类、目标检测和图像生成等。图像去雾是一种旨在恢复图像清晰度的技术，它处理的是由于大气散射导致的图像模糊现象。在实际应用中，例如自动驾驶、无人机拍摄、监控系统等，去雾技术能提高图像的可读性和识别性。 Vision Transformer (ViT) 是Transformer在计算机视觉领域的变体，它将图像分割为一系列的像素块（tokens），然后通过多层自注意力机制来处理这些tokens，以学习图像的全局上下文信息。在图像去雾任务中，ViT可能通过学习到的特征表示，对每个像素块进行去雾处理，从而恢复清晰图像。 Python源码是实现这一技术的关键，它通常包含以下几个部分： 1. 数据预处理：对输入的雾天图像进行预处理，如归一化、分块等。 2. Transformer模块：实现Transformer的编码器和解码器结构，包括自注意力层和前馈神经网络层。 3. 去雾模块：结合Transformer学习到的特征，进行去雾运算，可能包括反卷积、上采样等步骤，以恢复图像的原始分辨率。 4. 损失函数：定义损失函数，如MSE（均方误差）或PSNR（峰值信噪比），用于训练过程中的模型优化。 5. 训练与评估：设置训练参数，如学习率、批次大小等，进行模型训练，并在验证集上评估模型性能。 6. 应用与测试：使用训练好的模型对新的雾天图像进行去雾处理。 "毕业设计"和"课程设计"标签表明这个项目可能是学生在学术研究或课程作业中的实践，而"软件工程"标签则暗示了整个实现过程需要遵循良好的软件工程实践，包括代码组织、注释、文档编写等。压缩包中的"code"文件很可能是实现这一算法的Python源代码文件，其中可能包含了上述各个部分的具体实现细节。用户需要根据提供的使用说明来运行和理解代码，这可能涉及到安装依赖库、配置参数、调用模型等步骤。使用说明通常会解释如何加载数据、训练模型、保存和加载模型权重，以及如何使用模型进行推理。这个压缩包提供了一个使用Vision Transformer进行图像去雾的完整解决方案，涵盖了理论、实现和应用，对于理解Transformer在图像处理中的应用以及深入研究图像去雾技术具有重要的参考价值。

资源推荐

资源详情

资源评论

收起资源包目录

基于Vision Transformer的图像去雾算法研究与实现python源码+使用说明.zip （334个子文件）

cifar100_resnet_dnn_50_losslandscape.csv 116KB

cifar100_vit_ti_losslandscape.csv 110KB

cifar100_vit_ti_9857b21357_x1_losslandscape.csv 107KB

cifar10_alexnet_dnn_corrupted.csv 18KB

cifar100_alexnet_dnn_corrupted.csv 17KB

imagenet_alexnet_dnn_corrupted.csv 14KB

buildup_v.gif 5.02MB

resnet_mcdo_18_mlp.gif 2.02MB

resnet_mcdo_18.gif 2MB

resnet_mcdo_smoothing_18.gif 1.98MB

ablation_animated.gif 1.47MB

.gitignore 2KB

.gitignore 78B

.gitignore 50B

Uformer_ProbSparse.iml 485B

losslandscape.ipynb 673KB

fourier_analysis.ipynb 11KB

featuremap_variance.ipynb 11KB

robustness.ipynb 9KB

classification.ipynb 9KB

31_indoor_hazy.jpg 4.77MB

save.jpg 69KB

save.jpg 66KB

buildup_v.key 1.31MB

LICENSE 11KB

README.md 20KB

README.md 11KB

README.md 7KB

readme.md 250B

ots_train_ffa_3_19.pk 25.39MB

its_train_ffa_3_19.pk 21.26MB

nn.png 1.06MB

loss-landscape.png 1.05MB

skip.png 1.04MB

fourier.png 586KB

architecture.png 379KB

alternet.png 218KB

cifar_100_resnet_18_acc_featured.png 182KB

cifar_100_resnet_18_ece_featured.png 179KB

cifar_100_resnet_18_nll_featured.png 170KB

featured.png 150KB

smooth.png 67KB

ensemble_size_acc.png 64KB

ensemble_size_nll.png 62KB

ensemble_size_ece.png 61KB

cifar_100_resnet_18_acc.png 45KB

Masked MSA.png 44KB

legend_robustness.png 39KB

cifar_100_resnet_18_ece.png 39KB

cifar_100_resnet_18_nll.png 35KB

stages.png 24KB

legend1.png 22KB

__init__.py 1.09MB

My_model.py 61KB

共 334 条

# How Do Vision Transformers Work? [[paper](https://openreview.net/forum?id=D78Go4hVcxO), [arxiv](https://arxiv.org/abs/2202.06709), [poster](https://github.com/xxxnell/how-do-vits-work-storage/blob/master/resources/how_do_vits_work_poster_iclr2022.pdf), [slide](https://github.com/xxxnell/how-do-vits-work-storage/blob/master/resources/how_do_vits_work_talk.pdf)] This repository provides a PyTorch implementation of ["How Do Vision Transformers Work? (ICLR 2022 Spotlight)"](https://openreview.net/forum?id=D78Go4hVcxO) In the paper, we show that the success of multi-head self-attentions (MSAs) for computer vision is ***NOT due to their weak inductive bias and capturing long-range dependency***. In particular, we address the following three key questions of MSAs and Vision Transformers (ViTs): ***Q1. What properties of MSAs do we need to better optimize NNs?*** A1. MSAs have their pros and cons. MSAs improve NNs by flattening the loss landscapes. A key feature is their data specificity (data dependency), not long-range dependency. On the other hand, ViTs suffers from non-convex losses. ***Q2. Do MSAs act like Convs?*** A2. MSAs and Convs exhibit opposite behaviorsâe.g., MSAs are low-pass filters, but Convs are high-pass filters. It suggests that MSAs are shape-biased, whereas Convs are texture-biased. Therefore, MSAs and Convs are complementary. ***Q3. How can we harmonize MSAs with Convs?*** A3. MSAs at the end of a stage (not a model) significantly improve the accuracy. Based on this, we introduce *AlterNet* by replacing Convs at the end of a stage with MSAs. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. ð Let's find the detailed answers below! ### I. What Properties of MSAs Do We Need to Improve Optimization? <img src="resources/vit/loss-landscape.png" style="width:90%;"> MSAs improve not only accuracy but also generalization by flattening the loss landscapes. ***Such improvement is primarily attributable to their data specificity, NOT long-range dependency*** ð± On the other hand, ViTs suffers from non-convex losses. Their weak inductive bias and long-range dependency produce negative Hessian eigenvalues in small data regimes, and these non-convex points disrupt NN training. Large datasets and loss landscape smoothing methods alleviate this problem. ### II. Do MSAs Act Like Convs? <img src="resources/vit/fourier.png" style="width:90%;"> MSAs and Convs exhibit opposite behaviors. Therefore, MSAs and Convs are complementary. For example, MSAs are low-pass filters, but Convs are high-pass filters. Likewise, Convs are vulnerable to high-frequency noise but that MSAs are vulnerable to low-frequency noise: it suggests that MSAs are shape-biased, whereas Convs are texture-biased. In addition, Convs transform feature maps and MSAs aggregate transformed feature map predictions. Thus, it is effective to place MSAs after Convs. ### III. How Can We Harmonize MSAs With Convs? <img src="resources/vit/architecture.png" style="width:90%;"> Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage (not the end of a model) play a key role in prediction. Based on these insights, we propose design rules to harmonize MSAs with Convs. NN stages using this design pattern consists of a number of CNN blocks and one (or a few) MSA block. The design pattern naturally derives the structure of the canonical Transformer, which has one MLP block for one MSA block. <img src="resources/vit/alternet.png" style="width:90%;"> Based on these design rules, we introduce AlterNet ([code](https://github.com/xxxnell/how-do-vits-work/blob/transformer/models/alternet.py)) by replacing Conv blocks at the end of a stage with MSA blocks. ***Surprisingly, AlterNet outperforms CNNs not only in large data regimes but also in small data regimes***, e.g., CIFAR. This contrasts with canonical ViTs, models that perform poorly on small amounts of data. For more details, see below (["How to Apply MSA to Your Own Model"](#how-to-apply-msa-to-your-own-model) section). This repository is based on [the official implementation of "Blurs Behaves Like Ensembles: Spatial Smoothings to Improve Accuracy, Uncertainty, and Robustness"](https://github.com/xxxnell/spatial-smoothing). In this paper, we show that a simple (non-trainable) 2 â 2 box blur filter improves accuracy, uncertainty, and robustness simultaneously by ensembling spatially nearby feature maps of CNNs. MSA is not simply generalized Conv, but rather a generalized (trainable) blur filter that complements Conv. Please check it out! ## Getting Started The following packages are required: * pytorch * matplotlib * notebook * ipywidgets * timm * einops * tensorboard * seaborn (optional) We mainly use docker images `pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime` for the code. See [```classification.ipynb```](classification.ipynb) ([Colab notebook](https://colab.research.google.com/github/xxxnell/how-do-vits-work/blob/transformer/classification.ipynb)) for image classification. Run all cells to train and test models on CIFAR-10, CIFAR-100, and ImageNet. **Metrics.** We provide several metrics for measuring accuracy and uncertainty: Acuracy (Acc, â) and Acc for 90% certain results (Acc-90, â), negative log-likelihood (NLL, â), Expected Calibration Error (ECE, â), Intersection-over-Union (IoU, â) and IoU for certain results (IoU-90, â), Unconfidence (Unc-90, â), and Frequency for certain results (Freq-90, â). We also define a method to plot a reliability diagram for visualization. **Models.** We provide AlexNet, VGG, pre-activation VGG, ResNet, pre-activation ResNet, ResNeXt, WideResNet, ViT, PiT, Swin, MLP-Mixer, and Alter-ResNet by default. timm implementations also can be used. <details> <summary> Four pretrained models for CIFAR-100 are also provided: <a href="https://github.com/xxxnell/how-do-vits-work-storage/releases/download/v0.1/resnet_50_cifar100_691cc9a9e4.pth.tar">ResNet-50</a>, <a href="https://github.com/xxxnell/how-do-vits-work-storage/releases/download/v0.1/vit_ti_cifar100_9857b21357.pth.tar">ViT-Ti</a>, <a href="https://github.com/xxxnell/how-do-vits-work-storage/releases/download/v0.1/pit_ti_cifar100_0645889efb.pth.tar">PiT-Ti</a>, and <a href="https://github.com/xxxnell/how-do-vits-work-storage/releases/download/v0.1/swin_ti_cifar100_ec2894492b.pth.tar">Swin-Ti</a>. We recommend using <a href="https://github.com/rwightman/pytorch-image-models">timm</a> for ImageNet-1K (e.g., please refer to <code><a href="https://github.com/xxxnell/how-do-vits-work/blob/transformer/fourier_analysis.ipynb">fourier_analysis.ipynb</a></code>). </summary> The codes below are snippets for (a) loading pretrained models and (b) converting them into block sequences. ```python # ResNet-50 import models # a. download and load a pretrained model for CIFAR-100 url = "https://github.com/xxxnell/how-do-vits-work-storage/releases/download/v0.1/resnet_50_cifar100_691cc9a9e4.pth.tar" path = "checkpoints/resnet_50_cifar100_691cc9a9e4.pth.tar" models.download(url=url, path=path) name = "resnet_50" model = models.get_model(name, num_classes=num_classes, # timm does not provide a ResNet for CIFAR stem=model_args.get("stem", False)) map_location = "cuda" if torch.cuda.is_available() else "cpu" checkpoint = torch.load(path, map_location=map_location) model.load_state_dict(checkpoint["state_dict"]) # b. model â blocks. `blocks` is a sequence of blocks blocks = [ model.layer0, *model.layer1, *model.layer2, *model.layer3, *model.layer4, model.classifier, ] ``` ```python # ViT-Ti import

评论收藏

内容反馈

版权申诉