MaxViT:多轴视觉Transformer_maxvit资源-CSDN文库

共23个文件

py：15个

png：3个

md：2个

需积分: 3 185 浏览量 2022-10-26 13:07:35 上传评论收藏 1.46MB ZIP 举报

MaxViT，全称为“多轴视觉Transformer”，是谷歌研究团队在ECCV 2022会议上发表的一项创新性工作。这篇论文的核心是探索如何有效地融合卷积神经网络（CNN）与Transformer架构，以提升计算机视觉任务的性能。Transformer自提出以来，在自然语言处理领域取得了巨大成功，而MaxViT的出现，旨在将其优势拓展到图像处理领域。 MaxViT的主要贡献在于提出了一个新的模块设计，该模块能够同时利用CNN的局部特征捕获能力和Transformer的全局上下文建模能力。传统的Transformer通常完全依赖自注意力机制来处理全局信息，这在处理高分辨率图像时可能会导致计算成本过高。而CNN则擅长捕捉局部特征，但在捕获远距离依赖关系时可能表现不佳。MaxViT的创新之处就在于它找到了这两者之间的平衡点。论文中的“多轴”概念指的是模型采用了多种尺度的Transformer块，这些块在不同空间分辨率下工作，从而能够处理不同范围的特征交互。这种多尺度的设计允许MaxViT在保持计算效率的同时，更好地理解和处理图像的复杂结构。在实现上，MaxViT将CNN的特征图转换为Transformer可以处理的形式，并通过一系列精心设计的Transformer层进行信息交换。这些层不仅包含标准的自注意力机制，还可能包含改进的机制，例如局部注意力或者带有位置编码的注意力，以增强模型对图像局部细节的敏感度。值得注意的是，MaxViT的代码已经公开，这为研究者和开发者提供了一个便捷的平台，他们可以直接使用这个框架进行实验和应用开发。不过，论文中提到，预训练权重需要用户自行下载，这意味着用户需要找到合适的预训练模型以加速收敛和提高模型性能。在实际应用中，MaxViT可能广泛应用于图像分类、目标检测、语义分割等计算机视觉任务。通过结合CNN和Transformer的优势，MaxViT有望在处理复杂场景和精细化任务时表现出色，比如在自动驾驶、医疗影像分析等领域。 MaxViT是深度学习领域的一个重要进展，它尝试打破传统的架构界限，将两种截然不同的模型设计理念融合在一起，以实现更高效、更全面的图像理解。这一工作对于推动计算机视觉技术的发展具有重要意义，也为未来的研究方向提供了新的启示。

资源推荐

资源详情

资源评论

收起资源包目录

maxvit-main.zip （23个子文件）

maxvit-main

maxvit

models

vision.py 3KB

__init__.py 576B

vision_i1k.py 4KB

utils.py 7KB

maxvit.py 42KB

attention_utils.py 5KB

__init__.py 576B

hparams_registry.py 770B

hparams.py 1KB

common_ops.py 5KB

hparam_configs.py 8KB

eval_ckpt.py 11KB

test_maxvit.py 675B

__init__.py 576B

LICENSE 11KB

CONTRIBUTING.md 1KB

requirements.txt 114B

doc

imagenet_results.png 318KB

maxvit_arch.png 212KB

i21k_jft_results.png 261KB

setup.py 1KB

README.md 9KB

MaxViT_tutorial.ipynb 949KB

# MaxViT: Multi-Axis Vision Transformer (ECCV 2022) [![Paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2204.01697) [![Tutorial In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-research/maxvit/blob/master/MaxViT_tutorial.ipynb) [![video](https://img.shields.io/badge/Video-Presentation-F9D371)](https://youtu.be/WEgB4lAZyKM) This repository hosts the official TensorFlow implementation of MAXViT models: __[MaxViT: Multi-Axis Vision Transformer](https://arxiv.org/abs/2204.01697)__. ECCV 2022.\ [Zhengzhong Tu](https://twitter.com/_vztu), [Hossein Talebi](https://scholar.google.com/citations?hl=en&user=UOX9BigAAAAJ), [Han Zhang](https://sites.google.com/view/hanzhang), [Feng Yang](https://sites.google.com/view/feng-yang), [Peyman Milanfar](https://sites.google.com/view/milanfarhome/), [Alan Bovik](https://www.ece.utexas.edu/people/faculty/alan-bovik), and [Yinxiao Li](https://scholar.google.com/citations?user=kZsIU74AAAAJ&hl=en)\ Google Research, University of Texas at Austin *Disclaimer: This is not an officially supported Google product.* **News**: - Oct 12, 2022: Added the remaining ImageNet-1K and -21K checkpoints. - Oct 4, 2022: A list of updates * Added MaxViTTiny and MaxViTSmall checkpoints. * Added a Colab tutorial. - Sep 8, 2022: our Google AI blog covering both [MaxViT](https://arxiv.org/abs/2204.01697) and [MAXIM](https://github.com/google-research/maxim) is [live](https://ai.googleblog.com/2022/09/a-multi-axis-approach-for-vision.html). - Sep 7, 2022: [@rwightman](https://github.com/rwightman) released a few small model weights in [timm](https://github.com/rwightman/pytorch-image-models#aug-26-2022). Achieves even better results than our paper. See more [here](https://github.com/rwightman/pytorch-image-models#aug-26-2022). - Aug 26, 2022: our MaxViT models have been implemented in [timm (pytorch-image-models)](https://github.com/rwightman/pytorch-image-models#aug-26-2022). Kudos to [@rwightman](https://github.com/rwightman)! - July 21, 2022: Initial code release of [MaxViT models](https://arxiv.org/abs/2204.01697): accepted to ECCV'22. - Apr 6, 2022: MaxViT has been implemented by [@lucidrains](https://github.com/lucidrains): [vit-pytorch](https://github.com/lucidrains/vit-pytorch#maxvit) :scream: :exploding_head: - Apr 4, 2022: initial uploads to [Arxiv](https://arxiv.org/abs/2204.01697) ## MaxViT Models [MaxViT](https://arxiv.org/abs/2204.01697) is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers. They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. MaxViT meta-architecture: <img src = "./doc/maxvit_arch.png" width="80%"> Results on ImageNet-1k train and test: <img src = "./doc/imagenet_results.png" width="80%"> Results on ImageNet-21k and JFT pre-trained models: <img src = "./doc/i21k_jft_results.png" width="80%"> ## Colab Demo We have released a Google Colab Demo on the tutorials of how to run MaxViT on images. Try it here [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-research/maxvit/blob/master/MaxViT_tutorial.ipynb) ## Pretrained MaxViT Checkpoints We have provided a list of results and checkpoints as follows: | Name | Resolution | Top1 Acc. | #Params | FLOPs | Model | | ---------- | ---------| ------ | ------ | ------ | ------ | | MaxViT-T | 224x224 | 83.62% | 31M | 5.6B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvittiny/i1k/224) | MaxViT-T | 384x384 | 85.24% | 31M | 17.7B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvittiny/i1k/384) | MaxViT-T | 512x512 | 85.72% | 31M | 33.7B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvittiny/i1k/512) | MaxViT-S | 224x224 | 84.45% | 69M | 11.7B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitsmall/i1k/224) | MaxViT-S | 384x384 | 85.74% | 69M | 36.1B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitsmall/i1k/384) | MaxViT-S | 512x512 | 86.19% | 69M | 67.6B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitsmall/i1k/512) | MaxViT-B | 224x224 | 84.95% | 119M | 24.2B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i1k/224) | MaxViT-B | 384x384 | 86.34% | 119M | 74.2B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i1k/384) | MaxViT-B | 512x512 | 86.66% | 119M | 138.5B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i1k/512) | MaxViT-L | 224x224 | 85.17% | 212M | 43.9B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i1k/224) | MaxViT-L | 384x384 | 86.40% | 212M | 133.1B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i1k/384) | MaxViT-L | 512x512 | 86.70% | 212M | 245.4B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i1k/512) Here are a list of ImageNet-21K pretrained and ImageNet-1K finetuned models: | Name | Resolution | Top1 Acc. | #Params | FLOPs | 21k model | 1k model | | ---------- | ------ | ------ | ------ | ------ | ------ | --------| | MaxViT-B | 224x224 | - | 119M | 24.2B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i21k_pt/224) | - | | MaxViT-B | 384x384 | - | 119M | 74.2B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i21k_i1k/384) | MaxViT-B | 512x512 | - | 119M | 138.5B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i21k_i1k/512) | MaxViT-L | 224x224 | - | 212M | 43.9B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i21k_pt/224) | - | | MaxViT-L | 384x384 | - | 212M | 133.1B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i21k_i1k/384) | MaxViT-L | 512x512 | - | 212M | 245.4B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i21k_i1k/512) | MaxViT-XL | 224x224 | - | 475M | 97.8B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitxlarge/i21k_pt/224) | - | | MaxViT-XL | 384x384 | - | 475M | 293.7B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitxlarge/i21k_i1k/384) | MaxViT-XL | 512x512 | - | 475M | 535.2B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitxlarge/i21k_i1k/512) ## Citation Should you find this repository useful, please consider citing: ``` @article{tu2022maxvit, title={MaxViT: Multi-Axis Vision Transformer}, author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao}, journal={ECCV}, year={2022}, } ``` ## Other Related Works * MAXIM: Multi-Axis ML

评论收藏

内容反馈