Python库|vit-pytorch-0.9.3.tar.gz资源-CSDN文库

版权申诉

5星 · 超过95%的资源 170 浏览量 2022-04-19 00:10:22 上传评论收藏 10KB GZ 举报

共16个文件

py：8个

txt：4个

pkg-info：2个

资源推荐

资源详情

资源评论

收起资源包目录

vit-pytorch-0.9.3.tar.gz （16个子文件）

vit-pytorch-0.9.3

PKG-INFO 588B

vit_pytorch

t2t.py 3KB

deepvit.py 4KB

distill.py 4KB

vit.py 4KB

__init__.py 32B

mpp.py 6KB

efficient.py 2KB

vit_pytorch.egg-info

PKG-INFO 588B

requires.txt 23B

SOURCES.txt 344B

top_level.txt 12B

dependency_links.txt 1B

setup.cfg 38B

setup.py 757B

README.md 12KB

<img src="./vit.gif" width="500px"></img> ## Vision Transformer - Pytorch Implementation of <a href="https://openreview.net/pdf?id=YicbFdNTTy">Vision Transformer</a>, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch. Significance is further explained in <a href="https://www.youtube.com/watch?v=TrdevFK_am4">Yannic Kilcher's</a> video. There's really not much to code here, but may as well lay it out for everyone so we expedite the attention revolution. For a Pytorch implementation with pretrained models, please see Ross Wightman's repository <a href="https://github.com/rwightman/pytorch-image-models">here</a>. The official Jax repository is <a href="https://github.com/google-research/vision_transformer">here</a>. ## Install ```bash $ pip install vit-pytorch ``` ## Usage ```python import torch from vit_pytorch import ViT v = ViT( image_size = 256, patch_size = 32, num_classes = 1000, dim = 1024, depth = 6, heads = 16, mlp_dim = 2048, dropout = 0.1, emb_dropout = 0.1 ) img = torch.randn(1, 3, 256, 256) mask = torch.ones(1, 8, 8).bool() # optional mask, designating which patch to attend to preds = v(img, mask = mask) # (1, 1000) ``` ## Parameters - `image_size`: int. Image size. If you have rectangular images, make sure your image size is the maximum of the width and height - `patch_size`: int. Number of patches. `image_size` must be divisible by `patch_size`. The number of patches is: ` n = (image_size // patch_size) ** 2` and `n` **must be greater than 16**. - `num_classes`: int. Number of classes to classify. - `dim`: int. Last dimension of output tensor after linear transformation `nn.Linear(..., dim)`. - `depth`: int. Number of Transformer blocks. - `heads`: int. Number of heads in Multi-head Attention layer. - `mlp_dim`: int. Dimension of the MLP (FeedForward) layer. - `channels`: int, default `3`. Number of image's channels. - `dropout`: float between `[0, 1]`, default `0.`. Dropout rate. - `emb_dropout`: float between `[0, 1]`, default `0`. Embedding dropout rate. - `pool`: string, either `cls` token pooling or `mean` pooling ## Distillation <img src="./distill.png" width="300px"></img> A recent <a href="https://arxiv.org/abs/2012.12877">paper</a> has shown that use of a distillation token for distilling knowledge from convolutional nets to vision transformer can yield small and efficient vision transformers. This repository offers the means to do distillation easily. ex. distilling from Resnet50 (or any teacher) to a vision transformer ```python import torch from torchvision.models import resnet50 from vit_pytorch.distill import DistillableViT, DistillWrapper teacher = resnet50(pretrained = True) v = DistillableViT( image_size = 256, patch_size = 32, num_classes = 1000, dim = 1024, depth = 6, heads = 8, mlp_dim = 2048, dropout = 0.1, emb_dropout = 0.1 ) distiller = DistillWrapper( student = v, teacher = teacher, temperature = 3, # temperature of distillation alpha = 0.5 # trade between main loss and distillation loss ) img = torch.randn(2, 3, 256, 256) labels = torch.randint(0, 1000, (2,)) loss = distiller(img, labels) loss.backward() # after lots of training above ... pred = v(img) # (2, 1000) ``` The `DistillableViT` class is identical to `ViT` except for how the forward pass is handled, so you should be able to load the parameters back to `ViT` after you have completed distillation training. You can also use the handy `.to_vit` method on the `DistillableViT` instance to get back a `ViT` instance. ```python v = v.to_vit() type(v) # <class 'vit_pytorch.vit_pytorch.ViT'> ``` ## Deep ViT This <a href="https://arxiv.org/abs/2103.11886">paper</a> notes that ViT struggles to attend at greater depths (past 12 layers), and suggests mixing the attention of each head post-softmax as a solution, dubbed Re-attention. The results line up with the <a href="https://github.com/lucidrains/x-transformers#talking-heads-attention">Talking Heads</a> paper from NLP. You can use it as follows ```python import torch from vit_pytorch.deepvit import DeepViT v = DeepViT( image_size = 256, patch_size = 32, num_classes = 1000, dim = 1024, depth = 6, heads = 16, mlp_dim = 2048, dropout = 0.1, emb_dropout = 0.1 ) img = torch.randn(1, 3, 256, 256) preds = v(img) # (1, 1000) ``` ## Token-to-Token ViT <img src="./t2t.png" width="400px"></img> <a href="https://arxiv.org/abs/2101.11986">This paper</a> proposes that the first couple layers should downsample the image sequence by unfolding, leading to overlapping image data in each token as shown in the figure above. You can use this variant of the `ViT` as follows. ```python import torch from vit_pytorch.t2t import T2TViT v = T2TViT( dim = 512, image_size = 224, depth = 5, heads = 8, mlp_dim = 512, num_classes = 1000, t2t_layers = ((7, 4), (3, 2), (3, 2)) # tuples of the kernel size and stride of each consecutive layers of the initial token to token module ) img = torch.randn(1, 3, 224, 224) v(img) # (1, 1000) ``` ## Masked Patch Prediction Thanks to <a href="https://github.com/zankner">Zach</a>, you can train using the original masked patch prediction task presented in the paper, with the following code. ```python import torch from vit_pytorch import ViT from vit_pytorch.mpp import MPP model = ViT( image_size=256, patch_size=32, num_classes=1000, dim=1024, depth=6, heads=8, mlp_dim=2048, dropout=0.1, emb_dropout=0.1 ) mpp_trainer = MPP( transformer=model, patch_size=32, dim=1024, mask_prob=0.15, # probability of using token in masked prediction task random_patch_prob=0.30, # probability of randomly replacing a token being used for mpp replace_prob=0.50, # probability of replacing a token being used for mpp with the mask token ) opt = torch.optim.Adam(mpp_trainer.parameters(), lr=3e-4) def sample_unlabelled_images(): return torch.randn(20, 3, 256, 256) for _ in range(100): images = sample_unlabelled_images() loss = mpp_trainer(images) opt.zero_grad() loss.backward() opt.step() # save your improved network torch.save(model.state_dict(), './pretrained-net.pt') ``` ## Research Ideas ### Self Supervised Training You can train this with a near SOTA self-supervised learning technique, <a href="https://github.com/lucidrains/byol-pytorch">BYOL</a>, with the following code. (1) ```bash $ pip install byol-pytorch ``` (2) ```python import torch from vit_pytorch import ViT from byol_pytorch import BYOL model = ViT( image_size = 256, patch_size = 32, num_classes = 1000, dim = 1024, depth = 6, heads = 8, mlp_dim = 2048 ) learner = BYOL( model, image_size = 256, hidden_layer = 'to_latent' ) opt = torch.optim.Adam(learner.parameters(), lr=3e-4) def sample_unlabelled_images(): return torch.randn(20, 3, 256, 256) for _ in range(100): images = sample_unlabelled_images() loss = learner(images) opt.zero_grad() loss.backward() opt.step() learner.update_moving_average() # update moving average of target encoder # save your improved network torch.save(model.state_dict(), './pretrained-net.pt') ``` A pytorch-lightning script is ready for you to use at the repository link above. ### Efficient Attention There may be some coming from computer vision who think attention still suffers from quadratic costs. Fortunately, we have a lot of new techniques that may help. This repository offers a way for you to plugin your own sparse attention transformer. An example with <a href="https://arxiv.org/abs/2102.03902">Nystromformer</a> ```bash $ pip install nystrom-attention ``` ```python import torch from vit_pytorch.efficient import ViT from nystrom_attention import Nystromformer ef

评论收藏

内容反馈

版权申诉