# CSWin-Transformer, CVPR 2022
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cswin-transformer-a-general-vision/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=cswin-transformer-a-general-vision)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cswin-transformer-a-general-vision/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=cswin-transformer-a-general-vision)
This repo is the official implementation of ["CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows"](https://arxiv.org/pdf/2107.00652.pdf).
## Introduction
**CSWin Transformer** (the name `CSWin` stands for **C**ross-**S**haped **Win**dow) is introduced in [arxiv](https://arxiv.org/abs/2107.00652), which is a new general-purpose backbone for computer vision. It is a hierarchical Transformer and replaces the traditional full attention with our newly proposed cross-shaped window self-attention. The cross-shaped window self-attention mechanism computes self-attention in the horizontal and vertical stripes in parallel that from a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. With CSWin, we could realize global attention with a limited computation cost.
CSWin Transformer achieves strong performance on ImageNet classification (87.5 on val with only 97G flops) and ADE20K semantic segmentation (`55.7 mIoU` on val), surpassing previous models by a large margin.
![teaser](teaser.png)
## Main Results on ImageNet
| model | pretrain | resolution | acc@1 | #params | FLOPs | 22K model | 1K model |
|:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| CSWin-T | ImageNet-1K | 224x224 | 82.8 | 23M | 4.3G | - | [model](https://github.com/microsoft/CSWin-Transformer/releases/download/v0.1.0/cswin_tiny_224.pth) |
| CSWin-S | ImageNet-1k | 224x224 | 83.6 | 35M | 6.9G | - | [model](https://github.com/microsoft/CSWin-Transformer/releases/download/v0.1.0/cswin_small_224.pth) |
| CSWin-B | ImageNet-1k | 224x224 | 84.2 | 78M | 15.0G | - | [model](https://github.com/microsoft/CSWin-Transformer/releases/download/v0.1.0/cswin_base_224.pth) |
| CSWin-B | ImageNet-1k | 384x384 | 85.5 | 78M | 47.0G | - | [model](https://github.com/microsoft/CSWin-Transformer/releases/download/v0.1.0/cswin_base_384.pth) |
| CSWin-L | ImageNet-22k | 224x224 | 86.5 | 173M | 31.5G | [model](https://github.com/microsoft/CSWin-Transformer/releases/download/v0.1.0/cswin_large_22k_224.pth) | [model](https://github.com/microsoft/CSWin-Transformer/releases/download/v0.1.0/cswin_large_224.pth) |
| CSWin-L | ImageNet-22k | 384x384 | 87.5 | 173M | 96.8G | - | [model](https://github.com/microsoft/CSWin-Transformer/releases/download/v0.1.0/cswin_large_384.pth) |
## Main Results on Downstream Tasks
**COCO Object Detection**
| backbone | Method | pretrain | lr Schd | box mAP | mask mAP | #params | FLOPS |
|:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| CSwin-T | Mask R-CNN | ImageNet-1K | 3x | 49.0 | 43.6 | 42M | 279G |
| CSwin-S | Mask R-CNN | ImageNet-1K | 3x | 50.0 | 44.5 | 54M | 342G |
| CSwin-B | Mask R-CNN | ImageNet-1K | 3x | 50.8 | 44.9 | 97M | 526G |
| CSwin-T | Cascade Mask R-CNN | ImageNet-1K | 3x | 52.5 | 45.3 | 80M | 757G |
| CSwin-S | Cascade Mask R-CNN | ImageNet-1K | 3x | 53.7 | 46.4 | 92M | 820G |
| CSwin-B | Cascade Mask R-CNN | ImageNet-1K | 3x | 53.9 | 46.4 | 135M | 1004G |
**ADE20K Semantic Segmentation (val)**
| Backbone | Method | pretrain | Crop Size | Lr Schd | mIoU | mIoU (ms+flip) | #params | FLOPs |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| CSwin-T | Semantic FPN | ImageNet-1K | 512x512 | 80K | 48.2 | - | 26M | 202G |
| CSwin-S | Semantic FPN | ImageNet-1K | 512x512 | 80K | 49.2 | - | 39M | 271G |
| CSwin-B | Semantic FPN | ImageNet-1K | 512x512 | 80K | 49.9 | - | 81M | 464G |
| CSwin-T | UPerNet | ImageNet-1K | 512x512 | 160K | 49.3 | 50.7 | 60M | 959G |
| CSwin-S | UperNet | ImageNet-1K | 512x512 | 160K | 50.4 | 51.5 | 65M | 1027G |
| CSwin-B | UperNet | ImageNet-1K | 512x512 | 160K | 51.1 | 52.2 | 109M | 1222G |
| CSwin-B | UPerNet | ImageNet-22K | 640x640 | 160K | 51.8 | 52.6 | 109M | 1941G |
| CSwin-L | UperNet | ImageNet-22K | 640x640 | 160K | 53.4 | 55.7 | 208M | 2745G |
pretrained models and code could be found at [`segmentation`](segmentation)
## Requirements
timm==0.3.4, pytorch>=1.4, opencv, ... , run:
```
bash install_req.sh
```
Apex for mixed precision training is used for finetuning. To install apex, run:
```
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
Data prepare: ImageNet with the following folder structure, you can extract imagenet by this [script](https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4).
```
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
```
## Train
Train the three lite variants: CSWin-Tiny, CSWin-Small and CSWin-Base:
```
bash train.sh 8 --data <data path> --model CSWin_64_12211_tiny_224 -b 256 --lr 2e-3 --weight-decay .05 --amp --img-size 224 --warmup-epochs 20 --model-ema-decay 0.99984 --drop-path 0.2
```
```
bash train.sh 8 --data <data path> --model CSWin_64_24322_small_224 -b 256 --lr 2e-3 --weight-decay .05 --amp --img-size 224 --warmup-epochs 20 --model-ema-decay 0.99984 --drop-path 0.4
```
```
bash train.sh 8 --data <data path> --model CSWin_96_24322_base_224 -b 128 --lr 1e-3 --weight-decay .1 --amp --img-size 224 --warmup-epochs 20 --model-ema-decay 0.99992 --drop-path 0.5
```
If you want to train our CSWin on images with 384x384 resolution, please use '--img-size 384'.
If the GPU memory is not enough, please use '-b 128 --lr 1e-3 --model-ema-decay 0.99992' or use [checkpoint](https://pytorch.org/docs/stable/checkpoint.html) '--use-chk'.
## Finetune
Finetune CSWin-Base with 384x384 resolution:
```
bash finetune.sh 8 --data <data path> --model CSWin_96_24322_base_384 -b 32 --lr 5e-6 --min-lr 5e-7 --weight-decay 1e-8 --amp --img-size 384 --warmup-epochs 0 --model-ema-decay 0.9998 --finetune <pretrained 224 model> --epochs 20 --mixup 0.1 --cooldown-epochs 10 --drop-path 0.7 --ema-finetune --lr-scale 1 --cutmix 0.1
```
Finetune ImageNet-22K pretrained CSWin-Large with 224x224 resolution:
```
bash finetune.sh 8 --data <data path> --model CSWin_144_24322_large_224 -b 64 --lr 2.5e-4 --min-lr 5e-7 --weight-decay 1e-8 --amp --img-size 224 --warmup-epochs 0 --model-ema-decay 0.9996 --finetune <22k-pretrained model> --epochs 30 --mixup 0.01 --cooldown-epochs 10 --interpolation bicubic --lr-scale 0.05 --drop-path 0.2 --cutmix 0.3 --use-chk --fine-22k --ema-finetune
```
If the GPU memory is not enough, please use [checkpoint](https://pytorch.org/docs/stable/checkpoint.html) '--use-chk'.
## Cite CSWin Transformer
```
@misc{dong2021cswin,
title={CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows},
author={Xiaoyi Dong and Jianmin Bao and Dongdong Chen and Weiming Zhang and Nenghai Yu and Lu Yuan and Dong Chen and Baining Guo},
year={2021},
eprint={2107.00652},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## Acknowledgement
This repository is built using the [timm](https://github.com/rwightman/pytorch-image-models) library and the [DeiT](https://github.com/facebookresearch/deit) repository.
## License
This project is licensed under th
没有合适的资源?快使用搜索试试~ 我知道了~
CSWin Transformer
共26个文件
py:12个
md:5个
sh:4个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
5星 · 超过95%的资源 1 下载量 131 浏览量
2023-08-20
17:27:21
上传
评论 1
收藏 4.52MB ZIP 举报
温馨提示
Transformer设计中一个具有挑战性的问题是,全局自注意力的计算成本非常高,而局部自注意力通常会限制每个词向量的交互域。为了解决这个问题,作者提出了CSWin Transformer。CSWin Transformer 在常见的视觉任务上展示了非常好的性能。具体来说,它在没有任何额外训练数据或标签的情况下,在 ImageNet-1K 分类任务上达到了 85.4% Top-1 准确率,在 COCO 检测任务上达到了 53.9 box AP 和 46.4 mask AP,在 ADE20K 语义分割任务上达到了 51.7 mIOU,均超过了SwinT。通过在更大的数据集 ImageNet-21K 上进一步预训练,在 ImageNet-1K 上达到了 87.5% 的 Top-1 准确率,在 ADE20K 上达到了最先进的分割性能,达到了 55.7 mIoU。
资源推荐
资源详情
资源评论
收起资源包目录
CSWin-Transformer-main.zip (26个子文件)
CSWin-Transformer-main
SECURITY.md 3KB
teaser.png 347KB
install_req.sh 517B
main.py 35KB
LICENSE 1KB
finetune.py 42KB
segmentation
install_req.sh 577B
mmcv_custom
checkpoint.py 19KB
configs
cswin
upernet_cswin_base.py 1KB
upernet_cswin_tiny.py 1KB
upernet_cswin_small.py 1KB
_base
upernet_cswin.py 1KB
backbone
cswin_transformer.py 14KB
README.md 3KB
finetune.sh 307B
dataset
imagenet_class_index.json 35KB
ILSVRC2012_name_val.txt 1.86MB
ILSVRC2012_name_train.txt 37.05MB
checkpoint_saver.py 6KB
CODE_OF_CONDUCT.md 444B
SUPPORT.md 1KB
train.sh 303B
models
__init__.py 230B
cswin.py 15KB
README.md 8KB
labeled_memcached_dataset.py 2KB
共 26 条
- 1
资源评论
- weixin_721051792023-12-10怎么能有这么好的资源!只能用感激涕零来形容TAT...
sjx_alo
- 粉丝: 1w+
- 资源: 1235
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功