# Image classification reference training scripts
This folder contains reference training scripts for image classification.
They serve as a log of how to train specific models, as provide baseline
training and evaluation scripts to quickly bootstrap research.
Except otherwise noted, all models have been trained on 8x V100 GPUs with
the following parameters:
| Parameter | value |
| ------------------------ | ------ |
| `--batch_size` | `32` |
| `--epochs` | `90` |
| `--lr` | `0.1` |
| `--momentum` | `0.9` |
| `--wd`, `--weight-decay` | `1e-4` |
| `--lr-step-size` | `30` |
| `--lr-gamma` | `0.1` |
### AlexNet and VGG
Since `AlexNet` and the original `VGG` architectures do not include batch
normalization, the default initial learning rate `--lr 0.1` is too high.
torchrun --nproc_per_node=8 train.py\
--model $MODEL --lr 1e-2
Here `$MODEL` is one of `alexnet`, `vgg11`, `vgg13`, `vgg16` or `vgg19`. Note
that `vgg11_bn`, `vgg13_bn`, `vgg16_bn`, and `vgg19_bn` include batch
normalization and thus are trained with the default parameters.
### GoogLeNet
The weights of the GoogLeNet model are ported from the original paper rather than trained from scratch.
### Inception V3
The weights of the Inception V3 model are ported from the original paper rather than trained from scratch.
Since it expects tensors with a size of N x 3 x 299 x 299, to validate the model use the following command:
torchrun --nproc_per_node=8 train.py --model inception_v3\
--test-only --weights Inception_V3_Weights.IMAGENET1K_V1
### ResNet
torchrun --nproc_per_node=8 train.py --model $MODEL
Here `$MODEL` is one of `resnet18`, `resnet34`, `resnet50`, `resnet101` or `resnet152`.
### ResNext
torchrun --nproc_per_node=8 train.py\
--model $MODEL --epochs 100
Here `$MODEL` is one of `resnext50_32x4d` or `resnext101_32x8d`.
Note that the above command corresponds to a single node with 8 GPUs. If you use
a different number of GPUs and/or a different batch size, then the learning rate
should be scaled accordingly. For example, the pretrained model provided by
`torchvision` was trained on 8 nodes, each with 8 GPUs (for a total of 64 GPUs),
with `--batch_size 16` and `--lr 0.4`, instead of the current defaults
which are respectively batch_size=32 and lr=0.1
### MobileNetV2
torchrun --nproc_per_node=8 train.py\
--model mobilenet_v2 --epochs 300 --lr 0.045 --wd 0.00004\
--lr-step-size 1 --lr-gamma 0.98
### MobileNetV3 Large & Small
torchrun --nproc_per_node=8 train.py\
--model $MODEL --epochs 600 --opt rmsprop --batch-size 128 --lr 0.064\
--wd 0.00001 --lr-step-size 2 --lr-gamma 0.973 --auto-augment imagenet --random-erase 0.2
Here `$MODEL` is one of `mobilenet_v3_large` or `mobilenet_v3_small`.
Then we averaged the parameters of the last 3 checkpoints that improved the Acc@1. See [#3182](https://github.com/pytorch/vision/pull/3182)
and [#3354](https://github.com/pytorch/vision/pull/3354) for details.
### EfficientNet-V1
The weights of the B0-B4 variants are ported from Ross Wightman's [timm repo](https://github.com/rwightman/pytorch-image-models/blob/01cb46a9a50e3ba4be167965b5764e9702f09b30/timm/models/efficientnet.py#L95-L108).
The weights of the B5-B7 variants are ported from Luke Melas' [EfficientNet-PyTorch repo](https://github.com/lukemelas/EfficientNet-PyTorch/blob/1039e009545d9329ea026c9f7541341439712b96/efficientnet_pytorch/utils.py#L562-L564).
All models were trained using Bicubic interpolation and each have custom crop and resize sizes. To validate the models use the following commands:
torchrun --nproc_per_node=8 train.py --model efficientnet_b0 --test-only --weights EfficientNet_B0_Weights.IMAGENET1K_V1
torchrun --nproc_per_node=8 train.py --model efficientnet_b1 --test-only --weights EfficientNet_B1_Weights.IMAGENET1K_V1
torchrun --nproc_per_node=8 train.py --model efficientnet_b2 --test-only --weights EfficientNet_B2_Weights.IMAGENET1K_V1
torchrun --nproc_per_node=8 train.py --model efficientnet_b3 --test-only --weights EfficientNet_B3_Weights.IMAGENET1K_V1
torchrun --nproc_per_node=8 train.py --model efficientnet_b4 --test-only --weights EfficientNet_B4_Weights.IMAGENET1K_V1
torchrun --nproc_per_node=8 train.py --model efficientnet_b5 --test-only --weights EfficientNet_B5_Weights.IMAGENET1K_V1
torchrun --nproc_per_node=8 train.py --model efficientnet_b6 --test-only --weights EfficientNet_B6_Weights.IMAGENET1K_V1
torchrun --nproc_per_node=8 train.py --model efficientnet_b7 --test-only --weights EfficientNet_B7_Weights.IMAGENET1K_V1
### EfficientNet-V2
torchrun --nproc_per_node=8 train.py \
--model $MODEL --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr \
--lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 600 --random-erase 0.1 \
--label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.00002 --norm-weight-decay 0.0 \
--train-crop-size $TRAIN_SIZE --model-ema --val-crop-size $EVAL_SIZE --val-resize-size $EVAL_SIZE \
--ra-sampler --ra-reps 4
Here `$MODEL` is one of `efficientnet_v2_s` and `efficientnet_v2_m`.
Note that the Small variant had a `$TRAIN_SIZE` of `300` and a `$EVAL_SIZE` of `384`, while the Medium `384` and `480` respectively.
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 4 nodes, each with 8 GPUs (for a total of 32 GPUs),
and `--batch_size 32`.
The weights of the Large variant are ported from the original paper rather than trained from scratch. See the `EfficientNet_V2_L_Weights` entry for their exact preprocessing transforms.
### RegNet
#### Small models
torchrun --nproc_per_node=8 train.py\
--model $MODEL --epochs 100 --batch-size 128 --wd 0.00005 --lr=0.8\
--lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
--lr-warmup-epochs=5 --lr-warmup-decay=0.1
Here `$MODEL` is one of `regnet_x_400mf`, `regnet_x_800mf`, `regnet_x_1_6gf`, `regnet_y_400mf`, `regnet_y_800mf` and `regnet_y_1_6gf`. Please note we used learning rate 0.4 for `regent_y_400mf` to get the same Acc@1 as [the paper)(https://arxiv.org/abs/2003.13678).
#### Medium models
torchrun --nproc_per_node=8 train.py\
--model $MODEL --epochs 100 --batch-size 64 --wd 0.00005 --lr=0.4\
--lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
--lr-warmup-epochs=5 --lr-warmup-decay=0.1
Here `$MODEL` is one of `regnet_x_3_2gf`, `regnet_x_8gf`, `regnet_x_16gf`, `regnet_y_3_2gf` and `regnet_y_8gf`.
#### Large models
torchrun --nproc_per_node=8 train.py\
--model $MODEL --epochs 100 --batch-size 32 --wd 0.00005 --lr=0.2\
--lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
--lr-warmup-epochs=5 --lr-warmup-decay=0.1
Here `$MODEL` is one of `regnet_x_32gf`, `regnet_y_16gf` and `regnet_y_32gf`.
### Vision Transformer
#### vit_b_16
torchrun --nproc_per_node=8 train.py\
--model vit_b_16 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
--lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
--lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
--clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 8 nodes, each with 8 GPUs (for a total of 64 GPUs),
and `--batch_size 64`.
#### vit_b_32
torchrun --nproc_per_node=8 train.py\
--model vit_b_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
--lr-scheduler cosineannealinglr --lr-warmup-me
