# bert-multi-gpu
Feel free to fine tune large BERT models with large batch size easily. Multi-GPU and FP16 are supported.
## Dependencies
- Tensorflow
- tensorflow >= 1.11.0 # CPU Version of TensorFlow.
- tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow.
- NVIDIA Collective Communications Library (NCCL)
## Features
- CPU/GPU/TPU Support
- **Multi-GPU Support**: [`tf.distribute.MirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) is used to achieve Multi-GPU support for this project, which mirrors vars to distribute across multiple devices and machines. The maximum batch_size for each GPU is almost the same as [bert](https://github.com/google-research/bert/blob/master/README.md#out-of-memory-issues). So **global batch_size** depends on how many GPUs there are.
- Assume: num_train_examples = 32000
- Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1
- global_batch_size = train_batch_size * num_gpu_cores = 32
- iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
- Situation 2 (single-gpu): train_batch_size = 32, num_gpu_cores = 1, num_train_epochs = 4
- global_batch_size = train_batch_size * num_gpu_cores = 32
- iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
- Result after training is equivalent between situation 1 and 2 when synchronous update on gradients is applied.
- **FP16 Support**: [FP16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) allows you to use a larger batch_size. And training speed will increase by 70~100% on Volta GPUs, but may be slower on Pascal GPUs.
- **SavedModel Export**
## Usage
### Run Classifier
List some optional parameters below:
- `task_name`: The name of task which you want to fine tune, you can define your own task by implementing `DataProcessor` class.
- `do_lower_case`: Whether to lower case the input text. Should be True for uncased models and False for cased models. Default value is `true`.
- `do_train`: Fine tune classifier or not. Default value is `false`.
- `do_eval`: Evaluate classifier or not. Default value is `false`.
- `do_predict`: Predict by classifier recovered from checkpoint or not. Default value is `false`.
- `save_for_serving`: Output SavedModel for tensorflow serving. Default value is `false`.
- `data_dir`: Your original input data directory.
- `vocab_file`, `bert_config_file`, `init_checkpoint`: Files in BERT model directory.
- `max_seq_length`: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default value is `128`.
- `train_batch_size`: Batch size for [**each GPU**](<https://stackoverflow.com/questions/54327610/does-tensorflow-estimator-take-different-batches-for-workers-when-mirroredstrate/54332773#54332773>). For example, if `train_batch_size` is 16, and `num_gpu_cores` is 4, your **GLOBAL** batch size is 16 * 4 = 64.
- `learning_rate`: Learning rate for Adam optimizer initialization.
- `num_train_epochs`: Train epoch number.
- `use_gpu`: Use GPU or not.
- `num_gpu_cores`: Total number of GPU cores to use, only used if `use_gpu` is True.
- `use_fp16`: Use [`FP16`](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) or not.
- `output_dir`: **Checkpoints** and **SavedModel(.pb) files** will be saved in this directory.
```shell
python run_custom_classifier.py \
--task_name=QQP \
--do_lower_case=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--save_for_serving=true \
--data_dir=/cfs/data/glue/QQP \
--vocab_file=/cfs/models/bert-large-uncased/vocab.txt \
--bert_config_file=/cfs/models/bert-large-uncased/bert_config.json \
--init_checkpoint=/cfs/models/bert-large-uncased/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--use_gpu=true \
--num_gpu_cores=4 \
--use_fp16=false \
--output_dir=/cfs/outputs/bert-large-uncased-qqp
```
Shell script is available also (see run_custom_classifier.sh)
- Optional params could be passed flexibly through command line.
- CUDA_VISIBLE_DEVICES could be set and export as environmental variables when multi-gpus are used.
```shell
# refer to the variables acronym
bash run_custom_classifier.sh -h
# output
current params setting:
-s max_seq_length, default val is: 128
-g num_gpu_cores, default val is: 4
-b train_batch_size, default val is: 32
-l learning_rate, default val is: 2e-5
-e num_train_epochs, default val is: 3.0
-c CUDA_VISIBLE_DEVICES, default val is: 0,1,2,3
# example to pass params
bash run_custom_classifier.sh -s 512 -b 8 -l 3e-5 -e 1 -g 2 -c 2,3
```
### Run Multi-label Classification
Use case: In some situations, one example could be assigned to different groups, e.g. one movie could be tagged as
romantic, commercial, boring with different aspects. As a result, multi-label classification should be applied rather than multi-class
classification as labels are not exclusive (e.g. [1, 1, 0]).
One additional parameter **'num_labels'** are required and other parameters keep similar to basic classifier.
```shell
python run_custom_classifier_mlabel.py \
--num_labels=10 \
--task_name=Mlabel \
--do_lower_case=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--save_for_serving=true \
--data_dir=/cfs/data/Mlabel \
--vocab_file=/cfs/models/bert-large-uncased/vocab.txt \
--bert_config_file=/cfs/models/bert-large-uncased/bert_config.json \
--init_checkpoint=/cfs/models/bert-large-uncased/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--use_gpu=true \
--num_gpu_cores=4 \
--use_fp16=false \
--output_dir=/cfs/outputs/bert-large-uncased-mlabel
```
### Run Sequence Labeling
List some optional parameters below:
- `task_name`: The name of task which you want to fine tune, you can define your own task by implementing `DataProcessor` class.
- `do_lower_case`: Whether to lower case the input text. Should be True for uncased models and False for cased models. Default value is `true`.
- `do_train`: Fine tune model or not. Default value is `false`.
- `do_eval`: Evaluate model or not. Default value is `false`.
- `do_predict`: Predict by model recovered from checkpoint or not. Default value is `false`.
- `save_for_serving`: Output SavedModel for tensorflow serving. Default value is `false`.
- `data_dir`: Your original input data directory.
- `vocab_file`, `bert_config_file`, `init_checkpoint`: Files in BERT model directory.
- `max_seq_length`: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default value is `128`.
- `train_batch_size`: Batch size for [**each GPU**](<https://stackoverflow.com/questions/54327610/does-tensorflow-estimator-take-different-batches-for-workers-when-mirroredstrate/54332773#54332773>). For example, if `train_batch_size` is 16, and `num_gpu_cores` is 4, your **GLOBAL** batch size is 16 * 4 = 64.
- `learning_rate`: Learning rate for Adam optimizer initialization.
- `num_train_epochs`: Train epoch number.
- `use_gpu`: Use GPU or not.
- `num_gpu_cores`: Total number of GPU cores to use, only used if `use_gpu` is True.
- `use_fp16`: Use [`FP16`](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) or not.
- `output_dir`: **Checkpoints** and **SavedModel(.pb) files** will be saved in this directory.
```shell
python run_seq_labeling.py \
--task_name=PUNCT \
--do_lower_case=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--save_for_serving