大模型微调-基于Multi-GPU+FP16微调BERT大语言模型-附项目源码-优质项目实战.zip资源-CSDN文库

共14个文件

py：11个

txt：1个

sh：1个

版权申诉

101 浏览量 2024-05-25 19:49:57 上传评论收藏 59KB ZIP 举报

在本项目中，我们将深入探讨如何使用多GPU和FP16优化进行BERT大语言模型的微调。BERT（Bidirectional Encoder Representations from Transformers）是由Google研发的一种预训练语言模型，它在自然语言处理任务上取得了显著的效果。微调是将预训练模型应用到特定任务时的常见步骤，通过在目标数据集上对模型进行训练，使其更好地适应下游任务。我们关注“大模型微调”这一概念。大模型通常指的是参数量非常庞大的神经网络模型，如BERT、GPT或T5等。微调这些模型时，需要在预训练模型的基础上进行少量的额外训练，以调整模型使其适用于特定任务，如问答、文本分类或情感分析。微调过程可以显著提升模型在特定任务上的性能，但也会带来计算资源的需求增加。接下来，我们讨论“Multi-GPU”技术。在微调大模型时，单个GPU可能无法承载全部参数，因此需要利用多个GPU并行计算。数据并行性是实现这一目标的主要方法，它将输入数据分成多个部分，每个GPU处理一部分，然后合并结果。在PyTorch或TensorFlow等深度学习框架中，可以使用DataParallel或 DistributedDataParallel API实现多GPU训练。 FP16，全称为Half-Precision Floating Point，是一种较低精度的浮点格式，相比传统的FP32（单精度），它占用的内存更少，能够提高计算速度和降低内存消耗。在微调大模型时，使用混合精度训练（Mixed Precision Training）结合FP16和FP32可以加速计算过程，同时保持模型的准确性。NVIDIA的 Apex 库提供了一种便捷的方式来实现PyTorch中的混合精度训练。项目源码通常包含以下关键组件： 1. 数据预处理：将原始文本数据转换为模型可接受的格式，如Tokenization（分词）、Encoding（编码）和Padding（填充）。 2. 模型定义：导入预训练的BERT模型，并根据任务需求对其进行调整，例如添加分类头层。 3. 训练循环：设置优化器、学习率调度器，以及多GPU训练的逻辑。 4. 混合精度训练：启用FP16训练，确保梯度的正确缩放和损失聚合。 5. 评估与保存：在验证集上定期评估模型性能，保存最佳模型。在实际操作中，开发者需要根据硬件配置和任务需求调整训练参数，如批大小、学习率、权重衰减等。此外，监控训练过程中的损失、准确率变化，以及GPU内存使用情况，也是优化模型的关键环节。本项目提供了一个使用多GPU和FP16优化微调BERT大语言模型的实践案例。通过理解和应用这些技术，开发者可以有效地训练和部署大规模语言模型，以解决各种复杂的自然语言处理任务。

资源推荐

资源详情

资源评论

收起资源包目录

大模型微调_基于Multi-GPU+FP16微调BERT大语言模型_附项目源码_优质项目实战.zip （14个子文件）

大模型微调_基于Multi-GPU+FP16微调BERT大语言模型_附项目源码_优质项目实战

modeling_test.py 8KB

custom_optimization.py 10KB

optimization_test.py 1KB

modeling.py 39KB

run_custom_classifier.sh 3KB

optimization.py 7KB

tokenization_test.py 4KB

run_custom_classifier.py 43KB

run_custom_classifier_mlabel.py 39KB

tokenization.py 11KB

requirements.txt 110B

run_seq_labeling.py 34KB

gpu_environment.py 931B

README.md 14KB

# bert-multi-gpu Feel free to fine tune large BERT models with large batch size easily. Multi-GPU and FP16 are supported. ## Dependencies - Tensorflow - tensorflow >= 1.11.0 # CPU Version of TensorFlow. - tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow. - NVIDIA Collective Communications Library (NCCL) ## Features - CPU/GPU/TPU Support - **Multi-GPU Support**: [`tf.distribute.MirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) is used to achieve Multi-GPU support for this project, which mirrors vars to distribute across multiple devices and machines. The maximum batch_size for each GPU is almost the same as [bert](https://github.com/google-research/bert/blob/master/README.md#out-of-memory-issues). So **global batch_size** depends on how many GPUs there are. - Assume: num_train_examples = 32000 - Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1 - global_batch_size = train_batch_size * num_gpu_cores = 32 - iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000 - Situation 2 (single-gpu): train_batch_size = 32, num_gpu_cores = 1, num_train_epochs = 4 - global_batch_size = train_batch_size * num_gpu_cores = 32 - iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000 - Result after training is equivalent between situation 1 and 2 when synchronous update on gradients is applied. - **FP16 Support**: [FP16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) allows you to use a larger batch_size. And training speed will increase by 70~100% on Volta GPUs, but may be slower on Pascal GPUs. - **SavedModel Export** ## Usage ### Run Classifier List some optional parameters below: - `task_name`: The name of task which you want to fine tune, you can define your own task by implementing `DataProcessor` class. - `do_lower_case`: Whether to lower case the input text. Should be True for uncased models and False for cased models. Default value is `true`. - `do_train`: Fine tune classifier or not. Default value is `false`. - `do_eval`: Evaluate classifier or not. Default value is `false`. - `do_predict`: Predict by classifier recovered from checkpoint or not. Default value is `false`. - `save_for_serving`: Output SavedModel for tensorflow serving. Default value is `false`. - `data_dir`: Your original input data directory. - `vocab_file`, `bert_config_file`, `init_checkpoint`: Files in BERT model directory. - `max_seq_length`: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default value is `128`. - `train_batch_size`: Batch size for [**each GPU**](<https://stackoverflow.com/questions/54327610/does-tensorflow-estimator-take-different-batches-for-workers-when-mirroredstrate/54332773#54332773>). For example, if `train_batch_size` is 16, and `num_gpu_cores` is 4, your **GLOBAL** batch size is 16 * 4 = 64. - `learning_rate`: Learning rate for Adam optimizer initialization. - `num_train_epochs`: Train epoch number. - `use_gpu`: Use GPU or not. - `num_gpu_cores`: Total number of GPU cores to use, only used if `use_gpu` is True. - `use_fp16`: Use [`FP16`](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) or not. - `output_dir`: **Checkpoints** and **SavedModel(.pb) files** will be saved in this directory. ```shell python run_custom_classifier.py \ --task_name=QQP \ --do_lower_case=true \ --do_train=true \ --do_eval=true \ --do_predict=true \ --save_for_serving=true \ --data_dir=/cfs/data/glue/QQP \ --vocab_file=/cfs/models/bert-large-uncased/vocab.txt \ --bert_config_file=/cfs/models/bert-large-uncased/bert_config.json \ --init_checkpoint=/cfs/models/bert-large-uncased/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate=2e-5 \ --num_train_epochs=3.0 \ --use_gpu=true \ --num_gpu_cores=4 \ --use_fp16=false \ --output_dir=/cfs/outputs/bert-large-uncased-qqp ``` Shell script is available also (see run_custom_classifier.sh) - Optional params could be passed flexibly through command line. - CUDA_VISIBLE_DEVICES could be set and export as environmental variables when multi-gpus are used. ```shell # refer to the variables acronym bash run_custom_classifier.sh -h # output current params setting: -s max_seq_length, default val is: 128 -g num_gpu_cores, default val is: 4 -b train_batch_size, default val is: 32 -l learning_rate, default val is: 2e-5 -e num_train_epochs, default val is: 3.0 -c CUDA_VISIBLE_DEVICES, default val is: 0,1,2,3 # example to pass params bash run_custom_classifier.sh -s 512 -b 8 -l 3e-5 -e 1 -g 2 -c 2,3 ``` ### Run Multi-label Classification Use case: In some situations, one example could be assigned to different groups, e.g. one movie could be tagged as romantic, commercial, boring with different aspects. As a result, multi-label classification should be applied rather than multi-class classification as labels are not exclusive (e.g. [1, 1, 0]). One additional parameter **'num_labels'** are required and other parameters keep similar to basic classifier. ```shell python run_custom_classifier_mlabel.py \ --num_labels=10 \ --task_name=Mlabel \ --do_lower_case=true \ --do_train=true \ --do_eval=true \ --do_predict=true \ --save_for_serving=true \ --data_dir=/cfs/data/Mlabel \ --vocab_file=/cfs/models/bert-large-uncased/vocab.txt \ --bert_config_file=/cfs/models/bert-large-uncased/bert_config.json \ --init_checkpoint=/cfs/models/bert-large-uncased/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate=2e-5 \ --num_train_epochs=3.0 \ --use_gpu=true \ --num_gpu_cores=4 \ --use_fp16=false \ --output_dir=/cfs/outputs/bert-large-uncased-mlabel ``` ### Run Sequence Labeling List some optional parameters below: - `task_name`: The name of task which you want to fine tune, you can define your own task by implementing `DataProcessor` class. - `do_lower_case`: Whether to lower case the input text. Should be True for uncased models and False for cased models. Default value is `true`. - `do_train`: Fine tune model or not. Default value is `false`. - `do_eval`: Evaluate model or not. Default value is `false`. - `do_predict`: Predict by model recovered from checkpoint or not. Default value is `false`. - `save_for_serving`: Output SavedModel for tensorflow serving. Default value is `false`. - `data_dir`: Your original input data directory. - `vocab_file`, `bert_config_file`, `init_checkpoint`: Files in BERT model directory. - `max_seq_length`: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default value is `128`. - `train_batch_size`: Batch size for [**each GPU**](<https://stackoverflow.com/questions/54327610/does-tensorflow-estimator-take-different-batches-for-workers-when-mirroredstrate/54332773#54332773>). For example, if `train_batch_size` is 16, and `num_gpu_cores` is 4, your **GLOBAL** batch size is 16 * 4 = 64. - `learning_rate`: Learning rate for Adam optimizer initialization. - `num_train_epochs`: Train epoch number. - `use_gpu`: Use GPU or not. - `num_gpu_cores`: Total number of GPU cores to use, only used if `use_gpu` is True. - `use_fp16`: Use [`FP16`](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) or not. - `output_dir`: **Checkpoints** and **SavedModel(.pb) files** will be saved in this directory. ```shell python run_seq_labeling.py \ --task_name=PUNCT \ --do_lower_case=true \ --do_train=true \ --do_eval=true \ --do_predict=true \ --save_for_serving

评论收藏

内容反馈

版权申诉