## 全量微调
生成用于Ascend芯片分布式通信的芯片资源信息配置文件(RANK_TABLE_FILE)。
Ascend HCCL RANK_TABLE_FILE 文件提供Ascend分布式训练作业的集群信息。
```
# 如生成8卡的rank_table_file
> python ./mindformers/tools/hccl_tools.py --device_num "[0,8)"
start ./mindformers/tools/hccl_tools.py
visible_devices:['0', '1', '2', '3', '4', '5', '6', '7']
server_id:192.168.1.196
device_num_list: [0, 1, 2, 3, 4, 5, 6, 7]
rank_id:0, device_id:0, device_ip:192.168.100.101
rank_id:1, device_id:1, device_ip:192.168.101.101
rank_id:2, device_id:2, device_ip:192.168.102.101
rank_id:3, device_id:3, device_ip:192.168.103.101
rank_id:4, device_id:4, device_ip:192.168.100.100
rank_id:5, device_id:5, device_ip:192.168.101.100
rank_id:6, device_id:6, device_ip:192.168.102.100
rank_id:7, device_id:7, device_ip:192.168.103.100
Completed: hccl file was save in : /root/workspace/code/mindformers/hccl_8p_01234567_192.168.1.196.json
```
### 修改配置
```
cd /root/workspace/code/mindformers
vim configs/glm/run_glm_6b_finetune.yaml
```
### 启动训练任务
```
> bash run_distribute.sh /root/workspace/code/mindformers/hccl_8p_01234567_192.168.1.196.json ../configs/glm/run_glm_6b_finetune.yaml '[0,8]' finetune
start training for rank 0, device 0
start training for rank 1, device 1
start training for rank 2, device 2
start training for rank 3, device 3
start training for rank 4, device 4
start training for rank 5, device 5
start training for rank 6, device 6
start training for rank 7, device 7
```
部分训练日志如下所示:
```
...
[INFO] 2023-07-11 10:35:39,223 [run_mindformer.py:71] main: moe config is: <mindformers.modules.transformer.moe.MoEConfig object at 0xffff10297b10>
[INFO] 2023-07-11 10:35:39,223 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:75] __init__: Now Running Task is: text_generation, Model is: glm_6b
[INFO] 2023-07-11 10:35:39,224 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:177] _check_global_batch_size_for_auto_parallel: The current parallel mode is semi_auto_parallel, full batch is True,so global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_interleave_num = 8 * 1 * 1 = 8
[INFO] 2023-07-11 10:35:39,224 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:514] training_process: .........Build Dataset For Train..........
[INFO] 2023-07-11 10:35:39,224 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:268] create_train_dataset: .........Build Dataset From Config..........
[INFO] 2023-07-11 10:35:39,224 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/dataset/causal_language_model_dataset.py:98] __new__: Now Create Causal Language Model Dataset.
[INFO] 2023-07-11 10:35:39,224 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/dataset/base_dataset.py:50] init_dataset_config: Now the semi auto parallel mode is used and full_batch is True,and the shuffle of the dataset is required to be False,so as to ensure that the data loaded on each card is consistent and to avoid the problem of non-convergence of loss.
[INFO] 2023-07-11 10:35:39,231 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/utils.py:133] check_runner_config: Will be Training epochs:1, sink_size:4
[INFO] 2023-07-11 10:35:39,231 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/utils.py:134] check_runner_config: Create training dataset finish, dataset size:125
[INFO] 2023-07-11 10:35:39,231 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:521] training_process: .........Build Net For Train..........
[INFO] 2023-07-11 10:35:39,231 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:282] create_network: .........Build Network From Config..........
[INFO] 2023-07-11 10:38:43,280 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/models/base_model.py:80] load_checkpoint: weights in /root/workspace/model/chatglm-convert/ms_glm_6b.ckpt are loaded
[INFO] 2023-07-11 10:38:43,299 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:425] count_parameters: Network Parameters: 6707 M.
[INFO] 2023-07-11 10:38:43,299 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:544] training_process: .........Build Optimizer For Train..........
[INFO] 2023-07-11 10:38:43,299 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:321] create_optimizer_scheduler: .........Build Optimizer From Config..........
[INFO] 2023-07-11 10:38:43,299 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:354] create_lr_scheduler: .........Build LR Schedule From Config..........
[WARNING] 2023-07-11 10:38:43,306 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/optimizer_grouped_parameters.py:74] get_optimizer_grouped_parameters: dynamic_lr_schedule will be reset and invalid when layer_scale is False.
...
[INFO] 2023-07-11 10:38:43,568 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:550] training_process: .........Build Running Wrapper From Config For Train..........
[INFO] 2023-07-11 10:38:43,568 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:391] create_model_wrapper: .........Build Model Wrapper for Train From Config..........
[INFO] 2023-07-11 10:38:43,582 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:562] training_process: .........Starting Init Train Model..........
[INFO] 2023-07-11 10:38:43,583 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:581] training_process: .........Build Callbacks For Train..........
[INFO] 2023-07-11 10:38:43,583 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:400] create_callbacks: .........Build Callbacks for Train From Config..........
[INFO] 2023-07-11 10:38:43,584 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/core/callback/callback.py:340] __init__: Integrated_save is changed to False when using auto_parallel.
[INFO] 2023-07-11 10:38:43,585 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:609] training_process: .........Starting Training Model..........
[INFO] 2023-07-11 10:38:43,585 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/trainer/base_trainer.py:610] training_process: .........Model Compiling, Please Wait a Moment...........
[INFO] 2023-07-11 10:47:36,427 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/core/callback/callback.py:269] print_output_info: Epoch:[ 1/ 1], step:[ 4/ 125], loss:[2.244/2.244], time:507844.205 ms, lr:[0.], overflow cond: True, loss_scale: 268435460.0
[INFO] 2023-07-11 10:47:37,342 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/core/callback/callback.py:146] epoch_end: Per sink_size step time: 533756.177 ms, per step time: 133439.044 ms, avg loss: 2.244
[INFO] 2023-07-11 10:47:44,861 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/core/callback/callback.py:269] print_output_info: Epoch:[ 1/ 1], step:[ 8/ 125], loss:[2.499/2.499], time:7480.938 ms, lr:[0.], overflow cond: True, loss_scale: 16777216.0
[INFO] 2023-07-11 10:47:44,874 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/core/callback/callback.py:146] epoch_end: Per sink_size step time: 7518.224 ms, per step time: 1879.556 ms, avg loss: 2.499
...
[INFO] 2023-07-11 10:48:35,199 [/root/workspace/code/mindformers/scripts/mf_parallel1/mindformers/core/callback/callback.py:146] epoch_end: Per sink_size st
没有合适的资源?快使用搜索试试~ 我知道了~
《AI大模型应用》-本项目旨在分享大模型相关技术原理以及实战经验 .zip
共763个文件
md:460个
py:119个
sh:45个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 113 浏览量
2024-07-10
18:15:10
上传
评论
收藏 22.06MB ZIP 举报
温馨提示
个人深耕AI大模型应用领域积累的成果,希望对您有所帮助。有大模型账号、环境问题、AI大模型技术应用落地方案等相关问题,欢迎详聊,能为您解决问题是我的荣幸! 个人深耕AI大模型应用领域积累的成果,希望对您有所帮助。有大模型账号、环境问题、AI大模型技术应用落地方案等相关问题,欢迎详聊,能为您解决问题是我的荣幸! 个人深耕AI大模型应用领域积累的成果,希望对您有所帮助。有大模型账号、环境问题、AI大模型技术应用落地方案等相关问题,欢迎详聊,能为您解决问题是我的荣幸! 个人深耕AI大模型应用领域积累的成果,希望对您有所帮助。有大模型账号、环境问题、AI大模型技术应用落地方案等相关问题,欢迎详聊,能为您解决问题是我的荣幸! 个人深耕AI大模型应用领域积累的成果,希望对您有所帮助。有大模型账号、环境问题、AI大模型技术应用落地方案等相关问题,欢迎详聊,能为您解决问题是我的荣幸! 个人深耕AI大模型应用领域积累的成果,希望对您有所帮助。有大模型账号、环境问题、AI大模型技术应用落地方案等相关问题,欢迎详聊,能为您解决问题是我的荣幸!
资源推荐
资源详情
资源评论
收起资源包目录
《AI大模型应用》-本项目旨在分享大模型相关技术原理以及实战经验 .zip (763个子文件)
pip.conf 767B
pip.conf 732B
cMinhash.cpp 797KB
mindie-1.0.Dockerfile 733B
mindie-all-1.0.Dockerfile 679B
mindie-1.0.Dockerfile 293B
mindie-env-1.0.Dockerfile 285B
megatron.drawio 5KB
模型架构类图.drawio 3KB
并行技术.drawio 1KB
spec_infer_demo.gif 4.1MB
模型架构.gif 2.9MB
All Gather 流程图.gif 173KB
Scatter Reduce 流程图.gif 135KB
slurm.gif 71KB
.gitignore 3KB
baichuan2-13b-8tp.html 1.6MB
baichuan2-14b-2tp.html 1.6MB
baichuan2-14b-4tp.html 1.6MB
qwen-72b-8tp.html 1.6MB
qwen1.5-14b-2tp.html 1.6MB
qwen1.5-72b-8tp.html 1.6MB
qwen1.5-7b-2tp.html 1.6MB
qwen-14B.html 1.6MB
chatglm-4tp.html 1.6MB
qwen1.5-14b-8p.html 1.6MB
chatglm3-6b-2tp.html 1.6MB
peft_prompt_tuning_clm.ipynb 87KB
peft_p_tuning_lstm_clm.ipynb 78KB
peft_p_tuning_clm.ipynb 77KB
peft_prefix_tuning_clm.ipynb 76KB
peft_p_tuning_v2_clm.ipynb 75KB
peft_lora_clm.ipynb 72KB
peft_ia3_clm.ipynb 64KB
pipeline_tutorial.ipynb 26KB
transformer_tutorial.ipynb 20KB
finetune_bloom_bnb_peft.ipynb 15KB
pipeshard_parallelism.ipynb 12KB
英伟达A100-A800-H100-H800.jpeg 213KB
wechat.jpeg 176KB
A800.jpeg 152KB
公众号.jpeg 124KB
A800-H100-H800.jpeg 121KB
H800.jpeg 99KB
peft方法.jpg 254KB
qwen-14b-chart.jpg 195KB
wx.jpg 170KB
llm-famliy.jpg 131KB
transformer架构.jpg 127KB
why_RLHF.jpg 80KB
qwen-14b-stat.jpg 77KB
qwen-72b.json 2KB
qwen-72b.json 2KB
baichuan2-13b.json 2KB
baichuan2-7b.json 2KB
qwen1.5-72b.json 2KB
chatglm3-6b.json 2KB
qwen1.5-14b.json 2KB
qwen1.5-7b.json 2KB
ds_config_zero2.json 1KB
ds_config_zero2.json 1KB
zero3-offload.json 1KB
zero2-offload.json 964B
sft_argument.json 935B
ds_config.json 928B
ds_config_zero2_ddp.json 869B
ds_config_zero3.json 719B
deepspeed.json 490B
LICENSE 11KB
model_train.md 141KB
model_merge_eval_inference.md 115KB
dcgmi.md 88KB
megatron-gpt2-fp8.md 83KB
AI 集群基础设施 NVMe SSD 详解.md 64KB
llama-note.md 52KB
AI 集群基础设施 InfiniBand 详解.md 51KB
模型架构.md 43KB
大模型实践总结.md 41KB
README.md 38KB
大模型实践总结-20230930.md 36KB
模型架构.md 35KB
README.md 34KB
LESS.md 32KB
README.md 31KB
README.md 30KB
nvbandwidth.md 29KB
数据集格式.md 27KB
nvidia-smi.md 25KB
项目结构-202312228.md 25KB
ZeroQuant.md 23KB
大模型国产化适配4-基于昇腾910使用LLaMA-13B进行多机多卡训练.md 22KB
大模型推理框架概述.md 22KB
gpt-data-preprocess.md 22KB
README.md 19KB
README.md 19KB
飞桨面向异构场景下的自动并行设计与实践.md 18KB
大模型国产化适配1-华为昇腾AI全栈软硬件平台总结.md 18KB
autoregressive-lm-decoding-methods.md 17KB
ascend910-env-install.md 16KB
mindie-1.0-chatglm2.md 15KB
共 763 条
- 1
- 2
- 3
- 4
- 5
- 6
- 8
资源评论
季风泯灭的季节
- 粉丝: 1628
- 资源: 3385
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于JavaScript、CSS、HTML的site_stock_record日计划物料领取/出库设计源码
- 基于Vue框架的信息网设计源码
- 基于JavaScript的foobar2000 foobox-cn DUI配置源码
- 基于Java技术的数码商城设计源码
- 基于Java与多语言集成的校园排课系统设计源码
- 基于区块链知识的入门级小白书设计源码
- 基于Vue.js构建的金融场景专用移动端UI组件库设计源码
- 基于Python的AdaptiveTest_OralCalculation自适应测评与普通测评口算题设计源码
- 基于Java与HTML的workTools设计源码,涵盖面试题整理与多种PDF处理示例
- 基于PHP的完整美发预约系统前端+后端设计源码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功