## pytorch-distributed-training
Distribute Dataparallel (DDP) Training on Pytorch
### Features
* Easy to study DDP training
* You can directly copy this code for a quick start
* Learning Notes Sharing(with `√`means finished):
- [x] [Basic Theory](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/0.%20Basic%20Theory.md)
- [x] [Pytorch Gradient Accumulation](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/1.%20Gradient%20Accumulation.md)
- [x] [More Details of DDP Training](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/2.%20DDP%20Training%20Details.md)
- [x] [DDP training with apex](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/4.%20DDP%20with%20apex.md)
- [ ] [Accelerate-on-Accelerate DDP Training Tricks](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/3.%20DDP%20Training%20Tricks.md)
- [ ] [DP and DDP 源码解读](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/5.%20DP%20and%20DDP.md)
### Good Notes
分享一些网上优质的笔记
- [分布式训练(理论篇)](https://zhuanlan.zhihu.com/p/129912419)
- [当代研究生应当掌握的并行训练方法(单机多卡)](https://zhuanlan.zhihu.com/p/98535650)
### TODO
- [ ] 完成DP和DDP源码解读笔记(当前进度50%)
- [ ] 修改代码细节, 复现实验结果
### Quick start
想直接运行查看结果的可以执行以下命令, 注意一定要用`--ip`和`--port`来指定主机的`ip`地址以及空闲的`端口`,否则可能无法运行
- [dataparaller.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/dataparallel.py)
```bash
$ python dataparallel.py --gpu 0,1,2,3
```
- [distributed.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed.py)
```bash
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py
```
- [distributed_mp.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_mp.py)
```bash
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py
```
- [distributed_apex.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_apex.py)
```bash
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py
```
- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址
- `--port=int`, e.g `--port=23456` 来指定启动端口号
- `--batch_size=int`, e.g `--batch_size=128` 设定训练batch_size
- [distributed_gradient_accumulation.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_gradient_accumulation.py)
```bash
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py
```
- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址
- `--port=int`, e.g `--port=23456` 来指定启动端口号
- `--grad_accu_steps=int`, e.g `--grad_accu_steps=4'` 来指定gradient_step
### Comparison
结果不够准确,GPU状态不同结果可能差异较大
默认情况下都使用`SyncBatchNorm`, 这会导致执行速度变慢一些,因为需要增加进程之间的通讯来计算`BatchNorm`, 但有利于保证准确率
Concepts
- [apex](https://github.com/NVIDIA/apex)
- DP: `DataParallel`
- DDP: `DistributedDataParallel`
Environments
- 4 × 2080Ti
|model|dataset|training method|time(seconds/epoch)|Top-1 accuracy
|:---:|:---:|:---:|:---:|:---:
|resnet18|cifar100|DP|20s|
|resnet18|cifar100|DP+apex|18s|
|resnet18|cifar100|DDP|16s|
|resnet18|cifar100|DDP+apex|14.5s|
### Basic Concept
- group: 表示进程组,默认情况下只有一个进程组。
- world size: 全局进程个数
- 比如16张卡`单卡单进程`: world size = 16
- `8卡单进程`: world size = 1
- 只有当连接的进程数等于world size, 程序才会执行
- rank: 进程序号,用于进程间通讯,表示进程优先级,`rank=0`表示`主进程`
- local_rank: 进程内,`GPU`编号,非显示参数,由`torch.distributed.launch`内部指定,`rank=3, local_rank=0` 表示第`3`个进程的第`1`块`GPU`
### Usage 单机多卡
#### 1. 获取当前进程的index
pytorch可以通过torch.distributed.lauch启动器,在命令行分布式地执行.py文件, 在执行的过程中会将当前进程的index通过参数传递给python
```python
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
help='node rank for distributed training')
args = parser.parse_args()
print(args.local_rank)
```
#### 2. 定义 main_worker 函数
主要的训练流程都写在main_worker函数中,main_worker需要接受三个参数(最后一个参数optional):
```python
def main_worker(local_rank, nprocs, args):
training...
```
- local_rank: 接受当前进程的rank值,在一机多卡的情况下对应使用的GPU号
- nprocs: 进程数量
- args: 自己定义的额外参数
main_worker,相当于你每个进程需要运行的函数(每个进程执行的函数内容是一致的,只不过传入的local_rank不一样)
#### 3. main_worker函数中的整体流程
main_worker函数中完整的训练流程
```python
import torch
import torch.distributed as dist
import torch.backends.cudnn as cudnn
def main_worker(local_rank, nprocs, args):
args.local_rank = local_rank
# 分布式初始化,对于每个进程来说,都需要进行初始化
cudnn.benchmark = True
dist.init_process_group(backend='nccl', init_method='tcp://ip:port', world_size=nprocs, rank=local_rank)
# 模型、损失函数、优化器定义
model = ...
criterion = ...
optimizer = ...
# 设置进程对应使用的GPU
torch.cuda.set_device(local_rank)
model.cuda(local_rank)
# 使用分布式函数定义模型
model = model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
# 数据集的定义,使用 DistributedSampler
mini_batch_size = batch_size / nprocs # 手动划分 batch_size to mini-batch_size
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=...,
sampler=train_sampler)
test_dataset = ...
test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
testloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=...,
sampler=test_sampler)
# 正常的 train 流程
for epoch in range(300):
model.train()
for batch_idx, (images, target) in enumerate(trainloader):
images = images.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
...
pred = model(images)
loss = loss_function(pred, target)
...
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
#### 4. 定义main函数
```python
import argparse
import torch
parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--batch_size','--batch-size', default=256, type=int)
parser.add_argument('--lr', default=0.1, type=float)
def main_worker(local_rank, nprocs, args):
...
def main():
args = parser.parse_args()
args.nprocs = torch.cuda.device_count()
# 执行 main_worker
main_worker(args.local_rank, args.nprocs, args)
if __name__ == '__main__':
main()
```
#### 5. Command Line 启动
```bash
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py
```
- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址
- `--port=int`, e.g `--
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
pytorch-distributed-training Distribute Dataparallel (DDP) Training on Pytorch Features Easy to study DDP training You can directly copy this code for a quick start Learning Notes Sharing(with √means finished): Good Notes 分享一些网上优质的笔记 TODO 完成DP和DDP源码解读笔记(当前进度50%) 修改代码细节, 复现实验结果 Quick start 想直接运行查看结果的可以执行以下命令, 注意一定要用--ip和--port来指定主机的ip地址以及空闲的端口,否则可能无法运行 $ python dataparallel.py --gpu 0,1,2,3 $ CUD
资源详情
资源评论
资源推荐
收起资源包目录
pytorch-distributed-training-master.zip (20个子文件)
pytorch-distributed-training-master
distributed.py 5KB
.gitignore 37B
dataparallel_apex.py 5KB
distributed_apex.py 6KB
figs
DP.jpg 314KB
distributed_gradient_accumulation.py 6KB
distributed_mp.py 6KB
dataparallel.py 4KB
utils
validation.py 2KB
config.py 403B
util.py 2KB
dataset.py 1KB
model.py 5KB
README.md 10KB
tutorials
3. DDP Training Tricks.md 587B
5. DP and DDP.md 5KB
0. Basic Theory.md 4KB
1. Gradient Accumulation.md 4KB
2. DDP Training Details.md 4KB
4. DDP with apex.md 2KB
共 20 条
- 1
似蜉蝣
- 粉丝: 26
- 资源: 4602
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1