【免费】GPipe：大规模模型并行训练的有效解决方案资源-CSDN文库

需积分: 0 9 浏览量更新于2024-09-14 收藏 527KB PDF 举报

本文介绍了GPipe——一种利用微批次管道并行性实现大型神经网络高效扩展的库。GPipe支持将任意深度神经网络分解成层序列并在不同加速器上执行。它引入了一种新颖的管道并行算法和批量分割方法，可以在多设备环境下同步梯度更新，使得硬件利用率高同时保持训练稳定性。文中展示了GPipe应用在图像分类与多语言神经机器翻译领域的成功实验效果，证实了GPipe的强大性能以及其灵活性。不论是对于研究人员还是工程实践者来说，都能有效提高深层模型特别是巨型规模下计算任务的工作效率。适合人群：专注于神经网络研究的研究人员，需要大规模模型的应用团队成员。使用场景及目标：适用于需要突破单一加速器内存限制，构建更大更复杂的机器学习模型的情景，特别是希望借助GPU集群或其他加速设备来扩展训练能力的专业人士。目标是在有限的硬件条件下最大程度优化神经网络的容量。其他说明：相比现有的一些解决方案如SPMD、Pipeline等方式，GPipe提供更广泛的任务适应性和更低通信开销。然而需要注意当前版本假设每个单层仍然符合单个加速卡的显存配置。

GPipe: Easy Scaling with Micro-Batch Pipeline

Parallelism

Yanping Huang

huangyp@google.com

Youlong Cheng

ylc@google.com

Ankur Bapna

ankurbpn@google.com

Orhan Firat

orhanf@google.com

Mia Xu Chen

miachen@google.com

Dehao Chen

dehao@google.com

HyoukJoong Lee

hyouklee@google.com

Jiquan Ngiam

jngiam@google.com

Quoc V. Le

qvl@google.com

Yonghui Wu

yonghui@google.com

Zhifeng Chen

zhifengc@google.com

Abstract

Scaling up deep neural network capacity has been known as an effective approach

to improving model quality for several different machine learning tasks. In many

cases, increasing model capacity beyond the memory limit of a single accelera-

tor has required developing special algorithms or infrastructure. These solutions

are often architecture-speciﬁc and do not transfer to other tasks. To address the

need for efﬁcient and task-independent model parallelism, we introduce GPipe, a

pipeline parallelism library that allows scaling any network that can be expressed

as a sequence of layers. By pipelining different sub-sequences of layers on sep-

arate accelerators, GPipe provides the ﬂexibility of scaling a variety of different

networks to gigantic sizes efﬁciently. Moreover, GPipe utilizes a novel batch-

splitting pipelining algorithm, resulting in almost linear speedup when a model

is partitioned across multiple accelerators. We demonstrate the advantages of

GPipe by training large-scale neural networks on two different tasks with distinct

network architectures: (i) Image Classiﬁcation: We train a 557-million-parameter

AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii)

Multilingual Neural Machine Translation: We train a single 6-billion-parameter,

128-layer Transformer model on a corpus spanning over 100 languages and achieve

better quality than all bilingual models.

1 Introduction

Deep learning has seen great progress over the last decade, partially thanks to the development of

methods that have facilitated scaling the effective capacity of neural networks. This trend has been

most visible for image classiﬁcation, as demonstrated by the accuracy improvements on ImageNet

with the increase in model capacity (Figure 1a). A similar phenomenon can also be observed in

the context of natural language processing (Figure 1b) where simple shallow models of sentence

representations [1, 2] are outperformed by their deeper and larger counterparts [3, 4].

While larger models have brought remarkable quality improvements to several ﬁelds, scaling neural

networks introduces signiﬁcant practical challenges. Hardware constraints, including memory

limitations and communication bandwidths on accelerators (GPU or TPU), force users to divide larger

Preprint. Under review.

arXiv:1811.06965v5 [cs.CV] 25 Jul 2019

Figure 1: (a) Strong correlation between top-1 accuracy on ImageNet 2012 validation dataset [

]

and model size for representative state-of-the-art image classiﬁcation models in recent years [

]. There has been a

36×

increase in the model capacity. Red dot depicts

84.4%

top-1

accuracy for the 550M parameter AmoebaNet model. (b) Average improvement in translation quality

(BLEU) compared against bilingual baselines on our massively multilingual in-house corpus, with

increasing model size. Each point,

T (L, H, A)

, depicts the performance of a Transformer with

encoder and

decoder layers, a feed-forward hidden dimension of

and

attention heads. Red dot

depicts the performance of a 128-layer 6B parameter Transformer.

models into partitions and to assign different partitions to different accelerators. However, efﬁcient

model parallelism algorithms are extremely hard to design and implement, which often requires the

practitioner to make difﬁcult choices among scaling capacity, ﬂexibility (or speciﬁcity to particular

tasks and architectures) and training efﬁciency. As a result, most efﬁcient model-parallel algorithms

are architecture and task-speciﬁc. With the growing number of applications of deep learning, there is

an ever-increasing demand for reliable and ﬂexible infrastructure that allows researchers to easily

scale neural networks for a large variety of machine learning tasks.

To address these challenges, we introduce GPipe, a ﬂexible library that enables efﬁcient training of

large neural networks. GPipe allows scaling arbitrary deep neural network architectures beyond the

memory limitations of a single accelerator by partitioning the model across different accelerators and

supporting re-materialization on every accelerator [

]. With GPipe, each model can be speciﬁed

as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is

then placed on a separate accelerator. Based on this partitioned setup, we propose a novel pipeline

parallelism algorithm with batch splitting. We ﬁrst split a mini-batch of training examples into

smaller micro-batches, then pipeline the execution of each set of micro-batches over cells. We apply

synchronous mini-batch gradient descent for training, where gradients are accumulated across all

micro-batches in a mini-batch and applied at the end of a mini-batch. Consequently, gradient updates

using GPipe are consistent regardless of the number of partitions, allowing researchers to easily train

increasingly large models by deploying more accelerators. GPipe can also be complemented with

data parallelism to further scale training.

We demonstrate the ﬂexibility and efﬁciency of GPipe on image classiﬁcation and machine translation.

For image classiﬁcation, we train the AmoebaNet model on

480 × 480

input from the ImageNet 2012

dataset. By increasing the model width, we scale up the number of parameters to

557

million and

achieve a top-1 validation accuracy of 84.4%. On machine translation, we train a single 128-layer

6-billion-parameter multilingual Transformer model on 103 languages (102 languages to English).

We show that this model is capable of outperforming the individually trained 350-million-parameter

bilingual Transformer Big [15] models on 100 language pairs.

2 The GPipe Library

We now describe the interface and the main design features of GPipe. This open-source library is

implemented under the Lingvo [

] framework. The core design features of GPipe are generally

applicable and can be implemented for other frameworks [17, 18, 19].

Figure 2: (a) An example neural network with sequential layers is partitioned across four accelerators.

is the composite forward computation function of the

-th cell.

is the back-propagation

function, which depends on both

k+1

from the upper layer and

. (b) The naive model parallelism

strategy leads to severe under-utilization due to the sequential dependency of the network. (c) Pipeline

parallelism divides the input mini-batch into smaller micro-batches, enabling different accelerators to

work on different micro-batches simultaneously. Gradients are applied synchronously at the end.

(b)

(a) (c)

2.1 Interface

Any deep neural network can be deﬁned as a sequence of

layers. Each layer

is composed of

a forward computation function

, and a corresponding set of parameters

. GPipe additionally

allows the user to specify an optional computation cost estimation function,

. With a given number

of partitions

, the sequence of

layers can be partitioned into

composite layers, or cells. Let

consist of consecutive layers between layers

and

. The set of parameters corresponding to

equivalent to the union of

i+1

, . . . ,

, and its forward function would be

= f

◦. . .◦f

i+1

◦f

The corresponding back-propagation function

can be computed from

using automatic symbolic

differentiation. The cost estimator, C

, is set to Σ

l=i

The GPipe interface is extremely simple and intuitive, requiring the user to specify: (i) the number of

model partitions

, (ii) the number of micro-batches

, and (iii) the sequence and deﬁnitions of

layers that deﬁne the model. Please refer to supplementary material for examples.

2.2 Algorithm

Once the user deﬁnes the sequence of layers in their network in terms of model parameters

, forward

computation function

, and the cost estimation function

, GPipe partitions the network into

cells and places the

-th cell on the

-th accelerator. Communication primitives are automatically

inserted at partition boundaries to allow data transfer between neighboring partitions. The partitioning

algorithm minimizes the variance in the estimated costs of all cells in order to maximize the efﬁciency

of the pipeline by syncing the computation time across all partitions.

During the forward pass, GPipe ﬁrst divides every mini-batch of size

into

equal micro-batches,

which are pipelined through the

accelerators. During the backward pass, gradients for each

micro-batch are computed based on the same model parameters used for the forward pass. At the end

of each mini-batch, gradients from all

micro-batches are accumulated and applied to update the

model parameters across all accelerators. This sequence of operations is illustrated in Figure 2c.

If batch normalization [

] is used in the network, the sufﬁcient statistics of inputs during training

are computed over each micro-batch and over replicas if necessary [

]. We also track the moving

average of the sufﬁcient statistics over the entire mini-batch to be used during evaluation.

剩余10页未读，继续阅读

资源推荐

资源评论

豪AI冰

粉丝: 73
资源: 68

GPipe：大规模模型并行训练的有效解决方案

Auto Parallel：自动化分布式并行训练-华为苏腾.pdf

Python_基于天气和气候模型的机器学习的大规模并行训练.zip

大模型与数字工厂解决方案.pptx

NVIDIA课程：模型并行-构建和部署大型神经网络参考答案

AI分布式训练：DDP (数据并行）技术详解与实战.docx

2-5+FastMoE：开源分布式MoE模型训练系统.pdf

5-3+神舟大规模预训练模型.pdf

大规模并行处理器编程实战

GPU专题跟踪：大模型核心是训练算力，Chat流量核心是推理算力.pdf

李笙维：DataFunSummit非数据中心GPU上的大模型并行训练.pdf

大规模并行计算

第一期·MindFormers大模型套件《架构讲解与使用入门》

大规模数据并行处理技术.pptx

卷积神经网络并行训练的优化研究.pdf

2-3+超大模型高效训练的分布式框架Whale.pdf

Python-torchgpipe是GPipe的一个PyTorch中实现它针对CUDA而不是TPU进行了优化

基于DeepSpeed库的gpu上模型并行自回归变压器的实现.zip

清华大学计算机系并行计算课件

6-1+基于GPU的超大规模离散模型训练框架PaddleBox、FeaBox.pdf

是构建一个大模型训练、推理、部署的全流程套件： 提供业内主流的Transformer类预训练模型， 涵盖丰富的并行特性

并行编程模型研究文档

大模型训练大模型训练大模型训练

《利用MegEngine的分布式通信算子实现复杂的并行训练》 _MegEngine Meetup No.2..pdf

OGAI详解：AIStation调度平台如何实现大模型高效长时间持续训练

AI大模型源代码(包含算法+模型训练+算力管理和推理等)100%能用.zip

昇腾训练工具链 针对训练、大模型场景，提供端到端命令行、可视化调试调优工具，帮助用户快速提高模型开发效率

大模型训练避坑指南.pdf

最新资源

是构建一个大模型训练、推理、部署的全流程套件：提供业内主流的Transformer类预训练模型，涵盖丰富的并行特性

昇腾训练工具链针对训练、大模型场景，提供端到端命令行、可视化调试调优工具，帮助用户快速提高模型开发效率