分布式AI训练系统：ColossalAI_ColossalAI资源-CSDN文库

共1984个文件

py：1538个

md：146个

sh：77个

版权申诉

人工智能

分布式

74 浏览量 2023-12-25 09:20:46 上传评论收藏 3.79MB ZIP 举报

分布式AI训练系统，如ColossalAI，是当前AI发展的重要推动力，它们使得大规模的人工智能模型能够以更经济、高效的方式进行训练。在传统的单机训练中，处理能力受到硬件资源的限制，而分布式训练则通过将计算任务分散到多台机器上，实现了对更大模型的处理，同时提高了训练速度。 ColossalAI是一个先进的分布式深度学习框架，它旨在解决现代AI研究和工业应用中的计算瓶颈。这个框架的核心目标是使大型模型的训练过程变得简单且易于管理，让开发者和研究人员无需深厚的并行计算知识就能利用分布式资源。在ColossalAI中，有几个关键的概念和技术： 1. 数据并行：这是最基础的分布式训练策略，通过将数据集分割成多个小批次，然后在不同的GPU或服务器之间分配，每个设备处理一部分数据，最终的结果再进行聚合，以实现模型的整体更新。 2. 模型并行：当模型过于庞大，无法在单个GPU上完全容纳时，模型并行就变得至关重要。它涉及到将模型的不同部分分布到多个计算设备上，每个设备负责一部分模型的前向传播和反向传播。 3. 张量并行：张量并行是模型并行的一种形式，它专注于操作级别的并行化。例如，将一个大的矩阵乘法分解为多个小的矩阵乘法并在不同的设备上并行执行。 4. Pipeline并行：在处理深度神经网络时，pipeline并行可以将模型的不同层划分到不同的设备上，形成流水线式的执行，从而减少计算等待时间。 5. 全局优化策略：ColossalAI提供了多种全局优化算法，如异步优化、同步优化（如AllReduce）等，这些算法协调不同设备间的模型更新，确保模型的收敛质量。 6. 动态负载均衡：在分布式环境中，资源的不均匀分配可能导致性能瓶颈。ColossalAI通过动态负载均衡机制，自动调整任务分配，最大化整体效率。 7. 自适应通信：通信是分布式训练中的重要环节。ColossalAI支持自适应通信策略，可以根据模型结构和训练阶段动态选择最有效的通信方法。 8. 轻量级接口：ColossalAI设计了简洁的API，用户可以轻松地将现有的PyTorch模型转换为分布式版本，降低了分布式训练的入门难度。 9. 集群管理和扩展性：ColossalAI支持大规模集群环境，能够灵活扩展到数百甚至数千台设备，满足超大规模模型的需求。 ColossalAI作为一款分布式AI训练系统，通过集成各种并行技术和优化策略，使得处理复杂的AI模型变得更加高效和便捷。它不仅提升了训练速度，还降低了使用分布式系统的门槛，对于推动AI技术的发展具有重要意义。无论是研究人员还是开发者，都能从中受益，实现更快速、更经济的模型训练。

资源推荐

资源详情

资源评论

收起资源包目录

分布式AI训练系统：ColossalAI （1984个子文件）

ckpt_solver_rotor.c 7KB

.isort.cfg 136B

.clang-format 21B

CODEOWNERS 29B

.compatibility 41B

.coveragerc 67B

helpers.cpp 25KB

cpu_adam.cpp 18KB

multihead_attention_1d.cpp 17KB

cpu_adam_arm.cpp 13KB

linear_gptq.cpp 7KB

mask.cpp 5KB

layer_norm_cuda.cpp 5KB

moe_cuda.cpp 4KB

scaled_masked_softmax.cpp 3KB

colossal_C_frontend.cpp 3KB

scaled_upper_triang_masked_softmax.cpp 2KB

binding.cpp 200B

companies.csv 14KB

csv_organization_100.csv 13KB

normalize_kernels.cu 46KB

dropout_kernels.cu 36KB

layer_norm_cuda_kernel.cu 25KB

moe_cuda_kernel.cu 25KB

multi_tensor_l2norm_kernel.cu 13KB

softmax_kernels.cu 13KB

multi_tensor_lamb.cu 13KB

transform_kernels.cu 11KB

q4_matmul.cu 8KB

cross_entropy.cu 7KB

general_kernels.cu 7KB

linear.cu 7KB

multi_tensor_sgd_kernel.cu 6KB

q4_matrix.cu 5KB

cuda_util.cu 5KB

multi_tensor_adam.cu 5KB

multi_tensor_scale_kernel.cu 4KB

cublas_wrappers.cu 4KB

scaled_masked_softmax_cuda.cu 3KB

scaled_upper_triang_masked_softmax_cuda.cu 3KB

cuda_buffers.cu 2KB

column_remap.cu 1KB

matrix.cuh 11KB

multi_tensor_apply.cuh 5KB

hip_compat.cuh 2KB

cu_compat.cuh 2KB

cuda_buffers.cuh 1KB

q4_matrix.cuh 861B

q4_matmul.cuh 859B

util.cuh 699B

column_remap.cuh 340B

ls_cub.cuh 296B

Dockerfile 2KB

Dockerfile 1KB

.DS_Store 6KB

hostfile.example 19B

.gitignore 2KB

.gitignore 36B

.gitignore 8B

.gitmodules 139B

scaled_upper_triang_masked_softmax.h 23KB

scaled_masked_softmax.h 21KB

type_shim.h 15KB

kernels.h 9KB

block_reduce.h 8KB

cpu_adam.h 6KB

cpu_adam_arm.h 6KB

multihead_attention_1d.h 5KB

dropout.h 3KB

strided_batch_gemm.h 3KB

feed_forward.h 2KB

normalize_layer.h 2KB

cublas_wrappers.h 2KB

cross_entropy_layer.h 1KB

softmax.h 1019B

cuda_util.h 966B

context.h 647B

linear.h 469B

tuning.h 220B

compat.h 214B

hostfile 70B

test.html 198KB

MANIFEST.in 161B

pytest.ini 361B

pytest.ini 220B

pytest.ini 207B

flat-meta.json 119KB

evaluation_prompt_en.json 23KB

evaluation_prompt_cn.json 18KB

custom_service.json 17KB

custom_service_preprocessed.json 14KB

eval_cn_examples.json 10KB

eval_en_examples.json 9KB

custom_service_classification.json 3KB

config.json 2KB

sidebars.json 2KB

共 1984 条

# Analyzer # Overview The Analyzer is a collection of static graph utils including Colossal-AI FX. Features include: - MetaTensor -- enabling: - Ahead-of-time Profiling - Shape Propagation - Ideal Flop Counter - symbolic_trace() - Robust Control-flow Tracing / Recompile - Robust Activation Checkpoint Tracing / CodeGen - Easy-to-define Bias-Addition Split - symbolic_profile() - Support ``MetaTensorMode``, where all Tensor operations are executed symbolically. - Shape Inference Across Device and Unified ``MetaInfo`` - Ideal Flop Counter https://dev-discuss.pytorch.org/t/the-ideal-pytorch-flop-counter-with-torch-dispatch/505 # Quickstart ## Analyzer.FX **Reference:** https://pytorch.org/docs/stable/fx.html [[paper](https://arxiv.org/pdf/2112.08429)] torch.FX is a toolkit for developers to use to transform nn.Module instances. FX consists of three main components: a symbolic tracer, an intermediate representation, and Python code generation. FX.Tracer hacks _\_\_torch_function\_\__ and use a Proxy object to propagate through any forward function of torch.nn.Module. ![image](https://user-images.githubusercontent.com/78588128/212531495-bbb934dd-dbbb-4578-8869-6171973f7dd8.png) ColossalAI FX is modified from torch.FX, with the extra capability of ahead-of-time profiling enabled by the subclass of ``MetaTensor``. ### Analyzer.FX.symbolic_trace() A drawback of the original torch.FX implementation is that it is poor at handling control flow. All control flow is not PyTorch native operands and requires actual instances that specify the branches to execute on. For example, ```python class MyModule(nn.Module): def forward(self, x): if x.dim() == 3: return x * 2 + 1 else: return x - 5 ``` The above function has the computation graph of ![image](https://user-images.githubusercontent.com/78588128/212532631-dba30734-577b-4418-8dc9-004d7983abc5.png) However, since Proxy does not have concrete data, applying ``x.dim()`` will return nothing. In the context of the auto-parallel system, at least the control-flow dependencies for tensor shape should be removed, since any searched strategy could only auto-parallelize a specific computation graph with the same tensor shape. It is native to attach concrete data onto a Proxy, and propagate them through control flow. ![image](https://user-images.githubusercontent.com/78588128/212533403-1b620986-1c3a-420a-87c6-d08c9702135d.png) With ``MetaTensor``, the computation during shape propagation can be virtualized. This speeds up tracing by avoiding allocating actual memory on devices. #### Remarks There is no free lunch for PyTorch to unify all operands in both its repo and other repos in its eco-system. For example, the einops library currently has no intention to support torch.FX (See https://github.com/arogozhnikov/einops/issues/188). To support different PyTorch-based libraries without modifying source code, good practices can be to allow users to register their implementation to substitute the functions not supported by torch.FX, or to avoid entering incompatible submodules. ### Analyzer.FX.symbolic_profile() ``symbolic_profile`` is another important feature of Colossal-AI's auto-parallel system. Profiling DNN can be costly, as you need to allocate memory and execute on real devices. However, since the profiling requirements for auto-parallel is enough if we can detect when and where the intermediate activations (i.e. Tensor) are generated, we can profile the whole procedure without actually executing it. ``symbolic_profile``, as its name infers, profiles the whole network with symbolic information only. ```python with MetaTensorMode(): model = MyModule().cuda() sample = torch.rand(100, 3, 224, 224).cuda() meta_args = dict( x = sample, ) gm = symbolic_trace(model, meta_args=meta_args) gm = symbolic_profile(gm, sample) ``` ``symbolic_profile`` is enabled by ``ShapeProp`` and ``GraphProfile``. #### ShapeProp Both Tensor Parallel and Activation Checkpoint solvers need to know the shape information ahead of time. Unlike PyTorch's implementation, this ``ShapeProp`` can be executed under MetaTensorMode. With this, all the preparation for auto-parallel solvers can be done in milliseconds. Meanwhile, it is easy to keep track of the memory usage of each node when doing shape propagation. However, the drawbacks of FX is that not every ``call_function`` saves its input for backward, and different tensor that flows within one FX.Graph can actually have the same layout. This raises problems for fine-grained profiling. ![image](https://user-images.githubusercontent.com/78588128/215312957-7eb6cbc3-61b2-49cf-95a4-6b859149eb8d.png) To address this problem, I came up with a simulated environment enabled by ``torch.autograd.graph.saved_tensor_hooks`` and fake ``data_ptr`` (check ``_subclasses/meta_tensor.py`` for more details of ``data_ptr`` updates). ```python class sim_env(saved_tensors_hooks): """ A simulation of memory allocation and deallocation in the forward pass using ``saved_tensor_hooks``. Attributes: ctx (Dict[int, torch.Tensor]): A dictionary that maps the data pointer of a tensor to the tensor itself. This is used to track the memory allocation and deallocation. param_ctx (Dict[int, torch.Tensor]): A dictionary that maps the data pointer of all model parameters to the parameter itself. This avoids overestimating the memory usage of the intermediate activations. """ def __init__(self, module: Optional[torch.nn.Module] = None): super().__init__(self.pack_hook, self.unpack_hook) self.ctx = {} self.param_ctx = {param.data_ptr(): param for param in module.parameters()} self.buffer_ctx = {buffer.data_ptr(): buffer for buffer in module.buffers()} if module else {} def pack_hook(self, tensor: torch.Tensor): if tensor.data_ptr() not in self.param_ctx and tensor.data_ptr() not in self.buffer_ctx: self.ctx[tensor.data_ptr()] = tensor return tensor def unpack_hook(self, tensor): return tensor ``` The ``ctx`` variable will keep track of all saved tensors with a unique identifier. It is likely that ``nn.Parameter`` is also counted in the ``ctx``, which is not desired. To avoid this, we can use ``param_ctx`` to keep track of all parameters in the model. The ``buffer_ctx`` is used to keep track of all buffers in the model. The ``local_ctx`` that is attached to each ``Node`` marks the memory usage of the stage to which the node belongs. With simple ``intersect``, ``union`` and ``subtract`` operations, we can get any memory-related information. For non-profileable nodes, you might add your customized profile rules to simulate the memory allocation. If a ``Graph`` is modified with some non-PyTorch functions, such as fused operands, you can register the shape propagation rule with the decorator. ```python @register_shape_impl(fuse_conv_bn) def fuse_conv_bn_shape_impl(*args, **kwargs): # infer output shape here return torch.empty(output_shape, device=output_device) ``` An important notice is that ``ShapeProp`` will attach additional information to the graph, which will be exactly the input of ``Profiler``. #### GraphProfiler ``GraphProfiler`` executes at the node level, and profiles both forward and backward within one node. For example, ``FlopProfiler`` will profile the forward and backward FLOPs of a node, and ``CommunicationProfiler`` will profile the forward and backward communication cost of a node. The ``GraphProfiler`` will attach the profiling results to the ``Node``. These procedures are decoupled for better extensibility. To provide a general insight of the profiled results, you can set ``verbose=True`` to print the summary as well. ```python model = tm.resnet18() sample = torch.rand(100, 3, 224, 224) meta_args = dict(x=sample) gm = symbolic_trace(model, meta_args=meta

评论收藏

内容反馈

版权申诉