# Paddle HIgh reusability operator library (PHI) Design
Paddle HIgh reusability operator library (PHI), or we also call it 'functional operator library', supports to implement new operator kernels based on existing operator kernel functions and 'Kernel Primitives API (KPS)', and supports plug-in access to new hardware or new acceleration library.
In order to solve the problems of unclear operator interface in the original operator library of the Paddle Fluid Framework, high cost of operator reuse, and poor scheduling performance, we refactored the operator library of the Paddle Framework, designed flexible and efficient functional paradigm.
The operator library PHI can implement new operators by combining calls to functional operator interfaces, which can greatly reduce the development cost of native operator and custom operator.
## 1. Background
> Introduce the problems to be solved in designing and building the PHI operator library
The PHI operator library project was initially launched to support the refactoring of the paddle dynamic graph architecture to reduce scheduling overhead and improve the reuse capability of OpKernel development. However, the subsequent decision to take this opportunity to establish an operator library that can be used in both training and inference scenarios (including server-side and mobile-side scenarios), reduce the cost of infrastructure development and operator maintenance in the paddle ecosystem in the long run, so we expanded the target scope of the project.
Specifically, the PHI operator library project carries the expectation to solve the following problems of Paddle.
### 1.1 Poor reusability between Op&OpKernel
Before version 2.3, the reusability between Operators (Op) in Paddle was relatively poor. Only in a few backward Ops, some simple Ops were reused by calling `SetType` in the `GradOpMaker` implementation. In most cases where the existing Op implementation can be reused, the code is rewritten by copy.
The root cause of poor reusability is the inflexibility of the original Op architecture design:
1. When an Op reuses the `Opkernel::Compute` method of another Op, an `ExecutionContext` needs to be constructed first, and the reuse method is relatively cumbersome
> It will be much more convenient if you can directly call the Kernel in the form of a function
2. Due to the overhead introduced by additional data structure construction and independent Op scheduling, from the perspective of computing performance, it is better to copy the calculation code directly when reusing Op, which leads us to gradually abandon the earlier principle of backward Op reusing forward Op, and began to implement Kernel separately for each backward Op, so that Paddle maintains a large number of backward OpKernel implementation codes internally.
> Only when the overhead of reusing Ops is small enough, reusing existing Ops to implement new Ops can be widely promoted
### 1.2 Conciseness and fine-grained execution scheduling
#### 1.2.1 Dynamic graph
After the release of Paddle 2.0, it has received many feedbacks from internal and external users that the performance of the dynamic graph is several times lower than that of competing products in the execution scenario of small model on CPU.
The main reason for this problem is: the execution path of the C++ side of the Padddle dynamic graph is relatively long and the scheduling overhead is relatively heavy, which is related to the early design of the dynamic graph which is compatible with the static graph and inherits many object construction processes of the static graph Op.
Therefore, the dynamic graph needs to be upgraded to a function-based scheduling architecture, and this problem can be solved by abandoning the original complex Op architecture, which depends on the OpKernel being changed to a functional writing method.
#### 1.2.2 Static image + IR
Our current static graph mode are not "static" enough. At present, static graph mode still have a lot of logic for dynamic selection at runtime, for example, selecting OpKernel at runtime, judging whether to copy data across devices at runtime, etc.. However, these can actually be determined during the compilation of the static graph mode network, and the execution process is determined as a series of OpKernel executions, and no dynamic judgment selection is made, thereby further improving the execution efficiency.
And these rely on the fine-grained OpKernel itself, decoupling the existing complex large OpKernel into small Kernels for specific scenarios and specific devices.
### 1.3 Ease of use improvement requirements for custom operators
The new custom C++ external operator paradigm released in early 2021 has a relatively intuitive usage at the level of interface and function writing, but because we lack the C++ APIs for basic operations, in fact, when implementing specific custom Op operation logic, such as basic addition, subtraction, multiplication and division and matrix operations, still need to be reimplemented again, and Paddle's existing and optimized basic operations cannot be reused, development costs are still relatively high. In order to reuse the basic operations inside Paddle, the Op paradigm must be upgraded to functional paradigm, and build the corresponding C++ API system.
### 1.4 Build an integrated training and inference operator library to reduce the maintenance cost of inference operators
For a long time, because the Paddle and Paddle-Lite operators are maintained separately, the new paddle operator, if Paddle-Lite needs it, must be manually reimplemented in Paddle-Lite, and when the Paddle operator is upgraded, Paddle-Lite does not perceive it in time, which will directly lead to bugs in the inference model when lite is executed, which introduces high maintenance costs. Only a unified operator library can solve this problem for a long time.
Therefore, this functional operator library will be jointly constructed by training and inference team, and will serve as an independent compilation component and underlying infrastructure (not yet independently split), which can serve training, server-inference, and -inference execution systems at the same time.
### 1.5 The adaptation of the new inference Runtime design 'infrt'
Inference team designed a new runtime `infrt`. It is expected to unify the execution system of Paddle-Inference and Paddle-Lite. It is necessary to directly call the operators in the PHI operator library jointly built this time. Therefore, the adaptation to `infrt` needs to be considered in the design. (Currently the `infrt` project is temporarily on hold).
### 1.6 Op and Kernel parameter normalization
The Python 2.0 API project in 2020 standardized the argument list of the Paddle Python-side API, making it concise, easy to use, and standard. However, due to cost considerations, the argument list at the Op level was not standardized, so there will be many early developed operators that differ greatly in arguments from the Python API. For example, `conv` op, the Python API has only 8 arguments, but the corresponding C++ `Conv` Op has 29 arguments. API and Op are essentially the same layer of concepts, both are descriptions of an operation, and the arguments should be consistent. In order to solve this problem, 'the operator definition enhancement project' was launched, and the declarations of 'AsExtra' and 'AsQuant' were added to some unnecessary arguments, but the problem was not fundamentally solved, which is what the construction of the PHI operator library hopes to solve.
We hope to be able to achieve the same three-layer arguments of Python API -> Op(C++ API) -> Kernel API, so that the overall structure is clear, and the reuse relationship of each layer is clear enough. Maintaining a set of official Python API documents can basically satisfy the common reference requirements of the three-tier API, no longer focus on maintaining additional document systems and reduce
没有合适的资源?快使用搜索试试~ 我知道了~
【人工智能机器学习/深度学习】PaddlePaddle百度研发的深度学习平台
共11430个文件
py:3954个
cc:3414个
h:2547个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 18 浏览量
2023-07-09
20:50:01
上传
评论 1
收藏 21.96MB ZIP 举报
温馨提示
PaddlePaddle (PArallel Distributed Deep LEarning 并行分布式深度学习)是百度研发的深度学习平台,具有易用,高效,灵活和可伸缩等特点,为百度内部多项产品提供深度学习算法支持 四大领先技术 开发便捷的产业级深度学习框架 飞桨深度学习框架采用基于编程逻辑的组网范式,对于普通开发者而言更容易上手,符合他们的开发习惯。同时支持声明式和命令式编程,兼具开发的灵活性和高性能。网络结构自动设计,模型效果超越人类专家。 支持超大规模深度学习模型的训练 飞桨突破了超大规模深度学习模型训练技术,实现了支持千亿特征、万亿参数、数百节点的开源大规模训练平台,攻克了超大规模深度学习模型的在线学习难题,实现了万亿规模参数模型的实时更新。 查看详情 支持多端多平台的高性能推理部署工具 飞桨不仅广泛兼容第三方开源框架训练的模型部署,并且为不同的场景的生产环境提供了完备的推理引擎,包括适用于高性能服务器及云端推理的原生推理库 Paddle Inference,面向分布式、流水线生产环境下自动上云、A/B测试等高阶功能的服务化推理框架 Paddle Serving,
资源推荐
资源详情
资源评论
收起资源包目录
【人工智能机器学习/深度学习】PaddlePaddle百度研发的深度学习平台 (11430个子文件)
.bashrc 670B
paddle_build.bat 33KB
build_compile_environment.bat 8KB
run_windows_demo.bat 7KB
build.bat 6KB
series_build.bat 2KB
mlu_baseop.cc 240KB
fused_multi_transformer_encoder_pass.cc 219KB
unary.cc 165KB
graph_pattern_detector.cc 154KB
fused_multi_transformer_decoder_pass.cc 144KB
operator.cc 135KB
imperative.cc 131KB
eager_generator.cc 129KB
multiary.cc 110KB
pybind.cc 110KB
binary.cc 109KB
analysis_predictor.cc 102KB
op_teller.cc 101KB
data_feed.cc 100KB
common_graph_table.cc 93KB
ps_gpu_wrapper.cc 86KB
eager_method.cc 82KB
ssd_sparse_table.cc 82KB
fused_attention_pass.cc 81KB
brpc_ps_client.cc 80KB
parallel_executor.cc 75KB
pooling.cc 74KB
gpc.cc 73KB
eager_math_op_patch.cc 68KB
data_set.cc 67KB
trt_multihead_matmul_fuse_pass.cc 66KB
multihead_matmul_fuse_pass.cc 63KB
memcpy.cc 63KB
test.cc 59KB
fleet_wrapper.cc 58KB
eager_utils.cc 57KB
communicator.cc 55KB
distributed_py.cc 54KB
rnn_grad_kernel.cc 53KB
interpretercore.cc 52KB
multi_encoder_xpu_fuse_pass.cc 52KB
multi_devices_graph_pass.cc 51KB
rpn_target_assign_op.cc 50KB
tensor_util.cc 49KB
parallel_executor.cc 49KB
eager.cc 49KB
tensor_ops.cc 48KB
ternary.cc 48KB
reducer.cc 48KB
reducer.cc 47KB
eager_functions.cc 47KB
inference_api.cc 47KB
cpu_quantize_pass.cc 46KB
analysis_config.cc 46KB
interpolate_kernel.cc 45KB
interpreter_util.cc 44KB
allocator_facade.cc 44KB
op_desc.cc 44KB
custom_operator.cc 44KB
hetercpu_worker.cc 43KB
tensor.cc 43KB
flags.cc 43KB
resnet_basic_block_op_xpu.cc 43KB
interpolate_grad_kernel.cc 42KB
process_group_nccl.cc 42KB
memory_sparse_table.cc 41KB
partial_grad_engine.cc 40KB
set_value_sig.cc 39KB
xpu2_op_list.cc 39KB
sync_batch_norm_op_npu.cc 38KB
profiler.cc 38KB
activation_op_npu.cc 38KB
cpu_quantize_squash_pass_tester.cc 38KB
downpour_worker.cc 38KB
cudnn_bn_add_relu_test.cc 37KB
ipu_compiler.cc 37KB
fake_quantize_op.cc 37KB
fused_attention_op_xpu.cc 37KB
test_dynamic_engine.cc 37KB
backward.cc 37KB
zero_copy_tensor.cc 36KB
ascend_wrapper_py.cc 36KB
selected_rows_functor.cc 36KB
custom_device.cc 36KB
multihead_matmul_op.cc 35KB
matmul_op.cc 35KB
gradient_accumulator.cc 35KB
op_function_common.cc 35KB
multihead_matmul_roformer_fuse_pass.cc 34KB
generate_proposal_labels_op.cc 34KB
fused_attention_op.cc 34KB
data_transfer.cc 34KB
tensor_utils.cc 34KB
place.cc 34KB
rnn_kernel.cc 33KB
infershape_utils.cc 33KB
device_tracer.cc 33KB
gpu_context.cc 32KB
fused_multi_transformer_encoder_pass_tester.cc 32KB
共 11430 条
- 1
- 2
- 3
- 4
- 5
- 6
- 115
资源评论
技术宅小伙
- 粉丝: 179
- 资源: 1777
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Windows系统下安装与配置Neo4j的步骤
- 基于matlab实现潮流计算和最优潮流计算的程序1,对毕业设计有一定用处.rar
- 基于大数据学习资源推荐系统的设计与实现(部署视频)-kaic.mp4
- 哈工大形式语言和自动机2022期末含答案
- Windows系统下安装与配置Neo4j的步骤
- 哈希算法(Hash Algorithm)是一种将任意长度的二进制数据映射为较短的、固定长度的二进制值的函数.txt
- Windows系统下安装与配置Neo4j的步骤
- 在二叉树或更复杂的树形结构中,先序输出叶结点.txt
- 列出所有祖先结点的概念通常与树形结构或图论中的节点相关.txt
- 基于matlab实现潮流计算程序,MATLAB潮流计算程序.rar
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功