# Paddle HIgh reusability operator library (PHI) Design
Paddle HIgh reusability operator library (PHI), or we also call it 'functional operator library', supports to implement new operator kernels based on existing operator kernel functions and 'Kernel Primitives API (KPS)', and supports plug-in access to new hardware or new acceleration library.
In order to solve the problems of unclear operator interface in the original operator library of the Paddle Fluid Framework, high cost of operator reuse, and poor scheduling performance, we refactored the operator library of the Paddle Framework, designed flexible and efficient functional paradigm.
The operator library PHI can implement new operators by combining calls to functional operator interfaces, which can greatly reduce the development cost of native operator and custom operator.
## 1. Background
> Introduce the problems to be solved in designing and building the PHI operator library
The PHI operator library project was initially launched to support the refactoring of the paddle dynamic graph architecture to reduce scheduling overhead and improve the reuse capability of OpKernel development. However, the subsequent decision to take this opportunity to establish an operator library that can be used in both training and inference scenarios (including server-side and mobile-side scenarios), reduce the cost of infrastructure development and operator maintenance in the paddle ecosystem in the long run, so we expanded the target scope of the project.
Specifically, the PHI operator library project carries the expectation to solve the following problems of Paddle.
### 1.1 Poor reusability between Op&OpKernel
Before version 2.3, the reusability between Operators (Op) in Paddle was relatively poor. Only in a few backward Ops, some simple Ops were reused by calling `SetType` in the `GradOpMaker` implementation. In most cases where the existing Op implementation can be reused, the code is rewritten by copy.
The root cause of poor reusability is the inflexibility of the original Op architecture design:
1. When an Op reuses the `Opkernel::Compute` method of another Op, an `ExecutionContext` needs to be constructed first, and the reuse method is relatively cumbersome
> It will be much more convenient if you can directly call the Kernel in the form of a function
2. Due to the overhead introduced by additional data structure construction and independent Op scheduling, from the perspective of computing performance, it is better to copy the calculation code directly when reusing Op, which leads us to gradually abandon the earlier principle of backward Op reusing forward Op, and began to implement Kernel separately for each backward Op, so that Paddle maintains a large number of backward OpKernel implementation codes internally.
> Only when the overhead of reusing Ops is small enough, reusing existing Ops to implement new Ops can be widely promoted
### 1.2 Conciseness and fine-grained execution scheduling
#### 1.2.1 Dynamic graph
After the release of Paddle 2.0, it has received many feedbacks from internal and external users that the performance of the dynamic graph is several times lower than that of competing products in the execution scenario of small model on CPU.
The main reason for this problem is: the execution path of the C++ side of the Paddle dynamic graph is relatively long and the scheduling overhead is relatively heavy, which is related to the early design of the dynamic graph which is compatible with the static graph and inherits many object construction processes of the static graph Op.
Therefore, the dynamic graph needs to be upgraded to a function-based scheduling architecture, and this problem can be solved by abandoning the original complex Op architecture, which depends on the OpKernel being changed to a functional writing method.
#### 1.2.2 Static image + IR
Our current static graph mode are not "static" enough. At present, static graph mode still have a lot of logic for dynamic selection at runtime, for example, selecting OpKernel at runtime, judging whether to copy data across devices at runtime, etc.. However, these can actually be determined during the compilation of the static graph mode network, and the execution process is determined as a series of OpKernel executions, and no dynamic judgment selection is made, thereby further improving the execution efficiency.
And these rely on the fine-grained OpKernel itself, decoupling the existing complex large OpKernel into small Kernels for specific scenarios and specific devices.
### 1.3 Ease of use improvement requirements for custom operators
The new custom C++ external operator paradigm released in early 2021 has a relatively intuitive usage at the level of interface and function writing, but because we lack the C++ APIs for basic operations, in fact, when implementing specific custom Op operation logic, such as basic addition, subtraction, multiplication and division and matrix operations, still need to be reimplemented again, and Paddle's existing and optimized basic operations cannot be reused, development costs are still relatively high. In order to reuse the basic operations inside Paddle, the Op paradigm must be upgraded to functional paradigm, and build the corresponding C++ API system.
### 1.4 Build an integrated training and inference operator library to reduce the maintenance cost of inference operators
For a long time, because the Paddle and Paddle-Lite operators are maintained separately, the new paddle operator, if Paddle-Lite needs it, must be manually reimplemented in Paddle-Lite, and when the Paddle operator is upgraded, Paddle-Lite does not perceive it in time, which will directly lead to bugs in the inference model when lite is executed, which introduces high maintenance costs. Only a unified operator library can solve this problem for a long time.
Therefore, this functional operator library will be jointly constructed by training and inference team, and will serve as an independent compilation component and underlying infrastructure (not yet independently split), which can serve training, server-inference, and -inference execution systems at the same time.
### 1.5 Op and Kernel parameter normalization
The Python 2.0 API project in 2020 standardized the argument list of the Paddle Python-side API, making it concise, easy to use, and standard. However, due to cost considerations, the argument list at the Op level was not standardized, so there will be many early developed operators that differ greatly in arguments from the Python API. For example, `conv` op, the Python API has only 8 arguments, but the corresponding C++ `Conv` Op has 29 arguments. API and Op are essentially the same layer of concepts, both are descriptions of an operation, and the arguments should be consistent. In order to solve this problem, 'the operator definition enhancement project' was launched, and the declarations of 'AsExtra' and 'AsQuant' were added to some unnecessary arguments, but the problem was not fundamentally solved, which is what the construction of the PHI operator library hopes to solve.
We hope to be able to achieve the same three-layer arguments of Python API -> Op(C++ API) -> Kernel API, so that the overall structure is clear, and the reuse relationship of each layer is clear enough. Maintaining a set of official Python API documents can basically satisfy the common reference requirements of the three-tier API, no longer focus on maintaining additional document systems and reduce maintenance costs.
## 2. Design Overview
### 2.1 Location
The PHI code directory is inside the paddle directory, which is at the same level as fluid, rather than inside the fluid directory. PHI is a basic component that is called by various upper-layer runtime such as fluid, lite, and it will be used later as a separately compiled dynamic library, therefore PHI is not suitable as the submodule of fluid.
### 2.2 Directory St
没有合适的资源?快使用搜索试试~ 我知道了~
Paddle(PArallel Distributed Deep LEarning 并行分布式深度学习)是百度研发的深度学习平台
共2000个文件
h:1570个
txt:192个
py:113个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 167 浏览量
2023-10-16
09:47:29
上传
评论
收藏 24.59MB ZIP 举报
温馨提示
Paddle (PArallel Distributed Deep LEarning 并行分布式深度学习)是百度研发的深度学习平台,具有易用,高效,灵活和可伸缩等特点,为百度内部多项产品提供深度学习算法支持。飞桨(PaddlePaddle)以百度多年的深度学习技术研究和业务应用为基础,是中国首个自主研发、功能完备、 开源开放的产业级深度学习平台,集深度学习核心训练和推理框架、基础模型库、端到端开发套件和丰富的工具组件于一体
资源推荐
资源详情
资源评论
收起资源包目录
Paddle(PArallel Distributed Deep LEarning 并行分布式深度学习)是百度研发的深度学习平台 (2000个子文件)
com_baidu_paddle_inference_Config.cpp 10KB
com_baidu_paddle_inference_Tensor.cpp 6KB
com_baidu_paddle_inference_Predictor.cpp 4KB
config.go 25KB
tensor.go 6KB
predictor_test.go 6KB
config_test.go 4KB
predictor.go 4KB
utils.go 1KB
version.go 843B
lib.go 793B
heter_comm_inl.h 156KB
round_robin.h 88KB
graph_pattern_detector.h 76KB
fused_multi_transformer_op.cu.h 65KB
data_feed.h 62KB
net_builder.h 57KB
run_program_op_node.h 57KB
interpolate_op.h 56KB
elementwise_op_function.h 55KB
composite_backward_api.h 55KB
paddle_analysis_config.h 46KB
fused_gate_attention.h 46KB
feature_value.h 45KB
tensor_py.h 43KB
fused_layernorm_residual_dropout_bias.h 43KB
box_wrapper.h 41KB
ps_gpu_wrapper.h 40KB
tensorrt_engine_op.h 39KB
pass_tester_helper.h 38KB
profiler_helper.h 34KB
reduce_op.h 32KB
fused_elemwise_activation_op.h 32KB
concurrent_unordered_map.cuh.h 31KB
op_converter.h 31KB
operator.h 31KB
svd_helper.h 29KB
device_worker.h 29KB
metrics.h 29KB
common_graph_table.h 28KB
prepared_operator.h 28KB
fmha_ref.h 28KB
ir.h 28KB
heter_comm.h 27KB
prroi_pool_op.h 26KB
fake_quantize_op.cu.h 26KB
dlnne_engine_op.h 26KB
pd_config.h 26KB
deformable_psroi_pooling_op.h 26KB
engine.h 25KB
general_grad.h 25KB
sync_batch_norm_utils.h 25KB
nodes.h 24KB
lstmp_op.h 24KB
run_program_op.h 24KB
heter_server.h 23KB
optimizer.cuh.h 23KB
communicator.h 22KB
composite_grad_desc_maker.h 22KB
ir_schedule.h 22KB
fused_residual_dropout_bias.h 22KB
argument.h 22KB
utils.h 21KB
op_registry.h 21KB
composite_double_backward_api.h 20KB
archive.h 20KB
elementwise_op.h 20KB
analysis_predictor.h 19KB
fused_dropout_helper.h 19KB
process_group.h 19KB
cinn_runtime.h 19KB
float16.h 19KB
float16.h 19KB
sparse_momentum_op.h 19KB
detection_map_op.h 19KB
paddle_api.h 19KB
linear_chain_crf_op.h 18KB
layout_transformer.h 18KB
eager_generator.h 18KB
lstm_op.h 18KB
fake_quantize_op.h 18KB
fused_dropout_act_bias.h 18KB
supported_ops_autogen.h 17KB
nce_op.h 17KB
nodes.h 17KB
fused_multi_transformer_encoder_pass.h 17KB
matmul_op_int8_plugin.h 17KB
cudnn_norm_conv.cu.h 17KB
syntax.h 17KB
ir_schedule_util.h 17KB
infer_shape_context.h 17KB
nonblocking_threadpool.h 16KB
data_set.h 16KB
unique_op.h 16KB
fleet_wrapper.h 16KB
gpu_graph_node.h 16KB
details.h 16KB
device_event_base.h 16KB
graph_khop_sampler_op.h 16KB
eigen_values_vectors.h 16KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
Java程序员-张凯
- 粉丝: 1w+
- 资源: 6649
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功