# InfMoE
Inference framework for MoE-based models, based on a TensorRT custom plugin
named `MoELayerPlugin` (including Python binding) that can run inference of MoE layers with any sub-layer on NVIDIA GPUs with minimal memory consumption.
## Installation
Dependencies:
* CUDA (>=10.2)
* cuDNN (>=8.0, corresponding to CUDA version)
* TensorRT (>=7.0, corresponding to CUDA & cuDNN version)
* zlib (to read `npz` files)
* meson & ninja (building system)
### Python (recommended)
To use TensorRT in Python, you need to first install:
* TensorRT pip package (either from downloaded TensorRT package or from PyPI as `nvidia-tensorrt`, see <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing-pip>)
* PyCUDA
Simply you could run `python3 -m pip install -r requirements.txt`.
Note: If you install `nvidia-tensorrt` from PyPI (but not from downloaded TensorRT package), you **MUST** ensure the version of TensorRT that `MoELayerPlugin` links to matches the version that pip package uses (see `site-packages/tensorrt/`). Otherwise the plugin will not work correctly.
Then build this plugin:
```bash
cd python
# if you have cuDNN & TensorRT installed in search path, or
python3 setup.py build_ext
# if you need to specify CUDA / cuDNN install location
# (CUDA can only be automatically searched by meson)
python3 setup.py build_ext --tensorrt-prefix=/path/to/tensorrt --cudnn-prefix=/path/to/cudnn
python3 setup.py install .
```
You can also use `bdist_wheel` or other commands provided by `setuptools`. You can pass `--debug` to `build_ext` to enable verbose logging & keep the symbols for debugging purpose.
### C++ only (advanced)
```bash
cd plugin
# if you have cuDNN & TensorRT installed in search path
make builddir && make compile
# if you need to specify CUDA / cuDNN install location
# (CUDA can only be automatically searched by meson)
meson setup build -DWITH_TENSORRT=/path/to/tensorrt -DWITH_CUDNN=/path/to/cudnn
ninja -C builddir # or just run `make`
```
If everything goes well, you can find `libtrtmoelayer.so` in `builddir`. Similarly you can pass `-DDEBUG=true` to `meson setup` for debugging.
## Plugin attributes
When initializing `MoELayerPlugin` in TensorRT (either C++ or Python), the following attributes must be specified:
* `expert_count`: INT32, number of experts (sub-layers)
* `embedding_size`: INT32, the input & output size of expert network
* `hidden_size`: INT32, the intermediate size of feed forward network (might not be used by sub-layer)
* `max_concurrency`: INT32, maximal concurrent experts in GPU memory (default to 2), setting it too large will lead to OOM
* `expert_centroids`: FLOAT32 array, weight for dispatching tokens to experts, must be shape `(d_model, expert_count)` where `d_model` is the last dimension of input tensor (a.k.a. embedding size)
* `expert_weight_file`: null-terminated CHAR array, path to expert weight file, to be read by implmentation of sub-layer
* `expert_sublayer_type`: null-terminated CHAR array, type of sub-layer used, currently only `T5_FF` can be used
* `moe_variant`: null-terminated CHAR array, variant type of MoE layer, used to decide different behaviours (can be `cpm_2`, `base_layer` or `default`)
* `layernorm_weight`: FLOAT32 array, weight of layer norm layer applied to input before calculating expert affliation / score, must be provided when `moe_variant` is `cpm_2`
## Usage
Currently InfMoE can only handle MoE layers with FP32 parameters, input & output. To run inference with a full network, you should slice it before and after any MoE layer:
* For non-MoE layers, jsut save them as `onnx` / UFF format and use TensorRT to parse it into a network ([Python](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#import_onnx_python) / [C++](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#import_onnx_c)). Or you can use TensorRT API to construct the network manually ([Python](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#network_python) / [C++](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#create_network_c)).
* For MoE layers, dump expert centroids and weights of each expert separately (in the format mentioned below), create a layer using `MoELayerPlugin` with Python or C++ (see examples).
Then you can concatenate MoE / non-MoE layers to obtain the full network (or replace any specific 'placeholder' layer with MoE layer), which can be later built into a TensorRT CUDA engine and used to run inference with / serialize & dump to file.
We provide several Python examples in `python/examples` showing how to do the aforementioned work. You can run them after installing this plugin. You are encouraged to read [TensorRT documentation](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html) to understand its workflow prior to using this plugin.
## Error handling
InfMoE requires that none of the following tensors contains `NaN` values:
* layer input
* expert centroids
* weight of layer norm (if applicable)
It will also check the shape and data type of all parameters, input & output tensors. If any misconfiguration is found, it will print error message to `stderr` and abort the whole process.
## Scheduling
See CPM-2 paper for scheduling details. To be ported to public source code soon.
## Sub-layer
We have provided some sublayers in `plugin/sublayers`. To implement your own sub-layer, you need to:
* Extend `MoESubLayer` class
* Add your layer name and initialization code to `MoELayerPlugin.h` (in `sublayer_type`) and `MoELayerPlugin.cc` (in `MoELayerPlugin::createSublayer()`)
* Add your source file (`.cpp` only) to `meson.build`
* Rebuild the plugin
### T5FFLayer (`T5_FF`)
This project includes an sublayer implementation of feed-forward layer in T5 network. It is defined as:
```text
hs := hs + dense_relu_dense(layer_norm(hs))
layer_norm(hs) := wl * hs / sqrt(mean(pow(hs, 2)) + eps)
dense_relu_dense(hs) := (gelu(hs @ wi_0^T) * (hs @ wi_1^T)) @ wo^T
```
where `wi_0`, `wi_1` and `wo` are linear layers with no bias, first converting input tensor to 4 times large (in last dimension) then back.
The given `export_weight_file` must be a `npz` file containing the following variables (`n` varies from `0` to `expert_count - 1`): `n/layer_norm_weight`, `n/wi_0_weight`, `n/wi_1_weight`, `n/wo_weight`.
### IdentityLayer (`Identity`)
This layer **DOES NOTHING** (thus use none of the provided plugin attributes), just copies the input directly to the output. It is intended for debugging purpose only.
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
算法部署_使用TensorRT+Python部署MoE模型_优质算法部署项目实战.zip (33个子文件)
算法部署_使用TensorRT+Python部署MoE模型_优质算法部署项目实战
.editorconfig 163B
.clang-format 164B
plugin
meson.build 2KB
MoELayerPluginCreator.cc 7KB
Makefile 626B
utility.h 3KB
MoELayerPlugin.cc 20KB
sublayers
IdentityLayer.hh 2KB
T5FFLayer.cc 8KB
SubLayer.h 2KB
T5FFLayer.h 1KB
meson_options.txt 107B
cuda
ops.h 756B
moe.h 2KB
ops
layernorm.cu 7KB
gelu.cu 881B
common.cuh 772B
moe.cu 8KB
MoELayerPlugin.h 7KB
thirdparty
cnpy
cnpy.cpp 11KB
cnpy.h 11KB
dbg.h 25KB
python
setup.py 3KB
infmoe
utils.py 2KB
__init__.py 171B
plugin.py 4KB
config.py 2KB
examples
common.py 504B
stacked_moe.py 3KB
parse_and_concat_network.py 4KB
generate_onnx.py 1KB
requirements.txt 29B
README.md 6KB
共 33 条
- 1
资源评论
极智视界
- 粉丝: 2w+
- 资源: 1351
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功