# FasterTransformer
This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.
## Table Of Contents
- [FasterTransformer](#fastertransformer)
- [Table Of Contents](#table-of-contents)
- [Model overview](#model-overview)
- [Configuration support matrix](#configuration-support-matrix)
- [Model architecture](#model-architecture)
- [Encoder](#encoder)
- [Effective Transformer](#effective-transformer)
- [Decoder](#decoder)
- [Decoding](#decoding)
- [Decoder and Decoding](#decoder-and-decoding)
- [Setup](#setup)
- [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Build the FasterTransformer](#build-the-fastertransformer)
- [Execute the encoder demos](#execute-the-encoder-demos)
- [Execute the decoder/decoding demos](#execute-the-decoderdecoding-demos)
- [Translation demos](#translation-demos)
- [Advanced](#advanced)
- [Scripts and sample codes](#scripts-and-sample-codes)
- [Command-line options](#command-line-options)
- [Inference process](#inference-process)
- [Encoder process](#encoder-process)
- [Decoder and decoding process](#decoder-and-decoding-process)
- [Translation process](#translation-process)
- [Performance](#performance)
- [Encoder performance](#encoder-performance)
- [Encoder performance on T4 and TensorFlow](#encoder-performance-on-t4-and-tensorflow)
- [Encoder performance on V100 and TensorFlow](#encoder-performance-on-v100-and-tensorflow)
- [Effective Transformer performance on V100 and TensorFlow](#effective-transformer-performance-on-v100-and-tensorflow)
- [Encoder performance on T4 and PyTorch](#encoder-performance-on-t4-and-pytorch)
- [Encoder performance on V100 and PyTorch](#encoder-performance-on-v100-and-pytorch)
- [Performance on application codes of TensorFlow](#performance-on-application-codes-of-tensorflow)
- [Performance on application codes of PyTorch](#performance-on-application-codes-of-pytorch)
- [Decoder performance](#decoder-performance)
- [Decoder performance on T4 and TensorFlow](#decoder-performance-on-t4-and-tensorflow)
- [Decoder performance on V100 and TensorFlow](#decoder-performance-on-v100-and-tensorflow)
- [Decoding performance](#decoding-performance)
- [Decoding performance on T4 and TensorFlow](#decoding-performance-on-t4-and-tensorflow)
- [Decoding performance on V100 and TensorFlow](#decoding-performance-on-v100-and-tensorflow)
- [Decoder and decoding performance on T4 and PyTorch](#decoder-and-decoding-performance-on-t4-and-pytorch)
- [Decoder and decoding performance on V100 and PyTorch](#decoder-and-decoding-performance-on-v100-and-pytorch)
- [TensorFlow performance on translation](#tensorflow-performance-on-translation)
- [PyTorch performance on translation](#pytorch-performance-on-translation)
- [Release notes](#release-notes)
- [Changelog](#changelog)
- [Known issues](#known-issues)
- [TODO](#todo)
## Model overview
In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta and Turing GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.
In FasterTransformer 1.0, we implemented a highly optimized BERT transformer layer, which is used in the encoder.
In FasterTransformer 2.0, we have added a highly optimized decoder and decoding models based on OpenNMT-TF, an open-source library. Here, the decoder is the model that contains some transformer layers. On the other hand, decoding refers to the whole translating process, including the lookup embedding table, position encoding, a decoder and beam search.
In FasterTransformer 2.1, we add some important features. First one is the supporting on PyTorch. Recently, there are more and more PyTorch users. We hope the users of PyTorch can also use the FasterTransformer in their application and researches. The second feature is the supporting of [effective transformer](https://github.com/bytedance/effective_transformer). This idea is proposed by ByteDance. It removes the useless padding of encoder input to reduce the computing cost. Third, in addition to decoding with beam search, we also provide the decoding with sampling module. Finally, we optimize many kernels of encoder, decoder and beam search to improve the speed of FasterTransformer.
In FasterTransformer 3.0, we implemented the INT8 quantization for encoder (supporting [effective transformer](https://github.com/bytedance/effective_transformer)). With INT8 quantization, we can take advantage of the powerful INT8 tensor core in Turing GPU to achieve better inference performance (INT8 quantization in FT 3.0 is only supported on device with SM >= 7.5). We also provide quantization tools of tensorflow.
The following graph demonstrates the model architecture.
![](images/encoder-decoding-2.png)
FasterTransformer is built on top of CUDA, cuBLAS and cuBLASLt, providing the C++ API and TensorFlow/PyTorch OPs. Users can integrate them into TensorFlow, PyTorch, or other inference service codes that are built in native C++. We also provide some simple sample code to demonstrate how to use the encoder, decoder and to carry out decoding in C++, TensorFlow and PyTorch.
### Configuration support matrix
The following configurations are supported in the FasterTransformer encoder.
- Batch size (B<sub>1</sub>): smaller or equal to 512
- Sequence length (S): smaller or equal to 1024. For INT8 data type, sequence length should be a multiple of 32.
- Head number (H) and size per head (N):
- 16 heads * 64 per heads
- 12 heads * 64 per heads
- 4 heads * 32 per heads
- 8 heads * 96 per heads
- Data type: FP32, FP16 and INT8
- Any number layer (N<sub>1</sub>) if the memory is enough
The following configurations are supported in the FasterTransformer decoder and decoding.
- Batch size (B<sub>1</sub>) * beam width (B<sub>2</sub>): smaller than 1024
- Sequence length (S): smaller than 1024
- Head number (H): 8 and 12
- Size per head (N): 64
- Vocabulary size (V): from 64 to 40000
- Data type: FP32 and FP16
- Any number layer (N<sub>2</sub>) if the memory is enough
### Model architecture
#### Encoder
The arguments, inputs, and outputs of encoder:
* Arguments:
1. Head number (H)
2. Size per head (N)
3. Remove padding flag: A bool value to determine using the effective transformer or not.
4. INT8 mode flag: An integer value to determine which INT8 mode is used.
5. Layer number: The number of layers.
6. Layer index: An integer value to determine which layer is.
* Inputs:
1. An input tensor. The shape is \[ B<sub>1</sub>, S, H x N\].
2. An attention mask.
3. The weights of all parameters.
4. Sequence id offset vector, using to compute the offset of sentence for effective transformer.
5. Scales list, using to quantize and de-quantize activation in INT8 quantization.
* Outputs:
1. The encoder output feature. The shape is \[ B<sub>1</sub>, S, H x N \].
#### Effective Transformer
Effective Transformer is proposed by [link](https://github.com/bytedance/effective_transformer). It is based on the encoder of FasterTransformer.
The main idea is: removing the padding of sentence to prevent computing the useless tokens. This method can save lots of time when the average sequence length of one batch is much smaller than the maximum sequence length.
Using the Effective Transformer requires to add some additional kernels, the details are demonstrated in the sample codes.
![](images/effective_transformer.png)
#### Decoder
The arguments, inputs, and outputs of decoder:
* Arguments:
1.
没有合适的资源?快使用搜索试试~ 我知道了~
DeepLearningExamples:深度学习示例
共3745个文件
py:1483个
sh:522个
cuh:260个
需积分: 28 1 下载量 60 浏览量
2021-02-28
08:17:41
上传
评论
收藏 62.33MB ZIP 举报
温馨提示
面向Tensor核心的NVIDIA深度学习示例 介绍 该存储库提供了易于训练和部署的最新深度学习示例,并通过在NVIDIA Volta,Turing和Ampere GPU上运行的NVIDIA CUDA-X软件堆栈实现了最佳的可重复精度和性能。 NVIDIA GPU Cloud(NGC)容器注册表 这些示例以及NVIDIA深度学习软件堆栈在NGC容器注册表( )上每月更新的Docker容器中提供。 这些容器包括: 该存储库中最新的NVIDIA示例 NVIDIA在各自框架的上游共享了最新的贡献 最新的NVIDIA深度学习软件库(例如cuDNN,NCCL,cuBLAS等)均经过严格的每月质量保证流程,以确保它们提供最佳性能 每个NVIDIA优化容器的 计算机视觉 楷模 框架 A100 安培 多GPU 多节点 TRT 昂尼克斯 特里顿 DLC NB 火炬 是的 是的 是的 -- 是
资源详情
资源评论
资源推荐
收起资源包目录
DeepLearningExamples:深度学习示例 (3745个子文件)
AUTHORS 349B
BUILD 15KB
BUILD 1KB
encoder_igemm_func.cc 40KB
bert_transformer_op.cc 19KB
bert_transformer_op.cc 17KB
bert_transformer_op.cc 16KB
kaldi-backend.cc 15KB
encoder_sample.cc 15KB
gpt2_sample.cc 14KB
encoder_sample.cc 13KB
decoding_beamsearch_op.cc 12KB
decoding_beamsearch_op.cc 12KB
decoding_beamsearch_op.cc 12KB
encoder_sample.cc 12KB
decoding_sampling_op.cc 11KB
decoding_sampling_op.cc 11KB
decoding_beamsearch_sample.cc 11KB
decoding_beamsearch_sample.cc 11KB
decoding_beamsearch_sample.cc 11KB
decoding_sampling_op.cc 11KB
decoding_sampling_sample.cc 11KB
decoding_sampling_sample.cc 11KB
decoding_sampling_sample.cc 11KB
decoding_sample.cc 11KB
decoding_op.cc 11KB
encoder_gemm_func.cc 10KB
gpt2_op.cc 9KB
decoder_op.cc 9KB
bert_transformer_op.cc 9KB
decoder_op.cc 9KB
decoder_op.cc 9KB
asr_client_imp.cc 9KB
bert_transformer_op.cc 9KB
decoder_op.cc 9KB
decoding_ths_op.cc 8KB
decoding_ths_op.cc 8KB
decoding_ths_op.cc 8KB
weight_quantize_op.cc 8KB
kaldi_asr_parallel_client.cc 8KB
encoder_sample.cc 7KB
decoding_ext.cc 7KB
decoding_ext.cc 7KB
decoding_ext.cc 7KB
encoder_ths_op.cc 7KB
weight_quantize_op.cc 7KB
ft_ths_op.cc 7KB
transformer_fp32.cc 6KB
transformer_fp16.cc 6KB
encoder_ext.cc 6KB
transformer_trt.cc 6KB
transformer_trt.cc 6KB
transformer_trt.cc 6KB
transformer_trt.cc 6KB
transformer_trt.cc 6KB
weight_quantize_op.cc 6KB
fused_multihead_attention_op.cc 6KB
ft_ths_op.cc 5KB
ft_ths_op.cc 5KB
kaldi-backend-utils.cc 5KB
decoder_ths_op.cc 5KB
decoder_ths_op.cc 5KB
decoder_ths_op.cc 5KB
decoder_ext.cc 5KB
decoder_ext.cc 5KB
decoder_ext.cc 5KB
add_bias_transpose_op.cc 5KB
encoder_ths_op.cc 4KB
encoder_ths_op.cc 4KB
encoder_ext.cc 4KB
encoder_ext.cc 4KB
encoder_ths_op_f.cc 4KB
encoder_ths_op_f.cc 4KB
encoder_gemm.cc 3KB
ft_ext.cc 3KB
decoder_op.cu.cc 2KB
ft_ext.cc 2KB
ft_ext.cc 2KB
decoding_op.cu.cc 2KB
bert_transformer_op.cu.cc 2KB
bert_transformer_op.cu.cc 2KB
decoding_gemm.cc 2KB
decoding_gemm.cc 2KB
decoding_gemm.cc 2KB
decoding_gemm.cc 2KB
encoder_gemm.cc 2KB
encoder_gemm.cc 1KB
encoder_gemm.cc 1KB
ft_ths_op_f.cc 857B
ft_ths_op_f.cc 857B
.clang-format 1KB
Dockerfile.client 1KB
FindCUDNN.cmake 6KB
FindCUDNN.cmake 6KB
FindCUDNN.cmake 6KB
FindCUDNN.cmake 6KB
CODEOWNERS 3KB
spark-defaults.conf 1KB
ssd320_full_1gpus.config 5KB
ssd320_full_8gpus.config 5KB
共 3745 条
- 1
- 2
- 3
- 4
- 5
- 6
- 38
绘画窝
- 粉丝: 21
- 资源: 4715
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0