DeepLearningExamples：深度学习示例资源-CSDN文库

共3745个文件

py：1483个

sh：522个

cuh：260个

需积分: 28 60 浏览量 2021-02-28 08:17:41 上传评论收藏 62.33MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

DeepLearningExamples：深度学习示例（3745个子文件）

AUTHORS 349B

BUILD 15KB

BUILD 1KB

encoder_igemm_func.cc 40KB

bert_transformer_op.cc 19KB

bert_transformer_op.cc 17KB

bert_transformer_op.cc 16KB

kaldi-backend.cc 15KB

encoder_sample.cc 15KB

gpt2_sample.cc 14KB

encoder_sample.cc 13KB

decoding_beamsearch_op.cc 12KB

encoder_sample.cc 12KB

decoding_sampling_op.cc 11KB

decoding_beamsearch_sample.cc 11KB

decoding_sampling_op.cc 11KB

decoding_sampling_sample.cc 11KB

decoding_sample.cc 11KB

decoding_op.cc 11KB

encoder_gemm_func.cc 10KB

gpt2_op.cc 9KB

decoder_op.cc 9KB

bert_transformer_op.cc 9KB

decoder_op.cc 9KB

asr_client_imp.cc 9KB

bert_transformer_op.cc 9KB

decoder_op.cc 9KB

decoding_ths_op.cc 8KB

weight_quantize_op.cc 8KB

kaldi_asr_parallel_client.cc 8KB

encoder_sample.cc 7KB

decoding_ext.cc 7KB

encoder_ths_op.cc 7KB

weight_quantize_op.cc 7KB

ft_ths_op.cc 7KB

transformer_fp32.cc 6KB

transformer_fp16.cc 6KB

encoder_ext.cc 6KB

transformer_trt.cc 6KB

weight_quantize_op.cc 6KB

fused_multihead_attention_op.cc 6KB

ft_ths_op.cc 5KB

kaldi-backend-utils.cc 5KB

decoder_ths_op.cc 5KB

decoder_ext.cc 5KB

add_bias_transpose_op.cc 5KB

encoder_ths_op.cc 4KB

encoder_ext.cc 4KB

encoder_ths_op_f.cc 4KB

encoder_gemm.cc 3KB

ft_ext.cc 3KB

decoder_op.cu.cc 2KB

ft_ext.cc 2KB

decoding_op.cu.cc 2KB

bert_transformer_op.cu.cc 2KB

decoding_gemm.cc 2KB

encoder_gemm.cc 2KB

encoder_gemm.cc 1KB

ft_ths_op_f.cc 857B

.clang-format 1KB

Dockerfile.client 1KB

FindCUDNN.cmake 6KB

CODEOWNERS 3KB

spark-defaults.conf 1KB

ssd320_full_1gpus.config 5KB

ssd320_full_8gpus.config 5KB

共 3745 条

# FasterTransformer This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. ## Table Of Contents - [FasterTransformer](#fastertransformer) - [Table Of Contents](#table-of-contents) - [Model overview](#model-overview) - [Configuration support matrix](#configuration-support-matrix) - [Model architecture](#model-architecture) - [Encoder](#encoder) - [Effective Transformer](#effective-transformer) - [Decoder](#decoder) - [Decoding](#decoding) - [Decoder and Decoding](#decoder-and-decoding) - [Setup](#setup) - [Requirements](#requirements) - [Quick Start Guide](#quick-start-guide) - [Build the FasterTransformer](#build-the-fastertransformer) - [Execute the encoder demos](#execute-the-encoder-demos) - [Execute the decoder/decoding demos](#execute-the-decoderdecoding-demos) - [Translation demos](#translation-demos) - [Advanced](#advanced) - [Scripts and sample codes](#scripts-and-sample-codes) - [Command-line options](#command-line-options) - [Inference process](#inference-process) - [Encoder process](#encoder-process) - [Decoder and decoding process](#decoder-and-decoding-process) - [Translation process](#translation-process) - [Performance](#performance) - [Encoder performance](#encoder-performance) - [Encoder performance on T4 and TensorFlow](#encoder-performance-on-t4-and-tensorflow) - [Encoder performance on V100 and TensorFlow](#encoder-performance-on-v100-and-tensorflow) - [Effective Transformer performance on V100 and TensorFlow](#effective-transformer-performance-on-v100-and-tensorflow) - [Encoder performance on T4 and PyTorch](#encoder-performance-on-t4-and-pytorch) - [Encoder performance on V100 and PyTorch](#encoder-performance-on-v100-and-pytorch) - [Performance on application codes of TensorFlow](#performance-on-application-codes-of-tensorflow) - [Performance on application codes of PyTorch](#performance-on-application-codes-of-pytorch) - [Decoder performance](#decoder-performance) - [Decoder performance on T4 and TensorFlow](#decoder-performance-on-t4-and-tensorflow) - [Decoder performance on V100 and TensorFlow](#decoder-performance-on-v100-and-tensorflow) - [Decoding performance](#decoding-performance) - [Decoding performance on T4 and TensorFlow](#decoding-performance-on-t4-and-tensorflow) - [Decoding performance on V100 and TensorFlow](#decoding-performance-on-v100-and-tensorflow) - [Decoder and decoding performance on T4 and PyTorch](#decoder-and-decoding-performance-on-t4-and-pytorch) - [Decoder and decoding performance on V100 and PyTorch](#decoder-and-decoding-performance-on-v100-and-pytorch) - [TensorFlow performance on translation](#tensorflow-performance-on-translation) - [PyTorch performance on translation](#pytorch-performance-on-translation) - [Release notes](#release-notes) - [Changelog](#changelog) - [Known issues](#known-issues) - [TODO](#todo) ## Model overview In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta and Turing GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16. In FasterTransformer 1.0, we implemented a highly optimized BERT transformer layer, which is used in the encoder. In FasterTransformer 2.0, we have added a highly optimized decoder and decoding models based on OpenNMT-TF, an open-source library. Here, the decoder is the model that contains some transformer layers. On the other hand, decoding refers to the whole translating process, including the lookup embedding table, position encoding, a decoder and beam search. In FasterTransformer 2.1, we add some important features. First one is the supporting on PyTorch. Recently, there are more and more PyTorch users. We hope the users of PyTorch can also use the FasterTransformer in their application and researches. The second feature is the supporting of [effective transformer](https://github.com/bytedance/effective_transformer). This idea is proposed by ByteDance. It removes the useless padding of encoder input to reduce the computing cost. Third, in addition to decoding with beam search, we also provide the decoding with sampling module. Finally, we optimize many kernels of encoder, decoder and beam search to improve the speed of FasterTransformer. In FasterTransformer 3.0, we implemented the INT8 quantization for encoder (supporting [effective transformer](https://github.com/bytedance/effective_transformer)). With INT8 quantization, we can take advantage of the powerful INT8 tensor core in Turing GPU to achieve better inference performance (INT8 quantization in FT 3.0 is only supported on device with SM >= 7.5). We also provide quantization tools of tensorflow. The following graph demonstrates the model architecture. ![](images/encoder-decoding-2.png) FasterTransformer is built on top of CUDA, cuBLAS and cuBLASLt, providing the C++ API and TensorFlow/PyTorch OPs. Users can integrate them into TensorFlow, PyTorch, or other inference service codes that are built in native C++. We also provide some simple sample code to demonstrate how to use the encoder, decoder and to carry out decoding in C++, TensorFlow and PyTorch. ### Configuration support matrix The following configurations are supported in the FasterTransformer encoder. - Batch size (B1): smaller or equal to 512 - Sequence length (S): smaller or equal to 1024. For INT8 data type, sequence length should be a multiple of 32. - Head number (H) and size per head (N): - 16 heads * 64 per heads - 12 heads * 64 per heads - 4 heads * 32 per heads - 8 heads * 96 per heads - Data type: FP32, FP16 and INT8 - Any number layer (N1) if the memory is enough The following configurations are supported in the FasterTransformer decoder and decoding. - Batch size (B1) * beam width (B2): smaller than 1024 - Sequence length (S): smaller than 1024 - Head number (H): 8 and 12 - Size per head (N): 64 - Vocabulary size (V): from 64 to 40000 - Data type: FP32 and FP16 - Any number layer (N2) if the memory is enough ### Model architecture #### Encoder The arguments, inputs, and outputs of encoder: * Arguments: 1. Head number (H) 2. Size per head (N) 3. Remove padding flag: A bool value to determine using the effective transformer or not. 4. INT8 mode flag: An integer value to determine which INT8 mode is used. 5. Layer number: The number of layers. 6. Layer index: An integer value to determine which layer is. * Inputs: 1. An input tensor. The shape is \[ B1, S, H x N\]. 2. An attention mask. 3. The weights of all parameters. 4. Sequence id offset vector, using to compute the offset of sentence for effective transformer. 5. Scales list, using to quantize and de-quantize activation in INT8 quantization. * Outputs: 1. The encoder output feature. The shape is \[ B1, S, H x N \]. #### Effective Transformer Effective Transformer is proposed by [link](https://github.com/bytedance/effective_transformer). It is based on the encoder of FasterTransformer. The main idea is: removing the padding of sentence to prevent computing the useless tokens. This method can save lots of time when the average sequence length of one batch is much smaller than the maximum sequence length. Using the Effective Transformer requires to add some additional kernels, the details are demonstrated in the sample codes. ![](images/effective_transformer.png) #### Decoder The arguments, inputs, and outputs of decoder: * Arguments: 1.