# BERT For PyTorch
This repository provides a script and recipe to train the BERT model for PyTorch to achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
## Table Of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Enabling TF32](#enabling-tf32)
* [Glossary](#glossary)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Pre-training parameters](#pre-training-parameters)
* [Fine tuning parameters](#fine-tuning-parameters)
* [Multi-node](#multi-node)
* [Fine-tuning parameters](#fine-tuning-parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Multi-dataset](#multi-dataset)
* [Training process](#training-process)
* [Pre-training](#pre-training)
* [Fine-tuning](#fine-tuning)
* [Inference process](#inference-process)
* [Fine-tuning inference](#fine-tuning-inference)
* [Deploying BERT using NVIDIA Triton Inference Server](#deploying-bert-using-nvidia-triton-inference-server)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Pre-training loss results: NVIDIA DGX A100 (8x A100 40GB)](#pre-training-loss-results-nvidia-dgx-a100-8x-a100-40gb)
* [Pre-training loss results: NVIDIA DGX-2H V100 (16x V100 32GB)](#pre-training-loss-results-nvidia-dgx-2h-v100-16x-v100-32gb)
* [Pre-training loss results](#pre-training-loss-results)
* [Pre-training loss curves](#pre-training-loss-curves)
* [Fine-tuning accuracy results: NVIDIA DGX A100 (8x A100 40GB)](#fine-tuning-accuracy-results-nvidia-dgx-a100-8x-a100-40gb)
* [Fine-tuning accuracy results: NVIDIA DGX-2 (16x V100 32G)](#fine-tuning-accuracy-results-nvidia-dgx-2-16x-v100-32g)
* [Fine-tuning accuracy results: NVIDIA DGX-1 (8x V100 16G)](#fine-tuning-accuracy-results-nvidia-dgx-1-8x-v100-16g)
* [Training stability test](#training-stability-test)
* [Pre-training stability test](#pre-training-stability-test)
* [Fine-tuning stability test](#fine-tuning-stability-test)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Pre-training NVIDIA DGX A100 (8x A100 40GB)](#pre-training-nvidia-dgx-a100-8x-a100-40gb)
* [Fine-tuning NVIDIA DGX A100 (8x A100 40GB)](#fine-tuning-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-16x-v100-32g)
* [Pre-training NVIDIA DGX-2 With 32G](#pre-training-nvidia-dgx-2-with-32g)
* [Pre-training on multiple NVIDIA DGX-2H With 32G](#pre-training-on-multiple-nvidia-dgx-2h-with-32g)
* [Fine-tuning NVIDIA DGX-2 With 32G](#fine-tuning-nvidia-dgx-2-with-32g)
* [Training performance: NVIDIA DGX-1 (8x V100 32G)](#training-performance-nvidia-dgx-1-8x-v100-32g)
* [Pre-training NVIDIA DGX-1 With 32G](#pre-training-nvidia-dgx-1-with-32g)
* [Fine-tuning NVIDIA DGX-1 With 32G](#fine-tuning-nvidia-dgx-1-with-32g)
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
* [Pre-training NVIDIA DGX-1 With 16G](#pre-training-nvidia-dgx-1-with-16g)
* [Pre-training on multiple NVIDIA DGX-1 With 16G](#pre-training-on-multiple-nvidia-dgx-1-with-16g)
* [Fine-tuning NVIDIA DGX-1 With 16G](#fine-tuning-nvidia-dgx-1-with-16g)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
* [Fine-tuning inference on NVIDIA DGX A100 (1x A100 40GB)](#fine-tuning-inference-on-nvidia-dgx-a100-1x-a100-40gb)
* [Inference performance: NVIDIA DGX-2 (1x V100 32G)](#inference-performance-nvidia-dgx-2-1x-v100-32g)
* [Fine-tuning inference on NVIDIA DGX-2 with 32G](#fine-tuning-inference-on-nvidia-dgx-2-with-32g)
* [Inference performance: NVIDIA DGX-1 (1x V100 32G)](#inference-performance-nvidia-dgx-1-1x-v100-32g)
* [Fine-tuning inference on NVIDIA DGX-1 with 32G](#fine-tuning-inference-on-nvidia-dgx-1-with-32g)
* [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16g)
* [Fine-tuning inference on NVIDIA DGX-1 with 16G](#fine-tuning-inference-on-nvidia-dgx-1-with-16g)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on the [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) paper. NVIDIA's implementation of BERT is an optimized version of the [Hugging Face implementation](https://github.com/huggingface/pytorch-pretrained-BERT), leveraging mixed precision arithmetic and Tensor Cores on Volta V100 and Ampere A100 GPUs for faster training times while maintaining target accuracy.
This repository contains scripts to interactively launch data download, training, benchmarking and inference routines in a Docker container for both pre-training and fine-tuning for tasks such as question answering. The major differences between the original implementation of the paper and this version of BERT are as follows:
- Scripts to download Wikipedia and BookCorpus datasets
- Scripts to preprocess downloaded data or a custom corpus into inputs and targets for pre-training in a modular fashion
- Fused [LAMB](https://arxiv.org/pdf/1904.00962.pdf) optimizer to support training with larger batches
- Fused Adam optimizer for fine tuning tasks
- Fused CUDA kernels for better performance LayerNorm
- Automatic mixed precision (AMP) training support
- Scripts to launch on multiple number of nodes
Other publicly available implementations of BERT include:
1. [NVIDIA TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT)
2. [Hugging Face](https://github.com/huggingface/pytorch-pretrained-BERT)
3. [codertimo](https://github.com/codertimo/BERT-pytorch)
4. [gluon-nlp](https://github.com/dmlc/gluon-nlp/tree/master/scripts/bert)
5. [Google's implementation](https://github.com/google-research/bert)
This model trains with mixed precision Tensor Cores on Volta and provides a push-button solution to pretraining on a corpus of choice. As a result, researchers can get results 4x faster than training without Tensor Cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Model architecture
The BERT model uses the same architecture as the
没有合适的资源?快使用搜索试试~ 我知道了~
cambriocn pytorch训练和推理模型集合
共2000个文件
py:1025个
sh:263个
md:255个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 173 浏览量
2024-02-05
11:23:30
上传
评论
收藏 33.22MB ZIP 举报
温馨提示
PyTorch是时下最流行的AI框架,寒武纪对其进行了定制化开发,新增了对寒武纪加速板卡及寒武纪AI软件栈的支持,通常称之为Cambricon PyTorch。相比于原生PyTorch,用户基本不用做任何代码改动即可快速地将AI模型迁移至Cambricon PyTorch上。针对CV 分类、检测、分割、NLP、语音等场景常用的各类经典和前沿的AI模型,本仓库展示了如何对其进行适配,使其可运行在Cambricon PyTorch上。开发者在进行其他AI 应用移植时可参考本仓库。
资源推荐
资源详情
资源评论
收起资源包目录
cambriocn pytorch训练和推理模型集合 (2000个子文件)
_mask.c 632KB
cpu_nms.c 393KB
maskApi.c 8KB
gpu_nms.cpp 275KB
Taco2LSTMCellLayerPlugin_test.cpp 37KB
decoderBuilderPlugins.cpp 21KB
taco2AttentionLayerPlugin.cpp 19KB
taco2LSTMCellLayerPlugin.cpp 18KB
decoderBuilderPlain.cpp 18KB
taco2ProjectionLayerPlugin.cpp 18KB
taco2PrenetLayerPlugin.cpp 13KB
lstm.cpp 12KB
taco2ModulationRemovalLayerPlugin.cpp 12KB
taco2DenoiseTransformLayerPlugin.cpp 12KB
tacotron2StreamingInstance.cpp 11KB
CustomContext.cpp 10KB
decoderInstancePlugins.cpp 10KB
vector_pool.cpp 9KB
trtUtils.cpp 9KB
iou3d_cpu.cpp 8KB
characterMapping.cpp 8KB
jsonModelImporter.cpp 8KB
encoderBuilder.cpp 8KB
ROIAlign_cpu.cpp 8KB
ROIAlign_cpu.cpp 8KB
speechSynthesizer.cpp 8KB
TRTISClient.cpp 8KB
Taco2ModulationRemovalLayerPlugin_test.cpp 8KB
taco2AttentionLayerPluginCreator.cpp 8KB
engineCache.cpp 8KB
decoderInstancePlain.cpp 7KB
taco2ProjectionLayerPluginCreator.cpp 7KB
Taco2DenoiseTransformLayerPlugin_test.cpp 7KB
roiaware_pool3d.cpp 7KB
convBatchNormCreator.cpp 7KB
Taco2ProjectionLayerPlugin_test.cpp 7KB
decoderInstance.cpp 7KB
denoiserBuilder.cpp 7KB
tacotron2Instance.cpp 7KB
taco2LSTMCellLayerPluginCreator.cpp 7KB
waveGlowInstance.cpp 7KB
waveGlowStreamingInstance.cpp 7KB
iou3d_nms.cpp 6KB
Taco2PrenetLayerPlugin_test.cpp 6KB
waveGlowBuilder.cpp 6KB
taco2ModulationRemovalLayerPluginCreator.cpp 6KB
attentionLayerCreator.cpp 6KB
build_denoiser.cpp 6KB
taco2PrenetLayerPluginCreator.cpp 6KB
speechDataBuffer.cpp 6KB
taco2DenoiseTransformLayerPluginCreator.cpp 6KB
postNetBuilder.cpp 6KB
build_tacotron2.cpp 6KB
build_waveglow.cpp 5KB
pluginBuilder.cpp 5KB
waveFileWriter.cpp 5KB
WaveFileWriter.cpp 5KB
CustomInputReader.cpp 5KB
tacotron2Loader.cpp 4KB
tacotron2Builder.cpp 4KB
Blending_test.cpp 4KB
trtis_client.cpp 4KB
UnitTest.cpp 4KB
componentTiming.cpp 4KB
denoiserInstance.cpp 4KB
interpolate.cpp 4KB
CharacterMappingReader.cpp 4KB
CustomOutputWriter.cpp 4KB
encoderInstance.cpp 4KB
JSONModelImporter_test.cpp 4KB
CharacterMapping_test.cpp 4KB
layerData.cpp 3KB
DataShuffler_test.cpp 3KB
custom.cpp 3KB
postNetInstance.cpp 3KB
binding.cpp 3KB
denoiserLoader.cpp 3KB
waveGlowLoader.cpp 3KB
denoiserStreamingInstance.cpp 3KB
nms_cpu.cpp 2KB
engineDriver.cpp 2KB
group_points.cpp 2KB
nms_cpu.cpp 2KB
CharacterMappingReader_test.cpp 2KB
roipoint_pool3d.cpp 2KB
interpolate.cpp 2KB
sampling.cpp 2KB
pointnet2_api.cpp 2KB
ball_query.cpp 2KB
sampling.cpp 2KB
vision.cpp 1KB
voxel_query.cpp 1KB
vision.cpp 1KB
ball_query.cpp 1KB
group_points.cpp 1KB
pointnet2_api.cpp 1KB
iou3d_nms_api.cpp 596B
logging.h 16KB
taco2AttentionLayerPlugin.h 9KB
taco2LSTMCellLayerPlugin.h 9KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
Java程序员-张凯
- 粉丝: 1w+
- 资源: 6732
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功