TinyBERT
========
TinyBERT is 7.5x smaller and 9.4x faster on inference than BERT-base and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages. The overview of TinyBERT learning is illustrated as follows:
<br />
<br />
<img src="tinybert_overview.png" width="800" height="210"/>
<br />
<br />
For more details about the techniques of TinyBERT, refer to our paper:
[TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351)
Release Notes
=============
First version: 2019/11/26
Installation
============
Run command below to install the environment(**using python3**)
```bash
pip install -r requirements.txt
```
General Distillation
====================
In general distillation, we use the original BERT-base without fine-tuning as the teacher and a large-scale text corpus as the learning data. By performing the Transformer distillation on the text from general domain, we obtain a general TinyBERT which provides a good initialization for the task-specific distillation.
General distillation has two steps: (1) generate the corpus of json format; (2) run the transformer distillation;
Step 1: use `pregenerate_training_data.py` to produce the corpus of json format
```
# ${BERT_BASE_DIR}$ includes the BERT-base teacher model.
python pregenerate_training_data.py --train_corpus ${CORPUS_RAW} \
--bert_model ${BERT_BASE_DIR}$ \
--reduce_memory --do_lower_case \
--epochs_to_generate 3 \
--output_dir ${CORPUS_JSON_DIR}$
```
Step 2: use `general_distill.py` to run the general distillation
```
# ${STUDENT_CONFIG_DIR}$ includes the config file of student_model.
python general_distill.py --pregenerated_data ${CORPUS_JSON}$ \
--teacher_model ${BERT_BASE}$ \
--student_model ${STUDENT_CONFIG_DIR}$ \
--reduce_memory --do_lower_case \
--train_batch_size 256 \
--output_dir ${GENERAL_TINYBERT_DIR}$
```
We also provide the models of general TinyBERT here and users can skip the general distillation.
=================1st version to reproduce our results in the paper ===========================
[General_TinyBERT(4layer-312dim)](https://drive.google.com/uc?export=download&id=1dDigD7QBv1BmE6pWU71pFYPgovvEqOOj)
[General_TinyBERT(6layer-768dim)](https://drive.google.com/uc?export=download&id=1wXWR00EHK-Eb7pbyw0VP234i2JTnjJ-x)
=================2nd version (2019/11/18) trained with more (book+wiki) and no `[MASK]` corpus =======
[General_TinyBERT_v2(4layer-312dim)](https://drive.google.com/open?id=1PhI73thKoLU2iliasJmlQXBav3v33-8z)
[General_TinyBERT_v2(6layer-768dim)](https://drive.google.com/open?id=1r2bmEsQe4jUBrzJknnNaBJQDgiRKmQjF)
Data Augmentation
=================
Data augmentation aims to expand the task-specific training set. Learning more task-related examples, the generalization capabilities of student model can be further improved. We combine a pre-trained language model BERT and GloVe embeddings to do word-level replacement for data augmentation.
Use `data_augmentation.py` to run data augmentation and the augmented dataset `train_aug.tsv` is automatically saved into the corresponding ${GLUE_DIR/TASK_NAME}$
```
python data_augmentation.py --pretrained_bert_model ${BERT_BASE_DIR}$ \
--glove_embs ${GLOVE_EMB}$ \
--glue_dir ${GLUE_DIR}$ \
--task_name ${TASK_NAME}$
```
Before running data augmentation of GLUE tasks you should download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory GLUE_DIR. And TASK_NAME can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE.
Task-specific Distillation
==========================
In the task-specific distillation, we re-perform the proposed Transformer distillation to further improve TinyBERT by focusing on learning the task-specific knowledge.
Task-specific distillation includes two steps: (1) intermediate layer distillation; (2) prediction layer distillation.
Step 1: use `task_distill.py` to run the intermediate layer distillation.
```
# ${FT_BERT_BASE_DIR}$ contains the fine-tuned BERT-base model.
python task_distill.py --teacher_model ${FT_BERT_BASE_DIR}$ \
--student_model ${GENERAL_TINYBERT_DIR}$ \
--data_dir ${TASK_DIR}$ \
--task_name ${TASK_NAME}$ \
--output_dir ${TMP_TINYBERT_DIR}$ \
--max_seq_length 128 \
--train_batch_size 32 \
--num_train_epochs 10 \
--aug_train \
--do_lower_case
```
Step 2: use `task_distill.py` to run the prediction layer distillation.
```
python task_distill.py --pred_distill \
--teacher_model ${FT_BERT_BASE_DIR}$ \
--student_model ${TMP_TINYBERT_DIR}$ \
--data_dir ${TASK_DIR}$ \
--task_name ${TASK_NAME}$ \
--output_dir ${TINYBERT_DIR}$ \
--aug_train \
--do_lower_case \
--learning_rate 3e-5 \
--num_train_epochs 3 \
--eval_step 100 \
--max_seq_length 128 \
--train_batch_size 32
```
We here also provide the distilled TinyBERT(both 4layer-312dim and 6layer-768dim) of all GLUE tasks for evaluation. Every task has its own folder where the corresponding model has been saved.
[TinyBERT(4layer-312dim)](https://drive.google.com/uc?export=download&id=1_sCARNCgOZZFiWTSgNbE7viW_G5vIXYg)
[TinyBERT(6layer-768dim)](https://drive.google.com/uc?export=download&id=1Vf0ZnMhtZFUE0XoD3hTXc6QtHwKr_PwS)
Evaluation
==========================
The `task_distill.py` also provide the evalution by running the following command:
```
${TINYBERT_DIR}$ includes the config file, student model and vocab file.
python task_distill.py --do_eval \
--student_model ${TINYBERT_DIR}$ \
--data_dir ${TASK_DIR}$ \
--task_name ${TASK_NAME}$ \
--output_dir ${OUTPUT_DIR}$ \
--do_lower_case \
--eval_batch_size 32 \
--max_seq_length 128
```
To Dos
=========================
* Evaluate TinyBERT on Chinese tasks.
* Tiny*: use NEZHA or ALBERT as the teacher in TinyBERT learning.
* Release better general TinyBERTs.
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
Pretrained_language_model_and_its_ (106个子文件)
Dockerfile 806B
bert_config.json 552B
bert_config.json 379B
bert_config.json 377B
LICENSE 11KB
LICENSE 11KB
LICENSE 11KB
LICENSE 11KB
LICENSE 11KB
multilingual.md 11KB
README.md 7KB
README.md 5KB
README.md 4KB
CONTRIBUTING.md 1KB
README.md 1KB
README.md 1KB
README.md 876B
NOTICE 289B
dynabert_overview.png 73KB
tinybert_overview.png 24KB
tokenization_utils.py 54KB
run_squad.py 52KB
modeling_nezha.py 52KB
modeling.py 51KB
modeling.py 50KB
run_sequence_classifier.py 44KB
run_classifier_ner.py 44KB
interactive_conditional_generation.py 42KB
run_classifier.py 42KB
modeling_utils.py 42KB
task_distill.py 41KB
modeling_ori.py 38KB
glue.py 35KB
modeling_bert.py 29KB
run_glue.py 27KB
modeling_roberta.py 25KB
general_distill.py 23KB
run_pretraining.py 23KB
tokenization_bert.py 22KB
create_squad_data.py 20KB
pregenerate_training_data.py 18KB
create_glue_data.py 17KB
eval_glue.py 16KB
create_pretraining_data.py 15KB
tokenization.py 15KB
official_tokenization.py 14KB
extract_features.py 14KB
optimization.py 13KB
optimization.py 12KB
tokenization.py 12KB
tokenization.py 12KB
file_utils.py 11KB
data_augmentation.py 11KB
run_classifier_with_tfhub.py 11KB
configuration_utils.py 11KB
__init__.py 10KB
tokenization_gpt2.py 10KB
modeling_test.py 9KB
file_utils.py 9KB
optimization.py 8KB
pytorch_optimization.py 8KB
file_utils.py 8KB
tf_metrics.py 8KB
file_utils.py 8KB
optimization.py 8KB
run_squad_trtis_client.py 7KB
__main__.py 7KB
configuration_bert.py 7KB
tokenization_roberta.py 6KB
fused_layer_norm.py 5KB
utils.py 5KB
convert_tf_checkpoint_to_pytorch.py 5KB
utils.py 4KB
cmrc2018_evaluate.py 4KB
tokenization_test.py 4KB
__init__.py 3KB
read_tf_events.py 2KB
utils.py 2KB
optimization_test.py 2KB
gpu_environment.py 2KB
fp16_utils.py 1KB
configuration_roberta.py 1KB
__init__.py 616B
__init__.py 474B
__init__.py 300B
__init__.py 174B
run_clf.sh 987B
run_seq_labelling.sh 932B
run_reading.sh 770B
run_classifier.sh 573B
run_pretraining.sh 496B
run_clf_predict.sh 428B
run_seq_labelling_predict.sh 404B
run_ner_predict.sh 306B
tf_examples_01.tfrecord 48KB
tf_examples_00.tfrecord 48KB
train.tsv 2.88MB
train.tsv 919KB
dev.tsv 365KB
dev.tsv 103KB
共 106 条
- 1
- 2
资源评论
好家伙VCC
- 粉丝: 1545
- 资源: 7760
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功