人工智能项目资料-基于短文本分类的优化模型，预测速度可提升10倍，准确度基本不降.zip资源-CSDN文库

共84个文件

py：24个

jpg：9个

pyc：6个

版权申诉

毕业设计

课程设计

项目开发

实训作业

资源资料

142 浏览量 2024-02-08 12:00:05 上传评论收藏 183.01MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

基于短文本分类的优化模型，预测速度可提升10倍，准确度基本不降.zip （84个子文件）

资料总结

run_classifier_lcqmc.sh 2KB

classifier_utils.py 30KB

init_checkpoint

checkpoint 91B

albert_model.ckpt.meta 184KB

albert_config_tiny.json 562B

albert_model.ckpt.data-00000-of-00001 16.38MB

albert_model.ckpt.index 1KB

vocab.txt 107KB

optimization_finetuning.py 6KB

tokenization_google.py 15KB

resources

albert_performance.jpg 118KB

add_data_removing_dropout.jpg 96KB

create_pretraining_data_roberta.py 26KB

albert_large_zh_parameters.jpg 211KB

xlarge_loss.jpg 81KB

shell_scripts

create_pretrain_data_batch_webtext.sh 416B

albert_configuration.jpg 90KB

albert_tiny_compare_s_old.jpg 47KB

crmc2018_compare_s.jpg 62KB

state_of_the_art.jpg 118KB

albert_tiny_compare_s.jpg 149KB

modeling_google_fast.py 46KB

data

dev.txt 674KB

eval_results_albert_zh.txt 855B

test.txt 61KB

train.txt 15.74MB

bert_utils.py 4KB

run_classifier_clue.sh 3KB

similarity.py 11KB

run_pretraining.py 19KB

output

model.ckpt-35000.data-00000-of-00001 47.79MB

checkpoint 277B

graph.pbtxt 1.86MB

model.ckpt-36000.meta 815KB

model.ckpt-37307.index 3KB

label2id.pkl 28B

model.ckpt-37000.index 3KB

model.ckpt-34000.meta 815KB

model.ckpt-37000.meta 815KB

events.out.tfevents.1594989175.tensorflow-23 2.98MB

eval.tf_record 4.58MB

model.ckpt-35000.meta 815KB

model.ckpt-36000.index 3KB

model.ckpt-37307.meta 815KB

model.ckpt-34000.index 3KB

classification_model.pb 16.01MB

eval

events.out.tfevents.1595043971.tensorflow-23 584KB

model.ckpt-36000.data-00000-of-00001 47.79MB

predict.tf_record 527KB

model.ckpt-34000.data-00000-of-00001 47.79MB

model.ckpt-37000.data-00000-of-00001 47.79MB

test_results.tsv 22KB

model.ckpt-35000.index 3KB

model.ckpt-37307.data-00000-of-00001 47.79MB

run_classifier_clue.py 37KB

optimization_google.py 7KB

run_pretraining_google_fast.py 21KB

freeze_graph.py 7KB

albert_config

bert_config.json 518B

albert_config_tiny.json 562B

albert_config_small_google.json 482B

albert_config_tiny_google.json 483B

vocab.txt 107KB

modeling.py 49KB

optimization.py 12KB

test_changes.py 3KB

create_pretrain_data.sh 339B

modeling_google.py 42KB

tokenization.py 13KB

create_pretraining_data.py 43KB

args.py 911B

run_pretraining_google.py 21KB

lamb_optimizer_google.py 5KB

__pycache__

bert_utils.cpython-36.pyc 4KB

run_classifier.cpython-36.pyc 24KB

tokenization.cpython-36.pyc 10KB

args.cpython-36.pyc 789B

modeling.cpython-36.pyc 31KB

optimization_finetuning.cpython-36.pyc 4KB

create_pretraining_data_google.py 23KB

README.md 28KB

mobile_svr

bertsvr.sh 624B

run_classifier.py 35KB

run_classifier_sp_google.py 38KB

# albert_zh An Implementation of <a href="https://arxiv.org/pdf/1909.11942.pdf">A Lite Bert For Self-Supervised Learning Language Representations</a> with TensorFlow ALBert is based on Bert, but with some improvements. It achieves state of the art performance on main benchmarks with 30% parameters less. For albert_base_zh it only has ten percentage parameters compare of original bert model, and main accuracy is retained. Different version of ALBERT pre-trained model for Chinese, including TensorFlow, PyTorch and Keras, is available now. 海量中文语料上预训练ALBERT模型：参数更少，效果更好。预训练小模型也能拿下13项NLP任务，ALBERT三大改造登顶GLUE基准一键运行10个数据集、9个基线模型、不同任务上模型效果的详细对比，见<a href="http://www.CLUEbenchmark.com">中文任务基准测评 CLUE benchmark</a> <img src="https://github.com/brightmart/albert_zh/blob/master/resources/albert_tiny_compare_s.jpg" width="90%" height="70%" /> 一键运行CLUE中文任务：6个中文分类或句子对任务（新） --------------------------------------------------------------------- 使用方式： 1、克隆项目 git clone https://github.com/brightmart/albert_zh.git 2、运行一键运行脚本(GPU方式): 会自动下载模型和所有任务数据并开始运行。 bash run_classifier_clue.sh 执行该一键运行脚本将会自动下载所有任务数据，并为所有任务找到最优模型，然后测试得到提交结果模型下载 Download Pre-trained Models of Chinese ----------------------------------------------- 1、<a href="https://storage.googleapis.com/albert_zh/albert_tiny.zip">albert_tiny_zh</a>, <a href="https://storage.googleapis.com/albert_zh/albert_tiny_489k.zip">albert_tiny_zh(训练更久，累积学习20亿个样本)</a>，文件大小16M、参数为4M 训练和推理预测速度提升约10倍，精度基本保留，模型大小为bert的1/25；语义相似度数据集LCQMC测试集上达到85.4%，相比bert_base仅下降1.5个点。 lcqmc训练使用如下参数： --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5 albert_tiny使用同样的大规模中文语料数据，层数仅为4层、hidden size等向量维度大幅减少; 尝试使用如下学习率来获得更好效果：{2e-5, 6e-5, 1e-4} 【使用场景】任务相对比较简单一些或实时性要求高的任务，如语义相似度等句子对任务、分类任务；比较难的任务如阅读理解等，可以使用其他大模型。例如，可以使用[Tensorflow Lite](https://www.tensorflow.org/lite)在移动端进行部署，本文[随后](#use_tflite)针对这一点进行了介绍，包括如何把模型转换成Tensorflow Lite格式和对其进行性能测试等。一键运行albert_tiny_zh(linux,lcqmc任务)： 1) git clone https://github.com/brightmart/albert_zh 2) cd albert_zh 3) bash run_classifier_lcqmc.sh 1.1、<a href="https://storage.googleapis.com/albert_zh/albert_tiny_zh_google.zip">albert_tiny_google_zh(累积学习10亿个样本,google版本)</a>，模型大小16M、性能与albert_tiny_zh一致 1.2、<a href="https://storage.googleapis.com/albert_zh/albert_small_zh_google.zip">albert_small_google_zh(累积学习10亿个样本,google版本)</a>，速度比bert_base快4倍；LCQMC测试集上比Bert下降仅0.9个点；去掉adam后模型大小18.5M；使用方法，见 #下游任务 Fine-tuning on Downstream Task 2、<a href="https://storage.googleapis.com/albert_zh/albert_large_zh.zip">albert_large_zh</a>,参数量，层数24，文件大小为64M 参数量和模型大小为bert_base的六分之一；在口语化描述相似性数据集LCQMC的测试集上相比bert_base上升0.2个点 3、<a href="https://storage.googleapis.com/albert_zh/albert_base_zh_additional_36k_steps.zip">albert_base_zh(额外训练了1.5亿个实例即 36k steps * batch_size 4096)</a>; <a href="https://storage.googleapis.com/albert_zh/albert_base_zh.zip"> albert_base_zh(小模型体验版)</a>, 参数量12M, 层数12，大小为40M 参数量为bert_base的十分之一，模型大小也十分之一；在口语化描述相似性数据集LCQMC的测试集上相比bert_base下降约0.6~1个点；相比未预训练，albert_base提升14个点 4、<a href="https://storage.googleapis.com/albert_zh/albert_xlarge_zh_177k.zip">albert_xlarge_zh_177k </a>; <a href="https://storage.googleapis.com/albert_zh/albert_xlarge_zh_183k.zip">albert_xlarge_zh_183k(优先尝试)</a>参数量，层数24，文件大小为230M 参数量和模型大小为bert_base的二分之一；需要一张大的显卡；完整测试对比将后续添加；batch_size不能太小，否则可能影响精度 ### 快速加载依托于[Huggingface-Transformers 2.2.2](https://github.com/huggingface/transformers)，可轻松调用以上模型。 ``` tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME") model = AutoModel.from_pretrained("MODEL_NAME") ``` 其中`MODEL_NAME`对应列表如下： | 模型名 | MODEL_NAME | | - | - | | albert_tiny_google_zh | voidful/albert_chinese_tiny | | albert_small_google_zh | voidful/albert_chinese_small | | albert_base_zh (from google) | voidful/albert_chinese_base | | albert_large_zh (from google) | voidful/albert_chinese_large | | albert_xlarge_zh (from google) | voidful/albert_chinese_xlarge | | albert_xxlarge_zh (from google) | voidful/albert_chinese_xxlarge | 更多通过transformers使用albert的<a href='https://huggingface.co/models?search=albert_chinese'>示例</a> 预训练 Pre-training ----------------------------------------------- #### 生成特定格式的文件(tfrecords) Generate tfrecords Files Run following command 运行以下命令即可。项目自动了一个示例的文本文件(data/news_zh_1.txt) bash create_pretrain_data.sh 如果你有很多文本文件，可以通过传入参数的方式，生成多个特定格式的文件(tfrecords） ###### Support English and Other Non-Chinese Language: If you are doing pre-train for english or other language,which is not chinese, you should set hyperparameter of non_chinese to True on create_pretraining_data.py; otherwise, by default it is doing chinese pre-train using whole word mask of chinese. #### 执行预训练 pre-training on GPU/TPU using the command GPU(brightmart版, tiny模型): export BERT_BASE_DIR=./albert_tiny_zh nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord \ --output_dir=./my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/albert_config_tiny.json \ --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=51 \ --num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \ --save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt & GPU(Google版本, small模型): export BERT_BASE_DIR=./albert_small_zh_google nohup python3 run_pretraining_google.py --input_file=./data/tf*.tfrecord --eval_batch_size=64 \ --output_dir=./my_new_model_path --do_train=True --do_eval=True --albert_config_file=$BERT_BASE_DIR/albert_config_small_google.json --export_dir=./my_new_model_path_export \ --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=20 \ --num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \ --save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt TPU, add something like this: --use_tpu=True --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a 注：如果你重头开始训练，可以不指定init_checkpoint；如果你从现有的模型基础上训练，指定一下BERT_BASE_DIR的路径，并确保bert_config_file和init_checkpoi

评论收藏

内容反馈

版权申诉