Python《基于预训练模型（BERT，BERT-wwm）的文本分类模板，CCFBDCI新闻情感分析》+源代码+设计资料

共131个文件

py：70个

pyc：44个

txt：5个

版权申诉

python

人工智能

bert

自然语言处理

152 浏览量 2024-04-17 13:15:36 上传评论收藏 2.73MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Python《基于预训练模型（BERT，BERT-wwm）的文本分类模板，CCF BDCI新闻情感分析》+源代码+设计资料（131个子文件）

train.csv 2.42MB

test.csv 2.37MB

submit_example.csv 136KB

LICENSE 11KB

README.md 3KB

test_sentencepiece.model 247KB

modeling_bert.py 67KB

modeling_xlnet.py 64KB

modeling_transfo_xl.py 58KB

modeling_utils.py 47KB

modeling_xlm.py 44KB

modeling_gpt2.py 36KB

modeling_openai.py 34KB

tokenization_utils.py 32KB

run_xlnet.py 28KB

run_bert.py 27KB

modeling_common_test.py 26KB

tokenization_transfo_xl.py 21KB

tokenization_bert.py 20KB

modeling_roberta.py 18KB

bert_hubconf.py 16KB

modeling_auto.py 15KB

modeling_bert_test.py 14KB

modeling_xlnet_test.py 14KB

modeling_transfo_xl_utilities.py 13KB

modeling_xlm_test.py 12KB

tokenization_xlm.py 11KB

modeling_roberta_test.py 10KB

file_utils.py 9KB

convert_roberta_checkpoint_to_pytorch.py 9KB

optimization.py 8KB

modeling_transfo_xl_test.py 8KB

tokenization_gpt2.py 8KB

tokenization_roberta.py 8KB

tokenization_xlnet.py 8KB

gpt_hubconf.py 8KB

tokenization_openai.py 7KB

xlnet_hubconf.1.py 7KB

__main__.py 7KB

gpt2_hubconf.py 7KB

xlm_hubconf.py 6KB

optimization_test.py 6KB

tokenization_tests_commons.py 6KB

tokenization_auto.py 6KB

transformer_xl_hubconf.py 6KB

convert_transfo_xl_checkpoint_to_pytorch.py 5KB

tokenization_bert_test.py 5KB

tokenization_xlnet_test.py 5KB

convert_pytorch_checkpoint_to_tf.py 4KB

convert_xlnet_checkpoint_to_pytorch.py 4KB

tokenization_roberta_test.py 4KB

__init__.py 3KB

convert_openai_checkpoint_to_pytorch.py 3KB

tokenization_xlm_test.py 3KB

convert_gpt2_checkpoint_to_pytorch.py 3KB

convert_xlm_checkpoint_to_pytorch.py 3KB

setup.py 3KB

tokenization_transfo_xl_test.py 3KB

tokenization_gpt2_test.py 3KB

tokenization_openai_test.py 3KB

convert_tf_checkpoint_to_pytorch.py 3KB

modeling_openai_test.py 2KB

modeling_gpt2_test.py 2KB

final_changed.py 2KB

tokenization_utils_test.py 2KB

tokenization_auto_test.py 2KB

modeling_auto_test.py 2KB

preprocess.py 1KB

ensemble.py 956B

analysis.py 932B

main.py 829B

hubconf.py 723B

combine.py 679B

conftest.py 511B

statistc.py 461B

__init__.py 0B

modeling_bert.cpython-36.pyc 56KB

modeling_bert.cpython-37.pyc 56KB

modeling_xlnet.cpython-36.pyc 47KB

modeling_xlnet.cpython-37.pyc 46KB

modeling_transfo_xl.cpython-36.pyc 39KB

modeling_transfo_xl.cpython-37.pyc 39KB

modeling_utils.cpython-36.pyc 39KB

modeling_utils.cpython-37.pyc 39KB

modeling_xlm.cpython-36.pyc 34KB

modeling_xlm.cpython-37.pyc 34KB

modeling_gpt2.cpython-36.pyc 31KB

modeling_gpt2.cpython-37.pyc 31KB

modeling_openai.cpython-36.pyc 30KB

modeling_openai.cpython-37.pyc 29KB

tokenization_utils.cpython-37.pyc 28KB

tokenization_utils.cpython-36.pyc 28KB

modeling_roberta.cpython-36.pyc 17KB

modeling_roberta.cpython-37.pyc 17KB

tokenization_transfo_xl.cpython-36.pyc 16KB

tokenization_transfo_xl.cpython-37.pyc 16KB

tokenization_bert.cpython-37.pyc 16KB

tokenization_bert.cpython-36.pyc 16KB

modeling_auto.cpython-37.pyc 13KB

modeling_auto.cpython-36.pyc 13KB

共 131 条

作者ustc-linhw 本文件为文本分类任务目前支持的功能如下： —— 训练数据集kfold处理 —— 训练数据集数据信息查看 —— 使用预训练模型进行文本分类 —— roberta_wwm_ext_large —— roberta_large —— xlnet_large (to do) —— 不同模型结果进行投票ensemble —— 对于训练完成的模型自动保存模型，配置以及输出结果主要文件目录如下： —— backup-models:自动存档目录，输出的模型和结果会自动存档到该目录 —— data：数据文件，用于存放训练用的数据，在该文件下数据分析，数据kfold处理 —— pretrained_model: 用于存放预训练的模型 —— run_xxxxx.sh: 训练某个模型所使用的bash文件 —— run_xxxx.py: 具体的训练代码 —— ensemble_submits：对输出的result文件进行vote融合结果具体使用流程 1. 对于不同的分类任务，可能需要修改下述文件，目前是2分类，如要修改，修改下述文件。 —— preprocess.py —— run_bert.py —— 标签label —— 类别数 —— 类别loss —— combine.py 2. cd data && python analysis.py 查看数据集的相关情况 3. python preprocess.py 完成数据预处理，并且将数据分成kfold 4. 修改run_xxxx.sh文件设置参数注：该模型将文本截成k段，分别输入语言模型，然后顶层用GRU拼接起来。好处在于设置小的max_length和更大的k来降低显存占用，因为显存占用是关于长度平方级增长的，而关于k是线性增长 1)实际长度 = max_seq_length * split_num 2)实际batch size 大小= per_gpu_train_batch_size * numbers of gpu 3)上面的结果所使用的是4卡GPU，因此batch size为4。如果只有1卡的话，那么per_gpu_train_batch_size应设为4, max_length设置小一些。 4)如果显存太小，可以设置gradient_accumulation_steps参数，比如gradient_accumulation_steps=2，batch size=4，那么就会运行2次，每次batch size为2，累计梯度后更新，等价于batch size=4，但速度会慢两倍。而且迭代次数也要相应提高两倍，即train_steps设为10000 具体batch size可看运行时的log，如： 09/06/2019 21:03:41 - INFO - __main__ - ***** Running training ***** 09/06/2019 21:03:41 - INFO - __main__ - Num examples = 5872 09/06/2019 21:03:41 - INFO - __main__ - Batch size = 4 09/06/2019 21:03:41 - INFO - __main__ - Num steps = 5000 5. 最后输出文件会生成result.csv，模型会在对应的模型文件夹中生成，backup文件夹问自动保存对应的模型。

评论收藏

内容反馈

版权申诉