Python-微调BERT用于提取摘要的论文代码_中文生成式摘要T5资源-CSDN文库

共36个文件

py：20个

txt：7个

gitignore：5个

5星 · 超过95%的资源需积分: 39 12 浏览量 2019-08-10 06:30:23 上传评论 7 收藏 14.99MB ZIP 举报

在自然语言处理领域，BERT（Bidirectional Encoder Representations from Transformers）模型因其卓越的性能而备受关注。本项目“Python-微调BERT用于提取摘要的论文代码”是基于Python的实现，利用BERT对文本进行预训练，进而应用于摘要生成任务。在这一过程中，我们将深入探讨BERT的原理、微调过程以及如何将其应用于自动摘要。 BERT是一种基于Transformer架构的预训练模型，由Google在2018年提出。它的核心思想是通过Transformer的自注意力机制捕捉文本中的上下文信息，实现双向信息流。与传统的RNN或CNN模型相比，BERT能够更好地理解文本的全局语义。微调BERT涉及以下步骤： 1. **数据预处理**：在本项目中，我们需要将原始文本转换为BERT可以理解的输入格式，即Token IDs、Segment IDs和Mask IDs。这通常包括分词、添加特殊标记（如[CLS]和[SEP]）以及填充序列以确保所有输入具有相同的长度。 2. **加载预训练模型**：使用Hugging Face的Transformers库，我们可以方便地加载预训练的BERT模型。预训练模型已经在大规模无标注文本上进行了训练，具有丰富的语言表示能力。 3. **构建任务特定的层**：为了适应摘要生成任务，我们需要在BERT模型的顶部添加一个序列到序列模型，通常是一个编码器-解码器结构。编码器部分使用BERT，解码器部分可能是一个Transformer Decoder或者基于LSTM的解码器。解码器的任务是根据编码器的输出生成摘要。 4. **损失函数和优化**：在训练过程中，我们通常使用交叉熵损失函数，因为它适用于分类任务，而摘要生成可以看作是序列分类问题。使用Adam优化器进行参数更新，同时可能需要调整学习率策略，如学习率衰减，以达到更好的收敛效果。 5. **训练与评估**：在训练阶段，我们会用到数据集中的源文本和对应的摘要，通过反向传播优化模型参数。评估阶段，我们可以使用ROUGE（Recall-Oriented Understudy for Gisting Evaluation）等评价指标，来衡量生成摘要的质量与原文的相似度。 6. **后处理**：生成的摘要可能包含填充的特殊标记，因此需要进行后处理，例如去除[SEP]和[CLS]等特殊符号，以及处理过长的摘要。在项目“BertSum-master”中，这些步骤已经得到了实现。文件可能包含了数据预处理脚本、模型定义、训练与评估代码，以及可能的示例运行脚本。通过阅读和理解这些代码，开发者可以进一步学习如何将BERT应用到实际的自然语言处理任务中，尤其是摘要生成。微调BERT用于摘要生成是一项复杂的任务，涉及到自然语言处理、深度学习和序列到序列建模等多个领域的知识。这个项目提供了一个实践平台，让开发者有机会亲手实现这一技术，从而加深对BERT模型和自然语言处理的理解。

资源推荐

资源详情

资源评论

收起资源包目录

Python-微调BERT用于提取摘要的论文代码.zip （36个子文件）

BertSum-master

json_data

cnndm_sample.train.0.json 83KB

results

.gitignore 13B

bert_data

.gitignore 13B

src

distributed.py 4KB

prepro

__init__.py 0B

utils.py 870B

data_builder.py 12KB

smart_common_words.txt 4KB

models

trainer.py 16KB

reporter.py 5KB

neural.py 8KB

model_builder.py 4KB

__init__.py 0B

optimizers.py 9KB

rnn.py 4KB

data_loader.py 7KB

stats.py 3KB

encoder.py 5KB

others

__init__.py 0B

utils.py 4KB

pyrouge.py 23KB

logging.py 692B

preprocess.py 2KB

train.py 13KB

models

.gitignore 13B

LICENSE 11KB

bert_config_uncased_base.json 313B

README.md 5KB

raw_data

.gitignore 13B

logs

.gitignore 13B

urls

cnn_mapping_valid.txt 139KB

mapping_test.txt 2.01MB

mapping_valid.txt 2.32MB

cnn_mapping_train.txt 9.95MB

cnn_mapping_test.txt 130KB

mapping_train.txt 44.27MB

# BertSum **This code is for paper `Fine-tune BERT for Extractive Summarization`**(https://arxiv.org/pdf/1903.10318.pdf) **Please email the author for a pre-trained model** Results on CNN/Dailymail (25/3/2019): | Models| ROUGE-1 | ROUGE-2 |ROUGE-L | :--- | :--- | :--- | :--- | | Transformer Baseline | 40.9 | 18.02 |37.17 | | BERTSUM+Classifier | 43.23 | 20.22 |39.60 | | BERTSUM+Transformer | 43.25 | 20.24 |39.63 | | BERTSUM+LSTM | 43.22 | 20.17 |39.59 | **Python version**: This code is in Python3.6 **Package Requirements**: pytorch pytorch_pretrained_bert tensorboardX multiprocess pyrouge Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py) ## Data Preparation For CNN/Dailymail ### Option 1: download the processed data download https://drive.google.com/open?id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6 unzip the zipfile and put all `.pt` files into `bert_data` ### Option 2: process the data yourself #### Step 1 Download Stories Download and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail. Put all `.story` files in one directory (e.g. `../raw_stories`) #### Step 2. Download Stanford CoreNLP We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile: ``` export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar ``` replacing `/path/to/` with the path to where you saved the `stanford-corenlp-full-2017-06-09` directory. #### Step 3. Sentence Splitting and Tokenization ``` python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH ``` * `RAW_PATH` is the directory containing story files (`../raw_stories`), `JSON_PATH` is the target directory to save the generated json files (`../merged_stories_tokenized`) #### Step 4. Format to Simpler Json Files ``` python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -map_path MAP_PATH -lower ``` * `RAW_PATH` is the directory containing tokenized files (`../merged_stories_tokenized`), `JSON_PATH` is the target directory to save the generated json files (`../json_data/cnndm`), `MAP_PATH` is the directory containing the urls files (`../urls`) #### Step 5. Format to PyTorch Files ``` python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log ``` * `JSON_PATH` is the directory containing json files (`../json_data`), `BERT_DATA_PATH` is the target directory to save the generated binary files (`../bert_data`) * `-oracle_mode` can be `greedy` or `combination`, where `combination` is more accurate but takes much longer time to process ## Model Training **First run**: For the first time, you should use single-GPU, so the code can download the BERT model. Change ``-visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3`` to ``-visible_gpus 0 -gpu_ranks 0 -world_size 1``, after downloading, you could kill the process and rerun the code with multi-GPUs. To train the BERT+Classifier model, run: ``` python train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000 ``` To train the BERT+Transformer model, run: ``` python train.py -mode train -encoder transformer -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_transformer -lr 2e-3 -visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_transformer -use_interval true -warmup_steps 10000 -ff_size 2048 -inter_layers 2 -heads 8 ``` To train the BERT+RNN model, run: ``` python train.py -mode train -encoder rnn -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_rnn -lr 2e-3 -visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_rnn -use_interval true -warmup_steps 10000 -rnn_size 768 -dropout 0.1 ``` * `-mode` can be {`train, validate, test`}, where `validate` will inspect the model directory and evaluate the model for each newly saved checkpoint, `test` need to be used with `-test_from`, indicating the checkpoint you want to use ## Model Evaluation After the training finished, run ``` python train.py -mode validate -bert_data_path ../bert_data/cnndm -model_path MODEL_PATH -visible_gpus 0 -gpu_ranks 0 -batch_size 30000 -log_file LOG_FILE -result_path RESULT_PATH -test_all -block_trigram true ``` * `MODEL_PATH` is the directory of saved checkpoints * `RESULT_PATH` is where you want to put decoded summaries (default `../results/cnndm`)

评论收藏

内容反馈