# Text-Summarizer-Pytorch-Chinese
![Python application](https://github.com/LowinLi/Text-Summarizer-Pytorch-Chinese/workflows/Python%20application/badge.svg)
+ 提供一款中文版生成式摘要服务。
+ 提供从数据到训练到部署,完整流程参考。
## 初衷
由于工作需要,在开源社区寻找汉语生成摘要模型时,几乎找不到可用的开源项目。
本项目在英文生成式摘要开源项目[Text-Summarizer-Pytorch](https://github.com/rohithreddy024/Text-Summarizer-Pytorch)基础上(指针生成网络),结合jieba分词,在数据集[LCSTS](http://icrc.hitsz.edu.cn/Article/show/139.html)上跑通一遍训练流程,中间自然踩过了很多坑,完整代码在这里开源出来供大家参考。
这里包括下载已经训练好的模型,部署服务,也包括借鉴代码完整跑一边训练流程,做为baseline使用。
## 效果
测试集指标:
| 指标 | 验证集 | 测试集|
| ---- | ---- |---- |
| ROUGE-1 | 0.3553 |0.3396 |
| ROUGE-2 | 0.1843 |0.1668 |
| ROUGE-L | 0.3481 |0.3320 |
该模型没有经过细致优化,只是完整跑了一遍流程,仅供参考。
[case请移步readme最下方](#摘要效果示例)
## 搭建服务
+ 已训练好模型:
链接: https://pan.baidu.com/s/1NKMIAsaE8H7GiCpP7Jovig 提取码: d7pr
+ demo.tar不必解压
+ 对应字典:
链接: https://pan.baidu.com/s/1A3vzYYYenu7vfNQgRX9NHA 提取码: 8ti6
把下载的两份文件放在根目录下。
+ 部署:
```bash
sudo docker-compose up
```
+ 测试:
```bash
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"text":"e公司讯,龙力退(002604)7月14日晚间公告,公司股票已被深交所决定终止上市,并在退市整理期交易30个交易日,最后交易日为7月14日,将在2020年7月15日被摘牌。公司股票终止上市后,将进
入股转系统进行股份转让。"' http://localhost:5000/abstract
```
## 训练模型
#### 下载PreLCSTS数据集
链接: https://pan.baidu.com/s/12fTxxMhRcSHCSv_EcFXyHQ 密码: 39gn
1. 下载好的文件夹放在根目录下
2. 预处理数据
```bash
pip install requirements.txt
python make_data_files.py
```
3. 开始训练
```bash
sh train.sh
```
4. 评测验证集
```bash
sh eval.sh
```
5. 选出效果最好的模型,改shell脚本,进行再训练
```bash
sh train_lr.sh
```
6. 选出效果最好的模型,改shell脚本,进行测试
```bash
sh test.sh
```
## 训练中间过程
#### 训练时损失函数降低
```
2020-07-04 12:58:38,434 - data_util.log - INFO - Bucket queue size: 0, Input queue size: 0
2020-07-04 12:59:38,499 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 65600
2020-07-04 13:00:30,526 - data_util.log - INFO - iter:50 mle_loss:6.481 reward:0.0000
2020-07-04 13:00:38,552 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-04 13:01:07,877 - data_util.log - INFO - iter:100 mle_loss:6.031 reward:0.0000
2020-07-04 13:01:38,612 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-04 13:01:46,061 - data_util.log - INFO - iter:150 mle_loss:5.883 reward:0.0000
2020-07-04 13:02:24,137 - data_util.log - INFO - iter:200 mle_loss:5.790 reward:0.0000
2020-07-04 13:02:38,620 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-04 13:03:03,381 - data_util.log - INFO - iter:250 mle_loss:5.740 reward:0.0000
2020-07-04 13:03:38,633 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-04 13:03:42,141 - data_util.log - INFO - iter:300 mle_loss:5.690 reward:0.0000
2020-07-04 13:04:20,430 - data_util.log - INFO - iter:350 mle_loss:5.623 reward:0.0000
2020-07-04 13:04:38,692 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-04 13:04:59,155 - data_util.log - INFO - iter:400 mle_loss:5.592 reward:0.0000
2020-07-04 13:05:37,330 - data_util.log - INFO - iter:450 mle_loss:5.531 reward:0.0000
2020-07-04 13:05:38,752 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-04 13:06:16,069 - data_util.log - INFO - iter:500 mle_loss:5.473 reward:0.0000
2020-07-04 13:06:38,812 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-04 13:06:55,706 - data_util.log - INFO - iter:550 mle_loss:5.459 reward:0.0000
2020-07-04 13:07:33,658 - data_util.log - INFO - iter:600 mle_loss:5.366 reward:0.0000
2020-07-04 13:07:38,873 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
...
2020-07-06 09:24:01,484 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 299732
2020-07-06 09:24:04,571 - data_util.log - INFO - iter:206800 mle_loss:2.631 reward:0.0000
2020-07-06 09:24:42,961 - data_util.log - INFO - iter:206850 mle_loss:2.639 reward:0.0000
2020-07-06 09:25:01,511 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-06 09:25:20,772 - data_util.log - INFO - iter:206900 mle_loss:2.653 reward:0.0000
2020-07-06 09:26:00,946 - data_util.log - INFO - iter:206950 mle_loss:2.657 reward:0.0000
2020-07-06 09:26:01,571 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-06 09:26:38,844 - data_util.log - INFO - iter:207000 mle_loss:2.666 reward:0.0000
2020-07-06 09:27:01,600 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-06 09:27:16,921 - data_util.log - INFO - iter:207050 mle_loss:2.634 reward:0.0000
2020-07-06 09:27:54,971 - data_util.log - INFO - iter:207100 mle_loss:2.658 reward:0.0000
2020-07-06 09:28:01,661 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-06 09:28:32,825 - data_util.log - INFO - iter:207150 mle_loss:2.620 reward:0.0000
2020-07-06 09:29:01,721 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-06 09:29:10,510 - data_util.log - INFO - iter:207200 mle_loss:2.665 reward:0.0000
2020-07-06 09:29:50,878 - data_util.log - INFO - iter:207250 mle_loss:2.656 reward:0.0000
2020-07-06 09:30:01,782 - data_util.log - INFO - Bucket queue size: 1000, Input queue size: 300000
2020-07-06 09:30:29,142 - data_util.log - INFO - iter:207300 mle_loss:2.662 reward:0.0000
```
#### 第一步验证集验证(0200000.tar模型最佳)
```
2020-07-13 14:12:15,025 - data_util.log - INFO - 0005000.tar rouge_1:0.2338 rouge_2:0.0837 rouge_l:0.2338
2020-07-13 14:12:15,060 - data_util.log - INFO -
2020-07-13 14:12:21,874 - data_util.log - INFO - 0010000.tar rouge_1:0.2782 rouge_2:0.1240 rouge_l:0.2699
2020-07-13 14:12:21,908 - data_util.log - INFO -
2020-07-13 14:12:28,677 - data_util.log - INFO - 0015000.tar rouge_1:0.2833 rouge_2:0.1211 rouge_l:0.2751
2020-07-13 14:12:28,712 - data_util.log - INFO -
2020-07-13 14:12:35,460 - data_util.log - INFO - 0020000.tar rouge_1:0.3009 rouge_2:0.1351 rouge_l:0.2898
2020-07-13 14:12:35,496 - data_util.log - INFO -
2020-07-13 14:12:42,312 - data_util.log - INFO - 0025000.tar rouge_1:0.3121 rouge_2:0.1397 rouge_l:0.3074
2020-07-13 14:12:42,348 - data_util.log - INFO -
2020-07-13 14:12:49,107 - data_util.log - INFO - 0030000.tar rouge_1:0.2947 rouge_2:0.1264 rouge_l:0.2898
2020-07-13 14:12:49,142 - data_util.log - INFO -
2020-07-13 14:12:56,030 - data_util.log - INFO - 0035000.tar rouge_1:0.2959 rouge_2:0.1317 rouge_l:0.2870
2020-07-13 14:12:56,067 - data_util.log - INFO -
2020-07-13 14:13:02,853 - data_util.log - INFO - 0040000.tar rouge_1:0.3200 rouge_2:0.1388 rouge_l:0.3082
2020-07-13 14:13:02,887 - data_util.log - INFO -
2020-07-13 14:13:09,660 - data_util.log - INFO - 0045000.tar rouge_1:0.2928 rouge_2:0.1212 rouge_l:0.2851
2020-07-13 14:13:09,695 - data_util.log - INFO -
2020-07-13 14:13:16,485 - data_util.log - INFO - 0050000.tar rouge_1:0.2910 rouge_2:0.1332 rouge_l:0.2883
2020-07-13 14:13:16,519 - data_util.log - INFO -
2020-07-13 14:13:23,372 - data_util.log - INFO - 0055000.tar rouge_1:0.3003 rou