基于大语言模型的中文科技文献标注方法

共74个文件

py：53个

json：8个

sh：3个

语言模型

需积分: 5 182 浏览量 2023-08-31 22:41:17 上传评论收藏 14.58MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

ACSL-main.zip （74个子文件）

ACSL-main

statistic

subject_area_count.csv 76B

length.csv 157B

calculate_length.py 3KB

statistic.py 3KB

count.csv 206B

baseline

__init__.py 2B

tools

__init__.py 2B

finetuning_argparse.py 10KB

plot.py 3KB

common.py 12KB

download_clue_data.py 3KB

convert_albert_tf_checkpoint_to_pytorch.py 3KB

metrics

__init__.py 2B

ner_metrics.py 4KB

run_ner_softmax.py 25KB

run_ner_crf.py 28KB

LICENSE 1KB

run_ner_span.py 27KB

callback

__init__.py 0B

modelcheckpoint.py 4KB

adversarial.py 4KB

optimizater

__init__.py 0B

nadam.py 4KB

lars.py 4KB

planradam.py 3KB

adafactor.py 8KB

ralamb.py 4KB

novograd.py 3KB

adamw.py 4KB

ralars.py 5KB

lamb.py 4KB

adabound.py 6KB

lookahead.py 4KB

radam.py 4KB

sgdw.py 3KB

progressbar.py 3KB

lr_scheduler.py 21KB

trainingmonitor.py 2KB

datasets

acsl

test.json 1.21MB

train.json 9.52MB

dev.json 1.2MB

losses

__init__.py 2B

label_smoothing.py 861B

focal_loss.py 696B

models

__init__.py 2B

layers

__init__.py 0B

linears.py 1KB

crf.py 20KB

bert_for_ner.py 6KB

processors

__init__.py 2B

ner_span.py 11KB

utils_ner.py 6KB

ner_seq.py 10KB

README.md 3KB

scripts

run_ner_softmax.sh 791B

run_ner_crf.sh 824B

run_ner_span.sh 800B

.gitattributes 59B

dataset

ner

test.json 1.21MB

train.json 9.52MB

csl_camera_annotated_1_10000_result_fileter_format_cleaned.json 11.92MB

dev.json 1.2MB

csl_camera_annotated_1_10000.json 17.16MB

process

ner

process_json.py 8KB

process.py 2KB

generate_qa_dataset.py 1KB

text

process_text.py 3KB

chat_scie.py 10KB

handle_error.py 2KB

requirements.txt 406B

.gitignore 1KB

README.md 5KB

sample_prompt.txt 2KB

config.py 176B

# 基于大语言模型的中文科技文献标注方法这个仓库包含了我们论文的代码: 基于大语言模型的中文科技文献标注方法 - *A Chinese scientific literature annotation method based on large language model*。 ## 中文科技领域实体标注数据集 ACSL 由于 Github 文件大小限制，我们将数据集放在了 - [Google Drive](https://drive.google.com/drive/folders/1e9aveXUsz6qe0C6MqdXR7-TyvxSzJvy_?usp=sharing) - [百度网盘](https://pan.baidu.com/s/123rp_t6fCucMuz5iRXNMfQ?pwd=f0ie) - 提取码：f0ie 数据文件简介： - `csl_camera_readly.tsv`: [中文科学文献数据集 CSL](https://github.com/ydli-ai/CSL) - `csl_camera_readly_filter.tsv`: 过滤部分学科后的 CSL 数据集 - `csl_camera_readly_filter_cleaned.tsv`: 清洗后的 CSL 数据集 - `csl_camera_annotated_1_10000.json`: 基于大语言模型的原始标注数据集 - `csl_camera_annotated_1_10000_result_fileter_format_cleaned.json`: ACSL 数据集, 经过数据清洗后的数据集使用方式： - 将数据集放在 `dataset/raw_data` 文件夹下即可。 ## 项目结构 ```text . ├── baseline # 基于 BERT 的基准测试模型 ├── README.txt ├── chat_scie.py # 基于大语言模型的中文科技文献标注方法 ├── config.py # 配置信息 ├── dataset # CSL 与 ACSL 数据集 ├── handle_error.py # 标注错误样本的处理 ├── output # 输出文件 ├── process # 数据处理 └── statistic # 数据统计 ``` ## 基准测试 ### 数据集 1. acsl: baseline/datasets/acsl ### 模型 - `chinese-roberta-wwm-ext`: https://huggingface.co/hfl/chinese-roberta-wwm-ext 1. BERT+Softmax 2. BERT+CRF 3. BERT+Span ### 环境依赖 1. 1.1.0 =< PyTorch < 1.5.0 2. cuda=9.0 3. python3.6+ ### 输入数据格式 ```json {"text_id":0,"text":"为了使联合收割机具有自动测产功能,提出了一种基于变权分层激活扩散的产量预测误差剔除模型,并使用单片机设计了联合收获机测产系统.测产系统的主要功能是:在田间进行作业时,收割机可以测出当前的运行速度,....","label":{"研究问题":{"联合收割机具有自动测产功能":[[3,15]]},"研究方法":{"基于变权分层激活扩散的产量预测误差剔除模型":[[22,42]]},"研究材料":{"单片机":[[47,49]],"霍尔传感器":[[118,122]],"电容压力传感器":[[124,130]],"ADC0804差分式A/D转换芯片":[[150,166]]},"研究成果":{"将系统应用在了收割机上,通过测试得到了谷物产量的测量值,并与真实值进行比较,验证了系统的可靠性":[[250,296]]}},"title":"谷物联合收获机自动测产系统设计-基于变权分层激活扩散模型","keywords":"联合收割机_测产系统_变权分层_激活扩散","subject":"农业工程","subject_area":"工学"} ``` ### 代码运行方式 1. 进入 `baseline` 目录 1. 请修改 `run_ner_xxx.py` 或 `scripts/run_ner_xxx.sh` 中的配置信息。 2. 执行命令：`sh scripts/run_ner_xxx.sh` **NOTE**: 预训练模型的目录结构 ```text ├── model | └── bert_base | | └── pytorch_model.bin | | └── config.json | | └── vocab.txt | | └── ...... ``` ### ACSL上的基线模型测试结果基准模型在 **dev** 上的性能: **准确率 Precision (P):** | 基线模型 | 问题 | 方法 | 评估度量 | 材料 | 成果 | 总体性能 | |--------------|-------|-------|--------|-------|-------|--------| | BERT+Softmax | 0.397 | 0.394 | 0.198 | 0.332 | 0.244 | 0.327 | | BERT+CRF | 0.421 | 0.415 | 0.207 | 0.355 | 0.253 | 0.344 | | BERT+Span | 0.492 | 0.471 | 0.305 | 0.415 | 0.309 | 0.413 | **召回率 Recall (R):** | 基线模型 | 问题 | 方法 | 评估度量 | 材料 | 成果 | 总体性能 | |--------------|-------|-------|--------|-------|-------|--------| | BERT+Softmax | 0.327 | 0.380 | 0.195 | 0.282 | 0.210 | 0.289 | | BERT+CRF | 0.357 | 0.408 | 0.216 | 0.319 | 0.235 | 0.319 | | BERT+Span | 0.331 | 0.387 | 0.157 | 0.278 | 0.185 | 0.282 | **F1:** | 基线模型 | 问题 | 方法 | 评估度量 | 材料 | 成果 | 总体性能 | |--------------|-------|-------|--------|-------|-------|--------| | BERT+Softmax | 0.359 | 0.387 | 0.196 | 0.305 | 0.226 | 0.307 | | BERT+CRF | 0.387 | 0.412 | 0.212 | 0.336 | 0.244 | 0.331 | | BERT+Span | 0.396 | 0.425 | 0.207 | 0.333 | 0.231 | 0.335 | ## 参考项目 - [Chinese Scientific Literature Dataset](https://github.com/ydli-ai/CSL) - [Chinese NER using Bert](https://github.com/lonePatient/BERT-NER-Pytorch) - [中文BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)

评论收藏

内容反馈