文本纠错实现KenlmT5MacBERTChatGLM3LLaMA等模型应用在纠错场景开箱即用源码+详细运行说明.zip

共184个文件

py：111个

txt：22个

png：15个

版权申诉

27 浏览量 2024-05-08 10:42:20 上传评论收藏 10.77MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

文本纠错实现Kenlm T5 MacBERT ChatGLM3 LLaMA等模型应用在纠错场景开箱即用源码+详细运行说明.zip （184个子文件）

CITATION.cff 331B

Dockerfile 564B

.dockerignore 228B

en.json.gz 575KB

framework_context.jpeg 513KB

we_image.jpeg 216KB

wechat.jpeg 40KB

macbert_result.jpg 770KB

macbert_network.jpg 114KB

macbert_mask_strategies.jpg 94KB

train.json 790KB

test.json 351KB

train_sharegpt.jsonl 331KB

test_sharegpt.jsonl 214KB

README.md 32KB

README.md 9KB

README.md 8KB

README_EN.md 7KB

README.md 3KB

README.md 2KB

correction_solution.md 2KB

README.md 2KB

CONTRIBUTING.md 529B

eng_correction.md 368B

基于深度学习的中文文本自动校对研究与实现.pdf 1.77MB

error_type.png 687KB

RTD.png 507KB

long_text.png 455KB

bert_result.png 425KB

wechat_zhifu.png 286KB

peoplecorpus.png 212KB

short_result.png 163KB

macbert_network_old.png 156KB

arch1.png 136KB

convseq2seq_ret.png 124KB

hf.png 109KB

ernie_result.png 105KB

erweima.png 93KB

docker.png 70KB

pycorrector.png 5KB

zh_wiki.py 148KB

gpt_model.py 31KB

conv_seq2seq_model.py 24KB

gpt_utils.py 20KB

detector.py 19KB

deepcontext_utils.py 15KB

get_file.py 13KB

corrector.py 12KB

train.py 11KB

deepcontext_model.py 10KB

proper_corrector.py 9KB

evaluate_util.py 9KB

langconv.py 8KB

base_model.py 7KB

en_spell_corrector.py 7KB

lr_scheduler.py 7KB

predict.py 6KB

softmaskedbert4csc.py 6KB

ngram_util.py 6KB

model.py 6KB

tokenizer.py 6KB

ner_error_test.py 6KB

predict_sighan.py 6KB

error_correct_test.py 5KB

macbert_corrector_test.py 5KB

error_utils.py 5KB

kenlm_test.py 5KB

utils.py 5KB

detector_test.py 5KB

conv_seq2seq_utils.py 5KB

evaluate_utils.py 5KB

text_utils.py 5KB

macbert_corrector.py 5KB

train.py 5KB

sighan_evaluate.py 5KB

t5_corrector.py 4KB

deepcontext_corrector.py 4KB

en_spell_correct_test.py 4KB

conv_seq2seq_corrector.py 4KB

test_confusion.py 4KB

training_llama_demo.py 4KB

training_chatglm_demo.py 4KB

defaults.py 4KB

gpt_corrector.py 4KB

evaluate_models.py 3KB

char_error_test.py 3KB

math_utils.py 3KB

merge_peft_adapter.py 3KB

predict_ckpt.py 3KB

tokenizer_test.py 3KB

en_spell_dict_test.py 3KB

corrector_test.py 3KB

confusion_corrector.py 3KB

reader.py 3KB

train.py 3KB

共 184 条

[**🇨🇳中文**](https://github.com/shibing624/pycorrector/blob/master/README.md) | [**🌐English**](https://github.com/shibing624/pycorrector/blob/master/README_EN.md) | [**📖文档/Docs**](https://github.com/shibing624/pycorrector/wiki) | [**🤖模型/Models**](https://huggingface.co/shibing624) <div align="center"> <a href="https://github.com/shibing624/pycorrector"> <img src="https://github.com/shibing624/pycorrector/blob/master/docs/pycorrector.png" alt="Logo" height="156"> </a> </div> ----------------- # pycorrector: useful python text correction toolkit [![PyPI version](https://badge.fury.io/py/pycorrector.svg)](https://badge.fury.io/py/pycorrector) [![Downloads](https://static.pepy.tech/badge/pycorrector)](https://pepy.tech/project/pycorrector) [![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/graphs/contributors) [![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) [![python_vesion](https://img.shields.io/badge/Python-3.8%2B-green.svg)](requirements.txt) [![GitHub issues](https://img.shields.io/github/issues/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/issues) [![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact) **pycorrector**: 中文文本纠错工具。支持中文音似、形似、语法错误纠正，python3.8开发。 **pycorrector**实现了Kenlm、ConvSeq2Seq、BERT、MacBERT、ELECTRA、ERNIE、Transformer等多种模型的文本纠错，并在SigHAN数据集评估各模型的效果。 **Guide** - [Features](#Features) - [Evaluation](#Evaluation) - [Usage](#usage) - [Dataset](#Dataset) - [Contact](#Contact) - [References](#references) ## Introduction 中文文本纠错任务，常见错误类型： <img src="https://github.com/shibing624/pycorrector/blob/master/docs/git_image/error_type.png" width="600" /> 当然，针对不同业务场景，这些问题并不一定全部存在，比如拼音输入法、语音识别校对关注音似错误；五笔输入法、OCR校对关注形似错误，搜索引擎query纠错关注所有错误类型。本项目重点解决其中的"音似、形字、语法、专名错误"等类型。 ## News [2023/11/07] v1.0.0版本：新增了ChatGLM3/LLaMA2等GPT模型用于中文文本纠错，发布了基于ChatGLM3-6B的[shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora)拼写和语法纠错模型；重写了DeepContext、ConvSeq2Seq、T5等模型的实现。详见[Release-v1.0.0](https://github.com/shibing624/pycorrector/releases/tag/1.0.0) ## Features * [Kenlm模型](https://github.com/shibing624/pycorrector/tree/master/examples/kenlm)：本项目基于Kenlm统计语言模型工具训练了中文NGram语言模型，结合规则方法、混淆集可以纠正中文拼写错误，方法速度快，扩展性强，效果一般 * [DeepContext模型](https://github.com/shibing624/pycorrector/tree/master/examples/deepcontext)：本项目基于PyTorch实现了用于文本纠错的DeepContext模型，该模型结构参考Stanford University的NLC模型，2014英文纠错比赛得第一名，效果一般 * [Seq2Seq模型](https://github.com/shibing624/pycorrector/tree/master/examples/seq2seq)：本项目基于PyTorch实现了用于中文文本纠错的ConvSeq2Seq模型，该模型在NLPCC-2018的中文语法纠错比赛中，使用单模型并取得第三名，可以并行训练，模型收敛快，效果一般 * [T5模型](https://github.com/shibing624/pycorrector/tree/master/examples/t5)：本项目基于PyTorch实现了用于中文文本纠错的T5模型，使用Langboat/mengzi-t5-base的预训练模型finetune中文纠错数据集，模型改造的潜力较大，效果好 * [ERNIE_CSC模型](https://github.com/shibing624/pycorrector/tree/master/examples/ernie_csc)：本项目基于PaddlePaddle实现了用于中文文本纠错的ERNIE_CSC模型，模型在ERNIE-1.0上finetune，模型结构适配了中文拼写纠错任务，效果好 * [MacBERT模型](https://github.com/shibing624/pycorrector/tree/master/examples/macbert)【推荐】：本项目基于PyTorch实现了用于中文文本纠错的MacBERT4CSC模型，模型加入了错误检测和纠正网络，适配中文拼写纠错任务，效果好 * [GPT模型](https://github.com/shibing624/pycorrector/tree/master/examples/gpt)：本项目基于PyTorch实现了用于中文文本纠错的ChatGLM/LLaMA模型，模型在中文CSC和语法纠错数据集上finetune，适配中文文本纠错任务，效果好 - 延展阅读：[中文文本纠错实践和原理解读](https://github.com/shibing624/pycorrector/blob/master/docs/correction_solution.md) ## Demo - Official demo: https://www.mulanai.com/product/corrector/ - Colab online demo: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zvSyCdiLK_rglfXcIgc539K_Z7bIMpu0?usp=sharing) - HuggingFace demo: https://huggingface.co/spaces/shibing624/pycorrector ![](docs/hf.png) run example: [examples/macbert/gradio_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/gradio_demo.py) to see the demo: ```shell python examples/macbert/gradio_demo.py ``` ## Evaluation 提供评估脚本[examples/evaluate_models/evaluate_models.py](https://github.com/shibing624/pycorrector/blob/master/examples/evaluate_models/evaluate_models.py)： - 使用sighan15评估集：SIGHAN2015的测试集[pycorrector/data/sighan2015_test.tsv](https://github.com/shibing624/pycorrector/blob/master/pycorrector/data/sighan2015_test.tsv) ，已经转为简体中文 - 评估标准：纠错准召率，采用严格句子粒度（Sentence Level）计算方式，把模型纠正之后的与正确句子完成相同的视为正确，否则为错 ### 评估结果评估数据集：SIGHAN2015测试集 GPU：Tesla V100，显存 32 GB | Model Name | Model Link | Base Model | GPU | Precision | Recall | F1 | QPS | |:----------------|:--------------------------------------------------------------------------------------------------------------------|:--------------------------|:----|:-----------|:-----------|:-----------|:--------| | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | CPU | 0.6860 | 0.1529 | 0.2500 | 9 | | BART-CSC | [shibing624/bart4csc-base-chinese](https://huggingface.co/shibing624/bart4csc-base-chinese) | fnlp/bart-base-chinese | GPU | 0.6984 | 0.6354 | 0.6654 | 58 | | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | GPU | **0.8321** | 0.6390 | 0.7229 | 214 | | **MacBERT-CSC** | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | GPU | 0.8254 | **0.7311** | **0.7754** | **224** | | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | GPU | 0.5574 | 0.4917 | 0.5225 | 4 | ### 结论 - 中文拼写纠错模型效果最好的是**MacBert-CSC**，模型名称是*shibing624/macbert4csc-base-chinese*，huggingface model：https://huggingface.co/shibing624/macbert4csc-base-chinese - 中文语法纠错模型效果最好的是**Mengzi-T5-CSC**，模型名称是*shibing624/mengzi-t5-base-chinese-correction*�

评论收藏

内容反馈

版权申诉