文本纠错实现KenlmT5MacBERTChatGLM3LLaMA等模型应用在纠错场景开箱即用源码+详细运行说明.zip资源-CSDN文库

共183个文件

py：111个

txt：22个

png：15个

版权申诉

源码

毕业设计

174 浏览量 2024-05-09 20:02:50 上传评论收藏 10.77MB ZIP 举报

文本纠错是自然语言处理（NLP）领域的一个关键任务，其目标是检测并修正文本中的拼写、语法或用词错误。本项目提供了一种开箱即用的解决方案，涵盖了多个先进的模型，如Kenlm、T5、MacBERT、ChatGLM3以及LLaMA，这些模型在文本纠错场景中表现出色。让我们详细了解每个模型： 1. **Kenlm**：KenLM 是一个高效的统计语言模型库，由 Ken Smith 创建。它支持多种类型的 n-gram 模型，并且可以用于构建语言模型，为文本纠错提供概率基础。在纠错过程中，Kenlm 可以计算候选修复词序列的概率，帮助选择最可能正确的替换。 2. **T5**：T5（Text-to-Text Transfer Transformer）是由Google提出的预训练语言模型，它将所有NLP任务转化为统一的文本到文本格式。T5在文本纠错任务上表现出色，因为它能够理解上下文并生成连贯的文本，这在识别和纠正错别字、语法错误时尤其有用。 3. **MacBERT**：MacBERT（Macroscope BERT）是BERT的一个变体，针对中文任务进行了优化。BERT是基于Transformer架构的预训练模型，擅长捕捉上下文信息。MacBERT在中文语境下表现更优，对于纠正中文文本中的错误非常有效。 4. **ChatGLM3**：ChatGLM（Chatter Generative Language Model）是阿里云推出的一种新型对话语言模型，具备良好的生成能力和理解能力。在文本纠错场景中，ChatGLM3可以理解对话上下文，提供更精准的纠错建议。 5. **LLaMA**：LLaMA（LARGE LANGUAGE MODEL, MADE AVAILABLE）是阿里云的大型语言模型，它具有强大的语言生成和理解能力。LLaMA在各种NLP任务中都展现出色性能，包括文本纠错，可以在大量文本数据中找出潜在的错误并进行修正。本压缩包中的源码提供了使用这些模型进行文本纠错的实现，允许开发者快速部署和测试。为了运行这些模型，你需要有相应的Python环境，通常会依赖于TensorFlow或PyTorch这样的深度学习框架。代码应该包含了模型的加载、输入处理、模型预测和结果后处理等步骤。详细运行说明将指导你配置环境、安装依赖、运行脚本以及如何解读输出结果。在毕业设计或任何其他项目中，这个工具包可以作为一个强大的起点，帮助你理解和实践文本纠错技术。通过比较不同模型的性能，你可以深入研究它们的优缺点，进一步优化纠错效果。此外，你还可以探索如何将这些模型与其他NLP技术结合，例如句法分析或语义理解，以提升整体的纠错准确性和效率。这个项目提供了一个丰富的学习和研究平台，对于熟悉NLP领域的最新进展和实践应用是非常有价值的。

资源推荐

资源详情

资源评论

收起资源包目录

文本纠错实现Kenlm T5 MacBERT ChatGLM3 LLaMA等模型应用在纠错场景开箱即用源码+详细运行说明.zip （183个子文件）

CITATION.cff 331B

Dockerfile 564B

.dockerignore 228B

en.json.gz 575KB

framework_context.jpeg 513KB

we_image.jpeg 216KB

wechat.jpeg 40KB

macbert_result.jpg 770KB

macbert_network.jpg 114KB

macbert_mask_strategies.jpg 94KB

train.json 790KB

test.json 351KB

train_sharegpt.jsonl 331KB

test_sharegpt.jsonl 214KB

README.md 32KB

README.md 9KB

README.md 8KB

README_EN.md 7KB

README.md 3KB

README.md 2KB

correction_solution.md 2KB

README.md 2KB

CONTRIBUTING.md 529B

eng_correction.md 368B

error_type.png 687KB

RTD.png 507KB

long_text.png 455KB

bert_result.png 425KB

wechat_zhifu.png 286KB

peoplecorpus.png 212KB

short_result.png 163KB

macbert_network_old.png 156KB

arch1.png 136KB

convseq2seq_ret.png 124KB

hf.png 109KB

ernie_result.png 105KB

erweima.png 93KB

docker.png 70KB

pycorrector.png 5KB

zh_wiki.py 148KB

gpt_model.py 31KB

conv_seq2seq_model.py 24KB

gpt_utils.py 20KB

detector.py 19KB

deepcontext_utils.py 15KB

get_file.py 13KB

corrector.py 12KB

train.py 11KB

deepcontext_model.py 10KB

proper_corrector.py 9KB

evaluate_util.py 9KB

langconv.py 8KB

base_model.py 7KB

en_spell_corrector.py 7KB

lr_scheduler.py 7KB

predict.py 6KB

softmaskedbert4csc.py 6KB

ngram_util.py 6KB

model.py 6KB

tokenizer.py 6KB

ner_error_test.py 6KB

predict_sighan.py 6KB

error_correct_test.py 5KB

macbert_corrector_test.py 5KB

error_utils.py 5KB

kenlm_test.py 5KB

utils.py 5KB

detector_test.py 5KB

conv_seq2seq_utils.py 5KB

evaluate_utils.py 5KB

text_utils.py 5KB

macbert_corrector.py 5KB

train.py 5KB

sighan_evaluate.py 5KB

t5_corrector.py 4KB

deepcontext_corrector.py 4KB

en_spell_correct_test.py 4KB

conv_seq2seq_corrector.py 4KB

test_confusion.py 4KB

training_llama_demo.py 4KB

training_chatglm_demo.py 4KB

defaults.py 4KB

gpt_corrector.py 4KB

evaluate_models.py 3KB

char_error_test.py 3KB

math_utils.py 3KB

merge_peft_adapter.py 3KB

predict_ckpt.py 3KB

tokenizer_test.py 3KB

en_spell_dict_test.py 3KB

corrector_test.py 3KB

confusion_corrector.py 3KB

reader.py 3KB

train.py 3KB

train.py 2KB

共 183 条

[**🇨🇳中文**](https://github.com/shibing624/pycorrector/blob/master/README.md) | [**🌐English**](https://github.com/shibing624/pycorrector/blob/master/README_EN.md) | [**📖文档/Docs**](https://github.com/shibing624/pycorrector/wiki) | [**🤖模型/Models**](https://huggingface.co/shibing624) <div align="center"> <a href="https://github.com/shibing624/pycorrector"> <img src="https://github.com/shibing624/pycorrector/blob/master/docs/pycorrector.png" alt="Logo" height="156"> </a> </div> ----------------- # pycorrector: useful python text correction toolkit [![PyPI version](https://badge.fury.io/py/pycorrector.svg)](https://badge.fury.io/py/pycorrector) [![Downloads](https://static.pepy.tech/badge/pycorrector)](https://pepy.tech/project/pycorrector) [![GitHub contributors](https://img.shields.io/github/contributors/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/graphs/contributors) [![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) [![python_vesion](https://img.shields.io/badge/Python-3.8%2B-green.svg)](requirements.txt) [![GitHub issues](https://img.shields.io/github/issues/shibing624/pycorrector.svg)](https://github.com/shibing624/pycorrector/issues) [![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact) **pycorrector**: 中文文本纠错工具。支持中文音似、形似、语法错误纠正，python3.8开发。 **pycorrector**实现了Kenlm、ConvSeq2Seq、BERT、MacBERT、ELECTRA、ERNIE、Transformer等多种模型的文本纠错，并在SigHAN数据集评估各模型的效果。 **Guide** - [Features](#Features) - [Evaluation](#Evaluation) - [Usage](#usage) - [Dataset](#Dataset) - [Contact](#Contact) - [References](#references) ## Introduction 中文文本纠错任务，常见错误类型： <img src="https://github.com/shibing624/pycorrector/blob/master/docs/git_image/error_type.png" width="600" /> 当然，针对不同业务场景，这些问题并不一定全部存在，比如拼音输入法、语音识别校对关注音似错误；五笔输入法、OCR校对关注形似错误，搜索引擎query纠错关注所有错误类型。本项目重点解决其中的"音似、形字、语法、专名错误"等类型。 ## News [2023/11/07] v1.0.0版本：新增了ChatGLM3/LLaMA2等GPT模型用于中文文本纠错，发布了基于ChatGLM3-6B的[shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora)拼写和语法纠错模型；重写了DeepContext、ConvSeq2Seq、T5等模型的实现。详见[Release-v1.0.0](https://github.com/shibing624/pycorrector/releases/tag/1.0.0) ## Features * [Kenlm模型](https://github.com/shibing624/pycorrector/tree/master/examples/kenlm)：本项目基于Kenlm统计语言模型工具训练了中文NGram语言模型，结合规则方法、混淆集可以纠正中文拼写错误，方法速度快，扩展性强，效果一般 * [DeepContext模型](https://github.com/shibing624/pycorrector/tree/master/examples/deepcontext)：本项目基于PyTorch实现了用于文本纠错的DeepContext模型，该模型结构参考Stanford University的NLC模型，2014英文纠错比赛得第一名，效果一般 * [Seq2Seq模型](https://github.com/shibing624/pycorrector/tree/master/examples/seq2seq)：本项目基于PyTorch实现了用于中文文本纠错的ConvSeq2Seq模型，该模型在NLPCC-2018的中文语法纠错比赛中，使用单模型并取得第三名，可以并行训练，模型收敛快，效果一般 * [T5模型](https://github.com/shibing624/pycorrector/tree/master/examples/t5)：本项目基于PyTorch实现了用于中文文本纠错的T5模型，使用Langboat/mengzi-t5-base的预训练模型finetune中文纠错数据集，模型改造的潜力较大，效果好 * [ERNIE_CSC模型](https://github.com/shibing624/pycorrector/tree/master/examples/ernie_csc)：本项目基于PaddlePaddle实现了用于中文文本纠错的ERNIE_CSC模型，模型在ERNIE-1.0上finetune，模型结构适配了中文拼写纠错任务，效果好 * [MacBERT模型](https://github.com/shibing624/pycorrector/tree/master/examples/macbert)【推荐】：本项目基于PyTorch实现了用于中文文本纠错的MacBERT4CSC模型，模型加入了错误检测和纠正网络，适配中文拼写纠错任务，效果好 * [GPT模型](https://github.com/shibing624/pycorrector/tree/master/examples/gpt)：本项目基于PyTorch实现了用于中文文本纠错的ChatGLM/LLaMA模型，模型在中文CSC和语法纠错数据集上finetune，适配中文文本纠错任务，效果好 - 延展阅读：[中文文本纠错实践和原理解读](https://github.com/shibing624/pycorrector/blob/master/docs/correction_solution.md) ## Demo - Official demo: https://www.mulanai.com/product/corrector/ - Colab online demo: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zvSyCdiLK_rglfXcIgc539K_Z7bIMpu0?usp=sharing) - HuggingFace demo: https://huggingface.co/spaces/shibing624/pycorrector ![](docs/hf.png) run example: [examples/macbert/gradio_demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/macbert/gradio_demo.py) to see the demo: ```shell python examples/macbert/gradio_demo.py ``` ## Evaluation 提供评估脚本[examples/evaluate_models/evaluate_models.py](https://github.com/shibing624/pycorrector/blob/master/examples/evaluate_models/evaluate_models.py)： - 使用sighan15评估集：SIGHAN2015的测试集[pycorrector/data/sighan2015_test.tsv](https://github.com/shibing624/pycorrector/blob/master/pycorrector/data/sighan2015_test.tsv) ，已经转为简体中文 - 评估标准：纠错准召率，采用严格句子粒度（Sentence Level）计算方式，把模型纠正之后的与正确句子完成相同的视为正确，否则为错 ### 评估结果评估数据集：SIGHAN2015测试集 GPU：Tesla V100，显存 32 GB | Model Name | Model Link | Base Model | GPU | Precision | Recall | F1 | QPS | |:----------------|:--------------------------------------------------------------------------------------------------------------------|:--------------------------|:----|:-----------|:-----------|:-----------|:--------| | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | CPU | 0.6860 | 0.1529 | 0.2500 | 9 | | BART-CSC | [shibing624/bart4csc-base-chinese](https://huggingface.co/shibing624/bart4csc-base-chinese) | fnlp/bart-base-chinese | GPU | 0.6984 | 0.6354 | 0.6654 | 58 | | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | GPU | **0.8321** | 0.6390 | 0.7229 | 214 | | **MacBERT-CSC** | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | GPU | 0.8254 | **0.7311** | **0.7754** | **224** | | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | GPU | 0.5574 | 0.4917 | 0.5225 | 4 | ### 结论 - 中文拼写纠错模型效果最好的是**MacBert-CSC**，模型名称是*shibing624/macbert4csc-base-chinese*，huggingface model：https://huggingface.co/shibing624/macbert4csc-base-chinese - 中文语法纠错模型效果最好的是**Mengzi-T5-CSC**，模型名称是*shibing624/mengzi-t5-base-chinese-correction*�

评论收藏

内容反馈

版权申诉