【免费】基于Python的中文内容纠错算法-课程设计

共11个文件

txt：3个

ipynb：2个

pdf：2个

python

中文纠错

机器学习

需积分: 0 102 浏览量更新于2023-06-20 3 收藏 3.81MB ZIP 举报

本项目是基于 Python 的中文文本内容纠错算法，基于jieba分词和中文词典技术实现。中文文本纠错是针对中文文本拼写错误进行检测与纠正的一项工作，中文的文本纠错，应用场景很多，诸如输入法纠错、输入预测、ASR 后纠错等等，例如：写作辅助：在内容写作平台上内嵌纠错模块，可在作者写作时自动检查并提示错别字情况。从而降低因疏忽导致的错误表述，有效提升作者的文章写作质量，同时给用户更好的阅读体验。公文纠错：针对公文写作场景，提供字词、标点、专名、数值内容纠错，包含领导人姓名、领导人职位、数值一致性等内容的检查与纠错，辅助进行公文审阅校对。搜索纠错：用户在搜索时经常输入错误，通过分析搜索query的形式和特征，可自动纠正搜索query并提示用户，进而给出更符合用户需求的搜索结果，有效屏蔽错别字对用户真实需求的影响。

收起资源包目录

基于Python的中文内容纠错算法-课程设计.zip （11个子文件）

Autochecker4Chinese

AutoChecker4Chinese.pdf 337KB

Autochecker4Chinese.py 4KB

readme.md 7KB

cn_dict.txt 14KB

token_pinyin%4040k_sogou.txt 1.61MB

AutoChecker4Chinese.ipynb 11KB

result.png 357KB

readme.pdf 337KB

.ipynb_checkpoints

AutoChecker4Chinese-checkpoint.ipynb 11KB

words.dic 1.69MB

token_freq_pos%40350k_jieba.txt 4.84MB

资源推荐

资源预览

资源评论

## Solutions of autochecker for chinese ### How to use : - run in the terminal : python Autochecker4Chinese.py - You will get the following result : ![](./result.png) ### 1. Make a detecter - Construct a dict to detect the misspelled chinese phrase，key is the chinese phrase, value is its corresponding frequency appeared in corpus. - You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file. - The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase. ```python def construct_dict( file_path ): word_freq = {} with open(file_path, "r") as f: for line in f: info = line.split() word = info[0] frequency = info[1] word_freq[word] = frequency return word_freq ``` ```python FILE_PATH = "./token_freq_pos%40350k_jieba.txt" phrase_freq = construct_dict( FILE_PATH ) ``` ```python print( type(phrase_freq) ) print( len(phrase_freq) ) ``` <type 'dict'> 349045 ### 2. Make an autocorrecter - Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase - We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules: - If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected. - Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order. - Otherwise, we put the candidate in third order. ```python import pinyin ``` ```python # list for chinese words # read from the words.dic def load_cn_words_dict( file_path ): cn_words_dict = "" with open(file_path, "r") as f: for word in f: cn_words_dict += word.strip().decode("utf-8") return cn_words_dict ``` ```python # function calculate the edite distance from the chinese phrase def edits1(phrase, cn_words_dict): "All edits that are one edit away from `phrase`." phrase = phrase.decode("utf-8") splits = [(phrase[:i], phrase[i:]) for i in range(len(phrase) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in cn_words_dict] inserts = [L + c + R for L, R in splits for c in cn_words_dict] return set(deletes + transposes + replaces + inserts) ``` ```python # return the phrease exist in phrase_freq def known(phrases): return set(phrase for phrase in phrases if phrase.encode("utf-8") in phrase_freq) ``` ```python # get the candidates phrase of the error phrase # we sort the candidates phrase's importance according to their pinyin # if the candidate phrase's pinyin exactly matches with the error phrase, we put them into first order # if the candidate phrase's first word pinyin matches with the error phrase first word, we put them into second order # else we put candidate phrase into the third order def get_candidates( error_phrase ): candidates_1st_order = [] candidates_2nd_order = [] candidates_3nd_order = [] error_pinyin = pinyin.get(error_phrase, format="strip", delimiter="/").encode("utf-8") cn_words_dict = load_cn_words_dict( "./cn_dict.txt" ) candidate_phrases = list( known(edits1(error_phrase, cn_words_dict)) ) for candidate_phrase in candidate_phrases: candidate_pinyin = pinyin.get(candidate_phrase, format="strip", delimiter="/").encode("utf-8") if candidate_pinyin == error_pinyin: candidates_1st_order.append(candidate_phrase) elif candidate_pinyin.split("/")[0] == error_pinyin.split("/")[0]: candidates_2nd_order.append(candidate_phrase) else: candidates_3nd_order.append(candidate_phrase) return candidates_1st_order, candidates_2nd_order, candidates_3nd_order ``` ```python def auto_correct( error_phrase ): c1_order, c2_order, c3_order = get_candidates(error_phrase) # print c1_order, c2_order, c3_order if c1_order: return max(c1_order, key=phrase_freq.get ) elif c2_order: return max(c2_order, key=phrase_freq.get ) else: return max(c3_order, key=phrase_freq.get ) ``` ```python # test for the auto_correct error_phrase_1 = "呕涂" # should be "呕吐" error_phrase_2 = "东方之朱" # should be "东方之珠" error_phrase_3 = "沙拢" # should be "沙龙" print error_phrase_1, auto_correct( error_phrase_1 ) print error_phrase_2, auto_correct( error_phrase_2 ) print error_phrase_3, auto_correct( error_phrase_3 ) ``` 呕涂呕吐东方之朱东方之珠沙拢沙龙 ### 3. Correct the misspelled phrase in a sentance - For any given sentence, use jieba do the segmentation, - Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase - Use auto_correct function to correct the misspelled phrase - Output the correct sentence ```python import jieba import string import re ``` ```python PUNCTUATION_LIST = string.punctuation PUNCTUATION_LIST += "。，？：；｛｝［］‘“”《》／！％……（）" ``` ```python def auto_correct_sentence( error_sentence, verbose=True): jieba_cut = jieba.cut(err_test.decode("utf-8"), cut_all=False) seg_list = "\t".join(jieba_cut).split("\t") correct_sentence = "" for phrase in seg_list: correct_phrase = phrase # check if item is a punctuation if phrase not in PUNCTUATION_LIST.decode("utf-8"): # check if the phrase in our dict, if not then it is a misspelled phrase if phrase.encode("utf-8") not in phrase_freq.keys(): correct_phrase = auto_correct(phrase.encode("utf-8")) if verbose : print phrase, correct_phrase correct_sentence += correct_phrase if verbose: print correct_sentence return correct_sentence ``` ```python err_sent = '机七学习是人工智能领遇最能体现智能的一个分知！' correct_sent = auto_correct_sentence( err_sent ) ``` 机七机器领遇领域分知分枝机器学习是人工智能领域最能体现智能的一个分枝！ ```python print correct_sent ``` 机器学习是人工智能领域最能体现智能的一个分枝！ ```python ```