## Solutions of autochecker for chinese
### How to use :
- run in the terminal : python Autochecker4Chinese.py
- You will get the following result : ![](./result.png)
### 1. Make a detecter
- Construct a dict to detect the misspelled chinese phrase,key is the chinese phrase, value is its corresponding frequency appeared in corpus.
- You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file.
- The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.
```python
def construct_dict( file_path ):
word_freq = {}
with open(file_path, "r") as f:
for line in f:
info = line.split()
word = info[0]
frequency = info[1]
word_freq[word] = frequency
return word_freq
```
```python
FILE_PATH = "./token_freq_pos%40350k_jieba.txt"
phrase_freq = construct_dict( FILE_PATH )
```
```python
print( type(phrase_freq) )
print( len(phrase_freq) )
```
<type 'dict'>
349045
### 2. Make an autocorrecter
- Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase
- We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:
- If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.
- Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.
- Otherwise, we put the candidate in third order.
```python
import pinyin
```
```python
# list for chinese words
# read from the words.dic
def load_cn_words_dict( file_path ):
cn_words_dict = ""
with open(file_path, "r") as f:
for word in f:
cn_words_dict += word.strip().decode("utf-8")
return cn_words_dict
```
```python
# function calculate the edite distance from the chinese phrase
def edits1(phrase, cn_words_dict):
"All edits that are one edit away from `phrase`."
phrase = phrase.decode("utf-8")
splits = [(phrase[:i], phrase[i:]) for i in range(len(phrase) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in cn_words_dict]
inserts = [L + c + R for L, R in splits for c in cn_words_dict]
return set(deletes + transposes + replaces + inserts)
```
```python
# return the phrease exist in phrase_freq
def known(phrases): return set(phrase for phrase in phrases if phrase.encode("utf-8") in phrase_freq)
```
```python
# get the candidates phrase of the error phrase
# we sort the candidates phrase's importance according to their pinyin
# if the candidate phrase's pinyin exactly matches with the error phrase, we put them into first order
# if the candidate phrase's first word pinyin matches with the error phrase first word, we put them into second order
# else we put candidate phrase into the third order
def get_candidates( error_phrase ):
candidates_1st_order = []
candidates_2nd_order = []
candidates_3nd_order = []
error_pinyin = pinyin.get(error_phrase, format="strip", delimiter="/").encode("utf-8")
cn_words_dict = load_cn_words_dict( "./cn_dict.txt" )
candidate_phrases = list( known(edits1(error_phrase, cn_words_dict)) )
for candidate_phrase in candidate_phrases:
candidate_pinyin = pinyin.get(candidate_phrase, format="strip", delimiter="/").encode("utf-8")
if candidate_pinyin == error_pinyin:
candidates_1st_order.append(candidate_phrase)
elif candidate_pinyin.split("/")[0] == error_pinyin.split("/")[0]:
candidates_2nd_order.append(candidate_phrase)
else:
candidates_3nd_order.append(candidate_phrase)
return candidates_1st_order, candidates_2nd_order, candidates_3nd_order
```
```python
def auto_correct( error_phrase ):
c1_order, c2_order, c3_order = get_candidates(error_phrase)
# print c1_order, c2_order, c3_order
if c1_order:
return max(c1_order, key=phrase_freq.get )
elif c2_order:
return max(c2_order, key=phrase_freq.get )
else:
return max(c3_order, key=phrase_freq.get )
```
```python
# test for the auto_correct
error_phrase_1 = "呕涂" # should be "呕吐"
error_phrase_2 = "东方之朱" # should be "东方之珠"
error_phrase_3 = "沙拢" # should be "沙龙"
print error_phrase_1, auto_correct( error_phrase_1 )
print error_phrase_2, auto_correct( error_phrase_2 )
print error_phrase_3, auto_correct( error_phrase_3 )
```
呕涂 呕吐
东方之朱 东方之珠
沙拢 沙龙
### 3. Correct the misspelled phrase in a sentance
- For any given sentence, use jieba do the segmentation,
- Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase
- Use auto_correct function to correct the misspelled phrase
- Output the correct sentence
```python
import jieba
import string
import re
```
```python
PUNCTUATION_LIST = string.punctuation
PUNCTUATION_LIST += "。,?:;{}[]‘“”《》/!%……()"
```
```python
def auto_correct_sentence( error_sentence, verbose=True):
jieba_cut = jieba.cut(err_test.decode("utf-8"), cut_all=False)
seg_list = "\t".join(jieba_cut).split("\t")
correct_sentence = ""
for phrase in seg_list:
correct_phrase = phrase
# check if item is a punctuation
if phrase not in PUNCTUATION_LIST.decode("utf-8"):
# check if the phrase in our dict, if not then it is a misspelled phrase
if phrase.encode("utf-8") not in phrase_freq.keys():
correct_phrase = auto_correct(phrase.encode("utf-8"))
if verbose :
print phrase, correct_phrase
correct_sentence += correct_phrase
if verbose:
print correct_sentence
return correct_sentence
```
```python
err_sent = '机七学习是人工智能领遇最能体现智能的一个分知!'
correct_sent = auto_correct_sentence( err_sent )
```
机七 机器
领遇 领域
分知 分枝
机器学习是人工智能领域最能体现智能的一个分枝!
```python
print correct_sent
```
机器学习是人工智能领域最能体现智能的一个分枝!
```python
```
基于Python的中文内容纠错算法-课程设计
需积分: 0 102 浏览量
更新于2023-06-20
3
收藏 3.81MB ZIP 举报
本项目是基于 Python 的中文文本内容纠错算法,基于jieba分词和中文词典技术实现。
中文文本纠错是针对中文文本拼写错误进行检测与纠正的一项工作,中文的文本纠错,应用场景很多,诸如输入法纠错、输入预测、ASR 后纠错等等,例如:
写作辅助:在内容写作平台上内嵌纠错模块,可在作者写作时自动检查并提示错别字情况。从而降低因疏忽导致的错误表述,有效提升作者的文章写作质量,同时给用户更好的阅读体验。
公文纠错:针对公文写作场景,提供字词、标点、专名、数值内容纠错,包含领导人姓名、领导人职位、数值一致性等内容的检查与纠错,辅助进行公文审阅校对。
搜索纠错:用户在搜索时经常输入错误,通过分析搜索query的形式和特征,可自动纠正搜索query并提示用户,进而给出更符合用户需求的搜索结果,有效屏蔽错别字对用户真实需求的影响。
Python极客之家
- 粉丝: 1w+
- 资源: 79
最新资源
- 城镇老旧小区改造(加装电梯)考评内容和评价标准表.docx
- 城镇老旧小区改造及既有住宅加装电梯赋分权重.docx
- 底板隐蔽前监理检查记录.docx
- 出差审批单(表格模板).docx
- 第三方技术服务机构消防验收项目情况工作月汇报表.docx
- 电梯质量安全风险管控清单(安装(含修理).docx
- 飞机舱位代码表.docx
- 顶板隐蔽前监理检查记录表.docx
- 高危妊娠产前评分标准表.docx
- 高温中暑病例报告卡表格.docx
- 个体工商户营业执照颁发及归档记录表.doc
- 更换输液流程表.docx
- 公务接待审批单(表格模板).docx
- 古今地名对照表.docx
- 固定资产验收单、移交清单、处置清单.docx
- 骨关节损伤鉴定标准条款表.docx