IntegratingN-grammodelandcase-basedlearningforChinesewordsegmentation资源-CSDN文库

需积分: 9 173 浏览量 2011-06-23 01:00:38 上传评论收藏 155KB PDF 举报

### 整合N-gram模型与基于案例的学习用于中文分词 #### 摘要与研究背景本文介绍了一项最新工作，旨在参加第一届国际中文分词烘焙赛（ICWSB-1）。该研究结合了一个通用的N-gram模型进行分词以及一个基于案例的学习方法来解决歧义问题。系统在识别词汇表内的单词（IV words）方面表现出色，召回率高达96%-98%。文章详细介绍了语言模型训练和消歧规则学习的策略，分析了系统的性能，并讨论了进一步改进的方向，如发现未登录词（OOV words）的方法。 #### 引言与中文分词研究现状经过大约二十年对中文分词的研究，ICWSB-1首次尝试将不同的方法和系统在同一数据集上进行测试和比较。本文作者使用了一个分词系统参与烘焙赛，该系统旨在整合通用的N-gram模型进行概率性分词，以及基于案例或实例的学习方法进行消歧。N-gram模型从训练语料库中提取单词，使用EM算法（Dempster等人，1977年）进行训练，训练数据是未经分词的文本。最初，该模型是为了提高中文到英文的词对齐精度而开发的，适用于正在进行的基于实例的机器翻译（EBMT）项目，其中仅提供未分词的文本用于训练。为了简化EM训练过程，在烘焙赛中，作者使用了单词模型，并依赖维特比算法（Viterbi，1967年）来确定最可能的分词结果，而不是尝试穷尽每个句子的所有可能分词组合，这会使得完整的EM训练变得过于复杂。 #### 基于案例的学习方法基于案例的学习部分工作方式直接明了。它提取基于案例的知识，这通常涉及从过去的数据中学习模式或规则，以便在新的情况下应用。在中文分词的背景下，这意味着从已标记的语料库中学习分词和消歧的规则，然后将这些规则应用于新文本中的单词分割和意义消歧。这种方法的一个关键优势在于能够处理语境依赖性问题，即单词的意义可以根据上下文发生变化，通过基于案例的学习，系统可以学习到如何根据上下文正确地解释单词的意义。 #### 训练与消歧策略语言模型的训练是通过使用EM算法对N-gram模型进行优化实现的。EM算法是一种迭代的优化算法，常用于处理不完全数据的情况，对于N-gram模型来说，意味着即使在没有分词标注的训练文本中也能有效地估计模型参数。另一方面，消歧策略通过基于案例的学习方法实现，该方法利用先前标记好的数据集来训练模型识别和解决同形异义词的问题。这涉及到从训练数据中提取模式，然后将这些模式应用于新的输入数据以解决词汇的多义性问题。 #### 性能分析与未来方向系统在识别词汇表内单词的性能上表现优异，召回率接近完美，显示出模型在处理已知词汇方面的高效率和准确性。然而，对于未登录词的处理仍然是一个挑战，这是大多数自然语言处理系统普遍面临的问题。未来的工作将集中在开发更有效的OOV词发现策略上，例如，通过利用语境信息、形态学分析或是借助外部资源如在线词典来增强模型的泛化能力。此外，探索更复杂的N-gram模型版本，如考虑更高阶的N值，也可能有助于提升模型的整体性能，尤其是在处理长距离依赖和复杂的语法结构时。整合N-gram模型与基于案例的学习为中文分词提供了一种强大的解决方案，特别是在处理词汇表内单词时。然而，对于OOV词的处理仍然是一个待解决的难题，需要进一步的研究和发展。通过持续的技术创新和算法优化，有望在未来提升系统的整体性能，使其在各种应用场景下更加可靠和高效。

资源推荐

资源详情

资源评论

Integrating Ngram Model and Case-based Learning

For Chinese Word Segmentation

Chunyu Kit

†

Zhiming Xu

†‡

Jonathan J. Webster

†

Department of Chinese, Translation and Linguistics, City University of Hong Kong

†

Tat Chee Ave., Kowloon, Hong Kong

{ctckit, ctxuzm, ctjjw}@cityu.edu.hk

School of Computer Science and Technology, Harbin Institute of Technology

‡

Heilongjiang Province, P. R. China

Abstract

This paper presents our recent work

for participation in the First Interna-

tional Chinese Word Segmentation Bake-

off (ICWSB-1). It is based on a general-

purpose ngram model for word segmen-

tation and a case-based learning approach

to disambiguation. This system excels

in identifying in-vocabulary (IV) words,

achieving a recall of around 96-98%.

Here we present our strategies for lan-

guage model training and disambiguation

rule learning, analyze the system’s perfor-

mance, and discuss areas for further im-

provement, e.g., out-of-vocabulary (OOV)

word discovery.

1 Introduction

After about two decades of studies of Chinese word

segmentation, ICWSB-1 (henceforth, the bakeoff)

is the ﬁrst effort to put different approaches and

systems to the test and comparison on common

datasets. We participated in the bakeoff with a

segmentation system that is designed to integrate a

general-purpose ngram model for probabilistic seg-

mentation and a case- or example-based learning

approach (Kit et al., 2002) for disambiguation.

The ngram model, with words extracted from

training corpora, is trained with the EM algorithm

(Dempster et al., 1977) using unsegmented train-

ing corpora. Originally it was developed to en-

hance word segmentation accuracy so as to facili-

tate Chinese-English word alignment for our ongo-

ing EBMT project, where only unsegmented texts

are available for training. It is expected to be ro-

bust enough to handle novel texts, independent of

any segmented texts for training. To simplify the

EM training, we used the uni-gram model for the

bakeoff and relied on the Viterbi algorithm (Viterbi,

1967) for the most probable segmentation, instead of

attempting to exhaust all possible segmentations of

each sentence for a complicated full version of EM

training.

The case-based learning works in a straightfor-

ward way. It ﬁrst extracts case-based knowledge,

as a set of context-dependent transformation rules,

from the segmented training corpus, and then ap-

plies them to ambiguous strings in a test corpus in

terms of the similarity of their contexts. The simi-

larity is empirically computed in terms of the length

of relevant common afﬁxes of context strings.

The effectiveness of this integrated approach is

veriﬁed by its outstanding performance on IV word

identiﬁcation. Its IV recall rate, ranging from 96%

to 98%, stands at the top or the next to the top in all

closed tests in which we have participated. Unfortu-

nately, its overall performance is not sustainable at

the same level, due to the lack of a module for OOV

word detection.

This paper is intended to present the implementa-

tion of the system and analyze its performance and

problems, aiming at exploration of directions for fur-

ther improvement. The remaining sections are or-

ganized as follows. Section 2 presents the ngram

model and its training with the EM algorithm, and

Section 3 presents the case-based learning for dis-

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余3页未读，立即下载

评论收藏

内容反馈

wherrlich

粉丝: 0
资源: 15

Integrating N-gram model and case-based learning for Chinese wor...

最新资源

Integrating N-gram model and case-based learning for Chinese wor...

基于N-Gram的语言识别技术

基于n-gram的文本分类

N-gram语言模型

Integrating Context and Occlusion for Car Detection by Hierarchical And-Or Model

Integrating Open-Source Statistical Packages with ArcGIS.ppt

高一英语新教材Unit 5 Integrating skills-Not One Less

一种基于N-gram模型和机器学习的汉语分词算法.pdf

Ngram分词程序

Chinese-Word-Segmentation:一个中文分词系统

N-gram-Language-Model

《如何融合领域知识到深度学习中》 - 20190304-Russ.pdf.zip

Cisco Press - Integrating Voice and Data Networks

Data-Driven-Stochastic-Unit-Commitment-for-Integrating-Wind-Gene

Short Text Hashing Improved by Integrating Multi-Granularity Topics and Tags

Machine Learning for Wireless Networks

Automatic Chinese Text Classification Using N-Gram Model

ngram 算法 尝试

路径规划论文 --英文

Integrating Learning and Planning

最新资源

ngram 算法尝试