Integrating Ngram Model and Case-based Learning
For Chinese Word Segmentation
Chunyu Kit
†
Zhiming Xu
†‡
Jonathan J. Webster
†
Department of Chinese, Translation and Linguistics, City University of Hong Kong
†
Tat Chee Ave., Kowloon, Hong Kong
{ctckit, ctxuzm, ctjjw}@cityu.edu.hk
School of Computer Science and Technology, Harbin Institute of Technology
‡
Heilongjiang Province, P. R. China
Abstract
This paper presents our recent work
for participation in the First Interna-
tional Chinese Word Segmentation Bake-
off (ICWSB-1). It is based on a general-
purpose ngram model for word segmen-
tation and a case-based learning approach
to disambiguation. This system excels
in identifying in-vocabulary (IV) words,
achieving a recall of around 96-98%.
Here we present our strategies for lan-
guage model training and disambiguation
rule learning, analyze the system’s perfor-
mance, and discuss areas for further im-
provement, e.g., out-of-vocabulary (OOV)
word discovery.
1 Introduction
After about two decades of studies of Chinese word
segmentation, ICWSB-1 (henceforth, the bakeoff)
is the first effort to put different approaches and
systems to the test and comparison on common
datasets. We participated in the bakeoff with a
segmentation system that is designed to integrate a
general-purpose ngram model for probabilistic seg-
mentation and a case- or example-based learning
approach (Kit et al., 2002) for disambiguation.
The ngram model, with words extracted from
training corpora, is trained with the EM algorithm
(Dempster et al., 1977) using unsegmented train-
ing corpora. Originally it was developed to en-
hance word segmentation accuracy so as to facili-
tate Chinese-English word alignment for our ongo-
ing EBMT project, where only unsegmented texts
are available for training. It is expected to be ro-
bust enough to handle novel texts, independent of
any segmented texts for training. To simplify the
EM training, we used the uni-gram model for the
bakeoff and relied on the Viterbi algorithm (Viterbi,
1967) for the most probable segmentation, instead of
attempting to exhaust all possible segmentations of
each sentence for a complicated full version of EM
training.
The case-based learning works in a straightfor-
ward way. It first extracts case-based knowledge,
as a set of context-dependent transformation rules,
from the segmented training corpus, and then ap-
plies them to ambiguous strings in a test corpus in
terms of the similarity of their contexts. The simi-
larity is empirically computed in terms of the length
of relevant common affixes of context strings.
The effectiveness of this integrated approach is
verified by its outstanding performance on IV word
identification. Its IV recall rate, ranging from 96%
to 98%, stands at the top or the next to the top in all
closed tests in which we have participated. Unfortu-
nately, its overall performance is not sustainable at
the same level, due to the lack of a module for OOV
word detection.
This paper is intended to present the implementa-
tion of the system and analyze its performance and
problems, aiming at exploration of directions for fur-
ther improvement. The remaining sections are or-
ganized as follows. Section 2 presents the ngram
model and its training with the EM algorithm, and
Section 3 presents the case-based learning for dis-