Abstract
IV
Abstract
Word segmentation and Part-of-Speech tagging are fundamental research
projects of natural language process and are the foundation of other nlp tasks. They
can determine the final results of other tasks. A high-efficiency word segmentation
and pos tagging system is not only important in terms of academic value, but also
has important application value.
This work focuses on constructing an efficient word segmentation and pos
tagging system with excellent performance. The major reaserch contents include the
method of Chinese word segmentation and pos tagging combining statistic model
and dictionary, the optimization of system efficiency, performance improvement
and incremental training based on perceptron algorithm.
By using the method combining statistic model and dictionary, we achieved a
comparable performance in word segmentation and pos tagging and by integrating
dictionary information into statistic model we implement the domain adaption for
Chinese word segmentation and efficiency improvemen for pos tagging. We then
implement perceptron parallel training algorithm and improve triaing efficiency
significantly without great loss of performance. We also decrease the memory
requirement and accelerate test speed by compressing model file. Meanwhile, we
also improve performance of pos tagging by using large-scale unlabeled data. Based
on the advantages of online algorithm, three different incremental training methods
based on perceptron algorithm are proposed and the validity of new methods have
been affirmed through experiments. Finally, analyse the causation of fail in cross-
domain Chinese word segmentation deeply and stacked learning frame is applied in
cross-domain Chinese word segmentation.
Experimental results show that our system achieves the state-of-art
performance of Chinese word segmentation and pos tagging and by using parallel
training we can greatly improved training efficiency. The results also show that the
increamental method proposed in this paper is valid in same domain dataset for
Chinese word segmentation and pos tagging and stacked learning frame is effective
for cross-domain Chinese word segmentation.
Keywords: Word segmentation, Part-of-Speech tagging, Perceptron algorithm,
parallel training,Incremental Training