HMM.rar_hmm汉语资源-CSDN文库

共11个文件

gif：8个

htm：2个

css：1个

版权申诉

34 浏览量 2022-09-23 08:44:56 上传评论收藏 8KB RAR 举报

**隐马尔可夫模型（Hidden Markov Model, 简称HMM）在汉语处理中的应用** 隐马尔可夫模型是统计自然语言处理领域中一个基础且重要的工具，尤其在汉语处理方面，它被广泛应用于词性标注、语音识别、机器翻译等多个任务。在“HMM.rar_hmm汉语”这个压缩包中，我们可以推测包含的是北京大学汉语语言学研究中心关于使用HMM进行汉语词性标注的研究资料。 **一、HMM的基本原理** 1. **状态与观测**：HMM是一个有向图模型，其中每个节点代表一个“隐藏状态”，这些状态不可直接观察，但每个状态会生成一个“观测值”。在汉语处理中，隐藏状态可能代表词性的内部状态，而观测值可能是实际的汉字。 2. **转移概率**：描述从一个状态转移到另一个状态的概率，反映了连续两个词性的转换概率。 3. **发射概率**：表示在某个状态下产生特定观测值的概率，即在特定词性下出现某个汉字的可能性。 **二、HMM在汉语词性标注中的应用** 1. **前向-后向算法**：用于计算给定观测序列下所有可能的状态序列的概率，帮助我们理解最有可能的词性序列。 2. **维特比算法**：找到观测序列下最有可能的状态序列，即最可能的词性标注序列。 3. ** Baum-Welch算法（EM算法）**：用于参数估计，不断迭代优化HMM的转移和发射概率，使其更适应数据。 **三、HMM的局限性与改进** 1. **局限性**：HMM假设当前状态只依赖于前一个状态，这在处理复杂的语言结构时可能不足。 2. **改进**：通过引入更复杂的模型，如条件随机场（CRF）、双向LSTM等，可以克服HMM的局限，考虑更多上下文信息。 **四、HMM与汉语特性** 1. **汉语的词性特点**：汉语词性变化相对较少，这使得HMM在词性标注上相对有效。 2. **词语组合**：汉语的词语组合方式多样，HMM需要结合词汇知识库来提升性能。 3. **未登录词**：汉语新词层出不穷，HMM需结合动态模型或未知词处理策略。 **五、HMM_Tagger** 在压缩包中的“HMM_Tagger”很可能是一个实现HMM的词性标注器，可能包括训练模型的代码、预处理脚本、测试数据以及标注结果的输出。使用这个工具，研究者可以对新的汉语文本进行词性标注，评估模型的性能，并根据实际需求进行参数调整。 HMM在汉语处理中扮演了重要角色，尤其是在词性标注任务中。通过理解和应用HMM，我们可以更好地理解和处理汉语的结构和特性，为自然语言处理提供有力的支持。北京大学汉语语言学研究中心的这份资料对于深入理解HMM在汉语处理中的应用无疑是一份宝贵的资源。

资源推荐

资源详情

资源评论

收起资源包目录

HMM.rar （11个子文件）

HMM_Tagger

Part of Speech Taggers.files

previous_motif.gif 220B

up_motif.gif 145B

next_motif.gif 172B

contents_motif.gif 225B

foot_motif.gif 87B

Part of Speech Taggers.htm 8KB

Part-of-Speech Tagging.htm 7KB

Part-of-Speech Tagging.files

previous_motif.gif 220B

pos_phrase_nice.css 666B

up_motif.gif 145B

next_motif.gif 172B

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">  <!Converted with LaTeX2HTML 95.1 (Fri Jan 20 1995) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds ><HTML><HEAD><TITLE>Part of Speech Taggers</TITLE> <META content="text/html; charset=gb2312" http-equiv=Content-Type> <META content="MSHTML 5.00.2920.0" name=GENERATOR></HEAD> <BODY> <META name=description value=" Part of Speech Taggers"> <META name=keywords value="wg5rep1"> <META name=resource-type value="document"> <META name=distribution value="global"> <HR> <A href="http://nl.ijs.si/telri/wg5rep1/node7.html" name=tex2html127><IMG align=bottom alt=next src="Part of Speech Taggers.files/next_motif.gif"></A> <A href="http://nl.ijs.si/telri/wg5rep1/wg5rep1.html" name=tex2html125><IMG align=bottom alt=up src="Part of Speech Taggers.files/up_motif.gif"></A> <A href="http://nl.ijs.si/telri/wg5rep1/node5.html" name=tex2html119><IMG align=bottom alt=previous src="Part of Speech Taggers.files/previous_motif.gif"></A> <A href="http://nl.ijs.si/telri/wg5rep1/node1.html" name=tex2html129><IMG align=bottom alt=contents src="Part of Speech Taggers.files/contents_motif.gif"></A> Next: <A href="http://nl.ijs.si/telri/wg5rep1/node7.html" name=tex2html128>Aligners</A> Up: <A href="http://nl.ijs.si/telri/wg5rep1/wg5rep1.html" name=tex2html126>TELRI</A> Previous: <A href="http://nl.ijs.si/telri/wg5rep1/node5.html" name=tex2html120>SGML software</A> <HR> <H1><A name=SECTION00060000000000000000>Part of Speech Taggers</A></H1> Many corpora are, in addition to structural and bibliographic information, annotated with linguistic knowledge. The most basic and common form this annotation takes is marking up the running words in the corpus with their part of speech tags. This adds value to the corpus because, for example, searches can be performed not only on the word-forms as strings but also on whether they belong to a certain linguistic category. Such tags are typically taken to be atomic labels attached to words, denoting the part of speech of the word, together with shallow morphosyntactic information, e.g. they specify the word as a proper singular noun, or a plural comparative adjective. For English and other Western European languages, for which most such annotated corpora have been produced, the tagset size ranges from about forty to several hundred distinct categories. To label the words in the corpus with their PoS, we fist need a lexicon or morphological analyser that gives all the possible tags of a given word-form. Part-of-speech taggers then take as their input all these possible morphosyntactic interpretations of the word-form and output the correct interpretation, given the context in which the word-form appears. There has recently been an increased interest in statistically based part-of-speech taggers, which use the local context of a word form for morphosyntactic disambiguation. Such taggers have the advantage of being fast and can be automatically trained on a pre-tagged corpus. Their success rate depends on many factors, but is usually, for tagsets of about 100 tags and for Western European languages, at or below 96%. The most widely implemented approach to stochastic tagging is the one that uses Hidden Markov Models (HMM). The theory behind this approach is explained e.g. in [<A href="http://nl.ijs.si/telri/wg5rep1/node11.html#Charniak94">6</A>]. Such taggers have the putative advantage that they be trained on an untagged text, using the so called Baum-Welch algorithm. However, experience shows (cf. [<A href="http://nl.ijs.si/telri/wg5rep1/node11.html#elworthy94">11</A>]), that the results obtained with such training are low, and that training on even a small pre-tagged corpus gives better results. Therefore the recommended strategy for training is to hand-tag a small corpus, use this data to train a tagger and automatically tag a a bigger corpus, which is then hand-corrected and used for re-training the tagger. This effort should, however, not be underestimated, as hand-tagging a corpus is extremely slow work, and large datasets are needed to adequately train a tagger. While a large number of HMM-based tagger implementation exist, not very many are publicly available. A few sites offer automatic tagging of text which can be submitted via WWW or email, however this service is restricted to English, or at most some other Western European languages. The taggers that have source code and usually accompanying documentation available via the Internet are the following:<A href="http://nl.ijs.si/telri/wg5rep1/footnode.html#218" name=tex2html16><IMG align=bottom alt=gif src="Part of Speech Taggers.files/foot_motif.gif"></A> <UL> <LI>The <A href="ftp://parcftp.xerox.com/pub/tagger/" name=tex2html17>Xerox tagger</A> is described in [<A href="http://nl.ijs.si/telri/wg5rep1/node11.html#XeroxTag92">8</A>] and implemented in Common Lisp. An advantage of this tagger is that it also contains a tokenizer, and can thus handle plain, i.e. not pre-processed text. <LI><A href="ftp://issco-ftp.unige.ch/pub/multext/" name=tex2html18>MULTEXT tagger</A> is a re-implementation of the Xerox tagger in C, developed in the scope of the MULTEXT project. <LI><A href="http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html" name=tex2html19>IMS TreeTagger</A> is a HMM-based tagger which uses a decision-tree based method for learning, thus leading to improved tagging results. It is available only as an executable for SunOS 4.1.3 and Sparc workstations, but executables for other Unix machines might be available by contacting the author. <LI><A href="http://www.ltg.hcrc.ed.ac.uk/" name=tex2html20>LT PS tagger</A> is (soon to be) available from the Language Technology Group at Edinburgh. The advance information on this tagger indicates that it should accept plain text and also text with SGML mark-up. </LI></UL> Other approaches besides HMM have been tired for automatic taggers. The best known is <A href="http://www.cs.jhu.edu/~brill" name=tex2html21>Brill's rule based tagger</A> [<A href="http://nl.ijs.si/telri/wg5rep1/node11.html#Brill92">3</A>,<A href="http://nl.ijs.si/telri/wg5rep1/node11.html#Brill95">4</A>]. In the training phase, this tagger makes an initial hypothesis about the correct tags. In an iterative fashion it then betters its performance with regard to the training corpus by postulating context dependent tag rewrite rules. The advantage of Brill's tagger in comparison with HMM taggers is that the rule-set it generates is more perspicuous than the transition-weight tables of the HMM taggers. Namely, it often turns out to be advantageous to manually correct the automatically induced knowledge of the tagger and it is simpler and more obvious how to change explicit tag rewriting rules than it is changing tables of numbers. Brill's tagger is written in C, with source code and documentation available. <HR> <A href="http://nl.ijs.si/telri/wg5rep1/node7.html" name=tex2html127><IMG align=bottom alt=next src="Part of Speech Taggers.files/next_motif.gif"></A> <A href="http://nl.ijs.si/telri/wg5rep1/wg5rep1.html" name=tex2html125><IMG align=bottom alt=up src="Part of Speech Taggers.files/up_motif.gif"></A> <A href="http://nl.ijs.si/telri/wg5rep1/node5.html" name=tex2html119><IMG align=bottom alt=previous src="Part of Speech Taggers.files/previous_motif.gif"></A> <A href="http://nl.ijs.si/telri/wg5rep1/node1.html" name=tex2html129><IMG align=bottom alt=contents src="Part of Speech Taggers.files/contents_motif.gif"></A> Ne

评论收藏

内容反馈

版权申诉