<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0041)http://nl.ijs.si/telri/wg5rep1/node6.html -->
<!Converted with LaTeX2HTML 95.1 (Fri Jan 20 1995) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds ><HTML><HEAD><TITLE>Part of Speech Taggers</TITLE>
<META content="text/html; charset=gb2312" http-equiv=Content-Type>
<META content="MSHTML 5.00.2920.0" name=GENERATOR></HEAD>
<BODY>
<META name=description value=" Part of Speech Taggers">
<META name=keywords value="wg5rep1">
<META name=resource-type value="document">
<META name=distribution value="global">
<P><BR>
<HR>
<A href="http://nl.ijs.si/telri/wg5rep1/node7.html" name=tex2html127><IMG
align=bottom alt=next src="Part of Speech Taggers.files/next_motif.gif"></A> <A
href="http://nl.ijs.si/telri/wg5rep1/wg5rep1.html" name=tex2html125><IMG
align=bottom alt=up src="Part of Speech Taggers.files/up_motif.gif"></A> <A
href="http://nl.ijs.si/telri/wg5rep1/node5.html" name=tex2html119><IMG
align=bottom alt=previous
src="Part of Speech Taggers.files/previous_motif.gif"></A> <A
href="http://nl.ijs.si/telri/wg5rep1/node1.html" name=tex2html129><IMG
align=bottom alt=contents
src="Part of Speech Taggers.files/contents_motif.gif"></A> <BR><B>Next:</B> <A
href="http://nl.ijs.si/telri/wg5rep1/node7.html" name=tex2html128>Aligners</A>
<B>Up:</B> <A href="http://nl.ijs.si/telri/wg5rep1/wg5rep1.html"
name=tex2html126>TELRI</A> <B>Previous:</B> <A
href="http://nl.ijs.si/telri/wg5rep1/node5.html" name=tex2html120>SGML
software</A> <BR>
<HR>
<P>
<H1><A name=SECTION00060000000000000000>Part of Speech Taggers</A></H1>
<P>Many corpora are, in addition to structural and bibliographic information,
annotated with linguistic knowledge. The most basic and common form this
annotation takes is marking up the running words in the corpus with their
<EM>part of speech tags</EM>. This adds value to the corpus because, for
example, searches can be performed not only on the word-forms as strings but
also on whether they belong to a certain linguistic category. Such tags are
typically taken to be atomic labels attached to words, denoting the part of
speech of the word, together with shallow morphosyntactic information, e.g. they
specify the word as a proper singular noun, or a plural comparative adjective.
For English and other Western European languages, for which most such annotated
corpora have been produced, the tagset size ranges from about forty to several
hundred distinct categories.
<P>To label the words in the corpus with their PoS, we fist need a lexicon or
morphological analyser that gives all the possible tags of a given word-form.
Part-of-speech taggers then take as their input all these possible
morphosyntactic interpretations of the word-form and output the correct
interpretation, given the context in which the word-form appears.
<P>There has recently been an increased interest in statistically based
part-of-speech taggers, which use the local context of a word form for
morphosyntactic disambiguation. Such taggers have the advantage of being fast
and can be automatically trained on a pre-tagged corpus. Their success rate
depends on many factors, but is usually, for tagsets of about 100 tags and for
Western European languages, at or below 96%.
<P>The most widely implemented approach to stochastic tagging is the one that
uses Hidden Markov Models (HMM). The theory behind this approach is explained
e.g. in [<A href="http://nl.ijs.si/telri/wg5rep1/node11.html#Charniak94">6</A>].
Such taggers have the putative advantage that they be trained on an untagged
text, using the so called Baum-Welch algorithm. However, experience shows (cf.
[<A href="http://nl.ijs.si/telri/wg5rep1/node11.html#elworthy94">11</A>]), that
the results obtained with such training are low, and that training on even a
small pre-tagged corpus gives better results. Therefore the recommended strategy
for training is to hand-tag a small corpus, use this data to train a tagger and
automatically tag a a bigger corpus, which is then hand-corrected and used for
re-training the tagger. This effort should, however, not be underestimated, as
hand-tagging a corpus is extremely slow work, and large datasets are needed to
adequately train a tagger.
<P>While a large number of HMM-based tagger implementation exist, not very many
are publicly available. A few sites offer automatic tagging of text which can be
submitted via WWW or email, however this service is restricted to English, or at
most some other Western European languages. The taggers that have source code
and usually accompanying documentation available via the Internet are the
following:<A href="http://nl.ijs.si/telri/wg5rep1/footnode.html#218"
name=tex2html16><IMG align=bottom alt=gif
src="Part of Speech Taggers.files/foot_motif.gif"></A>
<UL>
<LI>The <A href="ftp://parcftp.xerox.com/pub/tagger/" name=tex2html17><B>Xerox
tagger</B></A> is described in [<A
href="http://nl.ijs.si/telri/wg5rep1/node11.html#XeroxTag92">8</A>] and
implemented in Common Lisp. An advantage of this tagger is that it also
contains a tokenizer, and can thus handle plain, i.e. not pre-processed text.
<P></P>
<LI><A href="ftp://issco-ftp.unige.ch/pub/multext/" name=tex2html18><B>MULTEXT
tagger</B></A> is a re-implementation of the Xerox tagger in C, developed in
the scope of the MULTEXT project.
<P></P>
<LI><A href="http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html"
name=tex2html19><B>IMS TreeTagger</B></A> is a HMM-based tagger which uses a
decision-tree based method for learning, thus leading to improved tagging
results. It is available only as an executable for SunOS 4.1.3 and Sparc
workstations, but executables for other Unix machines might be available by
contacting the author.
<P></P>
<LI><A href="http://www.ltg.hcrc.ed.ac.uk/" name=tex2html20><B>LT PS
tagger</B></A> is (soon to be) available from the Language Technology Group at
Edinburgh. The advance information on this tagger indicates that it should
accept plain text and also text with SGML mark-up.
<P></P></LI></UL>
<P>Other approaches besides HMM have been tired for automatic taggers. The best
known is <A href="http://www.cs.jhu.edu/~brill" name=tex2html21><B>Brill's rule
based tagger</B></A> [<A
href="http://nl.ijs.si/telri/wg5rep1/node11.html#Brill92">3</A>,<A
href="http://nl.ijs.si/telri/wg5rep1/node11.html#Brill95">4</A>]. In the
training phase, this tagger makes an initial hypothesis about the correct tags.
In an iterative fashion it then betters its performance with regard to the
training corpus by postulating context dependent tag rewrite rules. The
advantage of Brill's tagger in comparison with HMM taggers is that the rule-set
it generates is more perspicuous than the transition-weight tables of the HMM
taggers. Namely, it often turns out to be advantageous to manually correct the
automatically induced knowledge of the tagger and it is simpler and more obvious
how to change explicit tag rewriting rules than it is changing tables of
numbers. Brill's tagger is written in C, with source code and documentation
available.
<P><BR>
<HR>
<A href="http://nl.ijs.si/telri/wg5rep1/node7.html" name=tex2html127><IMG
align=bottom alt=next src="Part of Speech Taggers.files/next_motif.gif"></A> <A
href="http://nl.ijs.si/telri/wg5rep1/wg5rep1.html" name=tex2html125><IMG
align=bottom alt=up src="Part of Speech Taggers.files/up_motif.gif"></A> <A
href="http://nl.ijs.si/telri/wg5rep1/node5.html" name=tex2html119><IMG
align=bottom alt=previous
src="Part of Speech Taggers.files/previous_motif.gif"></A> <A
href="http://nl.ijs.si/telri/wg5rep1/node1.html" name=tex2html129><IMG
align=bottom alt=contents
src="Part of Speech Taggers.files/contents_motif.gif"></A> <BR><B>Ne