所需积分/C币:5 2019-09-14 13:08:29 379KB PDF
收藏 收藏

7232016 Using Wikicorpus nltk to build a Spanish part-of-speech tag ger CLiPS want to learn general rules in the form of, for example: any proper noun followed by a verb instead of Puerto Rico "followed by a verb entences wikicorpus(words=1000000) ANONYMOUS =anonymous for s in sentences for i, (w, tag) in enumerate(s): if tag =="NP":+ NP= proper nain in Parole tagset SL1=(ANONYMOUS,NP") We can then train NLTK's Fas-BrillTaggerTrainer. It is based on a unigram tagger, which is simply a lexicon of known words and their part-of-speech tag. It will then boost the accuracy with a set of contextual rules that change a word's part-of-speech tag depend ing on the surround ing words from nltk tag import Unigramfagger from nltk tag import FastBrillTaggerTrainer from nltk tag. brill import Symme-ricProxinmateTokensTemplate from nltk tag. brill import FroximateTokensTerrplate from nltk tag. bill import ProximateTags 1 from r⊥tk,tag.b=11⊥ import proximatewordsRu⊥e ctx-[t Context surrounding words and tags SymmetricProximateTokensTemplate(ProximateTagsRule,(1, 1)) SymmetricProximateTokensTemplate(pr SymmetricProxiateTokensTemolate(ProximmateTaysRuler(1, 3)), SymmetricProximateTokensTemplate(ProximateTagsRuler(2, 2)), SymmetricProximateTokensTemplate(ProximateWordsRule,(1, 1) ymmetricProximaterokensTemplate(pro ewordsRlle,(1, 2)) ProximateTokensTemplate(ProxinateTagsRule, (l,-1),(1, 1)) tagger UnigramTacger(sentences) tagger =-FastBrillTaggerTrainer (-agger, ctx, trace=0) tagger tagger train(sentences, max rules=100) #print tacger evaluate(wikicorpus(10000, start=l)) Brill's algorithm uses an iterative approach to learn contextual rules. In short, this means that it tries different combinations of interesting rules to find a subset that produces the best tagging accuracy This process is time-consuming(minutes or hours), so we want to store the final subset for reuse Brills algorithm in nltk dcfincs contcxt using indices. For cxamplc,(1, 2)in thc previous script ords(or tags)after the current word. Brill's original impler commands to describe context, e.g., NEXT1OR2WORD or NEXT1OR2TAG Pattem also uses these commands, so we need to map nltk's indices to the command set £⊙1u1e主nt a= rule. original t b= rule. replacemen- tag c= rule. conditions x=c[0][2] [0][:2 E 1 this script continue if isinstance(rule, ProximateTagsRule) if r -(-1,-1): cmd -PREVTAC if r ==(+1 +1): cmd =NEXTTAG i王r==1-2,-1):cmd="PREV⊥OR2AG" if if ) cmd =NEXTIOR2OR3TA f if isinstance(rule, ProximateWordsRule): if f if r ==(-2r-1): cmd =PREVIOR2WD tx append(ss ss"(a, b, nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger 36 7232016 Using Wikicorpus nLtK to build a Spanish part-of-speech tag ger CLiPS open("es-ccntext txt,w").write(BOM UtF8 + n"join(ctx. encode( utf-8) We end up with a file es-context txt with a 1 00 contextual rules in a format usable with Pattern 4. Rules for unknown words based on word suffixes By default, unknown words(=not in lexicon) will be tagged as nouns. we can improve this with morphological rules, in other words, rules based on word pre fixes and suffixes. For example, English words ending in-ly are usually adverbs really, extremely, and so on. Similarily, Spanish words that end n-menteare adverbs. Spanish words ending in -ando or -iendo are verbs in the present participle: hablando, escribiendo, and so on 860,"Sp":8,"VMs":7}} suffix defaultdict (lambda: defaultdict (int)) for senterce in wikicorpus(1000300) f。xw, tag in contone: X =w[-5:]# Tast 5 characters. ⊥en(x)<⊥en(W) and tag suffix[x][tag]+= 1 for x, tacs in suffix i=ems(): tag Iax(tags, key=tags get) t RO sum(tags. valucs()) t4860+8+7 f2= tags[tag] float (f1)+ 4860/4875 top append((fl, f2, tag) tcp- sorted(top, reverse=True) tcp filter(lambda (fl, f2, x, ag): fl >=10 and f top) tcp filter(lambda (fl, f2, x, -ag): tag !="NC top) p[:100 )王orf1,f open("es-mcrphology txt","w"). write(BOM UTF8 \n"join(top). encode(utf We end up with a file es-morpholcgy txt with a 100 sufix rules in a format usable with Pattern. To clarify this, examine the table below. We read 1 million words from Wikicorpus, of which 4, 875 bords cnd in -mente. 98%of thosc arc tagged as RG(Parolc tag for adverb, RB in Pcnn tagsct) It was also taggedsp(preposition)8 times and vMs(verb)7 times The above script has two constraint for rule selection: =1 >= 10 will discard rules that match less than 10 words, and f2>0. 8 will discard rules for which the most frequent tag falls below 80%. This means that unknown words that end in-mente will be tagged as adverbs, since we consider the other cases negligible. We can experiment with different settings to see if the accuracy of the tagger Improves. FREQUENCY SUFFIX PARTS-OF-SPEECH EXAMPLE 5986 -acion 998 Ncs+1%sp derivation 4875 went 988RG+1号sP+18WMs correctamente 3276 ones 998 NCP +1 VMS dimensions 1824 bien 100%RG tambien 1247 -en 998W+18Ncs septiembre 1134 -dades 998 NCP+18 sP posibilidades 5. Subclassing the pattern. text Parser class In summary, we constructed an es-lexicon txt with the part-of-speech tags of known words (steps 1-2)together with an es-contex= txt(step 3 )and an es-morphology txt(step 4). We can use these to create a parser for Spanish by subclassing the base Parser in the pattern. text module. The pattern. text module has base classes for Parser, Lexicon, Morphology, etc. Take a moment to review the source code, and the source code of other parsers in Pattern. Youl ll notice that all parsers follow the same simple steps. a template for new parsers is included nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger 46 7232016 Using Wiki corpus &nltK to build a Spanish part-of-speech tagger CLiPs n pattern, text.xx The parser basc class has the following mcthods with default behavior Parser fird tokens() finds sentence markers(? )and splits punctuation marks from word Parser fird tags() finds word part-of-speech tags, Parser fird chunks() finds words that belong together (e.g, the black cats ), Parser fir:d labels() finds word roles in the sentence(e.g, subject and object), Parser fird lcmmata() finds word base forms(cats -cat Parser parse() executes the above steps on a given string We can create an instance of the Spanishparser and feed it our data. We will need to rodcfinc -ind tags ( )to map Parole tags to Penn Trccbank tags(which all othcr parsers in Pattcrn as well PAROLE =I nCC":CC NCS:LN HVAN: D HPT: DTI VMN:VB RG def parole2penntreebank(token, tag): return token, PAROLE, get(tag, tag) lass SparishParser(parser def find tags(self, tokens, *xkwargs): t Farse:, find tags() can take an optional map(token, tag) function t which returns an updated (tok tag)tuple kwargs. setdefault ("map", parole2penntreeoank return Parser. find tags(self, tokens, x*kwarcs) Load the lexicon and the rules in an instance of spanishParser from pattern. text import Lexicon morpho⊥agy es- morpho⊥oay,tx parser= SranishParser( lexicon =lexicon aras return parserparse(s, * args, i*kwargs) nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger 56 7232016 Using Wiki corpus &nltK to build a Spanish part-of-speech tagger CLiPs It is still missing features(notably lemmatization)but our Spanish parser is essentially ready for use print parse (uEl gato se sento en la alfombra") gato sento alfom bra DT PRP VB DT 6. Testing the accuracy of the parser The following script can be used to test the accuracy of the parser against Wikicorpus. We used 1.5 million words with 300 contextual and 100 morphological rules for an accuracy of about 91%. So we lost g% but the parser is also fast and compact- the data files are about IMB in size. Note how we pass map=None to theparse()command. This parameter is in turn passed to SpanishParser find tags()so that the original Parole tags are returned, which we can compare to the tags in Wikicorpus for sl in wikicorpus (100000, start=1): join (w for w, tag in s1) 2- parsc(s2, tags-Truc, caunks-Faloc, rap-Nonc) split([o] f⊙x(w1,tag1) 2, tag2) in zip(sl, s2) f tagl print float(i)/n nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger 66

试读 6P Building_Spanish_Part-of-Speech_Tagger_Using_Python_Pattern.pdf.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    weixin_38744375 如果觉得有用,不妨留言支持一下
    • 至尊王者

    关注 私信 TA的资源
    Building_Spanish_Part-of-Speech_Tagger_Using_Python_Pattern.pdf.pdf 5积分/C币 立即下载


    5积分/C币 立即下载 >