斯坦福大学英文词性标注工具_斯坦福词性标注资源-CSDN文库

共20个文件

txt：5个

jar：4个

java：2个

5星 · 超过95%的资源需积分: 35 18 浏览量 2017-12-28 20:47:30 上传评论 1 收藏 24.99MB ZIP 举报

【正文】斯坦福大学英文词性标注工具是自然语言处理（NLP）领域的一个重要工具，主要用于对英文文本进行词性标注（Part-of-Speech tagging）。这个工具由斯坦福大学计算机科学系开发，广泛应用于学术研究、教育、信息检索、机器翻译、情感分析等多个领域。在处理大量英文文本，尤其是产品评论分析时，它能够帮助我们理解文本结构，提取关键信息，进而进行深入的文本挖掘。词性标注是NLP的基础任务之一，它将每个单词标记为特定的词性，如名词（Noun）、动词（Verb）、形容词（Adjective）、副词（Adverb）等，这有助于我们识别句子成分，理解语义。例如，在一条产品评论中，“This phone is fast and reliable.”，通过词性标注，我们可以得到“phone”是名词，“is”是动词，“fast”和“reliable”是形容词，这些信息对于分析评论的情感倾向至关重要。斯坦福大学的词性标注工具采用了隐马尔科夫模型（Hidden Markov Model, HMM）和条件随机场（Conditional Random Field, CRF）等统计学习方法。HMM是一种经典的序列标注模型，适合处理词性标注这类序列决策问题；而CRF则是一种考虑了上下文信息的统计建模方法，能更好地捕捉相邻词对词性的影响。在使用该工具前，你需要下载stanford-postagger-2017-06-09压缩包，解压后你会找到以下核心文件： 1. `stanford-postagger.jar`：这是词性标注工具的主程序，一个Java可执行文件。 2. `english-left3words-distsim.tagger`：这是预训练的英文模型，包含了大量英文文本的词性标注数据。 3. `stanford-postagger.properties`：配置文件，可以设置运行参数。使用这个工具，你可以通过Java API或命令行方式进行操作。例如，用命令行方式，你可以输入以下命令进行词性标注： ``` java -mx4g -cp "*" edu.stanford.nlp.tagger.maxent.MaxentTagger -model resources/english-left3words-distsim.tagger -textFile input.txt -outputFile output.txt ``` 这里，`input.txt`是待标注的文本文件，`output.txt`是标注结果的输出文件。斯坦福大学的英文词性标注工具是NLP领域中的利器，它提供了一种高效、准确的方法来理解和分析英文文本。无论你是研究人员、开发者还是教师，这个工具都能帮助你更深入地探索和理解英文文本的结构和含义。

资源推荐

资源详情

资源评论

收起资源包目录

stanford-postagger-2017-06-09.zip （20个子文件）

stanford-postagger-2017-06-09

sample-output.txt 619B

stanford-postagger.jar 3.5MB

models

english-left3words-distsim.tagger.props 1KB

english-left3words-distsim.tagger 11.83MB

english-bidirectional-distsim.tagger 15.06MB

README-Models.txt 4KB

english-bidirectional-distsim.tagger.props 2KB

LICENSE.txt 18KB

stanford-postagger-3.8.0-sources.jar 2.85MB

TaggerDemo2.java 2KB

stanford-postagger-3.8.0-javadoc.jar 4.14MB

stanford-postagger.sh 262B

TaggerDemo.java 1005B

README.txt 11KB

sample-input.txt 379B

stanford-postagger-3.8.0.jar 3.5MB

build.xml 6KB

stanford-postagger.bat 246B

stanford-postagger-gui.bat 158B

stanford-postagger-gui.sh 100B

Stanford POS Tagger, v3.8.0 - 2017-06-09 Copyright (c) 2002-2012 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. Original tagger author: Kristina Toutanova Code contributions: Christopher Manning, Dan Klein, William Morgan, Huihsin Tseng, Anna Rafferty, John Bauer Major rewrite for version 2.0 by Michel Galley. Current release prepared by: Jason Bolton This package contains a Maximum Entropy part of speech tagger. A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other tokens), such as noun, verb, adjective, etc. Generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech (POS) taggers described in: Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259. The system requires Java 1.8+ to be installed. About 60 MB of memory is required to run a trained tagger, depending on the OS, tagging model chosen, etc. (i.e., you may need to give to java an option like java -mx120m). Plenty of memory is needed to train a tagger. It depends on the complexity of the model but at least 1GB is recommended (java -mx1g). Two trained tagger models for English are included with the tagger, along with some caseless versions, and we provide models for some other languages. The tagger can be retrained on other languages based on POS-annotated training text. If you really want to use this software under Java 1.4, look into RetroWeaver: http://retroweaver.sourceforge.net/ QUICKSTART ----------------------------------------------- The Stanford POS Tagger is designed to be used from the command line or programmatically via its API. There is a GUI interface, but it is for demonstration purposes only; most features of the tagger can only be accessed via the command line. To run the demonstration GUI you should be able to use any of the following 2 methods: 1) java -mx200m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTaggerGUI models/wsj-0-18-left3words-distsim.tagger 2) Running the appropriate script for your operating system: stanford-postagger-gui.bat ./stanford-postagger-gui.sh To run the tagger from the command line, you can start with the provided script appropriate for you operating system: ./stanford-postagger.sh models/wsj-0-18-left3words-distsim.tagger sample-input.txt stanford-postagger models\wsj-0-18-left3words-distsim.tagger sample-input.txt The output should match what is found in sample-output.txt The tagger has three modes: tagging, training, and testing. Tagging allows you to use a pretrained model (two English models are included) to assign part of speech tags to unlabeled text. Training allows you to save a new model based on a set of tagged data that you provide. Testing allows you to see how well a tagger performs by tagging labeled data and evaluating the results against the correct tags. Many options are available for training, tagging, and testing. These options can be set using a properties file. To start, you can generate a default properties file by: java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -genprops > myPropsFile.prop This will create the file myPropsFile.prop with descriptions of each option for the tagger and the default values for these options specified. Any properties you can specify in a properties file can be specified on the command line or vice versa. For further information, please consult the Javadocs (start with the entry for MaxentTagger, which includes a table of all options which may be set to configure the tagger and descriptions of those options). To tag a file using the pre-trained bidirectional model ======================================================= java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile sample-input.txt > sample-tagged.txt Tagged output will be printed to standard out, which you can redirect as above. Note that the bidirectional model is slightly more accurate but significantly slower than the left3words model. To train a simple model ======================= java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -prop propertiesFile -model modelFile -trainFile trainingFile To test a model =============== java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -prop propertiesFile -model modelFile -testFile testFile CONTENTS ----------------------------------------------- README.txt This file. LICENSE.txt Stanford POS Tagger is licensed under the GNU General Public License (v2+). stanford-postagger.jar stanford-postagger-YYYY-MM-DD.jar This is a JAR file containing all the Stanford classes necessary to run the Stanford POS Tagger. The two jar files are identical. You can use either the one with a version (date) indication or without, as you prefer. src A directory containing the Java 1.8 source code for the Stanford POS Tagger distribution. build.xml, Makefile Files for building the distribution (with ant and make, respectively) models A directory containing trained POS taggers; the taggers end in ".tagger" and the props file used to make the taggers end in ".props". The ".props" files cannot be directly used on your own machine as they use paths on the Stanford NLP machines, but they may serve as examples for your own properties files. Included in the full version are other English taggers, a German tagger, an Arabic tagger, and a Chinese tagger. If you chose to download the smaller version of the tagger, you have only two English taggers (left3words is faster but slightly less accurate than bidirectional-distsim) - feel free to download any other taggers you need from the POS tagger website. More information about the models can be found in the README-Models.txt file in this directory. sample-input.txt A sample text file that you can tag to demonstrate the tagger. sample-output.txt Tagged output of the tagger (using the left3words model) stanford-postagger-gui.sh stanford-postagger-gui.bat Scripts for invoking the GUI demonstration version of the tagger. stanford-postagger.sh stanford-postagger.bat Scripts for running the command-line version of the tagger. javadoc Javadocs for the distribution. In particular, look at the javadocs for the class edu.stanford.nlp.tagger.maxent.MaxentTagger. TaggerDemo.java A sample file for how to call the tagger in your own program. You should be able to compile and run it with: javac -cp stanford-postagger.jar TaggerDemo.java java -cp ".:stanford-postagger.jar" TaggerDemo models/wsj-0-18-left3words-distsim.tagger sample-input.txt (If you are on Windows, you need to replace the ":" with a ";" in the -cp argument, and should use a "\" in place of the "/" in the filename....) THANKS ----------------------------------------------- Thanks to the members of the Stanford Natural Language Processing Lab for great collaborative work on Java libraries for natural language processing. http://nlp.stanford.edu/javanlp/ CHANGES ----------------------------------------------- 2017-06-09 3.8.0 new Spanish and French UD models 2016-10-31 3.7.0 Update for compatibility, German UD model 2015-12-09 3.6.0 Updated for compatibility 2015-04-20 3.5.2 Update for

评论收藏

内容反馈