中文,分词下载地址、资源下载-Love

中文分词工具word-1.0,Java实现的中文分词组件多种基于词典的分词算法

word分词是一个Java实现的中文分词组件，提供了多种基于词典的分词算法，并利用ngram模型来消除歧义。能准确识别英文、数字，以及日期、时间等数量词，能识别人名、地名、组织机构名等未登录词。同时提供了Lucene、Solr、ElasticSearch插件。分词使用方法： 1、快速体验运行项目根目录下的脚本demo-word.bat可以快速体验分词效果用法: command [text] [input] [output] 命令command的可选值为：demo、text、file demo text 杨尚川是APDPlat应用级产品开发平台的作者 file d:/text.txt d:/word.txt exit 2、对文本进行分词移除停用词：List<Word> words = WordSegmenter.seg("杨尚川是APDPlat应用级产品开发平台的作者"); 保留停用词：List<Word> words = WordSegmenter.segWithStopWords("杨尚川是APDPlat应用级产品开发平台的作者"); System.out.println(words); 输出：移除停用词：[杨尚川, apdplat, 应用级, 产品, 开发平台, 作者] 保留停用词：[杨尚川, 是, apdplat, 应用级, 产品, 开发平台, 的, 作者] 3、对文件进行分词 String input = "d:/text.txt"; String output = "d:/word.txt"; 移除停用词：WordSegmenter.seg(new File(input), new File(output)); 保留停用词：WordSegmenter.segWithStopWords(new File(input), new File(output)); 4、自定义配置文件默认配置文件为类路径下的word.conf，打包在word-x.x.jar中自定义配置文件为类路径下的word.local.conf，需要用户自己提供如果自定义配置和默认配置相同，自定义配置会覆盖默认配置配置文件编码为UTF-8 5、自定义用户词库自定义用户词库为一个或多个文件夹或文件，可以使用绝对路径或相对路径用户词库由多个词典文件组成，文件编码为UTF-8 词典文件的格式为文本文件，一行代表一个词可以通过系统属性或配置文件的方式来指定路径，多个路径之间用逗号分隔开类路径下的词典文件，需要在相对路径前加入前缀classpath: 指定方式有三种：指定方式一，编程指定（高优先级）： WordConfTools.set("dic.path", "classpath:dic.txt，d:/custom_dic"); DictionaryFactory.reload();//更改词典路径之后，重新加载词典指定方式二，Java虚拟机启动参数（中优先级）： java -Ddic.path=classpath:dic.txt，d:/custom_dic 指定方式三，配置文件指定（低优先级）：使用类路径下的文件word.local.conf来指定配置信息 dic.path=classpath:dic.txt，d:/custom_dic 如未指定，则默认使用类路径下的dic.txt词典文件 6、自定义停用词词库使用方式和自定义用户词库类似，配置项为： stopwords.path=classpath:stopwords.txt，d:/custom_stopwords_dic 7、自动检测词库变化可以自动检测自定义用户词库和自定义停用词词库的变化包含类路径下的文件和文件夹、非类路径下的绝对路径和相对路径如： classpath:dic.txt，classpath:custom_dic_dir, d:/dic_more.txt，d:/DIC_DIR，D:/DIC2_DIR，my_dic_dir，my_dic_file.txt classpath:stopwords.txt，classpath:custom_stopwords_dic_dir， d:/stopwords_more.txt，d:/STOPWORDS_DIR，d:/STOPWORDS2_DIR，stopwords_dir，remove.txt 8、显式指定分词算法对文本进行分词时，可显式指定特定的分词算法，如： WordSegmenter.seg("APDPlat应用级产品开发平台", SegmentationAlgorithm.BidirectionalMaximumMatching); SegmentationAlgorithm的可选类型为：正向最大匹配算法：MaximumMatching 逆向最大匹配算法：ReverseMaximumMatching 正向最小匹配算法：MinimumMatching 逆向最小匹配算法：ReverseMinimumMatching 双向最大匹配算法：BidirectionalMaximumMatching 双向最小匹配算法：BidirectionalMinimumMatching 双向最大最小匹配算法：BidirectionalMaximumMinimumMatching Lucene插件： 1、构造一个word分析器ChineseWordAnalyzer Analyzer analyzer = new ChineseWordAnalyzer(); 2、利用word分析器切分文本 TokenStream tokenStream = analyzer.tokenStream("text", "杨尚川是APDPlat应用级产品开发平台的作者"); while(tokenStream.incrementToken()){ CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class); System.out.println(charTermAttribute.toString()+" "+offsetAttribute.startOffset()); } 3、利用word分析器建立Lucene索引 Directory directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); 4、利用word分析器查询Lucene索引 QueryParser queryParser = new QueryParser(Version.LUCENE_47, "text", analyzer); Query query = queryParser.parse("text:杨尚川"); TopDocs docs = indexSearcher.search(query, Integer.MAX_VALUE); Solr插件： 1、创建目录solr-4.7.1/example/solr/lib，将word-1.0.jar文件复制到lib目录 2、配置schema指定分词器将solr-4.7.1/example/solr/collection1/conf/schema.xml文件中所有的 <tokenizer class="solr.WhitespaceTokenizerFactory"/>和 <tokenizer class="solr.StandardTokenizerFactory"/>全部替换为 <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory"/> 并移除所有的filter标签 3、如果需要使用特定的分词算法： <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"/> segAlgorithm可选值有：正向最大匹配算法：MaximumMatching 逆向最大匹配算法：ReverseMaximumMatching 正向最小匹配算法：MinimumMatching 逆向最小匹配算法：ReverseMinimumMatching 双向最大匹配算法：BidirectionalMaximumMatching 双向最小匹配算法：BidirectionalMinimumMatching 双向最大最小匹配算法：BidirectionalMaximumMinimumMatching 如不指定，默认使用双向最大匹配算法：BidirectionalMaximumMatching 4、如果需要指定特定的配置文件： <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching" conf="C:/solr-4.7.0/example/solr/nutch/conf/word.local.conf"/> 如不指定，使用默认配置文件，位于 word-1.0.jar 中的word.conf文件 ElasticSearch插件： 1、创建目录elasticsearch-1.1.0/plugins/word 2、将中文分词库文件word-1.0.jar和依赖的日志库文件 slf4j-api-1.6.4.jar logback-core-0.9.28.jar logback-classic-0.9.28.jar 复制到刚创建的word目录 3、修改文件elasticsearch-1.1.0/config/elasticsearch.yml，新增如下配置： index.analysis.analyzer.default.type : "word" index.analysis.tokenizer.default.type : "word" 4、启动ElasticSearch测试效果，在Chrome浏览器中访问： http://localhost:9200/_analyze?analyzer=word&text=杨尚川是APDPlat应用级产品开发平台的作者 5、自定义配置将word.local.conf复制到elasticsearch-1.1.0/plugins/word目录下 6、指定分词算法修改文件elasticsearch-1.1.0/config/elasticsearch.yml，新增如下配置： index.analysis.analyzer.default.segAlgorithm : "ReverseMinimumMatching" index.analysis.tokenizer.default.segAlgorithm : "ReverseMinimumMatching" 这里segAlgorithm可指定的值有：正向最大匹配算法：MaximumMatching 逆向最大匹配算法：ReverseMaximumMatching 正向最小匹配算法：MinimumMatching 逆向最小匹配算法：ReverseMinimumMatching 双向最大匹配算法：BidirectionalMaximumMatching 双向最小匹配算法：BidirectionalMinimumMatching 双向最大最小匹配算法：BidirectionalMaximumMinimumMatching 如不指定，默认使用双向最大匹配算法：BidirectionalMaximumMatching 分词算法文章： 1、中文分词算法之基于词典的正向最大匹配算法 2、中文分词算法之基于词典的逆向最大匹配算法 3、中文分词算法之词典机制性能优化与测试 4、中文分词算法之基于词典的正向最小匹配算法 5、中文分词算法之基于词典的逆向最小匹配算法 5、Java开源项目cws_evaluation：中文分词器分词效果评估

评级：5

浏览量：1013

资源大小：10.41MB

上传时间：2014-11-05

所需积分： 50

Love_Hachi

码龄9年

暂无认证

持续创作

授予每个自然月内发布4篇或4篇以上原创或翻译IT博文的用户。不积跬步无以至千里，不积小流无以成江海，程序人生的精彩需要坚持不懈地积累！

关注私信

上传资源赚积分or赚钱

中文分词工具word-1.0,Java实现的中文分词组件多种基于词典的分词算法