基于NLP技术实现的中文分词插件，准确度比常用的分词器高太多，同时提供ElasticSearch和OpenSearch插件

共536个文件

java：465个

txt：30个

bin：20个

版权申诉

中文分词

程序开发

139 浏览量 2023-11-15 16:27:39 上传评论 1 收藏 267.09MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

基于NLP技术实现的中文分词插件，准确度比常用的分词器高太多，同时提供ElasticSearch和OpenSearch插件（536个子文件）

cws.bin 265.16MB

pos.bin 157.19MB

cws.bin 94.3MB

pos.bin 58.3MB

pos.bin 58.06MB

ner.bin 44.7MB

cws.bin 27.11MB

CoreNatureDictionary.ngram.txt.table.bin 22.92MB

CustomDictionary.txt.bin 15.64MB

ner.txt.bin 14.59MB

cws.txt.bin 11.7MB

pos.txt.bin 8.59MB

CoreNatureDictionary.txt.bin 5.85MB

ner.bin 3.36MB

pinyin.txt.bin 2.57MB

nr.txt.bin 1.56MB

nt.txt.bin 1.33MB

CharTable.txt.bin 128KB

CharType.bin 22KB

stopwords.txt.bin 19KB

TagPKU98.csv 16KB

nrj.txt.trie.dat 1.44MB

nrf.txt.trie.dat 909KB

nrj.txt.value.dat 67KB

.gitignore 34B

OrganizationDictionary.java 156KB

Pinyin.java 93KB

FeatureExtractor.java 49KB

MDAG.java 49KB

DoubleArrayTrie.java 39KB

MutableDoubleArrayTrieInteger.java 35KB

ArcEagerBeamTrainer.java 32KB

HanLP.java 31KB

Segment.java 28KB

AhoCorasickDoubleArrayTrie.java 27KB

AbstractLexicalAnalyzer.java 26KB

TaggerImpl.java 24KB

DynamicCustomDictionary.java 24KB

KBeamArcEagerParser.java 24KB

String2PinyinConverter.java 23KB

Args.java 23KB

IOUtil.java 22KB

Word2VecTraining.java 20KB

WordBasedSegment.java 19KB

Options.java 18KB

MDAGNode.java 18KB

BinTrie.java 18KB

Utility.java 18KB

TextUtility.java 17KB

Nature.java 17KB

Preconditions.java 17KB

Encoder.java 15KB

HanLPDemo.java 15KB

ClusterAnalyzer.java 15KB

ParseThread.java 14KB

Occurrence.java 14KB

DoubleArrayBuilder.java 14KB

CRFModel.java 14KB

DoubleArrayTrieInteger.java 14KB

LinearModel.java 14KB

CoreDictionary.java 13KB

Mcsrch.java 13KB

Sentence.java 13KB

CoNLLReader.java 13KB

Vertex.java 13KB

EncoderFeatureIndex.java 13KB

PerceptronTrainer.java 13KB

MaxEntModel.java 13KB

DawgBuilder.java 12KB

AveragedPerceptron.java 11KB

WordNet.java 11KB

NTDictionaryMaker.java 11KB

SimpleMDAGNode.java 10KB

CRFSegment.java 10KB

DictionaryMaker.java 10KB

MutableDoubleArrayTrie.java 10KB

Viterbi.java 10KB

CoreBiGramTableDictionary.java 10KB

NRDictionaryMaker.java 10KB

CommonSynonymDictionary.java 10KB

HiddenMarkovModel.java 10KB

LbfgsOptimizer.java 10KB

LogLinearModel.java 10KB

FeatureIndex.java 9KB

TextRankSentence.java 9KB

Trie.java 9KB

CustomDictionary.java 9KB

ByteUtil.java 9KB

CharacterBasedGenerativeModel.java 9KB

POSInstance.java 9KB

BaseChineseDictionary.java 9KB

ViterbiSegment.java 9KB

CWSInstance.java 9KB

PinyinDictionary.java 8KB

NShortPath.java 8KB

NShortSegment.java 8KB

PersonDictionary.java 8KB

PerceptronClassifier.java 8KB

SecondOrderHiddenMarkovModel.java 8KB

TfIdfCounter.java 8KB

共 536 条

ideaseg is a Chinese tokenizer based on the latest [hanlp](https://github.com/hankcs/hanlp/tree/1.x) natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related [neuralnetworkparser](https://github.com/hankcs/hanlp/issues/644) code and data contained in hanlp. Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed. Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use. ideaseg provides three modules including: 1. `core` ~ core tokenizer module 2. `elasticsearch` ~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2) 3. `opensearch` ~ ideaseg tokenizer plugin for opensearch (default version 2.4.1) **Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.** In addition, the data folder contains model data of hanlp. Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building. ### Building The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools. First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2, open the `ideaseg/elasticsearch/pom.xml` file with a text editor, and modify the value of `elasticsearch.version` to `7.10.2` (if it is opensearch, please modify `opensearch/pom.xml`). Save the file and open the command line window, and execute the following command to start building: ```shell $ git clone https://gitee.com/indexea/ideaseg $ cd ideaseg $ mvn install ``` After the build is completed, two plugin files `ideaseg.zip` will be generated in `elasticsearch/target` and `opensearch/target` respectively. ### Installation After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install. The corresponding plugin management tool for elasticsearch is `<elasticsearch>/bin/elasticsearch-plugin`, while the corresponding management tool for opensearch is `<opensearch>/bin/opensearch-plugin`. The `<elasticsearch>` and `<opensearch>` are the respective directories of the two services after installation. #### Install ideaseg plugin for elasticsearch ```shell $ bin/elasticsearch-plugin install file:///<ideaseg>/elasticsearch/target/ideaseg.zip ``` #### Install ideaseg plugin for opensearch ```shell $ bin/opensearch-plugin install file:///<ideaseg>/opensearch/target/ideaseg.zip ``` where `<ideaseg>` is the path to the `ideaseg` source code. Please note that the path must have `file://` before it. If it is a windows system, the path needs to be added with `file:///`, such as `file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip`. During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation. Next, you can use the word segmentation test tool to test the plugin as follows: ``` POST _analyze { "analyzer": "ideaseg", "text": "你好，我用的是 ideaseg 分词插件。" } ``` `ideaseg` provides two participle modes, `standard` and `pinyin`, which default to the `standard` mode and the corresponding `analyzer` value is `ideaseg`. If you want to use the `pinyin` pattern, change the `analyzer` value to `ideaseg_pinyin`. In pinyin mode, the word segmentation result converts the Chinese to pinyin, while retaining the original Chinese. For more information on word segmentation testing, please refer to [ElasticSearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。 ### Feedback If you have any questions about using 'ideaseg', please raise them via [Issues](https://gitee.com/indexea/ideaseg/issues). ### Special thanks https://github.com/KennFalcon/elasticsearch-analysis-hanlp

评论收藏

内容反馈

版权申诉