ideaseg is a Chinese tokenizer based on the latest [hanlp](https://github.com/hankcs/hanlp/tree/1.x) natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related [neuralnetworkparser](https://github.com/hankcs/hanlp/issues/644) code and data contained in hanlp.
Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed.
Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use.
ideaseg provides three modules including:
1. `core` ~ core tokenizer module
2. `elasticsearch` ~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2)
3. `opensearch` ~ ideaseg tokenizer plugin for opensearch (default version 2.4.1)
**Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.**
In addition, the data folder contains model data of hanlp.
Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building.
### Building
The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools.
First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2,
open the `ideaseg/elasticsearch/pom.xml` file with a text editor, and modify the value of `elasticsearch.version` to `7.10.2`
(if it is opensearch, please modify `opensearch/pom.xml`).
Save the file and open the command line window, and execute the following command to start building:
```shell
$ git clone https://gitee.com/indexea/ideaseg
$ cd ideaseg
$ mvn install
```
After the build is completed, two plugin files `ideaseg.zip` will be generated in `elasticsearch/target` and `opensearch/target` respectively.
### Installation
After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install.
The corresponding plugin management tool for elasticsearch is `<elasticsearch>/bin/elasticsearch-plugin`,
while the corresponding management tool for opensearch is `<opensearch>/bin/opensearch-plugin`.
The `<elasticsearch>` and `<opensearch>` are the respective directories of the two services after installation.
#### Install ideaseg plugin for elasticsearch
```shell
$ bin/elasticsearch-plugin install file:///<ideaseg>/elasticsearch/target/ideaseg.zip
```
#### Install ideaseg plugin for opensearch
```shell
$ bin/opensearch-plugin install file:///<ideaseg>/opensearch/target/ideaseg.zip
```
where `<ideaseg>` is the path to the `ideaseg` source code. Please note that the path must have `file://` before it. If it is a windows system, the path needs to be added with `file:///`, such as `file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip`.
During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation.
Next, you can use the word segmentation test tool to test the plugin as follows:
```
POST _analyze
{
"analyzer": "ideaseg",
"text": "你好,我用的是 ideaseg 分词插件。"
}
```
`ideaseg` provides two participle modes, `standard` and `pinyin`, which default to the `standard` mode and the corresponding `analyzer` value is `ideaseg`. If you want to use the `pinyin` pattern, change the `analyzer` value to `ideaseg_pinyin`.
In pinyin mode, the word segmentation result converts the Chinese to pinyin, while retaining the original Chinese.
For more information on word segmentation testing, please refer to [ElasticSearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/test-analyzer.html)。
### Feedback
If you have any questions about using 'ideaseg', please raise them via [Issues](https://gitee.com/indexea/ideaseg/issues).
### Special thanks
https://github.com/KennFalcon/elasticsearch-analysis-hanlp
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
ideaseg 是 Indexea 推出的一个基于最新的 HanLP 自然语言处理工具包实现的中文分词器, 包含了最新的模型数据,同时移除了 HanLP 所包含的非商业友好许可的 NeuralNetworkParser 相关代码和数据。
资源推荐
资源详情
资源评论
收起资源包目录
基于NLP技术实现的中文分词插件,准确度比常用的分词器高太多,同时提供ElasticSearch和OpenSearch插件 (536个子文件)
cws.bin 265.16MB
pos.bin 157.19MB
cws.bin 94.3MB
pos.bin 58.3MB
pos.bin 58.06MB
ner.bin 44.7MB
cws.bin 27.11MB
CoreNatureDictionary.ngram.txt.table.bin 22.92MB
CustomDictionary.txt.bin 15.64MB
ner.txt.bin 14.59MB
cws.txt.bin 11.7MB
pos.txt.bin 8.59MB
CoreNatureDictionary.txt.bin 5.85MB
ner.bin 3.36MB
pinyin.txt.bin 2.57MB
nr.txt.bin 1.56MB
nt.txt.bin 1.33MB
CharTable.txt.bin 128KB
CharType.bin 22KB
stopwords.txt.bin 19KB
TagPKU98.csv 16KB
nrj.txt.trie.dat 1.44MB
nrf.txt.trie.dat 909KB
nrj.txt.value.dat 67KB
.gitignore 34B
OrganizationDictionary.java 156KB
Pinyin.java 93KB
FeatureExtractor.java 49KB
MDAG.java 49KB
DoubleArrayTrie.java 39KB
MutableDoubleArrayTrieInteger.java 35KB
ArcEagerBeamTrainer.java 32KB
HanLP.java 31KB
Segment.java 28KB
AhoCorasickDoubleArrayTrie.java 27KB
AbstractLexicalAnalyzer.java 26KB
TaggerImpl.java 24KB
DynamicCustomDictionary.java 24KB
KBeamArcEagerParser.java 24KB
String2PinyinConverter.java 23KB
Args.java 23KB
IOUtil.java 22KB
Word2VecTraining.java 20KB
WordBasedSegment.java 19KB
Options.java 18KB
MDAGNode.java 18KB
BinTrie.java 18KB
Utility.java 18KB
TextUtility.java 17KB
Nature.java 17KB
Preconditions.java 17KB
Encoder.java 15KB
HanLPDemo.java 15KB
ClusterAnalyzer.java 15KB
ParseThread.java 14KB
Occurrence.java 14KB
DoubleArrayBuilder.java 14KB
CRFModel.java 14KB
DoubleArrayTrieInteger.java 14KB
LinearModel.java 14KB
CoreDictionary.java 13KB
Mcsrch.java 13KB
Sentence.java 13KB
CoNLLReader.java 13KB
Vertex.java 13KB
EncoderFeatureIndex.java 13KB
PerceptronTrainer.java 13KB
MaxEntModel.java 13KB
DawgBuilder.java 12KB
AveragedPerceptron.java 11KB
WordNet.java 11KB
NTDictionaryMaker.java 11KB
SimpleMDAGNode.java 10KB
CRFSegment.java 10KB
DictionaryMaker.java 10KB
MutableDoubleArrayTrie.java 10KB
Viterbi.java 10KB
CoreBiGramTableDictionary.java 10KB
NRDictionaryMaker.java 10KB
CommonSynonymDictionary.java 10KB
HiddenMarkovModel.java 10KB
LbfgsOptimizer.java 10KB
LogLinearModel.java 10KB
FeatureIndex.java 9KB
TextRankSentence.java 9KB
Trie.java 9KB
CustomDictionary.java 9KB
ByteUtil.java 9KB
CharacterBasedGenerativeModel.java 9KB
POSInstance.java 9KB
BaseChineseDictionary.java 9KB
ViterbiSegment.java 9KB
CWSInstance.java 9KB
PinyinDictionary.java 8KB
NShortPath.java 8KB
NShortSegment.java 8KB
PersonDictionary.java 8KB
PerceptronClassifier.java 8KB
SecondOrderHiddenMarkovModel.java 8KB
TfIdfCounter.java 8KB
共 536 条
- 1
- 2
- 3
- 4
- 5
- 6
资源评论
Java程序员-张凯
- 粉丝: 1w+
- 资源: 7367
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功