elasticsearch-analysis-ik-7.9.3.tar.gz资源-CSDN文库

需积分: 2 99 浏览量 2024-04-11 00:37:55 上传评论收藏 3.12MB GZ 举报

共47个文件

java：25个

dic：11个

xml：3个

Elasticsearch 分析插件 IK 是一款非常流行的中文分词工具，特别适用于SpringData Elasticsearch进行全文检索。在本文中，我们将深入探讨IK分词器及其在Linux环境下的使用方法。 Elasticsearch是一个高性能、分布式、全文搜索引擎，广泛应用于数据分析、日志收集和全文检索等场景。它支持多种分析器，包括标准分析器、关键词分析器以及我们关注的IK分词器。IK分词器由开源社区维护，旨在为中文文本提供更精准的分词效果，尤其适合处理复杂的词汇组合和短语。 SpringData Elasticsearch是Spring框架的一个模块，它提供了与Elasticsearch的集成，简化了在Java应用中操作Elasticsearch的过程。通过SpringData，我们可以便捷地实现索引创建、文档增删改查、查询构建等功能，并且支持使用IK分词器进行全文搜索。下载的"elasticsearch-analysis-ik-7.9.3.tar.gz"是IK分词器的Linux版本，适用于Elasticsearch 7.9.3。在安装之前，确保你已经正确安装了Elasticsearch。然后，按照以下步骤部署IK分词器： 1. 解压下载的tar.gz文件：`tar -zxvf elasticsearch-analysis-ik-7.9.3.tar.gz` 2. 进入解压后的目录：`cd elasticsearch-analysis-ik-7.9.3` 3. 找到`plugins`目录，将整个`ik`目录复制或移动到Elasticsearch的`plugins`目录下。例如：`cp -r ik /path/to/your/elasticsearch/plugins/` 完成上述步骤后，重启Elasticsearch服务，分词器就已安装成功。你可以通过Elasticsearch的 `_analyze` API测试IK分词器的效果，发送如下请求： ```json POST /_analyze { "analyzer": "ik_max_word", "text": "这是一个测试文本" } ``` 这里使用了IK分词器的`ik_max_word`模式，它会尽可能拆分出更多的词语。返回结果会展示分词后的单词列表。在SpringData Elasticsearch中配置IK分词器，你需要在`@Field`注解中指定分析器，例如： ```java @Document(indexName = "my_index") public class MyDocument { @Id private String id; @Field(type = FieldType.Text, analyzer = "ik_max_word") private String content; // ...其他字段和方法 } ``` 现在，当你通过SpringData保存或搜索这个`MyDocument`类型的文档时，Elasticsearch会使用IK分词器对`content`字段进行分析。 IK分词器是Elasticsearch中文全文搜索的重要组件，它提高了中文文本处理的准确性和效率。结合SpringData Elasticsearch，我们可以轻松地在Java应用中实现高效、智能的全文搜索功能。通过上述步骤，你可以在Linux环境中顺利安装和使用IK分词器，提升你的搜索体验。

资源推荐

资源详情

资源评论

收起资源包目录

elasticsearch-analysis-ik-7.9.3.tar.gz （47个子文件）

elasticsearch-analysis-ik-7.9.3

.travis.yml 187B

pom.xml 11KB

LICENSE.txt 11KB

src

main

resources

plugin-descriptor.properties 2KB

plugin-security.policy 125B

assemblies

plugin.xml 1KB

java

org

wltea

analyzer

core

CharacterUtil.java 3KB

CN_QuantifierSegmenter.java 7KB

IKSegmenter.java 4KB

IKArbitrator.java 5KB

LetterSegmenter.java 9KB

QuickSortSet.java 6KB

LexemePath.java 6KB

ISegmenter.java 1KB

CJKSegmenter.java 4KB

AnalyzeContext.java 12KB

Lexeme.java 6KB

dic

Hit.java 3KB

Monitor.java 3KB

DictSegment.java 9KB

Dictionary.java 18KB

lucene

IKTokenizer.java 4KB

IKAnalyzer.java 2KB

cfg

Configuration.java 2KB

help

PrefixPluginLogger.java 2KB

CharacterHelper.java 2KB

Sleep.java 1019B

ESPluginLoggerFactory.java 875B

elasticsearch

index

analysis

IkAnalyzerProvider.java 1KB

IkTokenizerFactory.java 1KB

plugin

analysis

AnalysisIkPlugin.java 1KB

.gitignore 81B

README.md 8KB

licenses

lucene-LICENSE.txt 24KB

lucene-NOTICE.txt 9KB

config

main.dic 2.92MB

stopword.dic 164B

IKAnalyzer.cfg.xml 625B

extra_single_word.dic 62KB

quantifier.dic 2KB

suffix.dic 192B

extra_main.dic 4.98MB

extra_single_word_full.dic 62KB

extra_single_word_low_freq.dic 11KB

surname.dic 752B

extra_stopword.dic 156B

preposition.dic 123B

IK Analysis for Elasticsearch ============================= The IK Analysis plugin integrates Lucene IK analyzer (http://code.google.com/p/ik-analyzer/) into elasticsearch, support customized dictionary. Analyzer: `ik_smart` , `ik_max_word` , Tokenizer: `ik_smart` , `ik_max_word` Versions -------- IK version | ES version -----------|----------- master | 7.x -> master 6.x| 6.x 5.x| 5.x 1.10.6 | 2.4.6 1.9.5 | 2.3.5 1.8.1 | 2.2.1 1.7.0 | 2.1.1 1.5.0 | 2.0.0 1.2.6 | 1.0.0 1.2.5 | 0.90.x 1.1.3 | 0.20.x 1.0.0 | 0.16.2 -> 0.19.0 Install ------- 1.download or compile * optional 1 - download pre-build package from here: https://github.com/medcl/elasticsearch-analysis-ik/releases create plugin folder `cd your-es-root/plugins/ && mkdir ik` unzip plugin to folder `your-es-root/plugins/ik` * optional 2 - use elasticsearch-plugin to install ( supported from version v5.5.1 ): ``` ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip ``` NOTE: replace `6.3.0` to your own elasticsearch version 2.restart elasticsearch #### Quick Example 1.create a index ```bash curl -XPUT http://localhost:9200/index ``` 2.create a mapping ```bash curl -XPOST http://localhost:9200/index/_mapping -H 'Content-Type:application/json' -d' { "properties": { "content": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart" } } }' ``` 3.index some docs ```bash curl -XPOST http://localhost:9200/index/_create/1 -H 'Content-Type:application/json' -d' {"content":"美国留给伊拉克的是个烂摊子吗"} ' ``` ```bash curl -XPOST http://localhost:9200/index/_create/2 -H 'Content-Type:application/json' -d' {"content":"公安部：各地校车将享最高路权"} ' ``` ```bash curl -XPOST http://localhost:9200/index/_create/3 -H 'Content-Type:application/json' -d' {"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"} ' ``` ```bash curl -XPOST http://localhost:9200/index/_create/4 -H 'Content-Type:application/json' -d' {"content":"中国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首"} ' ``` 4.query with highlighting ```bash curl -XPOST http://localhost:9200/index/_search -H 'Content-Type:application/json' -d' { "query" : { "match" : { "content" : "中国" }}, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "content" : {} } } } ' ``` Result ```json { "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 2, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 2, "_source": { "content": "中国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首" }, "highlight": { "content": [ "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首 " ] } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 2, "_source": { "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船" }, "highlight": { "content": [ "均每天扣1艘<tag1>中国</tag1>渔船 " ] } } ] } } ``` ### Dictionary Configuration `IKAnalyzer.cfg.xml` can be located at `{conf}/analysis-ik/config/IKAnalyzer.cfg.xml` or `{plugins}/elasticsearch-analysis-ik-*/config/IKAnalyzer.cfg.xml` ```xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment>  <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>  <entry key="ext_stopwords">custom/ext_stopword.dic</entry>  <entry key="remote_ext_dict">location</entry>  <entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry> </properties> ``` ### 热更新 IK 分词使用方法目前该插件支持热更新 IK 分词，通过上文在 IK 配置文件中提到的如下配置 ```xml  <entry key="remote_ext_dict">location</entry>  <entry key="remote_ext_stopwords">location</entry> ``` 其中 `location` 是指一个 url，比如 `http://yoursite.com/getCustomDict`，该请求只需满足以下两点即可完成分词热更新。 1. 该 http 请求需要返回两个头部(header)，一个是 `Last-Modified`，一个是 `ETag`，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。 2. 该 http 请求返回的内容格式是一行一个分词，换行符用 `\n` 即可。满足上面两点要求就可以实现热更新分词了，不需要重启 ES 实例。可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里，放在 nginx 或其他简易 http server 下，当 .txt 文件修改时，http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇，并更新这个 .txt 文件。 have fun. 常见问题 ------- 1.自定义词典为什么没有生效？请确保你的扩展词典的文本格式为 UTF8 编码 2.如何手动安装？ ```bash git clone https://github.com/medcl/elasticsearch-analysis-ik cd elasticsearch-analysis-ik git checkout tags/{version} mvn clean mvn compile mvn package ``` 拷贝和解压release下的文件: #{project_path}/elasticsearch-analysis-ik/target/releases/elasticsearch-analysis-ik-*.zip 到你的 elasticsearch 插件目录, 如: plugins/ik 重启elasticsearch 3.分词测试失败请在某个索引下调用analyze接口测试,而不是直接调用analyze接口如: ```bash curl -XGET "http://localhost:9200/your_index/_analyze" -H 'Content-Type: application/json' -d' { "text":"中华人民共和国MN","tokenizer": "my_ik" }' ``` 4. ik_max_word 和 ik_smart 什么区别? ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query； ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询。 Changes ------ *自 v5.0.0 起* - 移除名为 `ik` 的analyzer和tokenizer,请分别使用 `ik_smart` 和 `ik_max_word` Thanks ------ YourKit supports IK Analysis for ElasticSearch project with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: <a href="http://www.yourkit.com/java/profiler/index.jsp">YourKit Java Profiler</a> and <a href="http://www.yourkit.com/.net/profiler/index.jsp">YourKit .NET Profiler</a>.

评论收藏

内容反馈