ElasticSearch教程——安装IK分词器插件
简介
IK Analyzer是一个开源的,基于Java语言开发的轻量级的中文分词工具包,最初的时候,它是以开源项目Lucene为应用主体的,结合词典分词和文法分析算法的中文分词组件,从3.0版本之后,IK逐渐成为面向java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现,IK实现了简单的分词 歧义排除算法,标志着IK分词器从单纯的词典分词向模拟语义分词衍化
基础环境
1.基础环境建立在前两篇博客的基础之上,这边IK的版本务必要和elasticsearch一致,否则会报错
2.安装maven
下载
git clone https://github.com/medcl/elasticsearch-analysis-ik.git
下载方式有两种,一种是直接用git命令下载,另一种是在windows上下载好后上传到服务器上再进行解压
打包编译
按照上述方式下载好后,将项目进行打包
执行如下脚本(需要先按照maven,此处不再赘述,网上相关博文很多):
mvn package
编译完成之后切换路径到项目下的target/releases,找到对应zip包,我这边是elasticsearch-analysis-ik-6.4.0.zip。
将该zip文件拷贝至/usr/elasticsearch/elasticsearch-6.4.0/plugins/ik(此处ik文件夹是自己创建的)下,并进行解压
unzip elasticsearch-analysis-ik-6.4.0.zip
重启ElasticSearch
systemctl restart elasticsearch.service
测试IK分词器
ik 带有两个分词器
ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语
ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有
更多具体内容可以看官方案例:官方测试案例
注意:在新版本中需要在请求头中设置请求格式,否则会报错,错误为
"error" : "Content-Type header [application/x-www-form-urlencoded] is not supported"
另外新版本中已经不支持String了,用text代替,输入String会报下错误
org.elasticsearch.index.mapper.MapperParsingException: No handler for type [string] declared on field [content]
at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:274) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:199) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.mapper.RootObjectMapper$TypeParser.parse(RootObjectMapper.java:131) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:112) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:92) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.mapper.MapperService.parse(MapperService.java:626) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:263) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:229) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0.jar:6.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
ik_smart正确请求方式如下(直接复制粘贴到xshell,回车即可):
curl -H "Content-Type: application/json" 'http://XXX.xx.xx.xx:9200/index/_analyze?pretty=true' -d '
{
"analyzer": "ik_smart",
"text": "*万岁万岁万万岁"
}'
返回结果:
{
"tokens" : [
{
"token" : "*",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "万岁",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "万岁",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "万万岁",
"start_offset" : 11,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 3
}
]
}
ik_max_word正确请求方式如下:
curl -H "Content-Type: application/json" 'http://XXX.XXX.xxx:9200/index/_analyze?pretty=true' -d '
{
"analyzer": "ik_max_word",
"text": "*万岁万岁万万岁"
}'
返回结果:
{
"tokens" : [
{
"token" : "*",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "人民*",
"start_offset" : 2,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "*",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "国",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "万岁",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "万",
"start_offset" : 7,
"end_offset" : 8,
"type" : "TYPE_CNUM",
"position" : 10
},
{
"token" : "岁",
"start_offset" : 8,
"end_offset" : 9,
"type" : "COUNT",
"position" : 11
},
{
"token" : "万岁",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 12
},
{
"token" : "万",
"start_offset" : 9,
"end_offset" : 10,
"type" : "TYPE_CNUM",
"position" : 13
},
{
"token" : "岁",
"start_offset" : 10,
"end_offset" : 11,
"type" : "COUNT",
"position" : 14
},
{
"token" : "万万岁",
"start_offset" : 11,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 15
},
{
"token" : "万万",
"start_offset" : 11,
"end_offset" : 13,
"type" : "TYPE_CNUM",
"position" : 16
},
{
"token" : "万岁",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 17
},
{
"token" : "岁",
"start_offset" : 13,
"end_offset" : 14,
"type" : "COUNT",
"position" : 18
}
]
}
结果高亮
官方git上也有案例,具体看上面链接
curl -XPOST http://localhost:9200/index/fulltext/4 -H 'Content-Type:application/json' -d'
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'
curl -XPOST http://localhost:9200/index/fulltext/_search -H 'Content-Type:application/json' -d'
{
"query" : { "match" : { "content" : "中国" }},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : {
"content" : {}
}
}
}'
返回结果
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [{
"_index": "index",
"_type": "fulltext",
"_id": "1",
"_score": 0.5753642,
"_source": {
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
},
"highlight": {
"content": ["<tag1>中</tag1><tag1>国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"]
}
}]
}
}
扩展配置文件
IKAnalyzer.cfg.xml
can be located at {conf}/analysis-ik/config/IKAnalyzer.cfg.xml
or {plugins}/elasticsearch-analysis-ik-*/config/IKAnalyzer.cfg.xml,意思就是说
IKAnalyzer.cfg.xml可以放在上述位置(不过我看了下,该文件在我的/usr/elasticsearch/elasticsearch-6.4.0/plugins/ik/config目录下自带的,相关的扩展字典也在该目录下),文件内容如下
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>
热更新IK分词
目前该插件支持热更新 IK 分词,通过上面在 IK 配置文件中添加如下配置
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">location</entry>
其中 location
是指一个 url,比如 http://yoursite.com/getCustomDict
,该请求只需满足以下两点即可完成分词热更新。
-
该 http 请求需要返回两个头部(header),一个是
Last-Modified
,一个是ETag
,这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。 -
该 http 请求返回的内容格式是一行一个分词,换行符用
\n
即可。
满足上面两点要求就可以实现热更新分词了,不需要重启 ES 实例。
可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里,放在 nginx 或其他简易 http server 下,当 .txt 文件修改时,http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇,并更新这个 .txt 文件。
推荐阅读