ElasticSearch学习笔记之十一 Anayle API和IK分词器
程序员文章站
2024-02-21 22:46:58
...
ElasticSearch学习笔记之十一 Anayle API和IK分词器
Anayle API
analyze API 可以用来查看可分析全文是如何被分析的。我们可以在消息体里,指定分析器和要分析的文本:
GET /_analyze
{
"analyzer": "standard",
"text": "歌唱我们亲爱的祖国从今走向走向繁荣富强 "
}
分析结果如下:
{
"tokens": [
{
"token": "歌",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "唱",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "我",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "们",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "亲",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "爱",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "的",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "祖",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "国",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "从",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 9
},
{
"token": "今",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 10
},
{
"token": "走",
"start_offset": 11,
"end_offset": 12,
"type": "<IDEOGRAPHIC>",
"position": 11
},
{
"token": "向",
"start_offset": 12,
"end_offset": 13,
"type": "<IDEOGRAPHIC>",
"position": 12
},
{
"token": "走",
"start_offset": 13,
"end_offset": 14,
"type": "<IDEOGRAPHIC>",
"position": 13
},
{
"token": "向",
"start_offset": 14,
"end_offset": 15,
"type": "<IDEOGRAPHIC>",
"position": 14
},
{
"token": "繁",
"start_offset": 15,
"end_offset": 16,
"type": "<IDEOGRAPHIC>",
"position": 15
},
{
"token": "荣",
"start_offset": 16,
"end_offset": 17,
"type": "<IDEOGRAPHIC>",
"position": 16
},
{
"token": "富",
"start_offset": 17,
"end_offset": 18,
"type": "<IDEOGRAPHIC>",
"position": 17
},
{
"token": "强",
"start_offset": 18,
"end_offset": 19,
"type": "<IDEOGRAPHIC>",
"position": 18
}
]
}
很明显不是我们想要的结果。
IK分词器
如上面的问题,当我们在Elasticsearch中使用默认的标准分词器,这个分词器在处理中文的时候会把中文单词切分成一个一个的汉字,这不是我们想要的结果,因此需要引入es之中文的分词器插件es-ik来解决这个问题。
分词器/Token 过滤器 | 支持 |
---|---|
Analyzer | ik_smart , ik_max_word |
Tokenizer | ik_smart , ik_max_word |
IK分词器版本支持
IK version | ES version |
---|---|
master | 6.x -> master |
6.3.0 | 6.3.0 |
6.2.4 | 6.2.4 |
6.1.3 | 6.1.3 |
5.6.8 | 5.6.8 |
5.5.3 | 5.5.3 |
安装
下载或者编译
选择一
从下面的网址下载需要的版本
https://github.com/medcl/elasticsearch-analysis-ik/releases
创建安装目录
cd your-es-root/plugins/ && mkdir ik
解压你下载的zip包到your-es-root/plugins/ik
选择二
使用的elasticsearch-plugin插件安装
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
重启ElasticSearch
./bin/elasticsearch -d
IK分词器效果
GET _analyze
{
"analyzer": "ik_smart",
"text": "歌唱我们亲爱的祖国从今走向走向繁荣富强"
}
结果如下:
{
"tokens": [
{
"token": "歌唱",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "我们",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "亲爱的",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
},
{
"token": "祖国",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 3
},
{
"token": "从今",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 4
},
{
"token": "走向",
"start_offset": 11,
"end_offset": 13,
"type": "CN_WORD",
"position": 5
},
{
"token": "走向",
"start_offset": 13,
"end_offset": 15,
"type": "CN_WORD",
"position": 6
},
{
"token": "繁荣富强",
"start_offset": 15,
"end_offset": 19,
"type": "CN_WORD",
"position": 7
}
]
}
上一篇: ElasticSearch中文分词器真的设置成功过了么
下一篇: 扫二维码下载apk并统计被扫描次数