Elasticsearch分词器
程序员文章站
2022-07-04 22:12:53
...
内置分词器
Standard
中文被分成单个词,英文以空格切分,自动转为小写。
请求示例:
GET 172.16.5.33:9200/_analyze
{
"text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
"analyzer": "standard"
}
Whitespace
按空格分词,中文不再被分词,英文保持大小写不变。
请求示例:
GET 172.16.5.33:9200/_analyze
{
"text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
"analyzer": "whitespace"
}
Simple
先按空格分词,中文不再被分词,英文转为小写。
请求示例:
GET 172.16.5.33:9200/_analyze
{
"text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
"analyzer": "simple"
}
Stop
在Simple分词器的基础上,加入了停用词the
, a
等。
请求示例:
GET 172.16.5.33:9200/_analyze
{1
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"analyzer": "stop"
}
ik中文分词器
ES内置的分词器对中文没有良好的支持,因此使用第三方ik中文分词器对中文信息进行检索。
安装
进入Elasticsearch安装目录,输入以下命令
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip
其中7.10.1
需替换为安装的Elasticsearch版本。
编辑
使用
ik_analyzer提供了两种颗粒度的拆分
ik_smart
ik_smart
会对文本做最粗粒度的拆分,适合Term Query。示例:
GET 172.16.5.33:9200/_analyze
{
"text": "上海市长宁区虹桥路2451号格林东方酒店。",
"analyzer": "ik_smart"
}
{
"tokens": [
{
"token": "上海市",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "长宁区",
"start_offset": 3,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
},
{
"token": "虹桥路",
"start_offset": 6,
"end_offset": 9,
"type": "CN_WORD",
"position": 2
},
{
"token": "2451号",
"start_offset": 9,
"end_offset": 14,
"type": "TYPE_CQUAN",
"position": 3
},
{
"token": "格林",
"start_offset": 14,
"end_offset": 16,
"type": "CN_WORD",
"position": 4
},
{
"token": "东方",
"start_offset": 16,
"end_offset": 18,
"type": "CN_WORD",
"position": 5
},
{
"token": "酒店",
"start_offset": 18,
"end_offset": 20,
"type": "CN_WORD",
"position": 6
}
]
}
ik_max_word
ik_max_word
会将文本做最细粒度的拆分,适合Phrase Query。示例:
GET 172.16.5.33:9200/_analyze
{
"text": "上海市长宁区虹桥路2451号格林东方酒店。",
"analyzer": "ik_max_word"
}
{
"tokens": [
{
"token": "上海市",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "上海",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "海市",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
},
{
"token": "市长",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "长宁区",
"start_offset": 3,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
},
{
"token": "长宁",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 5
},
{
"token": "区",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 6
},
{
"token": "虹桥路",
"start_offset": 6,
"end_offset": 9,
"type": "CN_WORD",
"position": 7
},
{
"token": "虹桥",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 8
},
{
"token": "路",
"start_offset": 8,
"end_offset": 9,
"type": "CN_CHAR",
"position": 9
},
{
"token": "2451",
"start_offset": 9,
"end_offset": 13,
"type": "ARABIC",
"position": 10
},
{
"token": "号",
"start_offset": 13,
"end_offset": 14,
"type": "COUNT",
"position": 11
},
{
"token": "格林",
"start_offset": 14,
"end_offset": 16,
"type": "CN_WORD",
"position": 12
},
{
"token": "林东",
"start_offset": 15,
"end_offset": 17,
"type": "CN_WORD",
"position": 13
},
{
"token": "东方",
"start_offset": 16,
"end_offset": 18,
"type": "CN_WORD",
"position": 14
},
{
"token": "酒店",
"start_offset": 18,
"end_offset": 20,
"type": "CN_WORD",
"position": 15
}
]
}
下一篇: 黑客如何给你的系统种木马
推荐阅读
-
SpringBoot整合Elasticsearch7.2.0的实现方法
-
安装ElasticSearch搜索工具并配置Python驱动的方法
-
Elasticsearch.Net使用入门教程(1)
-
详解Docker下使用Elasticsearch可视化Kibana
-
Spring Boot与Kotlin 整合全文搜索引擎Elasticsearch的示例代码
-
Python对ElasticSearch获取数据及操作
-
ElasticSearch实战系列三: ElasticSearch的JAVA API使用教程
-
windows 下安装ElasticSearch方法
-
Elasticsearch.Net使用教程 MVC4图书管理系统(2)
-
elasticsearch概念