Elasticsearch分词器

程序员文章站 2022-07-04 22:12:53

...

内置分词器

Standard

中文被分成单个词，英文以空格切分，自动转为小写。
请求示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "standard"
}

Whitespace

按空格分词，中文不再被分词，英文保持大小写不变。
请求示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "whitespace"
}

Simple

先按空格分词，中文不再被分词，英文转为小写。
请求示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "simple"
}

Stop

在Simple分词器的基础上，加入了停用词the, a等。
请求示例：

GET 172.16.5.33:9200/_analyze
{1
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
    "analyzer": "stop"
}

ik中文分词器

ES内置的分词器对中文没有良好的支持，因此使用第三方ik中文分词器对中文信息进行检索。

安装

进入Elasticsearch安装目录，输入以下命令

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip

其中7.10.1需替换为安装的Elasticsearch版本。

编辑

使用

ik_analyzer提供了两种颗粒度的拆分

ik_smart

ik_smart会对文本做最粗粒度的拆分，适合Term Query。示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_smart"
}

{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "2451号",
            "start_offset": 9,
            "end_offset": 14,
            "type": "TYPE_CQUAN",
            "position": 3
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 6
        }
    ]
}

ik_max_word

ik_max_word会将文本做最细粒度的拆分，适合Phrase Query。示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_max_word"
}

{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "上海",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "海市",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "市长",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "长宁",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "区",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 6
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "虹桥",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "路",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_CHAR",
            "position": 9
        },
        {
            "token": "2451",
            "start_offset": 9,
            "end_offset": 13,
            "type": "ARABIC",
            "position": 10
        },
        {
            "token": "号",
            "start_offset": 13,
            "end_offset": 14,
            "type": "COUNT",
            "position": 11
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 12
        },
        {
            "token": "林东",
            "start_offset": 15,
            "end_offset": 17,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 14
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 15
        }
    ]
}

Elasticsearch分词器

内置分词器

Standard

Whitespace

Simple

Stop

ik中文分词器

安装

使用

SpringBoot整合Elasticsearch7.2.0的实现方法

安装ElasticSearch搜索工具并配置Python驱动的方法

Elasticsearch.Net使用入门教程（1）

详解Docker下使用Elasticsearch可视化Kibana

Spring Boot与Kotlin 整合全文搜索引擎Elasticsearch的示例代码

Python对ElasticSearch获取数据及操作

ElasticSearch实战系列三: ElasticSearch的JAVA API使用教程

windows 下安装ElasticSearch方法

Elasticsearch.Net使用教程 MVC4图书管理系统（2）

elasticsearch概念