欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Elasticsearch分词器

程序员文章站 2022-07-04 22:12:53
...

内置分词器


Standard


中文被分成单个词,英文以空格切分,自动转为小写。
请求示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "standard"
}

Whitespace


按空格分词,中文不再被分词,英文保持大小写不变。
请求示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "whitespace"
}

Simple

先按空格分词,中文不再被分词,英文转为小写。
请求示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "simple"
}

Stop


在Simple分词器的基础上,加入了停用词thea等。
请求示例:

GET 172.16.5.33:9200/_analyze
{1
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
    "analyzer": "stop"
}

ik中文分词器


ES内置的分词器对中文没有良好的支持,因此使用第三方ik中文分词器对中文信息进行检索。

安装


进入Elasticsearch安装目录,输入以下命令

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip

其中7.10.1需替换为安装的Elasticsearch版本。


 编辑

使用

ik_analyzer提供了两种颗粒度的拆分

ik_smart

ik_smart会对文本做最粗粒度的拆分,适合Term Query。示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_smart"
}
{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "2451号",
            "start_offset": 9,
            "end_offset": 14,
            "type": "TYPE_CQUAN",
            "position": 3
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 6
        }
    ]
}

ik_max_word

ik_max_word会将文本做最细粒度的拆分,适合Phrase Query。示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_max_word"
}
{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "上海",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "海市",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "市长",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "长宁",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "区",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 6
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "虹桥",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "路",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_CHAR",
            "position": 9
        },
        {
            "token": "2451",
            "start_offset": 9,
            "end_offset": 13,
            "type": "ARABIC",
            "position": 10
        },
        {
            "token": "号",
            "start_offset": 13,
            "end_offset": 14,
            "type": "COUNT",
            "position": 11
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 12
        },
        {
            "token": "林东",
            "start_offset": 15,
            "end_offset": 17,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 14
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 15
        }
    ]
}

相关标签: 微服务