欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

ElasticSearch学习笔记之十一 Anayle API和IK分词器

程序员文章站 2024-02-21 22:46:58
...

Anayle API

analyze API 可以用来查看可分析全文是如何被分析的。我们可以在消息体里,指定分析器和要分析的文本:

GET /_analyze
{
  "analyzer": "standard",
  "text": "歌唱我们亲爱的祖国从今走向走向繁荣富强 "
}

分析结果如下:

{
  "tokens": [
    {
      "token": "歌",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "唱",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "我",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "们",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "亲",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "爱",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "的",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "祖",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    },
    {
      "token": "从",
      "start_offset": 9,
      "end_offset": 10,
      "type": "<IDEOGRAPHIC>",
      "position": 9
    },
    {
      "token": "今",
      "start_offset": 10,
      "end_offset": 11,
      "type": "<IDEOGRAPHIC>",
      "position": 10
    },
    {
      "token": "走",
      "start_offset": 11,
      "end_offset": 12,
      "type": "<IDEOGRAPHIC>",
      "position": 11
    },
    {
      "token": "向",
      "start_offset": 12,
      "end_offset": 13,
      "type": "<IDEOGRAPHIC>",
      "position": 12
    },
    {
      "token": "走",
      "start_offset": 13,
      "end_offset": 14,
      "type": "<IDEOGRAPHIC>",
      "position": 13
    },
    {
      "token": "向",
      "start_offset": 14,
      "end_offset": 15,
      "type": "<IDEOGRAPHIC>",
      "position": 14
    },
    {
      "token": "繁",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<IDEOGRAPHIC>",
      "position": 15
    },
    {
      "token": "荣",
      "start_offset": 16,
      "end_offset": 17,
      "type": "<IDEOGRAPHIC>",
      "position": 16
    },
    {
      "token": "富",
      "start_offset": 17,
      "end_offset": 18,
      "type": "<IDEOGRAPHIC>",
      "position": 17
    },
    {
      "token": "强",
      "start_offset": 18,
      "end_offset": 19,
      "type": "<IDEOGRAPHIC>",
      "position": 18
    }
  ]
}

很明显不是我们想要的结果。

IK分词器

如上面的问题,当我们在Elasticsearch中使用默认的标准分词器,这个分词器在处理中文的时候会把中文单词切分成一个一个的汉字,这不是我们想要的结果,因此需要引入es之中文的分词器插件es-ik来解决这个问题。

分词器/Token 过滤器 支持
Analyzer ik_smart , ik_max_word
Tokenizer ik_smart , ik_max_word

IK分词器版本支持

IK version ES version
master 6.x -> master
6.3.0 6.3.0
6.2.4 6.2.4
6.1.3 6.1.3
5.6.8 5.6.8
5.5.3 5.5.3

安装

下载或者编译

选择一

从下面的网址下载需要的版本
https://github.com/medcl/elasticsearch-analysis-ik/releases
创建安装目录

cd your-es-root/plugins/ && mkdir ik

解压你下载的zip包到your-es-root/plugins/ik

选择二

使用的elasticsearch-plugin插件安装

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

重启ElasticSearch

./bin/elasticsearch -d

IK分词器效果

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "歌唱我们亲爱的祖国从今走向走向繁荣富强"
}

结果如下:

{
  "tokens": [
    {
      "token": "歌唱",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "我们",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "亲爱的",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "祖国",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "从今",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "走向",
      "start_offset": 11,
      "end_offset": 13,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "走向",
      "start_offset": 13,
      "end_offset": 15,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "繁荣富强",
      "start_offset": 15,
      "end_offset": 19,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}