欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

4、ElasticSearch中的分词器

程序员文章站 2024-02-22 11:50:46
...

1、Analysis与Analyzer

Analysis 文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词。Analysis是通过Analyzer实现的,可使用ElasticSearch内置的分词器或按需定制分词器。

除了在数据写入转换词条时用到分词器,匹配Query语句时也需要用相同的分词器对查询语句进行分析。

Analyzer由三部分组成:

  • Character Filters 针对原始文本处理,例如去除html
  • Tokenizer 安装规则切分为单词
  • Token Filter 将切分的单词进行加工,如单词小写、删除stopword、增加同义词等
2、ElasticSearch的内置分词器

前置:使用_analyzer 分词API
如使用默认分词器进行分词

POST _analyze
{
  "text": ["I'm studing now"],
  "analyzer": "standard"
}
## 响应
{
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "studing",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "now",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

2.1、Standard Analyzer

默认分词器 按词切分 小写处理

  • Tokenizer:standard
  • Token Filters
    • standard
    • lower case
    • Stop(默认关闭)
2.2、Simple Analyzer
  • 按照非小写字母切分,非字母的都被去除
  • 小写处理
  • Tokenizer:lowercase

举例:对"I’m studying 11"分词

POST _analyze
{
  "text": ["I'm studying 11"],
  "analyzer": "simple"
}
## 响应
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "m",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "studying",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    }
  ]
}
2.3、Stop Analyzer

相比Simple Analyzer 多了stop filter,会把the/a/is/in等修饰词去掉

  • Tokenizer:lowercase
  • Token Filters:stop

举例:对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "stop"
}
## 响应
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "m",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "studying",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "room",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "word",
      "position" : 5
    }
  ]
}
2.4、WhiteSpace Analyzer

按照空格切分
Tokenizer:WhiteSpace

举例:对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "whitespace"
}
## 响应
{
  "tokens" : [
    {
      "token" : "I'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "studying",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "in",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "the",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "room",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "word",
      "position" : 4
    }
  ]
}
2.5、Keyword Analyzer

不分词,直接将输入当一个词语输出
Tokenizer:Keyword

举例:对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "keyword"
}
##响应
{
  "tokens" : [
    {
      "token" : "I'm studying in the room",
      "start_offset" : 0,
      "end_offset" : 24,
      "type" : "word",
      "position" : 0
    }
  ]
}
2.6、Pattern Analyzer
  • 通过正则表达式进行分词

  • 默认是\W+,非字符的符号进行分隔

  • Tokenizer: Pattern

  • Token Filters:lowercase/stop

2.7、English Analyzer

语言分词器
举例:对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "english"
}
##响应
{
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "studi",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "room",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

2.8、中文分词

中文分词的难点:

  • 需要将中文句子,切分成一个一个词(不是一个一个字)
  • 一句中文在不同的上下文中有不同的理解

常用中文分词器

  • ICU Analyzer 提供了Unicode的支持,更好的支持亚洲语言
  • IK 支持自定义词库,支持热更新分词字典
  • THULAC

本次学习中文分词使用ik分词插件,具体插件如何安装在下一小节给出详细介绍,这里不再赘述

ik分词插件有两个分词器,分别是ik_smart、ik_max_word

举例:对"这个苹果不大好吃"进行分词

POST _analyze
{
  "text": ["这个苹果不大好吃"],
  "analyzer": "ik_max_word"
}
## 响应
{
  "tokens" : [
    {
      "token" : "这个",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "苹果",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "不大好",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "不大",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "大好",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "好吃",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}
2.9、自定义分词

我的理解:根据自己的需求 重新指定Character Filters、Tokenizer、Tokenizer,使其满足自己的需求。
需求:如 ”不分词 将词语按照其小写形式原样输出“

POST _analyze
{
  "tokenizer": "keyword",
  "filter": ["lowercase"],
  "text": ["Mastering Elasticsearch"]
}
## 响应
{
  "tokens" : [
    {
      "token" : "Mastering Elasticsearch",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "word",
      "position" : 0
    }
  ]
}