ES第九天-分词，自定义分析器，中文分词器ik与基于中文分词器ik的全文检索

程序员文章站 2022-06-08 17:43:31

...

什么是分词？

分词就是指将一个文本转化成一系列单词的过程，也叫文本分析，在Elasticsearch中称之为Analysis。
举例：我是中国人 --> 我/是/中国人

什么是分词器？

1、顾名思义，分词器就是用来分词的（好像废话。
2、除此之外，有时候感叹语言的博大精深，一种东西有许多种叫法，就从英文来说，monther，mum dady father 这种同义词，其实我们是期望用户不管是输入的哪个叫法，我们都可以给它匹配到，这，也就是分词器会做的一部分的工作，也就是语义同化，整体就是为了提高召回率，也就是能搜素到的结果的比率。

ES中的分析器

实际上，在es中是有分析器这么一个存在的，它除了有分词器，还需要有一些预处理，语义同化等的这些的协调的角色。
它主要分成三部分：

1、character filter（mapping）

分词之前预处理（过滤无用字符、标签等，转换一些&=>and 《Elasticsearch》=> Elasticsearch

1、 HTML Strip Character Filter：html_strip 剔除html标签
参数：escaped_tags 需要保留的html标签

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip", //去除标签
          "escaped_tags": ["a"] //要保留的标签
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm so <a>happy</a>!</p>"
}

2、 Mapping Character Filter：type mapping 将映射内容替换成对应的内容

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "٠ => 0",
            "١ => 1",
            "٢ => 2",
            "٣ => 3",
            "٤ => 4",
            "٥ => 5",
            "٦ => 6",
            "٧ => 7",
            "٨ => 8",
            "٩ => 9"
          ]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My license plate is ٢٥٠١٥"
}
效果：My license plate is 25015

3、Pattern Replace Character Filter：type pattern_replace 将正则匹配的内容进行替换

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
效果： My credit card is 123_456_789

2、token filter

停用词、时态转换、大小写转换、同义词转换、语气词处理等。
比如：has=>have him=>he apples=>apple 去除无用谓词：the/oh/a

统一小写转换 lowercase token filter

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}

对分词后term小于5的进行小写转换：

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "condition",
      "filter": [ "lowercase" ],
      "script": {
        "source": "token.getTerm().length() < 5"
      }
    }
  ],
  "text": "THE QUICK BROWN FOX"
}

停用词 stopwords token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "type":"standard",
          "stopwords":"_english_" //对英语中的谓词进行停用
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Teacher Ma is in the restroom"
}

PUT /my_index6
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "type":"standard",
          "stopwords":[ "restroom","is" //restroom，is进行停用
            ]
        }
      }
    }
  }
}
GET my_index6/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Teacher Ma is in the restroom"
}

3、tokenizer（分词器）

主要进行分词。
在ES7.6中，内置了15种分词器。
列举4种：
① standard analyzer：默认分词器，中文支持的不理想，会逐字拆分。（一般来说，要么使用中文ik分词器，要么就使用这个默认的）
1) max_token_length：最大令牌长度。如果看到令牌超过此长度，则将其max_token_length间隔分割。默认为255。

GET /my_index/_analyze
{
  "text": "*如此多娇,小姐姐哪里可以撩",
  "analyzer": "standard"
}

② Pattern Tokenizer：以正则匹配分隔符，把文本拆分成若干词项。
③ Simple Pattern Tokenizer：以正则匹配词项，速度比Pattern Tokenizer快。
④ whitespace analyzer：以空白符分隔 Tim_cookie

自定义分析器

经过上面的介绍，我们知道了一个分析器大致由三个部分组成，我们来将每一个都进行自定义，然后组成一个分析器：

PUT /test_analysis
{
  "settings": {
    "analysis": {
      "char_filter": {
        "test_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or"
          ]
        }
      },
      "filter": {
        "test_stopwords": {
          "type": "stop",
          "stopwords": ["is","in","at","the","a","for"]
        }
      },
      "tokenizer": {
        //根据标点符号正则进行切分词  
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "analyzer": {
        "my_analyzer": {
       //设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较： type将设置为内置分析器的名称，如 standard或simple
          "type": "custom", 
          "char_filter": [
            "html_strip",
            "test_char_filter"
          ],
          "tokenizer": "punctuation",
          "filter": ["lowercase","test_stopwords"]
        }
      }
    }
  }
}

GET /test_analysis/_analyze
{
  "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
  "analyzer": "my_analyzer"
}

分词api

指定standard分词器，然后对英文内容进行分词

GET/_analyze
{
  "analyzer":"standard",
  "text":"hello world"
}

结果:

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

接下来对中文内容进行分词：

GET/_analyze
{
  "analyzer":"standard",
  "text":"我是中国人"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "国",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

发现这个中文的分词效果可能并不是我们想要的。

中文分词

中文分词的难点在于，在汉语中没有明显的词汇分界点，如在英语中，空格可以作为分隔符，如果分隔不正确就会造
成歧义。
如：
我/爱/炒肉丝
我/爱/炒/肉丝
常用中文分词器，IK、jieba、THULAC等，推荐使用IK分词器。

IK Analyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始，
IKAnalyzer已经推出了3个大版本。最初，它是以开源项目Luence为应用主体的，结合词典分词和文法分析算
法的中文分词组件。新版本的IK Analyzer 3.0则发展为面向Java的公用分词组件，独立于Lucene项目，同时提
供了对Lucene的默认优化实现。采用了特有的“正向迭代最细粒度切分算法“，具有80万字/秒的高速处理能力采用了多子处理器分析模式，支
持：英文字母（IP地址、Email、URL）、数字（日期，常用中文数量词，罗马数字，科学计数法），中文词汇（姓名、地名处理）等分词处理。
优化的词典存储，更小的内存占用。

IK分词器 Elasticsearch插件地址：https://github.com/medcl/elasticsearch-analysis-ik

#安装方法：将下载到的elasticsearch-analysis-ik-7.6.2.zip解压到/elasticsearch/plugins/ik目录下即可。
mkdir es/plugins/ik
cp elasticsearch-analysis-ik-7.6.2.zip ./es/plugins/ik
#解压
unzip elasticsearch-analysis-ik-6.5.4.zip
#重启es
./bin/elasticsearch

测试分词效果

ik分词有2个粒度：
1、ik_max_word：细粒度
2、ik_smart：粗粒度

ik_max_word(常用）

POST /_analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

ik_smart

POST /_analyze
{
"analyzer": "ik_smart",
"text": "我是中国人"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

可以看到，分词效果还不错。

验证中文分词下的全文检索效果

创建itcast2索引，并指定hobby字段使用ik分词器：

PUT /itcast2
{
  "settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "0"
    }
  },
  "mappings": {
  
      "properties": {
        "name": {
          "type": "text"
        },
        "age": {
          "type": "integer"
        },
        "mail": {
          "type": "keyword"
        },
        "hobby": {
          "type": "text",
          "analyzer": "ik_max_word"
        }
      
    }
  }
}

批量插入数据：

POST /itcast2/_bulk
{"index":{"_index":"itcast2"}}
{"name":"张三","age": 20,"mail": "[email protected]","hobby":"羽毛球、乒乓球、足球"}
{"index":{"_index":"itcast2"}}
{"name":"李四","age": 21,"mail": "[email protected]","hobby":"羽毛球、乒乓球、足球、篮球"}
{"index":{"_index":"itcast2"}}
{"name":"王五","age": 22,"mail": "[email protected]","hobby":"羽毛球、篮球、游泳、听音乐"}
{"index":{"_index":"itcast2"}}
{"name":"赵六","age": 23,"mail": "[email protected]","hobby":"跑步、游泳、篮球"}
{"index":{"_index":"itcast2"}}
{"name":"孙七","age": 24,"mail": "[email protected]","hobby":"听音乐、看电影、羽毛球"}

先看“羽毛球”的分词：

POST /_analyze
{
"analyzer": "ik_max_word",
"text": "羽毛球"
}

结果：

{
  "tokens" : [
    {
      "token" : "羽毛球",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "羽毛",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "球",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    }
  ]
}

然后全文检索hobby中的“羽毛球”

GET /itcast2/_search
{
  "query": {
    "match": {
      "hobby": "羽毛球"
    }
  },
  "highlight": {
    "fields": {
      "hobby": {}
    }
  }
}

结果：

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 0.9608413,
    "hits" : [
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "VtQRLXIB0mJFQ4Okj0ar",
        "_score" : 0.9608413,
        "_source" : {
          "name" : "张三",
          "age" : 20,
          "mail" : "[email protected]",
          "hobby" : "羽毛球、乒乓球、足球"
        },
        "highlight" : {
          "hobby" : [
            "<em>羽毛球</em>、乒乓<em>球</em>、足球"
          ]
        }
      },
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "V9QRLXIB0mJFQ4Okj0ar",
        "_score" : 0.9134824,
        "_source" : {
          "name" : "李四",
          "age" : 21,
          "mail" : "[email protected]",
          "hobby" : "羽毛球、乒乓球、足球、篮球"
        },
        "highlight" : {
          "hobby" : [
            "<em>羽毛球</em>、乒乓<em>球</em>、足球、篮球"
          ]
        }
      },
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "WNQRLXIB0mJFQ4Okj0ar",
        "_score" : 0.80493593,
        "_source" : {
          "name" : "王五",
          "age" : 22,
          "mail" : "[email protected]",
          "hobby" : "羽毛球、篮球、游泳、听音乐"
        },
        "highlight" : {
          "hobby" : [
            "<em>羽毛球</em>、篮球、游泳、听音乐"
          ]
        }
      },
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "WtQRLXIB0mJFQ4Okj0ar",
        "_score" : 0.80493593,
        "_source" : {
          "name" : "孙七",
          "age" : 24,
          "mail" : "[email protected]",
          "hobby" : "听音乐、看电影、羽毛球"
        },
        "highlight" : {
          "hobby" : [
            "听音乐、看电影、<em>羽毛球</em>"
          ]
        }
      }
    ]
  }
}

相关标签： elasticsearch elasticsearch

上一篇： 13. 处理复杂的触摸事件

下一篇：丑了还可以医美整容