欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

ES第九天-分词,自定义分析器,中文分词器ik与基于中文分词器ik的全文检索

程序员文章站 2022-06-08 17:43:31
...

什么是分词?

分词就是指将一个文本转化成一系列单词的过程,也叫文本分析,在Elasticsearch中称之为Analysis。
举例:我是中国人 --> 我/是/中国人

什么是分词器?

1、顾名思义,分词器就是用来分词的(好像废话。
2、除此之外,有时候感叹语言的博大精深,一种东西有许多种叫法,就从英文来说,monther,mum dady father 这种同义词,其实我们是期望用户不管是输入的哪个叫法,我们都可以给它匹配到,这,也就是分词器会做的一部分的工作,也就是语义同化,整体就是为了提高召回率,也就是能搜素到的结果的比率。

ES中的分析器

实际上,在es中是有分析器这么一个存在的,它除了有分词器,还需要有一些预处理,语义同化等的这些的协调的角色。
它主要分成三部分:

1、character filter(mapping)

分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch

1、 HTML Strip Character Filter:html_strip 剔除html标签
参数:escaped_tags 需要保留的html标签

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip", //去除标签
          "escaped_tags": ["a"] //要保留的标签
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm so <a>happy</a>!</p>"
}

2、 Mapping Character Filter:type mapping 将映射内容替换成对应的内容

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "٠ => 0",
            "١ => 1",
            "٢ => 2",
            "٣ => 3",
            "٤ => 4",
            "٥ => 5",
            "٦ => 6",
            "٧ => 7",
            "٨ => 8",
            "٩ => 9"
          ]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My license plate is ٢٥٠١٥"
}
效果:My license plate is 25015

3、Pattern Replace Character Filter:type pattern_replace 将正则匹配的内容进行替换

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
效果: My credit card is 123_456_789

2、token filter

停用词、时态转换、大小写转换、同义词转换、语气词处理等。
比如:has=>have him=>he apples=>apple 去除无用谓词:the/oh/a

统一小写转换 lowercase token filter

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}

对分词后term小于5的进行小写转换:

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "condition",
      "filter": [ "lowercase" ],
      "script": {
        "source": "token.getTerm().length() < 5"
      }
    }
  ],
  "text": "THE QUICK BROWN FOX"
}

停用词 stopwords token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "type":"standard",
          "stopwords":"_english_" //对英语中的谓词进行停用
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Teacher Ma is in the restroom"
}
PUT /my_index6
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "type":"standard",
          "stopwords":[ "restroom","is" //restroom,is进行停用
            ]
        }
      }
    }
  }
}
GET my_index6/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Teacher Ma is in the restroom"
}

3、tokenizer(分词器)

主要进行分词。
在ES7.6中,内置了15种分词器。
列举4种:
① standard analyzer:默认分词器,中文支持的不理想,会逐字拆分。(一般来说,要么使用中文ik分词器,要么就使用这个默认的)
1) max_token_length:最大令牌长度。如果看到令牌超过此长度,则将其max_token_length间隔分割。默认为255。

GET /my_index/_analyze
{
  "text": "*如此多娇,小姐姐哪里可以撩",
  "analyzer": "standard"
}

② Pattern Tokenizer:以正则匹配分隔符,把文本拆分成若干词项。
③ Simple Pattern Tokenizer:以正则匹配词项,速度比Pattern Tokenizer快。
④ whitespace analyzer:以空白符分隔 Tim_cookie

自定义分析器

经过上面的介绍,我们知道了一个分析器大致由三个部分组成,我们来将每一个都进行自定义,然后组成一个分析器:

PUT /test_analysis
{
  "settings": {
    "analysis": {
      "char_filter": {
        "test_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or"
          ]
        }
      },
      "filter": {
        "test_stopwords": {
          "type": "stop",
          "stopwords": ["is","in","at","the","a","for"]
        }
      },
      "tokenizer": {
        //根据标点符号正则进行切分词  
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "analyzer": {
        "my_analyzer": {
       //设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple
          "type": "custom", 
          "char_filter": [
            "html_strip",
            "test_char_filter"
          ],
          "tokenizer": "punctuation",
          "filter": ["lowercase","test_stopwords"]
        }
      }
    }
  }
}

GET /test_analysis/_analyze
{
  "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
  "analyzer": "my_analyzer"
}

分词api

指定standard分词器,然后对英文内容进行分词

GET/_analyze
{
  "analyzer":"standard",
  "text":"hello world"
}

结果:

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

接下来对中文内容进行分词:

GET/_analyze
{
  "analyzer":"standard",
  "text":"我是中国人"
}
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "国",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

发现这个中文的分词效果可能并不是我们想要的。

中文分词

中文分词的难点在于,在汉语中没有明显的词汇分界点,如在英语中,空格可以作为分隔符,如果分隔不正确就会造
成歧义。
如:
我/爱/炒肉丝
我/爱/炒/肉丝
常用中文分词器,IK、jieba、THULAC等,推荐使用IK分词器。

IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始,
IKAnalyzer已经推出了3个大版本。最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算
法的中文分词组件。新版本的IK Analyzer 3.0则发展为面向Java的公用分词组件,独立于Lucene项目,同时提
供了对Lucene的默认优化实现。 采用了特有的“正向迭代最细粒度切分算法“,具有80万字/秒的高速处理能力 采用了多子处理器分析模式,支
持:英文字母(IP地址、Email、URL)、数字(日期,常用中文数量词,罗马数字,科学计数法),中文词汇 (姓名、地名处理)等分词处理。
优化的词典存储,更小的内存占用。

IK分词器 Elasticsearch插件地址:https://github.com/medcl/elasticsearch-analysis-ik

#安装方法:将下载到的elasticsearch-analysis-ik-7.6.2.zip解压到/elasticsearch/plugins/ik目录下即可。
mkdir es/plugins/ik
cp elasticsearch-analysis-ik-7.6.2.zip ./es/plugins/ik
#解压
unzip elasticsearch-analysis-ik-6.5.4.zip
#重启es
./bin/elasticsearch

测试分词效果

ik分词有2个粒度:
1、ik_max_word:细粒度
2、ik_smart:粗粒度

ik_max_word(常用)

POST /_analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

ik_smart

POST /_analyze
{
"analyzer": "ik_smart",
"text": "我是中国人"
}
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

可以看到,分词效果还不错。

验证中文分词下的全文检索效果

创建itcast2索引,并指定hobby字段使用ik分词器:

PUT /itcast2
{
  "settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "0"
    }
  },
  "mappings": {
  
      "properties": {
        "name": {
          "type": "text"
        },
        "age": {
          "type": "integer"
        },
        "mail": {
          "type": "keyword"
        },
        "hobby": {
          "type": "text",
          "analyzer": "ik_max_word"
        }
      
    }
  }
}

批量插入数据:

POST /itcast2/_bulk
{"index":{"_index":"itcast2"}}
{"name":"张三","age": 20,"mail": "[email protected]","hobby":"羽毛球、乒乓球、足球"}
{"index":{"_index":"itcast2"}}
{"name":"李四","age": 21,"mail": "[email protected]","hobby":"羽毛球、乒乓球、足球、篮球"}
{"index":{"_index":"itcast2"}}
{"name":"王五","age": 22,"mail": "[email protected]","hobby":"羽毛球、篮球、游泳、听音乐"}
{"index":{"_index":"itcast2"}}
{"name":"赵六","age": 23,"mail": "[email protected]","hobby":"跑步、游泳、篮球"}
{"index":{"_index":"itcast2"}}
{"name":"孙七","age": 24,"mail": "[email protected]","hobby":"听音乐、看电影、羽毛球"}

先看“羽毛球”的分词:

POST /_analyze
{
"analyzer": "ik_max_word",
"text": "羽毛球"
}

结果:

{
  "tokens" : [
    {
      "token" : "羽毛球",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "羽毛",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "球",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    }
  ]
}

然后全文检索hobby中的“羽毛球”

GET /itcast2/_search
{
  "query": {
    "match": {
      "hobby": "羽毛球"
    }
  },
  "highlight": {
    "fields": {
      "hobby": {}
    }
  }
}

结果:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 0.9608413,
    "hits" : [
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "VtQRLXIB0mJFQ4Okj0ar",
        "_score" : 0.9608413,
        "_source" : {
          "name" : "张三",
          "age" : 20,
          "mail" : "[email protected]",
          "hobby" : "羽毛球、乒乓球、足球"
        },
        "highlight" : {
          "hobby" : [
            "<em>羽毛球</em>、乒乓<em>球</em>、足球"
          ]
        }
      },
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "V9QRLXIB0mJFQ4Okj0ar",
        "_score" : 0.9134824,
        "_source" : {
          "name" : "李四",
          "age" : 21,
          "mail" : "[email protected]",
          "hobby" : "羽毛球、乒乓球、足球、篮球"
        },
        "highlight" : {
          "hobby" : [
            "<em>羽毛球</em>、乒乓<em>球</em>、足球、篮球"
          ]
        }
      },
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "WNQRLXIB0mJFQ4Okj0ar",
        "_score" : 0.80493593,
        "_source" : {
          "name" : "王五",
          "age" : 22,
          "mail" : "[email protected]",
          "hobby" : "羽毛球、篮球、游泳、听音乐"
        },
        "highlight" : {
          "hobby" : [
            "<em>羽毛球</em>、篮球、游泳、听音乐"
          ]
        }
      },
      {
        "_index" : "itcast2",
        "_type" : "_doc",
        "_id" : "WtQRLXIB0mJFQ4Okj0ar",
        "_score" : 0.80493593,
        "_source" : {
          "name" : "孙七",
          "age" : 24,
          "mail" : "[email protected]",
          "hobby" : "听音乐、看电影、羽毛球"
        },
        "highlight" : {
          "hobby" : [
            "听音乐、看电影、<em>羽毛球</em>"
          ]
        }
      }
    ]
  }
}