ES第九天-分词,自定义分析器,中文分词器ik与基于中文分词器ik的全文检索
什么是分词?
分词就是指将一个文本转化成一系列单词的过程,也叫文本分析,在Elasticsearch中称之为Analysis。
举例:我是中国人 --> 我/是/中国人
什么是分词器?
1、顾名思义,分词器就是用来分词的(好像废话。
2、除此之外,有时候感叹语言的博大精深,一种东西有许多种叫法,就从英文来说,monther,mum dady father 这种同义词,其实我们是期望用户不管是输入的哪个叫法,我们都可以给它匹配到,这,也就是分词器会做的一部分的工作,也就是语义同化,整体就是为了提高召回率,也就是能搜素到的结果的比率。
ES中的分析器
实际上,在es中是有分析器这么一个存在的,它除了有分词器,还需要有一些预处理,语义同化等的这些的协调的角色。
它主要分成三部分:
1、character filter(mapping)
分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch
1、 HTML Strip Character Filter:html_strip 剔除html标签
参数:escaped_tags 需要保留的html标签
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "html_strip", //去除标签
"escaped_tags": ["a"] //要保留的标签
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I'm so <a>happy</a>!</p>"
}
2、 Mapping Character Filter:type mapping 将映射内容替换成对应的内容
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5",
"٦ => 6",
"٧ => 7",
"٨ => 8",
"٩ => 9"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My license plate is ٢٥٠١٥"
}
效果:My license plate is 25015
3、Pattern Replace Character Filter:type pattern_replace 将正则匹配的内容进行替换
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["my_char_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
效果: My credit card is 123_456_789
2、token filter
停用词、时态转换、大小写转换、同义词转换、语气词处理等。
比如:has=>have him=>he apples=>apple 去除无用谓词:the/oh/a
统一小写转换 lowercase token filter
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}
对分词后term小于5的进行小写转换:
GET /_analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "condition",
"filter": [ "lowercase" ],
"script": {
"source": "token.getTerm().length() < 5"
}
}
],
"text": "THE QUICK BROWN FOX"
}
停用词 stopwords token filter
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"type":"standard",
"stopwords":"_english_" //对英语中的谓词进行停用
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Teacher Ma is in the restroom"
}
PUT /my_index6
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"type":"standard",
"stopwords":[ "restroom","is" //restroom,is进行停用
]
}
}
}
}
}
GET my_index6/_analyze
{
"analyzer": "my_analyzer",
"text": "Teacher Ma is in the restroom"
}
3、tokenizer(分词器)
主要进行分词。
在ES7.6中,内置了15种分词器。
列举4种:
① standard analyzer:默认分词器,中文支持的不理想,会逐字拆分。(一般来说,要么使用中文ik分词器,要么就使用这个默认的)
1) max_token_length:最大令牌长度。如果看到令牌超过此长度,则将其max_token_length间隔分割。默认为255。
GET /my_index/_analyze
{
"text": "*如此多娇,小姐姐哪里可以撩",
"analyzer": "standard"
}
② Pattern Tokenizer:以正则匹配分隔符,把文本拆分成若干词项。
③ Simple Pattern Tokenizer:以正则匹配词项,速度比Pattern Tokenizer快。
④ whitespace analyzer:以空白符分隔 Tim_cookie
自定义分析器
经过上面的介绍,我们知道了一个分析器大致由三个部分组成,我们来将每一个都进行自定义,然后组成一个分析器:
PUT /test_analysis
{
"settings": {
"analysis": {
"char_filter": {
"test_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"| => or"
]
}
},
"filter": {
"test_stopwords": {
"type": "stop",
"stopwords": ["is","in","at","the","a","for"]
}
},
"tokenizer": {
//根据标点符号正则进行切分词
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"analyzer": {
"my_analyzer": {
//设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple
"type": "custom",
"char_filter": [
"html_strip",
"test_char_filter"
],
"tokenizer": "punctuation",
"filter": ["lowercase","test_stopwords"]
}
}
}
}
}
GET /test_analysis/_analyze
{
"text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
"analyzer": "my_analyzer"
}
分词api
指定standard分词器,然后对英文内容进行分词
GET/_analyze
{
"analyzer":"standard",
"text":"hello world"
}
结果:
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
接下来对中文内容进行分词:
GET/_analyze
{
"analyzer":"standard",
"text":"我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "中",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "国",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
发现这个中文的分词效果可能并不是我们想要的。
中文分词
中文分词的难点在于,在汉语中没有明显的词汇分界点,如在英语中,空格可以作为分隔符,如果分隔不正确就会造
成歧义。
如:
我/爱/炒肉丝
我/爱/炒/肉丝
常用中文分词器,IK、jieba、THULAC等,推荐使用IK分词器。
IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始,
IKAnalyzer已经推出了3个大版本。最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算
法的中文分词组件。新版本的IK Analyzer 3.0则发展为面向Java的公用分词组件,独立于Lucene项目,同时提
供了对Lucene的默认优化实现。 采用了特有的“正向迭代最细粒度切分算法“,具有80万字/秒的高速处理能力 采用了多子处理器分析模式,支
持:英文字母(IP地址、Email、URL)、数字(日期,常用中文数量词,罗马数字,科学计数法),中文词汇 (姓名、地名处理)等分词处理。
优化的词典存储,更小的内存占用。
IK分词器 Elasticsearch插件地址:https://github.com/medcl/elasticsearch-analysis-ik
#安装方法:将下载到的elasticsearch-analysis-ik-7.6.2.zip解压到/elasticsearch/plugins/ik目录下即可。
mkdir es/plugins/ik
cp elasticsearch-analysis-ik-7.6.2.zip ./es/plugins/ik
#解压
unzip elasticsearch-analysis-ik-6.5.4.zip
#重启es
./bin/elasticsearch
测试分词效果
ik分词有2个粒度:
1、ik_max_word:细粒度
2、ik_smart:粗粒度
ik_max_word(常用)
POST /_analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
}
]
}
ik_smart
POST /_analyze
{
"analyzer": "ik_smart",
"text": "我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}
]
}
可以看到,分词效果还不错。
验证中文分词下的全文检索效果
创建itcast2索引,并指定hobby字段使用ik分词器:
PUT /itcast2
{
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "0"
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "integer"
},
"mail": {
"type": "keyword"
},
"hobby": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
批量插入数据:
POST /itcast2/_bulk
{"index":{"_index":"itcast2"}}
{"name":"张三","age": 20,"mail": "[email protected]","hobby":"羽毛球、乒乓球、足球"}
{"index":{"_index":"itcast2"}}
{"name":"李四","age": 21,"mail": "[email protected]","hobby":"羽毛球、乒乓球、足球、篮球"}
{"index":{"_index":"itcast2"}}
{"name":"王五","age": 22,"mail": "[email protected]","hobby":"羽毛球、篮球、游泳、听音乐"}
{"index":{"_index":"itcast2"}}
{"name":"赵六","age": 23,"mail": "[email protected]","hobby":"跑步、游泳、篮球"}
{"index":{"_index":"itcast2"}}
{"name":"孙七","age": 24,"mail": "[email protected]","hobby":"听音乐、看电影、羽毛球"}
先看“羽毛球”的分词:
POST /_analyze
{
"analyzer": "ik_max_word",
"text": "羽毛球"
}
结果:
{
"tokens" : [
{
"token" : "羽毛球",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "羽毛",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "球",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 2
}
]
}
然后全文检索hobby中的“羽毛球”
GET /itcast2/_search
{
"query": {
"match": {
"hobby": "羽毛球"
}
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
结果:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : 0.9608413,
"hits" : [
{
"_index" : "itcast2",
"_type" : "_doc",
"_id" : "VtQRLXIB0mJFQ4Okj0ar",
"_score" : 0.9608413,
"_source" : {
"name" : "张三",
"age" : 20,
"mail" : "[email protected]",
"hobby" : "羽毛球、乒乓球、足球"
},
"highlight" : {
"hobby" : [
"<em>羽毛球</em>、乒乓<em>球</em>、足球"
]
}
},
{
"_index" : "itcast2",
"_type" : "_doc",
"_id" : "V9QRLXIB0mJFQ4Okj0ar",
"_score" : 0.9134824,
"_source" : {
"name" : "李四",
"age" : 21,
"mail" : "[email protected]",
"hobby" : "羽毛球、乒乓球、足球、篮球"
},
"highlight" : {
"hobby" : [
"<em>羽毛球</em>、乒乓<em>球</em>、足球、篮球"
]
}
},
{
"_index" : "itcast2",
"_type" : "_doc",
"_id" : "WNQRLXIB0mJFQ4Okj0ar",
"_score" : 0.80493593,
"_source" : {
"name" : "王五",
"age" : 22,
"mail" : "[email protected]",
"hobby" : "羽毛球、篮球、游泳、听音乐"
},
"highlight" : {
"hobby" : [
"<em>羽毛球</em>、篮球、游泳、听音乐"
]
}
},
{
"_index" : "itcast2",
"_type" : "_doc",
"_id" : "WtQRLXIB0mJFQ4Okj0ar",
"_score" : 0.80493593,
"_source" : {
"name" : "孙七",
"age" : 24,
"mail" : "[email protected]",
"hobby" : "听音乐、看电影、羽毛球"
},
"highlight" : {
"hobby" : [
"听音乐、看电影、<em>羽毛球</em>"
]
}
}
]
}
}
上一篇: 13. 处理复杂的触摸事件
下一篇: 丑了还可以医美整容