4、ElasticSearch中的分词器
文章目录
1、Analysis与Analyzer
Analysis 文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词。Analysis是通过Analyzer实现的,可使用ElasticSearch内置的分词器或按需定制分词器。
除了在数据写入转换词条时用到分词器,匹配Query语句时也需要用相同的分词器对查询语句进行分析。
Analyzer由三部分组成:
- Character Filters 针对原始文本处理,例如去除html
- Tokenizer 安装规则切分为单词
- Token Filter 将切分的单词进行加工,如单词小写、删除stopword、增加同义词等
2、ElasticSearch的内置分词器
前置:使用_analyzer 分词API
如使用默认分词器进行分词
POST _analyze
{
"text": ["I'm studing now"],
"analyzer": "standard"
}
## 响应
{
"tokens" : [
{
"token" : "i'm",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "studing",
"start_offset" : 4,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "now",
"start_offset" : 12,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
2.1、Standard Analyzer
默认分词器 按词切分 小写处理
- Tokenizer:standard
- Token Filters
- standard
- lower case
- Stop(默认关闭)
2.2、Simple Analyzer
- 按照非小写字母切分,非字母的都被去除
- 小写处理
- Tokenizer:lowercase
举例:对"I’m studying 11"分词
POST _analyze
{
"text": ["I'm studying 11"],
"analyzer": "simple"
}
## 响应
{
"tokens" : [
{
"token" : "i",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "m",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "studying",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
2.3、Stop Analyzer
相比Simple Analyzer
多了stop filter
,会把the/a/is/in
等修饰词去掉
- Tokenizer:lowercase
- Token Filters:stop
举例:对"I’m studying in the room"进行分词
POST _analyze
{
"text": ["I'm studying in the room"],
"analyzer": "stop"
}
## 响应
{
"tokens" : [
{
"token" : "i",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "m",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "studying",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
},
{
"token" : "room",
"start_offset" : 20,
"end_offset" : 24,
"type" : "word",
"position" : 5
}
]
}
2.4、WhiteSpace Analyzer
按照空格切分
Tokenizer:WhiteSpace
举例:对"I’m studying in the room"进行分词
POST _analyze
{
"text": ["I'm studying in the room"],
"analyzer": "whitespace"
}
## 响应
{
"tokens" : [
{
"token" : "I'm",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "studying",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "in",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "the",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 3
},
{
"token" : "room",
"start_offset" : 20,
"end_offset" : 24,
"type" : "word",
"position" : 4
}
]
}
2.5、Keyword Analyzer
不分词,直接将输入当一个词语输出
Tokenizer:Keyword
举例:对"I’m studying in the room"进行分词
POST _analyze
{
"text": ["I'm studying in the room"],
"analyzer": "keyword"
}
##响应
{
"tokens" : [
{
"token" : "I'm studying in the room",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 0
}
]
}
2.6、Pattern Analyzer
-
通过正则表达式进行分词
-
默认是\W+,非字符的符号进行分隔
-
Tokenizer: Pattern
-
Token Filters:lowercase/stop
2.7、English Analyzer
语言分词器
举例:对"I’m studying in the room"进行分词
POST _analyze
{
"text": ["I'm studying in the room"],
"analyzer": "english"
}
##响应
{
"tokens" : [
{
"token" : "i'm",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "studi",
"start_offset" : 4,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "room",
"start_offset" : 20,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
2.8、中文分词
中文分词的难点:
- 需要将中文句子,切分成一个一个词(不是一个一个字)
- 一句中文在不同的上下文中有不同的理解
常用中文分词器
- ICU Analyzer 提供了Unicode的支持,更好的支持亚洲语言
- IK 支持自定义词库,支持热更新分词字典
- THULAC
本次学习中文分词使用ik分词插件,具体插件如何安装在下一小节给出详细介绍,这里不再赘述
ik分词插件有两个分词器,分别是ik_smart、ik_max_word
举例:对"这个苹果不大好吃"进行分词
POST _analyze
{
"text": ["这个苹果不大好吃"],
"analyzer": "ik_max_word"
}
## 响应
{
"tokens" : [
{
"token" : "这个",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "苹果",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "不大好",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "不大",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "大好",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "好吃",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 5
}
]
}
2.9、自定义分词
我的理解:根据自己的需求 重新指定Character Filters、Tokenizer、Tokenizer,使其满足自己的需求。
需求:如 ”不分词 将词语按照其小写形式原样输出“
POST _analyze
{
"tokenizer": "keyword",
"filter": ["lowercase"],
"text": ["Mastering Elasticsearch"]
}
## 响应
{
"tokens" : [
{
"token" : "Mastering Elasticsearch",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 0
}
]
}
推荐阅读