Elasticsearch(二)
Elasticsearch(二)
一. analysis与analyzer
analysis,文本分析是将全文本转换为一系列单词的过程,也叫分词。analysis是通过analyzer(分词器)来实现的,可以使用Elasticsearch内置的分词器,也可以自己去定制一些分词器。除了在数据写入的是词条进行转换,那么在查询的时候也需要使用相同的分析器对语句进行分析。
anaylzer是由三部分组成,例如有
Hello a World, the world is beautifu
:1. Character Filter: 将文本中html标签剔除掉。
2. Tokenizer: 按照规则进行分词,在英文中按照空格分词。
3. Token Filter: 去掉stop world(停顿词,a, an, the, is),然后转换小写。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qo4tX983-1580444315063)(images2/analysis.png)]
1.1 内置的分词器
分词器名称 | 处理过程 |
---|---|
Standard Analyzer | 默认的分词器,按词切分,小写处理 |
Simple Analyzer | 按照非字母切分(符号被过滤),小写处理 |
Stop Analyzer | 小写处理,停用词过滤(the, a, this) |
Whitespace Analyzer | 按照空格切分,不转小写 |
Keyword Analyzer | 不分词,直接将输入当做输出 |
Pattern Analyzer | 正则表达式,默认是\W+(非字符串分隔) |
1.2 内置分词器示例
A. Standard Analyzer
GET _analyze
{
"analyzer": "standard",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
B. Simple Analyzer
GET _analyze
{
"analyzer": "simple",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
C. Stop Analyzer
GET _analyze
{
"analyzer": "stop",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
D. Whitespace Analyzer
GET _analyze
{
"analyzer": "whitespace",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
E. Keyword Analyzer
GET _analyze
{
"analyzer": "keyword",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
F. Pattern Analyzer
GET _analyze
{
"analyzer": "pattern",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
1.3 中文分词
中文分词在所有的搜索引擎中都是一个很大的难点,中文的句子应该是切分成一个个的词,一句中文,在不同的上下文中,其实是有不同的理解,例如下面这句话:
这个苹果,不大好吃/这个苹果,不大,好吃
1.3.1 IK分词器
IK分词器支持自定义词库,支持热更新分词字典,地址为 https://github.com/medcl/elasticsearch-analysis-ik
elasticsearch-plugin.bat install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
安装步骤:
- 下载zip包,下载路径为:https://github.com/medcl/elasticsearch-analysis-ik/releases
- 在Elasticsearch的plugins目录下创建名为 analysis-ik 的目录,将下载好的zip包解压在该目录下
- 在dos命令行进入Elasticsearch的bin目录下,执行 elasticsearch-plugin.bat list 即可查看到该插件
IK分词插件对应的分词器有以下几种:
- ik_smart
- ik_max_word
1.3.2 HanLP
安装步骤如下:
- 下载ZIP包,下载路径为:https://pan.baidu.com/s/1mFPNJXgiTPzZeqEjH_zifw#list/path=%2F,密码i0o7
- 在Elasticsearch的plugins目录下创建名为 analysis-hanlp 的目录,将下载好的zip包解压在该目录下.
- 下载词库,地址为:https://github.com/hankcs/HanLP/releases
- 将analyzer-hanlp目录下的data目录删掉,然后将词库解压到anayler-hanlp目录下
HanLP对应的分词器如下:
- hanlp,默认的分词
- hanlp_standard,标准分词
- hanlp_index,索引分词
- hanlp_nlp,nlp分词
- hanlp_n_short,N-最短路分词
- hanlp_dijkstra,最短路分词
- hanlp_speed,极速词典分词
1.3.3 pinyin分词器
安装步骤:
- 下载ZIP包,下载路径为:https://github.com/medcl/elasticsearch-analysis-pinyin/releases
- 在Elasticsearch的plugins目录下创建名为 analyzer-pinyin 的目录,将下载好的zip包解压在该目录下.
1.4 中文分词演示
ik_smart
GET _analyze
{
"analyzer": "ik_smart",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp
GET _analyze
{
"analyzer": "hanlp",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp_standard
GET _analyze
{
"analyzer": "hanlp_standard",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp_speed
GET _analyze
{
"analyzer": "hanlp_speed",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
1.5 分词的实际应用
在如上列举了很多的分词器,那么在实际中该如何应用?
1.5.1 设置mapping
要想使用分词器,先要指定我们想要对那个字段使用何种分词,如下所示:
PUT customers
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_standard"
}
}
}
}
1.5.2 插入数据
POST customers/_bulk
{"index":{}}
{"content":"如不能登录,请在百端登录百度首页,点击【登录遇到问题】,进行找回密码操作"}
{"index":{}}
{"content":"网盘客户端访问隐藏空间需要输入密码方可进入。"}
{"index":{}}
{"content":"剑桥的网盘不好用"}
1.5.3 查询
GET customers/_search
{
"query": {
"match": {
"content": "密码"
}
}
}
1.6 拼音分词器
在查询的过程中我们可能需要使用拼音来进行查询,在中文分词器中我们介绍过 pinyin
分词器,那么在实际的工作中该如何使用呢?
1.6.1 设置settings
PUT /medcl
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}
如上所示,我们基于现有的拼音分词器定制了一个名为 pinyin_analyzer
这样一个分词器。可用的参数可以参照:https://github.com/medcl/elasticsearch-analysis-pinyin
1.6.2 设置mapping
PUT medcl/_mapping
{
"properties": {
"name": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
}
}
}
1.6.3 数据的插入
POST medcl/_bulk
{"index":{}}
{"name": "刘德华"}
{"index":{}}
{"name": "张学友"}
{"index":{}}
{"name": "四大天王"}
{"index":{}}
{"name": "柳岩"}
{"index":{}}
{"name": "angel baby"}
1.6.4 查询
GET medcl/_search
{
"query": {
"match": {
"name.pinyin": "ldh"
}
}
}
1.7 中文、拼音混合查找
1.7.1 设置settings
PUT goods
{
"settings": {
"analysis": {
"analyzer": {
"hanlp_standard_pinyin":{
"type": "custom",
"tokenizer": "hanlp_standard",
"filter": ["my_pinyin"]
}
},
"filter": {
"my_pinyin": {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}
1.7.2 mappings设置
PUT goods/_mapping
{"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_standard_pinyin"
}
}
}
1.7.3 添加数据
POST goods/_bulk
{"index":{}}
{"content":"如不能登录,请在百端登录百度首页,点击【登录遇到问题】,进行找回密码操作"}
{"index":{}}
{"content":"网盘客户端访问隐藏空间需要输入密码方可进入。"}
{"index":{}}
{"content":"剑桥的网盘不好用"}
1.7.4 查询
GET goods/_search
{
"query": {
"match": {
"content": "caozuo"
}
},
"highlight": {
"pre_tags": "<em>",
"post_tags": "</em>",
"fields": {
"content": {}
}
}
}
二. spring boot与Elasticsearch的整合
2.1 添加依赖
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>
2.2 配置
spring:
elasticsearch:
rest:
uris: http://localhost:9200
2.3 获取ElasticsearchTemplate
@Configuration
public class ElasticsearchConfig extends ElasticsearchConfigurationSupport {
@Bean
public Client elasticsearchClient() throws UnknownHostException {
Settings settings = Settings.builder().put("cluster.name", "my-application").build();
TransportClient client = new PreBuiltTransportClient(settings);
client.addTransportAddress(new TransportAddress(InetAddress.getByName("127.0.0.1"), 9300));
return client;
}
@Bean(name = {"elasticsearchOperations", "elasticsearchTemplate"})
public ElasticsearchTemplate elasticsearchTemplate() throws UnknownHostException {
return new ElasticsearchTemplate(elasticsearchClient(), entityMapper());
}
// use the ElasticsearchEntityMapper
@Bean
@Override
public EntityMapper entityMapper() {
ElasticsearchEntityMapper entityMapper = new ElasticsearchEntityMapper(elasticsearchMappingContext(),
new DefaultConversionService());
entityMapper.setConversions(elasticsearchCustomConversions());
return entityMapper;
}
}
2.4 POJO类的定义
@Document(indexName = "movies", type = "_doc")
public class Movie {
private String id;
private String title;
private Integer year;
private List<String> genre;
// setters and getters
}
2.5 查询
A. 分页查询
// 分页查询
@RequestMapping("/page")
public Object pageQuery(
@RequestParam(required = false, defaultValue = "10") Integer size,
@RequestParam(required = false, defaultValue = "1") Integer page) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withPageable(PageRequest.of(page, size))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
B. range查询
// 单条件范围查询, 查询电影的上映日期在2016年到2018年间的所有电影
@RequestMapping("/range")
public Object rangeQuery() {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new RangeQueryBuilder("year").from(2016).to(2018))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
C. match查询
// 单条件查询只要包含其中一个字段
@RequestMapping("/match")
public Object singleCriteriaQuery(String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MatchQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
D. 多条件分页查询
@RequestMapping("/match/multiple")
public Object multiplePageQuery(
@RequestParam(required = true) String searchText,
@RequestParam(required = false, defaultValue = "10") Integer size,
@RequestParam(required = false, defaultValue = "1") Integer page) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(
new BoolQueryBuilder()
.must(new MatchQueryBuilder("title", searchText))
.must(new RangeQueryBuilder("year").from(2016).to(2018))
).withPageable(PageRequest.of(page, size))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
E. 多条件或者查询
// 多条件并且分页查询
@RequestMapping("/match/or/multiple")
public Object multipleOrQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(
new BoolQueryBuilder()
.should(new MatchQueryBuilder("title", searchText))
.should(new RangeQueryBuilder("year").from(2016).to(2018))
).build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
F. 精准匹配一个单词,且查询就一个单词
//其中包含有某个给定单词,必须是一个词
@RequestMapping("/term")
public Object termQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new TermQueryBuilder("title", searchText)).build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
精准匹配多个单词
//其中包含有某个几个单词
@RequestMapping("/terms")
public Object termsQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new TermsQueryBuilder("title", searchText.split("\\s+"))).build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
G. 短语匹配
@RequestMapping("/phrase")
public Object phraseQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MatchPhraseQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
H. 只查询部分列
@RequestMapping("/source")
public Object sourceQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withSourceFilter(new FetchSourceFilter(
new String[]{"title", "year", "id"}, new String[]{}))
.withQuery(new MatchPhraseQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
I. 多字段匹配
@RequestMapping("/multiple/field")
public Object allTermsQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MultiMatchQueryBuilder(searchText, "title", "genre")
.type(MultiMatchQueryBuilder.Type.MOST_FIELDS))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
J. 多单词同时包含
// 多单词同时包含
@RequestMapping("/also/include")
public Object alsoInclude(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new QueryStringQueryBuilder(searchText)
.field("title").defaultOperator(Operator.AND))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
三. logstash导入mysql数据
input {
jdbc {
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/es?useSSL=false&serverTimezone=UTC"
jdbc_user => es
jdbc_password => "123456"
#启用追踪,如果为true,则需要指定tracking_column
use_column_value => false
#指定追踪的字段,
tracking_column => "id"
#追踪字段的类型,目前只有数字(numeric)和时间类型(timestamp),默认是数字类型
tracking_column_type => "numeric"
#记录最后一次运行的结果
record_last_run => true
#上面运行结果的保存位置
last_run_metadata_path => "mysql-position.txt"
statement => "SELECT * FROM news where tags is not null"
#表示每天的 17:57分执行
schedule => " 0 57 17 * * *"
}
}
filter {
mutate {
split => { "tags" => ","}
}
}
output {
elasticsearch {
document_id => "%{id}"
document_type => "_doc"
index => "news"
hosts => ["http://localhost:9200"]
}
stdout{
codec => rubydebug
}
}
四. 搜索案例
4.1 自定义analyzer
PUT news
{
"settings": {
"analysis": {
"analyzer": {
"hanlp_standard_pinyin":{
"type": "custom",
"tokenizer": "hanlp_standard",
"filter": ["my_pinyin"]
}
},
"filter": {
"my_pinyin": {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}
4.2 定义mappings
PUT news/_mapping
{
"dynamic": false,
"properties": {
"id": {
"type": "long"
},
"title": {
"type": "text",
"analyzer": "hanlp_standard"
},
"content": {
"type": "text",
"analyzer": "hanlp_standard"
},
"tags": {
"type": "completion",
"analyzer": "hanlp_standard",
"fields": {
"tag_pinyin": {
"type": "completion",
"analyzer": "hanlp_standard_pinyin"
}
}
}
}
}
4.3 导入mysql的数据集
D:\logstash-datas\bin>logstash.bat -f ../config/logstash-mysql.conf
脚本参照第三章,数据库的脚本为news.sql
附录:
- 设置mappings的时候,可以指定 “dynamic”: false,意思是如果mappings中有些字段并没有指定,那么在数据导入的时候,该字段的数据会存入到es中,但是不会进行分词。
- 在使用suggestion的时候,“skip_duplicates”: true,表示的意思是如果出现相同的建议,那么只会保留一个。