elasticsearch 关键词搜索 & #

程序员文章站 2022-07-09 12:40:34

...

原文链接：https://www.hexianwei.com/2019/04/21/关键字搜索/

项目背景：需要提供关键词搜索功能（elasticSearch）。其中 & 表示逻辑并， # 表示逻辑或。并且逻辑从左往右。& 和 # 没有优先级之分。但是，可以使用小括号，代表优先级。

例子：

万科#房地产  =>  万科 | 房地产
万科&房地产  =>  万科 & 房地产
万科#房地产&科技  =>  万科&科技 | 房地产&科技
万科&房地产#科技  =>  万科&房地产|科技 
万科&(房地产#科技)  => 万科&房地产| 万科&科技

乍一看貌似很简单，但是 # 和 & 之间没有优先级之分，所以就不能套用四则运算的逻辑来做了。

ps: 大约花了大半天的时间来实现总体功能，使功能可用，

update 2019年05月06日22:02:53 : 更新了括号的逻辑 bug

技术点：正则，stack, 分拆，递归

解决

因为最后的结果是要使用 es 去搜索。看到上面的解析结果。都拆分为 | 的逻辑，

譬如 万科#房地产 结果就是 万科 | 房地产，es 语句就是多个 ** bool should**,

如果是 万科#房地产&科技,解析的结果是：万科&科技 | 房地产&科技，使用 es 就是两个 bool should,并且每个 should 里面是两个 must。

所以，解决的方式就是将用户输入的参数拆分为 List<List> 的结果。最外面的list表示 | , 里面表示 & 。

//万科#房地产&科技
[[万科, 科技], [房地产, 科技]]

//万科&房地产#科技
[[万科, 房地产], [科技]]

//万科&(房地产#科技)
[[万科,房地产],[万科,科技]]

拆分问题为三步：

全是 & 关键字的
包含括号的
不包含括号的

全是 & 关键字的

如果都是 & 关键字组合的，则关键字都是 & 的逻辑，直接按照 & 分割返回就好了。

String[] strings = params.split("#");
//如果都是 & 关键字，则按照 # 分割之后的结果大小是 1。
if (strings.length == 1) {
    //去掉脏数据，都是 & ，括号没意义
    params.replaceAll("(\\()|(\\))", "");

    strings = params.split("&");
    List list = new ArrayList();
    for (String string : strings) {
        list.add(string);
    }
    resultList.add(list);
    return resultList;
}

case:

万科&(房地产&科技)

//解析结果
[[万科, 房地产, 科技]]

//es语句
{
    "bool": {
        "should": [
            {
                "bool": {
                    "must": [
                        {
                            "multi_match": {
                                "query": "万科",
                                "minimum_should_match": "100%",
                                "fields": [
                                    "company_name_cn",
                                    "company_name_en",
                                    "company_shortname_cn",
                                    "company_shortname_en",
                                    "company_description_cn",
                                    "product_name"
                                ],
                                "type": "best_fields"
                            }
                        },
                        {
                            "multi_match": {
                                "query": "房地产",
                                "minimum_should_match": "100%",
                                "fields": [
                                    "company_name_cn",
                                    "company_name_en",
                                    "company_shortname_cn",
                                    "company_shortname_en",
                                    "company_description_cn",
                                    "product_name"
                                ],
                                "type": "best_fields"
                            }
                        },
                        {
                            "multi_match": {
                                "query": "科技",
                                "minimum_should_match": "100%",
                                "fields": [
                                    "company_name_cn",
                                    "company_name_en",
                                    "company_shortname_cn",
                                    "company_shortname_en",
                                    "company_description_cn",
                                    "product_name"
                                ],
                                "type": "best_fields"
                            }
                        }
                    ]
                }
            }
        ]
    }
}

不包含（的

如果都是 # 的很好处理了。但是如果是 # 和 & 的混合，则就有点麻烦了。没有单独分出一个逻辑是全是 # 的原因是：他和 # & 混合的情况是一个情况，都是有多个 should 的。也就是说 List<List> ,最外面的 list 的size > 1。

以万科#房地产&科技为例子。
最终的结果是 [[万科,房地产],[万科,科技]]。

处理逻辑是还是先按照 # 分割为 list.

所以结果是：[[万科],[房地产&科技]]
里面的 list 之间都是 | 的关系。因为优先级是从左往右。
所以，开始从左往右解析，如果元素包含了 & ，则按照 & 分割，然后✖️前面的集合元素。也就是：[[万科],[房地产&科技]] => [[万科,房地产],[万科,科技]]

public List<List<String>> noParentheses(String para
    String[] strings = params.split("#");
    List<List<String>> resultList = new ArrayList<>
    for (String s : strings) {
        //用户可能输入多个 ## 在一起
        if (s.isEmpty()) {
            continue;
        }
        List temp = new ArrayList();
        temp.add(s);
        resultList.add(temp);
    }

    //处理分割之后，包含& 的，相当于 ✖️的逻辑
    for (int i = 0; i < resultList.size(); i++) {
        List<String> tempList = resultList.get(i);
        if (tempList.get(0).contains("&")) {
            String[] strs = tempList.get(0).split("
            for (int j = 0; j < i; j++) {
                List<String> temp = resultList.get(
                temp.add(strs[1]);
                resultList.set(j, temp);
            }
            tempList.clear();
            tempList.add(strs[0]);
            tempList.add(strs[1]);
        }
    }
    return resultList;
}

万科#房地产&科技

//解析结果
[[万科, 科技)], [房地产, 科技)]]


//es 结果
{
    "bool": {
        "should": [
            {
                "bool": {
                    "must": [
                        {
                            "multi_match": {
                                "query": "万科",
                                "minimum_should_match": "100%",
                                "fields": [
                                    "company_name_cn",
                                    "company_name_en",
                                    "company_shortname_cn",
                                    "company_shortname_en",
                                    "company_description_cn",
                                    "product_name"
                                ],
                                "type": "best_fields"
                            }
                        },
                        {
                            "multi_match": {
                                "query": "科技)",
                                "minimum_should_match": "100%",
                                "fields": [
                                    "company_name_cn",
                                    "company_name_en",
                                    "company_shortname_cn",
                                    "company_shortname_en",
                                    "company_description_cn",
                                    "product_name"
                                ],
                                "type": "best_fields"
                            }
                        }
                    ]
                }
            },
            {
                "bool": {
                    "must": [
                        {
                            "multi_match": {
                                "query": "房地产",
                                "minimum_should_match": "100%",
                                "fields": [
                                    "company_name_cn",
                                    "company_name_en",
                                    "company_shortname_cn",
                                    "company_shortname_en",
                                    "company_description_cn",
                                    "product_name"
                                ],
                                "type": "best_fields"
                            }
                        },
                        {
                            "multi_match": {
                                "query": "科技)",
                                "minimum_should_match": "100%",
                                "fields": [
                                    "company_name_cn",
                                    "company_name_en",
                                    "company_shortname_cn",
                                    "company_shortname_en",
                                    "company_description_cn",
                                    "product_name"
                                ],
                                "type": "best_fields"
                            }
                        }
                    ]
                }
            }
        ]
    }
}

包含（）的逻辑

如果包含（）,则第一步先校验合法性。即括号是不是成对出现的。

校验括号合法性的逻辑

很简单的逻辑，使用栈。如果遇到左括号，则 push,如果是右括号则 pop。其他的不操作。最后验证栈是不是空的。如果是空的，则表示括号是成对出现的。

public boolean isValidKeyWords(String keys) {
    char[] chars = keys.toCharArray();
    Stack stack = new Stack();
    for (char aChar : chars) {
        switch (aChar) {
            case '(':
                stack.push(aChar);
                break;
            case ')':
                if (stack.empty() || !stack.peek().equals('(')) {
                    return false;
                } else {
                    stack.pop();
                }
                break;
            default:
                break;
        }
    }
    return stack.isEmpty();
}

解析

括号代表了优先级。

譬如：

万科&房地产#科技 => 万科房地产 | 科技

万科&(房地产#科技) 万科房地产 | 万科科技

核心代码：如果遇到（，则将（）里面的内容取出来，如果包含（，则递归调用 formatParams，否则调用 noParentheses

public List<List<String>> hasParentheses(String params) {
    List<List<String>> resultList = new ArrayList<>();
    char[] chars = params.toCharArray();
    //存储括号里面的关键词
    List<String> subList = new LinkedList();
    int index = 0;
    //当前关键字在括号里面的逻辑
    Boolean flag = Boolean.FALSE;
    //括号结束的标志
    int flagEnd = 0;
    for (int i = 0; i < chars.length; i++) {
        switch (chars[i]) {
            case '#':
                if (!flag) {
                    if (index == i) {
                        String subParams = String.valueOf(Arrays.copyOfRange(chars, index + 1, chars.length));
                        resultList.addAll(formatParams(subParams));
                    } else {
                        List<String> tempList = new ArrayList<>();
                        String subParams = String.valueOf(Arrays.copyOfRange(chars, index, i));
                        tempList.add(subParams);
                        resultList.add(tempList);
                        index = i;
                    }
                } else {
                    subList.add(String.valueOf(chars[i]));
                }
                break;
            case '(':
                flagEnd++;
                if (flag) {
                    subList.add(String.valueOf(chars[i]));
                } else {
                    flag = Boolean.TRUE;
                    index = i;
                }
                break;
            case ')':
                flagEnd--;
                if (flag && flagEnd == 0) {
                    flag = Boolean.FALSE;
                    if (!subList.isEmpty()) {
                        String localKey = String.join("", subList);
                        if (localKey.contains("(")) {
                            resultList.addAll(formatParams(localKey));
                        } else {
                            resultList.addAll(noParentheses(localKey));
                        }
                        subList.clear();
                        index = i + 1;
                    }
                }else{
                    subList.add(String.valueOf(chars[i]));
                }
                break;
            default:
                if (flag) {
                    subList.add(String.valueOf(chars[i]));
                }
                break;
        }
    }
    return resultList;
}

使用两个带括号的例子：

//(万科#(房地产&科技))#信息
[[万科], [房地产, 科技], [信息]]
//万科#(房地产&科技)
[[万科], [房地产, 科技]]

全部代码

import java.util.*;

/**
 * @author beer
 * @date 2019-04-21 20:23
 * @description: es 关键字搜索
 */
public class EsKeyWords {

    public static void main(String[] args) {
        EsKeyWords esKeyWords = new EsKeyWords();
        System.out.println(esKeyWords.formatParams("(万科#(房地产&科技))#信息"));
        System.out.println(esKeyWords.formatParams("万科#(房地产&科技)"));
    }

    public List<List<String>> formatParams(String params) {
        String[] strings = params.split("#");
        List<List<String>> resultList = new ArrayList<>();
        //如果都是 &
        if (strings.length == 1) {
            params = params.replaceAll("(\\()|(\\))", "");
            strings = params.split("&");
            List list = new ArrayList();
            for (String string : strings) {
                list.add(string);
            }
            resultList.add(list);
            return resultList;
        }
        if (params.contains("(")) {
            //校验 "(" ")" 的合法性，是否是成对出现的
            Boolean isValid = isValidKeyWords(params);
            if (!isValid) {
                throw new IllegalArgumentException("illegal keywords : " + params);
            }
            resultList = hasParentheses(params);
        } else {
            resultList = noParentheses(params);
        }
        return resultList;
    }

    public List<List<String>> hasParentheses(String params) {
        List<List<String>> resultList = new ArrayList<>();
        char[] chars = params.toCharArray();
        //存储括号里面的关键词
        List<String> subList = new LinkedList();
        int index = 0;
        //当前关键字在括号里面的逻辑
        Boolean flag = Boolean.FALSE;
        //括号结束的标志
        int flagEnd = 0;
        for (int i = 0; i < chars.length; i++) {
            switch (chars[i]) {
                case '#':
                    if (!flag) {
                        if (index == i) {
                            String subParams = String.valueOf(Arrays.copyOfRange(chars, index + 1, chars.length));
                            resultList.addAll(formatParams(subParams));
                        } else {
                            List<String> tempList = new ArrayList<>();
                            String subParams = String.valueOf(Arrays.copyOfRange(chars, index, i));
                            tempList.add(subParams);
                            resultList.add(tempList);
                            index = i;
                        }
                    } else {
                        subList.add(String.valueOf(chars[i]));
                    }
                    break;
                case '(':
                    flagEnd++;
                    if (flag) {
                        subList.add(String.valueOf(chars[i]));
                    } else {
                        flag = Boolean.TRUE;
                        index = i;
                    }
                    break;
                case ')':
                    flagEnd--;
                    if (flag && flagEnd == 0) {
                        flag = Boolean.FALSE;
                        if (!subList.isEmpty()) {
                            String localKey = String.join("", subList);
                            if (localKey.contains("(")) {
                                resultList.addAll(formatParams(localKey));
                            } else {
                                resultList.addAll(noParentheses(localKey));
                            }
                            subList.clear();
                            index = i + 1;
                        }
                    }else{
                        subList.add(String.valueOf(chars[i]));
                    }
                    break;
                default:
                    if (flag) {
                        subList.add(String.valueOf(chars[i]));
                    }
                    break;
            }
        }
        return resultList;
    }


    public List<List<String>> noParentheses(String params) {
        String[] strings = params.split("#");
        List<List<String>> resultList = new ArrayList<>();
        for (String s : strings) {
            //用户可能输入多个 ## 在一起
            if (s.isEmpty()) {
                continue;
            }
            List temp = new ArrayList();
            temp.add(s);
            resultList.add(temp);
        }
        for (int i = 0; i < resultList.size(); i++) {
            List<String> tempList = resultList.get(i);
            if (tempList.get(0).contains("&")) {
                String[] strs = tempList.get(0).split("&");
                for (int j = 0; j < i; j++) {
                    List<String> temp = resultList.get(j);
                    temp.add(strs[1]);
                    resultList.set(j, temp);
                }
                tempList.clear();
                tempList.add(strs[0]);
                tempList.add(strs[1]);
            }
        }
        return resultList;
    }

    public boolean isValidKeyWords(String keys) {
        char[] chars = keys.toCharArray();
        Stack stack = new Stack();
        for (char aChar : chars) {
            switch (aChar) {
                case '(':
                    stack.push(aChar);
                    break;
                case ')':
                    if (stack.empty() || !stack.peek().equals('(')) {
                        return false;
                    } else {
                        stack.pop();
                    }
                    break;
                default:
                    break;
            }
        }
        return stack.isEmpty();
    }
}

上一篇：数据结构与算法 ——快速排序

下一篇：数据结构与算法-快速排序

elasticsearch 关键词搜索 & #

解决

全是 & 关键字的

不包含（的

包含（）的逻辑

全部代码

手机百度“神灯搜索”配件百度神灯预售价599元

防止网站内部搜索被他人恶意利用

360搜索改名为好搜背后的逻辑

基于vue实现多引擎搜索及关键字提示

企业网站推广首先要做企业品牌关键词的建设

企业网站推广技巧之搜索引擎推广

ElasticSearch实战系列三: ElasticSearch的JAVA API使用教程

网站优化全解2：解析搜索引擎收录问题

seo标题优化该如何设置关键词？

SEO问题解答第⑤期：企业网站内容如何部署关键词才是最佳?

elasticsearch 关键词搜索 & #

解决

全是 & 关键字的

不包含（ 的

包含 （） 的逻辑

全部代码

手机百度“神灯搜索”配件百度神灯预售价599元

防止网站内部搜索被他人恶意利用

360搜索改名为好搜背后的逻辑

基于vue实现多引擎搜索及关键字提示

企业网站推广 首先要做企业品牌关键词的建设

企业网站推广技巧之搜索引擎推广

ElasticSearch实战系列三: ElasticSearch的JAVA API使用教程

网站优化全解2：解析搜索引擎收录问题

seo标题优化该如何设置关键词？

SEO问题解答第⑤期：企业网站内容如何部署关键词才是最佳?

不包含（的

包含（）的逻辑

企业网站推广首先要做企业品牌关键词的建设