Hyperscan 匹配性能，参数设置，db生成的一些理解

程序员文章站 2022-06-16 11:38:17

Hyperscan是个开源的高性能正则匹配库，支持几十万的正则库，使用起来比较方便，具体使用方法可以参照官方文档git地址：https://github.com/intel/hyperscan 开发者手册：http://intel.github.io/hyperscan/dev-reference/个人使用的总结如下Hyperscan database 的生成，一般数据量小的时候比较快的，几百个规则生成时间也就几秒或是几十秒，一万条规则的时候需要110秒左右，20万条的时候大概8小时左右，可...

Hyperscan是个开源的高性能正则匹配库，支持几十万的正则库，使用起来比较方便，具体使用方法可以参照官方文档

git地址：https://github.com/intel/hyperscan

开发者手册：http://intel.github.io/hyperscan/dev-reference/

个人使用的总结如下

Hyperscan database 的生成，一般数据量小的时候比较快的，几百个规则生成时间也就几秒或是几十秒，一万条规则的时候需要110秒左右，二万条规则的时候需要664秒左右，20万条的时候大概8小时左右，可能和机器和正则表达式的复杂程度有关系

根据database 匹配字符串的速度问题，这个和生成的时候需要设置几个参数就是8Hi，8表示UTF，H表示单匹配（只要匹配到就返回，不然会一直匹配下去，比如.*的情况，会一直匹配到结束，效率会非常的低），i表示忽略大小写

java封装了一个工具，git地址：https://github.com/gliwka/hyperscan-java

具体生成正则库的代码：

in为正则文件一条一换行即可，前面不需要就id号，如：

(A|b|c)|(你好 Hyperscan)

(1|b|c)|(你好 Hyperscan.*?)

out为输出路径，后缀随意，如：HyDB

public static void gen(String in, String out){
    long start = System.currentTimeMillis();
    System.out.println("装载正则库开始："+start);
    List<Expression> expressions2 = new ArrayList<>(250000);
    try {

        BufferedInputStream fis = null;
        BufferedReader reader = null;
        File file = new File(in);
        fis = new BufferedInputStream(new FileInputStream(file));
        reader = new BufferedReader(new InputStreamReader(fis, "utf-8"), 5 * 1024 * 1024);// 用5M的缓冲读取文本文件
        List<String> expressionStrings2 = new ArrayList<>(250000);
        String line = "";
        while ((line = reader.readLine()) != null) {
            if (line.getBytes().length > 0) {
                expressionStrings2.add(line);

            }
        }

        for (int i = 0; i < expressionStrings2.size(); i++) {
            try {

                expressions2.add(new Expression(expressionStrings2.get(i), EnumSet.of(ExpressionFlag.UTF8,ExpressionFlag.CASELESS,ExpressionFlag.SINGLEMATCH)));

            } catch (Exception e) {
                continue;
            }
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    try(Database db = Database.compile(expressions2)) {

        try(OutputStream outs = new FileOutputStream(out)) {
            db.save(outs);
            long end = System.currentTimeMillis();
            System.out.println("end   "+end);
            System.out.println("用时"+(end-start)/1000+"秒");
        }

    }
    catch (CompileErrorException ce) {

        Expression failedExpression = ce.getFailedExpression();
        System.out.println("error : "+failedExpression);
    }
    catch(IOException ie) {

    }

}

读取库就比较简单了：

dbPath为生成好的库的地址

input为目标字符串文件

public static void regx(String input,String dbPath){
    FileInputStream reader = null;

    System.out.println("装载正则库开始："+System.currentTimeMillis());
    long startLoad = System.currentTimeMillis();
    FileInputStream fi = null;
    try {
        fi = new FileInputStream(dbPath);
        File file = new File(input);
        reader = new FileInputStream(file);

        byte[] buf = new byte[1024];
        int length = 0;
        StringBuffer tmp= new StringBuffer();
        while((length = reader.read(buf)) != -1){

            tmp .append(new String(buf,0,length));
        }
    Database db = Database.load(fi);
    Scanner scanner = new Scanner();
    scanner.allocScratch(db);
        long endLoad = System.currentTimeMillis();
        long regxstart = System.currentTimeMillis();
    System.out.println("装载正则库结束"+endLoad+"，用时"+(endLoad-startLoad)/1000+"秒，匹配开始："+regxstart);
    List<Match> matches = scanner.scan(db, tmp.toString());
    System.out.println("匹配命中结果："+matches.size());
        long regxend = System.currentTimeMillis();
    System.out.println("匹配结束："+System.currentTimeMillis()+",用时："+(regxend-regxstart)/1000+"秒");
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

添加pom依赖：

<dependency>
    <groupId>com.gliwka.hyperscan</groupId>
    <artifactId>hyperscan</artifactId>
    <version>1.0.0</version>
</dependency>

demo代码地址：https://download.csdn.net/download/airyearth/14890010

性能还是比较快的，1万条规则，匹配出33条结果，74毫秒，2万条规则，匹配出68条结果，86毫秒，3万条规则，匹配出103条结果，391毫秒，20万条规则，匹配出873条结果，2719毫秒，待续

本文地址：https://blog.csdn.net/airyearth/article/details/112871300

上一篇： Android Activity生命周期调用的理解

下一篇：前端使用正则表达式获取地址栏URL参数的值并将需要的参数值展示在页面