详解Spring Boot 中使用 Java API 调用 lucene

程序员文章站 2022-03-18 10:17:14

lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，...

lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎

全文检索概述

比如，我们一个文件夹中，或者一个磁盘中有很多的文件，记事本、world、excel、pdf，我们想根据其中的关键词搜索包含的文件。例如，我们输入lucene，所有内容含有lucene的文件就会被检查出来。这就是所谓的全文检索。

因此，很容易的我们想到，应该建立一个关键字与文件的相关映射，盗用ppt中的一张图，很明白的解释了这种映射如何实现。

倒排索引

详解Spring Boot 中使用 Java API 调用 lucene

有了这种映射关系，我们就来看看lucene的架构设计。

下面是lucene的资料必出现的一张图，但也是其精髓的概括。

详解Spring Boot 中使用 Java API 调用 lucene

我们可以看到，lucene的使用主要体现在两个步骤：

1 创建索引，通过indexwriter对不同的文件进行索引的创建，并将其保存在索引相关文件存储的位置中。

2 通过索引查寻关键字相关文档。

在lucene中，就是使用这种“倒排索引”的技术，来实现相关映射。

lucene数学模型

文档、域、词元

文档是lucene搜索和索引的原子单位，文档为包含一个或者多个域的容器，而域则是依次包含“真正的”被搜索的内容，域值通过分词技术处理，得到多个词元。

for example，一篇小说（斗破苍穹）信息可以称为一个文档，小说信息又包含多个域，例如：标题（斗破苍穹）、作者、简介、最后更新时间等等，对标题这个域采用分词技术又可以得到一个或者多个词元（斗、破、苍、穹）。

lucene文件结构

层次结构

index
一个索引存放在一个目录中

segment
一个索引中可以有多个段，段与段之间是独立的，添加新的文档可能产生新段，不同的段可以合并成一个新段

document
文档是创建索引的基本单位，不同的文档保存在不同的段中，一个段可以包含多个文档

field
域，一个文档包含不同类型的信息，可以拆分开索引

term
词，索引的最小单位，是经过词法分析和语言处理后的数据。

正向信息

按照层次依次保存了从索引到词的包含关系：index-->segment-->document-->field-->term。

反向信息

反向信息保存了词典的倒排表映射：term-->document

indexwriter
lucene中最重要的的类之一，它主要是用来将文档加入索引，同时控制索引过程中的一些参数使用。

analyzer
分析器,主要用于分析搜索引擎遇到的各种文本。常用的有standardanalyzer分析器,stopanalyzer分析器,whitespaceanalyzer分析器等。

directory
索引存放的位置;lucene提供了两种索引存放的位置，一种是磁盘，一种是内存。一般情况将索引放在磁盘上；相应地lucene提供了fsdirectory和ramdirectory两个类。

document
文档;document相当于一个要进行索引的单元，任何可以想要被索引的文件都必须转化为document对象才能进行索引。

field
字段。

indexsearcher
是lucene中最基本的检索工具，所有的检索都会用到indexsearcher工具;

query
查询，lucene中支持模糊查询，语义查询，短语查询，组合查询等等,如有termquery,booleanquery,rangequery,wildcardquery等一些类。

queryparser
是一个解析用户输入的工具，可以通过扫描用户输入的字符串，生成query对象。

hits
在搜索完成之后，需要把搜索结果返回并显示给用户，只有这样才算是完成搜索的目的。在lucene中，搜索的结果的集合是用hits类的实例来表示的。

测试用例

github 代码

代码我已放到 github ，导入spring-boot-lucene-demo 项目

github

添加依赖

<!--对分词索引查询解析-->
<dependency>
  <groupid>org.apache.lucene</groupid>
  <artifactid>lucene-queryparser</artifactid>
  <version>7.1.0</version>
</dependency>

<!--高亮 -->
<dependency>
  <groupid>org.apache.lucene</groupid>
  <artifactid>lucene-highlighter</artifactid>
  <version>7.1.0</version>
</dependency>

<!--smartcn 中文分词器 smartchineseanalyzer smartcn分词器 需要lucene依赖 且和lucene版本同步-->
<dependency>
  <groupid>org.apache.lucene</groupid>
  <artifactid>lucene-analyzers-smartcn</artifactid>
  <version>7.1.0</version>
</dependency>

<!--ik-analyzer 中文分词器-->
<dependency>
  <groupid>cn.bestwu</groupid>
  <artifactid>ik-analyzers</artifactid>
  <version>5.1.0</version>
</dependency>

<!--mmseg4j 分词器-->
<dependency>
  <groupid>com.chenlb.mmseg4j</groupid>
  <artifactid>mmseg4j-solr</artifactid>
  <version>2.4.0</version>
  <exclusions>
    <exclusion>
      <groupid>org.apache.solr</groupid>
      <artifactid>solr-core</artifactid>
    </exclusion>
  </exclusions>
</dependency>

配置 lucene

private directory directory;

private indexreader indexreader;

private indexsearcher indexsearcher;

@before
public void setup() throws ioexception {
  //索引存放的位置，设置在当前目录中
  directory = fsdirectory.open(paths.get("indexdir/"));

  //创建索引的读取器
  indexreader = directoryreader.open(directory);

  //创建一个索引的查找器，来检索索引库
  indexsearcher = new indexsearcher(indexreader);
}

@after
public void teardown() throws exception {
  indexreader.close();
}

**
 * 执行查询，并打印查询到的记录数
 *
 * @param query
 * @throws ioexception
 */
public void executequery(query query) throws ioexception {

  topdocs topdocs = indexsearcher.search(query, 100);

  //打印查询到的记录数
  system.out.println("总共查询到" + topdocs.totalhits + "个文档");
  for (scoredoc scoredoc : topdocs.scoredocs) {

    //取得对应的文档对象
    document document = indexsearcher.doc(scoredoc.doc);
    system.out.println("id：" + document.get("id"));
    system.out.println("title：" + document.get("title"));
    system.out.println("content：" + document.get("content"));
  }
}

/**
 * 分词打印
 *
 * @param analyzer
 * @param text
 * @throws ioexception
 */
public void printanalyzerdoc(analyzer analyzer, string text) throws ioexception {

  tokenstream tokenstream = analyzer.tokenstream("content", new stringreader(text));
  chartermattribute chartermattribute = tokenstream.addattribute(chartermattribute.class);
  try {
    tokenstream.reset();
    while (tokenstream.incrementtoken()) {
      system.out.println(chartermattribute.tostring());
    }
    tokenstream.end();
  } finally {
    tokenstream.close();
    analyzer.close();
  }
}

创建索引

@test
public void indexwritertest() throws ioexception {
  long start = system.currenttimemillis();

  //索引存放的位置，设置在当前目录中
  directory directory = fsdirectory.open(paths.get("indexdir/"));

  //在 6.6 以上版本中 version 不再是必要的，并且，存在无参构造方法，可以直接使用默认的 standardanalyzer 分词器。
  version version = version.lucene_7_1_0;

  //analyzer analyzer = new standardanalyzer(); // 标准分词器，适用于英文
  //analyzer analyzer = new smartchineseanalyzer();//中文分词
  //analyzer analyzer = new complexanalyzer();//中文分词
  //analyzer analyzer = new ikanalyzer();//中文分词

  analyzer analyzer = new ikanalyzer();//中文分词

  //创建索引写入配置
  indexwriterconfig indexwriterconfig = new indexwriterconfig(analyzer);

  //创建索引写入对象
  indexwriter indexwriter = new indexwriter(directory, indexwriterconfig);

  //创建document对象，存储索引

  document doc = new document();

  int id = 1;

  //将字段加入到doc中
  doc.add(new intpoint("id", id));
  doc.add(new stringfield("title", "spark", field.store.yes));
  doc.add(new textfield("content", "apache spark 是专为大规模数据处理而设计的快速通用的计算引擎", field.store.yes));
  doc.add(new storedfield("id", id));

  //将doc对象保存到索引库中
  indexwriter.adddocument(doc);

  indexwriter.commit();
  //关闭流
  indexwriter.close();

  long end = system.currenttimemillis();
  system.out.println("索引花费了" + (end - start) + " 毫秒");
}

响应

17:58:14.655 [main] debug org.wltea.analyzer.dic.dictionary - 加载扩展词典：ext.dic
17:58:14.660 [main] debug org.wltea.analyzer.dic.dictionary - 加载扩展停止词典：stopword.dic
索引花费了879 毫秒

删除文档

@test
public void deletedocumentstest() throws ioexception {
  //analyzer analyzer = new standardanalyzer(); // 标准分词器，适用于英文
  //analyzer analyzer = new smartchineseanalyzer();//中文分词
  //analyzer analyzer = new complexanalyzer();//中文分词
  //analyzer analyzer = new ikanalyzer();//中文分词

  analyzer analyzer = new ikanalyzer();//中文分词

  //创建索引写入配置
  indexwriterconfig indexwriterconfig = new indexwriterconfig(analyzer);

  //创建索引写入对象
  indexwriter indexwriter = new indexwriter(directory, indexwriterconfig);

  // 删除title中含有关键词“spark”的文档
  long count = indexwriter.deletedocuments(new term("title", "spark"));

  // 除此之外indexwriter还提供了以下方法：
  // deletedocuments(query query):根据query条件来删除单个或多个document
  // deletedocuments(query[] queries):根据query条件来删除单个或多个document
  // deletedocuments(term term):根据term来删除单个或多个document
  // deletedocuments(term[] terms):根据term来删除单个或多个document
  // deleteall():删除所有的document

  //使用indexwriter进行document删除操作时，文档并不会立即被删除，而是把这个删除动作缓存起来，当indexwriter.commit()或indexwriter.close()时，删除操作才会被真正执行。

  indexwriter.commit();
  indexwriter.close();

  system.out.println("删除完成:" + count);
}

响应

删除完成:1

更新文档

/**
 * 测试更新
 * 实际上就是删除后新增一条
 *
 * @throws ioexception
 */
@test
public void updatedocumenttest() throws ioexception {
  //analyzer analyzer = new standardanalyzer(); // 标准分词器，适用于英文
  //analyzer analyzer = new smartchineseanalyzer();//中文分词
  //analyzer analyzer = new complexanalyzer();//中文分词
  //analyzer analyzer = new ikanalyzer();//中文分词

  analyzer analyzer = new ikanalyzer();//中文分词

  //创建索引写入配置
  indexwriterconfig indexwriterconfig = new indexwriterconfig(analyzer);

  //创建索引写入对象
  indexwriter indexwriter = new indexwriter(directory, indexwriterconfig);

  document doc = new document();

  int id = 1;

  doc.add(new intpoint("id", id));
  doc.add(new stringfield("title", "spark", field.store.yes));
  doc.add(new textfield("content", "apache spark 是专为大规模数据处理而设计的快速通用的计算引擎", field.store.yes));
  doc.add(new storedfield("id", id));

  long count = indexwriter.updatedocument(new term("id", "1"), doc);
  system.out.println("更新文档:" + count);
  indexwriter.close();
}

响应

更新文档:1

按词条搜索

/**
 * 按词条搜索
 * <p>
 * termquery是最简单、也是最常用的query。termquery可以理解成为“词条搜索”，
 * 在搜索引擎中最基本的搜索就是在索引中搜索某一词条，而termquery就是用来完成这项工作的。
 * 在lucene中词条是最基本的搜索单位，从本质上来讲一个词条其实就是一个名/值对。
 * 只不过这个“名”是字段名，而“值”则表示字段中所包含的某个关键字。
 *
 * @throws ioexception
 */
@test
public void termquerytest() throws ioexception {

  string searchfield = "title";
  //这是一个条件查询的api，用于添加条件
  termquery query = new termquery(new term(searchfield, "spark"));

  //执行查询，并打印查询到的记录数
  executequery(query);
}

响应