全文检索Lucene（二）---索引库维护

程序员文章站 2022-07-09 11:51:59

...

维护索引库
1，创建索引库
2，删除索引库
3，更新索引库
4，索引库的优化

索引设置的一些建议：
1) 尽量减少不必要的存储
2) 不需要检索的内容不要建立索引
3) 非文本格式需要提前转化
4）需要整体存放的内容不要分词

数据与Document、Field的转换
我们在应用程序中使用对象表示数据。在数据库中使用的是表记录，所以存在来回转换的问题。同样，要索引库中使用的是Document，也存在来回转换的问题。
对于一个要进行搜索的实体对象，我们会写一个对应的工具类，其中有两个方法：

Document Object2Document(Object object); // 对象Document
Object Document2Object(Document doc); // Document对象

在转换时，对象中的属性对应Document中的Field。由于Lucene只处理文本，所有所有的属性值在存储前都要先转成字符串。使用构造方法：Field(String name, String value, Store store, Index index)。
Store 指定当前字段的数据要不要存到索引库中
Index 指定当前字段的数据是否可以被搜索（是否更新词汇表）
Store与Index都是枚举类型。Store：指定是否把当前属性值的原始内容存储到索引库中。如果存储（YES），在搜索出相应数据时这个属性就有原始的值；如果不存储（NO），得到的数据的这个属性的值为null。Index：指定是否建立索引（词汇表）。建立索引才能被搜索到。不可以不存储也不建立索引（没有意义）。

NumericUtils与DateTools
如果属性的类型不是字符串，则要先进转换：如果是数字类型，使用NumericUtils。如果是日期类型，则使用DataTools。

代码示例：
Article.java

package com.my.bean;

public class Article {

    private Integer id;
    private String title;
    private String content;

    public Integer getId() {
        return id;
    }
    public void setId(Integer id) {
        this.id = id;
    }
    public String getTitle() {
        return title;
    }
    public void setTitle(String title) {
        this.title = title;
    }
    public String getContent() {
        return content;
    }
    public void setContent(String content) {
        this.content = content;
    }



}

QueryPageResult.java

package com.my.bean;

import java.util.List;

public class QueryPageResult {
    private int count;
    private List list;

    public QueryPageResult(int count, List list) {
        this.count = count;
        this.list = list;
    }

    public int getCount() {
        return count;
    }

    public void setCount(int count) {
        this.count = count;
    }

    public List getList() {
        return list;
    }

    public void setList(List list) {
        this.list = list;
    }

}

ArticleDocumentUtils.java

package com.my.utils;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;

import com.my.bean.Article;

public class ArticleDocumentUtils {

    public static Document article2Document(Article article) {
        Document doc = new Document();

        String idStr = article.getId().toString(); // 要把Integer的id值转为String型
        doc.add(new Field("id", idStr, Store.YES, Index.NOT_ANALYZED));
        doc.add(new Field("title", article.getTitle(), Store.YES, Index.ANALYZED));
        doc.add(new Field("content", article.getContent(), Store.YES, Index.ANALYZED));

        return doc;
    }

    public static Article document2Article(Document doc) {
        Article article = new Article();

        Integer id = Integer.parseInt(doc.get("id")); // 要把String型的值转为Integer型
        article.setId(id);
        article.setTitle(doc.get("title"));
        article.setContent(doc.get("content"));

        return article;
    }

}

Configuration.java

package com.my.utils;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Configuration {

    private static Directory directory;
    private static Analyzer analyzer;

    static {
        // 读取配置文件，并初始化配置（这里只模拟一下）
        try {
            directory = FSDirectory.open(new File("./indexDir"));
            analyzer = new StandardAnalyzer(Version.LUCENE_30);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    public static Directory getDirectory() {
        return directory;
    }

    public static Analyzer getAnalyzer() {
        return analyzer;
    }

}

LuceneUtils.java

package com.my.utils;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;

public class LuceneUtils {

    private static IndexWriter indexWriter;

    static {
        // 在加载类时初始化一次
        try {
            indexWriter = new IndexWriter(Configuration.getDirectory(), Configuration.getAnalyzer(), MaxFieldLength.LIMITED);
            System.out.println("-- 已经初始化IndexWriter --");
        } catch (Exception e) {
            throw new RuntimeException(e);
        }

        // 在程序退出前关闭
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() { // 在JVM退出前会执行这个run()方法
                try {
                    indexWriter.close();
                    System.out.println("-- IndexWriter已关闭 --");
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            }
        });
    }

    public static IndexWriter getIndexWriter() {
        return indexWriter;
    }


}

ArticleIndexDao.java

package com.my.indexdao;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.util.Version;

import com.my.bean.Article;
import com.my.bean.QueryPageResult;
import com.my.utils.ArticleDocumentUtils;
import com.my.utils.Configuration;
import com.my.utils.LuceneUtils;

public class ArticleIndexDao {

    /**
     * 创建索引（保存到索引库）
     * 
     * @param article
     */
    public void save(Article article) {
        // 1，把Article转成Document
        Document doc = ArticleDocumentUtils.article2Document(article);

        // 2，添加索引库中
        try {
            LuceneUtils.getIndexWriter().addDocument(doc); // 保存
            LuceneUtils.getIndexWriter().commit(); // 提交更改
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * 删除索引
     * 
     * Term：是指某字段中的某个关键词（在目录中出现的关键词）
     * 
     * @param id
     */
    public void delete(Integer id) {
        try {
            Term term = new Term("id", id.toString());

            LuceneUtils.getIndexWriter().deleteDocuments(term); // 删除所有包含指定term的数据
            LuceneUtils.getIndexWriter().commit(); // 提交更改
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * 更新索引
     * 
     * @param article
     */
    public void update(Article article) {
        try {
            Term term = new Term("id", article.getId().toString());
            Document doc = ArticleDocumentUtils.article2Document(article);

            LuceneUtils.getIndexWriter().updateDocument(term, doc); // 更新
            LuceneUtils.getIndexWriter().commit(); // 提交更改

            // // 更新就是先删除再创建
            // indexWriter.deleteDocuments(term);
            // indexWriter.addDocument(doc);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }


}

IndexDaoTest.java

package com.my.indexdao;

import java.util.List;

import org.junit.Test;

import com.my.bean.Article;
import com.my.bean.QueryPageResult;

public class IndexDaoTest {

    private ArticleIndexDao articleIndexDao = new ArticleIndexDao();

    @Test
    public void testSave_1() {
        // 模拟一条刚保存到数据库中的数据
        Article article = new Article();
        article.setId(1);
        article.setTitle("Lucene是全文检索的框架");
        article.setContent("如果信息检索系统在用户发出了检索请求后再去互联网上找答案，根本无法在有限的时间内返回结果。");

        // 建立索引 ？
        articleIndexDao.save(article);
    }

    @Test
    public void testSave_25() {
        for (int i = 1; i <= 25; i++) {
            // 模拟一条刚保存到数据库中的数据
            Article article = new Article();
            article.setId(i);
            article.setTitle("Lucene是全文检索的框架");
            article.setContent("如果信息检索系统在用户发出了检索请求后再去互联网上找答案，根本无法在有限的时间内返回结果。");

            // 建立索引 ？
            articleIndexDao.save(article);
        }
    }

    @Test
    public void testDelete() {
        articleIndexDao.delete(1);
    }

    @Test
    public void testUpdate() {
        // 模拟一条游离状态的数据
        Article article = new Article();
        article.setId(1);
        article.setTitle("Lucene是全文检索的框架");
        article.setContent("这是更新后的结果");

        articleIndexDao.update(article);
    }



}

索引库优化：
1,合并索引库文件
核心API

IndexWriter.optimize()
indexWriter.setMergeFactor(int)

代码示例
OptimizeTest.java

package com.my.lucene;

import org.apache.lucene.document.Document;
import org.junit.Test;

import com.my.bean.Article;
import com.my.utils.ArticleDocumentUtils;
import com.my.utils.LuceneUtils;

public class OptimizeTest {

    // 合并索引库中的多个小文件为一个大文件
    @Test
    public void testOptimize() throws Exception {
        LuceneUtils.getIndexWriter().optimize();
    }

    // 自动合并：在小文件的数量到达多少个后就自动的合并成一个大文件
    @Test
    public void testAutoOptimize() throws Exception {
        // 设置当小文件的数量到达多少个之后就自动的合并，This must never be less than 2. The default value is 10.
        LuceneUtils.getIndexWriter().setMergeFactor(3); // 设置后是在下一次的增删改操作后才会生效

        // 添加一条索引数据
        Article article = new Article();
        article.setId(1);
        article.setTitle("Lucene是全文检索的框架");
        article.setContent("如果信息检索系统在用户发出了检索请求后再去互联网上找答案，根本无法在有限的时间内返回结果。");
        Document doc = ArticleDocumentUtils.article2Document(article);
        LuceneUtils.getIndexWriter().addDocument(doc);
    }


}

2,使用RAMDirectory
Lucene的API接口设计的比较通用，输入输出结构都很像数据库的表==>记录==>字段，所以很多传统的应用的文件、数据库等都可以比较方便的映射到Lucene的存储结构/接口中。总体上看：可以先把Lucene当成一个支持全文索引的数据库系统。
Lucene的索引存储位置使用的是一个接口（抽象类），也就可以实现各种各样的实际存储方式（实现类、子类），比如存到文件系统中，存在内存中、存在数据库中等等。Lucene提供了两个子类：FSDirectory与RAMDirectory。
1， FSDirectory：在文件系统中，是真实的文件夹与文件。
2， RAMDirectory：在内存中，是模拟的文件夹与文件。与FSDirectory相比：1因为没有IO操作，所以速度快。2，因为在内存中，所以在程序退出后索引库数据就不存在了。

代码示例：
DirectoryTest.java

package com.my.lucene;

import java.io.File;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;

import com.my.bean.Article;
import com.my.utils.ArticleDocumentUtils;
import com.my.utils.Configuration;

public class DirectoryTest {

    @Test
    public void testRAMDirectory() throws Exception {
        // 一、程序启动时加载到RAMDirectory中（在构造对象时传递一个参数就可以了）
        Directory fsDir = FSDirectory.open(new File("./indexDir/")); // 文件系统中的真实的目录与文件：速度慢，能长久保存
        Directory ramDir = new RAMDirectory(fsDir); // 内存中虚拟出来的目录与文件：速度快，程序退出就没有了

        // ========== 在程序运行的过程，会对RAMDirectory进行操作 ==========
        IndexWriter ramIndexWriter = new IndexWriter(ramDir, Configuration.getAnalyzer(), MaxFieldLength.LIMITED);

        // 向ramDir中添加一条索引数据
        Article article = new Article();
        article.setId(1);
        article.setTitle("Lucene是全文检索的框架");
        article.setContent("如果信息检索系统在用户发出了检索请求后再去互联网上找答案，根本无法在有限的时间内返回结果。");
        Document doc = ArticleDocumentUtils.article2Document(article);
        ramIndexWriter.addDocument(doc); // 添加
        ramIndexWriter.close();
        // ========== 程序要退出了，对RAMDirectory的操作完毕 ==========

        // 二、退出前把RAMDirectory中数据保存到FSDirectory中
        // 如果有索引库，就追加，如果没有，就创建
        // IndexWriter fsIndexWriter = new IndexWriter(fsDir, Configuration.getAnalyzer(), MaxFieldLength.LIMITED);
        // 第3个参数create： true to create the index or overwrite the existing one; false to append to the existing index，如果在存在，就报错
        IndexWriter fsIndexWriter = new IndexWriter(fsDir, Configuration.getAnalyzer(), true, MaxFieldLength.LIMITED);
        fsIndexWriter.addIndexesNoOptimize(ramDir); // 把指定索引库中的数据添加到当前的索引库中
        // fsIndexWriter.optimize();
        fsIndexWriter.close();
    }

}

全文检索Lucene（二）---索引库维护

lucene全文检索实例二(实现对10万条数据检索文件的增删改查操作)

lucene全文检索实例二(实现对10万条数据检索文件的增删改查操作)

02.全文检索lucene创建索引库

全文检索Lucene（二）---索引库维护

lucene4.7 （1）全文检索之根据数据库内容创建索引

lucene 实现全文检索（二）：创建索引

Lucene全文检索引擎

使用lucene进行数据库全文检索(新增排序功能)

Lucene全文检索引擎

Lucene的配置以及创建索引全文检索的图文代码详解