Luncene快速入门

程序员文章站 2022-03-30 10:12:58

以下相关术语，在另一篇博文中有介绍。https://mp.csdn.net/postedit/85889608索引库的概念索引库中存放数据的基本单位是文档，文档中有很多域，索引库中可以是任何数据：包含结构化数据和非结构化数据转成文档之后存储在索引中。非结构化：泛指没有固定的结构，网上所有资源，如：视频、音乐、博客、网站.....将结构化数据写入到索引库的方式可以使......

以下相关术语，在另一篇博文中有介绍。

https://mp.csdn.net/postedit/85889608

索引库的概念

索引库中存放数据的基本单位是文档，文档中有很多域，索引库中可以是任何数据：包含结构化数据和非结构化数据转成文档之后存储在索引中。

非结构化：泛指没有固定的结构，网上所有资源，如：视频、音乐、博客、网站.....

将结构化数据写入到索引库的方式

可以使用工具或者Java代码将结构化数据写入到索引库中，这种方式主要适用于"站内搜索"

站内搜索：如:天猫、京东、淘宝、所有资源来自于本公司数据库中的搜索叫站内搜索。

非结构数据写入到索引库的方式

采用爬虫把非结构化数据写人到索引库，适用场景如：搜索引擎，百度..

非结构化数据的查询方式

1.顺序扫描法：一个文档一个文档的查询（不可取）

2.倒排索引法：先对文档进行倒排索引处理，然后再去检索。词条是检索的基本单位。

eg:文档二（doc2）：When do you come back from Rome? 对文档内容创建索引

Luncene快速入门

term：代表词条，也就是将文档的词切割成单一个体单位

doc：代表字条所在的文档 freq：代表字条在该文档中出现的频率

pos：代表字条在该文档中出现的位置

Lucene是采用全文检索方式实现的一个搜索工具包。

全文检索指：采用倒排索引法进行检索文档的方式。

把非结构化数据写入到索引库中注意的三个是否

1.是否索引：如果索引就意味着可以通过该字段查询到对应文档。

2.是否分词：如果分词就意味着通过词条去匹配文档，如果不分词就意味着要和对应内容完全一致才能匹配到。

3.是否存储：存储的话就是要把内容存储到索引库中，如果存储表示要在检索页面展示出数据。

Lucene中常用的域字段类型

Luncene快速入门

把数据写入到索引库

1.导入mven的pom依赖

<dependencies>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>6.6.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>6.6.1</version>
    </dependency>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.4</version>
    </dependency>
    
 -- 分词器hanlp整合jar包
 <dependency>
    <groupId>com.hankcs</groupId>
    <artifactId>hanlp</artifactId>
    <version>portable-1.6.3</version>
</dependency>
<dependency>
    <groupId>com.hankcs.nlp</groupId>
    <artifactId>hanlp-lucene-plugin</artifactId>
    <version>1.1.3</version>
</dependency>

2.编写写入索引库代码

public void Test01() throws Exception {
        //指定索引库所在的文件
        Directory d = FSDirectory.open(Paths.get("D:\\lucene\\index"));
        //创建写入索引库所需要的对象
        //标准的分词器
//      IndexWriterConfig conf=new IndexWriterConfig(new StandardAnalyzer());
        //聪明的中国人分词器
//      IndexWriterConfig conf = new IndexWriterConfig(new SimpleAnalyzer());
        //hanlp分词器
        IndexWriterConfig conf = new IndexWriterConfig(new HanLPAnalyzer());
        //创建写入索引库所需要的对象
        IndexWriter indexWriter = new IndexWriter(d, conf);
        //得到数据源文件夹
        File sourceFile = new File("D:\\lucene\\source");
        //得到数据源下的所有文件
        File[] files = sourceFile.listFiles();
        for (File file : files) {
            //获取文档标题
            String fileName = file.getName();
            //获取文件内容
            String fileContent = FileUtils.readFileToString(file);
            //获取文件的大小
            Long fileSize = FileUtils.sizeOf(file);
            //文件的路径
            String filepath = file.getPath();

            //将上面的字段变成域
            Field fName = new TextField("fileName", fileName, Field.Store.YES);
            Field fContent = new TextField("fileContent", fileContent, Field.Store.YES);
            Field fSize = new StoredField("fileSize", fileSize);
            Field filePath = new StoredField("filePath", filepath);
            //创建文档
            Document document = new Document();
            document.add(fName);
            document.add(fContent);
            document.add(fSize);
            document.add(filePath);
            //把文档写入到索引库
            indexWriter.addDocument(document);
        }
        //关闭资源
        indexWriter.close();
}

分词器

1.标准的分词器
IndexWriterConfig conf=new IndexWriterConfig(new StandardAnalyzer());

2聪明的中国人分词器
IndexWriterConfig conf = new IndexWriterConfig(new SimpleAnalyzer());
注意:对中文支持较好，但扩展性差，扩展词库，禁用词库和同义词库等不好处理

3.hanlp分词器
IndexWriterConfig conf = new IndexWriterConfig(new HanLPAnalyzer());

常用的查询器

1.MatchAllDocsQuery查询
使用MatchAllDocsQuery查询索引目录中的所有文档

2.LongPoint 查询
可以根据数值范围查询

3.BooleanQuery查询
组合条件查询

4.queryparser查询
单域分词查询

5.MulitFieldQueryParser
多域分词查询

查询索引库代码

public void indexReader() throws Exception {
        //指定索引库所在的文件
        Directory d = FSDirectory.open(Paths.get("D:\\lucene\\index"));

        IndexReader indexReader = DirectoryReader.open(d);
        //再对indexReader进行封装，使得其具有检索引库的功能
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //指定查询器（查询所有）
        Query query=new MatchAllDocsQuery();
        //单域不分词查询
        Query query1=new TermQuery(new Term("fileName","中国人"));
        //范围查询
        Query query2=LongPoint.newRangeQuery("fileSize",6,666);
        //组合条件查询
        BooleanClause bc1 = new BooleanClause(query, BooleanClause.Occur.MUST);
        BooleanClause bc2 = new BooleanClause(query1, BooleanClause.Occur.MUST_NOT);
        Query query3=new BooleanQuery.Builder().add(bc1).add(bc2).build();


        //指定一个单域分词查询解析器
        //查询解析器中指定的分词器一定是和写入索引使用的分词要一致
        QueryParser parser = new QueryParser("fileName", new HanLPAnalyzer());
        //通过查询解析器得到一个查询器
        Query query4 = parser.parse("Spring是一个很强大的框架内");


        //指定一个多域分词查询解析器
        MultiFieldQueryParser queryParser = new MultiFieldQueryParser(new String[]{"fileName", "fileContent"}, new HanLPAnalyzer());
        Query query5 = queryParser.parse("Spring是一个很强大的框架");
        //获取记录，参数一:指定查询器
        //参数二：指定查询多少条记录
        TopDocs topDocs = indexSearcher.search(query5, 10);
        //得到文档的总数据量
        int totalHits = topDocs.totalHits;
        System.out.println("-------------文档的总数据量："+totalHits);
        //得到文档编号的集合
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            //得到每一个文档的编号
            int doc = scoreDoc.doc;
            System.out.println("------------------文档的编号"+doc);
            //根据文档编号查询文档
            Document document = indexSearcher.doc(doc);
            //展示数据
            System.out.println(document.get("fileName"));
            System.out.println(document.get("fileContent"));
            System.out.println(document.get("fileSize"));
            System.out.println(document.get("filePath"));
        }
        //关闭资源
        indexReader.close();
    }

本文地址：https://blog.csdn.net/qq_42949806/article/details/85889900

上一篇： Java开发笔记（四十九）关键字super的用法

下一篇： Error 1390: Prepared statement contains too many placeholders

Luncene快速入门

以下相关术语，在另一篇博文中有介绍。

索引库的概念

非结构化数据的查询方式

全文检索指：采用倒排索引法进行检索文档的方式。

把非结构化数据写入到索引库中注意的三个是否

Lucene中常用的域字段类型

把数据写入到索引库

分词器

常用的查询器

助力企业进行微博营销，快速达到商品宣传效果

MyEclipse怎么快速处理折叠和展开代码?

Eclipse怎么快速注释程序代码?eclipse快速添加注释的三种办法

Android studio文件编码格式怎么快速切换?

如何用傲游快速保存视频和图片

Python入门篇之字典

Python入门篇之正则表达式

Python入门篇之字符串

Python入门篇之条件、循环

极点五笔输入法如何能快速*的切换简繁体