欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Lucene 分词

程序员文章站 2022-07-09 09:34:18
...

Lucene 分词

Lucene maven导入

<properties>
        <java.version>1.8</java.version>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <maven.compiler.compilerVersion>1.8</maven.compiler.compilerVersion>
        <lucene.version>7.6.0</lucene.version>
</properties>
 <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-smartcn -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queries</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-highlighter -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers</artifactId>
            <version>3.6.2</version>
        </dependency>
</dependencies>

Lucene分词

  • StopAnalyzer(停用词分词)
    StopAnalyzer能够过滤词汇中特定字符串和词汇,并且完成大写转小写的功能。

  • StandardAnalyzer(标准分词)
    StandAnalyzer根据空格和符号来完成分词,还可以完成数字、字母、E-mail地址、IP地址以及中文字符的分析处理,还可以支持过滤词表,用来代替StopAnalyzer能够实现的功能。

  • WhitespaceAnalyzer(空格分词)
    WhitespaceAnalyzer使用空格作为间隔符的词汇分割分词器。

  • SimpleAnalyzer(简单分词)
    SimpleAnalyzer具备西文字符词汇分析的分词器,处理词汇单元是,以非字母字符作为分割符号。

  • CJKAnalyzer(二分法分词)
    内部调用CJKTokenizer分词器,对中文进行分词

  • KeywordAnalyzer(关键字分词)
    KeywordAnalyzer把整个输入作为一个单独词汇单元,方便特殊类型的文本进行索引和检索。针对邮政编码、地址等文本信息使用关键字分词器进行索引项建立非常方便。

多种分词器实例

public class VariousAnalyzers {
    private static String str = "中华人名*简称中国, 是一个有13亿人口的国家。";

    public static void main(String[] args) throws IOException{
        Analyzer analyzer = null;
        analyzer = new StandardAnalyzer();
        System.out.println("标准分词:" + analyzer.getClass());
        printAnalyzer(analyzer);
        analyzer = new WhitespaceAnalyzer();
        System.out.println("空格分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        analyzer = new SimpleAnalyzer();
        System.out.println("简单分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        analyzer = new CJKAnalyzer();
        System.out.println("二分法分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        analyzer = new KeywordAnalyzer();
        System.out.println("关键字分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        analyzer = new StopAnalyzer();
        System.out.println("停用词分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        analyzer = new SmartChineseAnalyzer();
        System.out.println("中文分词" + analyzer.getClass());
        printAnalyzer(analyzer);

    }

    public static void printAnalyzer(Analyzer analyzer) throws IOException{
        StringReader reader = new StringReader(str);
        TokenStream tokenStream = analyzer.tokenStream(str, reader);
        tokenStream.reset();
        CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class);
        while(tokenStream.incrementToken()){
            System.out.print(termAttribute.toString() + "|");
        }
        System.out.println("\n");
        analyzer.close();
    }
}

分词结果

标准分词:class org.apache.lucene.analysis.standard.StandardAnalyzer
中|华|人|名|共|和|国|简|称|中|国|是|一|个|有|13|亿|人|口|的|国|家|

空格分词class org.apache.lucene.analysis.core.WhitespaceAnalyzer
中华人名*简称中国,|是一个有13亿人口的国家。|

简单分词class org.apache.lucene.analysis.core.SimpleAnalyzer
中华人名*简称中国|是一个有|亿人口的国家|

二分法分词class org.apache.lucene.analysis.cjk.CJKAnalyzer
中华|华人|人名|名共|共和|和国|国简|简称|称中|中国|是一|一个|个有|13|亿人|人口|口的|的国|国家|

关键字分词class org.apache.lucene.analysis.core.KeywordAnalyzer
中华人名*简称中国, 是一个有13亿人口的国家。|

停用词分词class org.apache.lucene.analysis.core.StopAnalyzer
中华人名*简称中国|是一个有|亿人口的国家|

中文分词class org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
中华|人名|*|简称|中国|是|一个|有|13|亿|人口|的|国家|

IK分词器配置

IKTokenizer6x

import java.io.IOException;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
public class IKTokenizer6x extends Tokenizer{
    private IKSegmenter _IKImplemenet;
    private final CharTermAttribute termAttribute;
    private final OffsetAttribute offsetAttribute;
    private final TypeAttribute typeAttribute;
    private int endPosition;

    public IKTokenizer6x(boolean useSmart){
        super();
        offsetAttribute = addAttribute(OffsetAttribute.class);
        termAttribute = addAttribute(CharTermAttribute.class);
        typeAttribute = addAttribute(TypeAttribute.class);
        _IKImplemenet = new IKSegmenter(input, useSmart);
    }

    @Override
    public boolean incrementToken() throws IOException {
        clearAttributes();
        Lexeme nextLexeme = _IKImplemenet.next();
        if(nextLexeme != null){
            termAttribute.append(nextLexeme.getLexemeText());
            termAttribute.setLength(nextLexeme.getLength());
            offsetAttribute.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
            endPosition = nextLexeme.getEndPosition();
            typeAttribute.setType(nextLexeme.getLexemeText());
            return true;
        }
        return false;
    }

    @Override
    public void reset() throws IOException{
        super.reset();
        _IKImplemenet.reset(input);
    }

    @Override
    public final void end(){
        int finalOffset = correctOffset(this.endPosition);
        offsetAttribute.setOffset(finalOffset, finalOffset);
    }
}

IKAnalyzer6x

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;

public class IKAnalyzer6x extends Analyzer{
    private boolean useSmart;

    public boolean isUseSmart(){
        return useSmart;
    }

    public IKAnalyzer6x(){
        this(false);
    }

    public IKAnalyzer6x(boolean useSmart) {
        super();
        this.useSmart = useSmart;
    }
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer _IKTokenizer = new IKTokenizer6x(this.isUseSmart());
        return new TokenStreamComponents(_IKTokenizer);
    }
}

Lucene 扩展词典

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <!--<entry key="ext_dict">com/org/config/my.dic;com/org/config/mine.dic;</entry>-->
    <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">stopword.dic; ext_stopword.dic</entry>

    <entry key="ext_dict">ext.dic</entry>
</properties>