欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

lucene 分词器

程序员文章站 2022-07-01 15:31:36
...
lucene的英文分词器主要用到StandardAnalyzer,中文的主要是极易分词MMAnalyzer(需要单独引jar包je-analysis-1.5.3.jar)。

英文分词的过程:[color=red][size=large]1,关键词切分->2,去除停用词(is of)->3,形态还原(ing,ed,复数等)->4,转化为小写[/size][/color]

中文分词::[color=red][size=large]1,关键词切分->2,去除停用词(的 着)[/size][/color]




import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

public class AnalyzerTest {


static String enText = "The PGP signatures can be verified using PGP or GPG. ";
static String chText = "世界发达国家居民消费1000度的电能的费用占全国月平均工资的6.79%";
static Analyzer en1 = new StandardAnalyzer();
static Analyzer en2 = new SimpleAnalyzer();
static Analyzer ch1 = new MMAnalyzer();



/**
* @param args
*/
public static void main(String[] args) throws Exception{
// TODO Auto-generated method stub
new AnalyzerTest().analyze(chText, ch1);

}


public void analyze(String text,Analyzer analyzer) throws Exception{
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
for (Token token = new Token();(token = tokenStream.next(token))!= null;){
System.out.println(token);
}
}

}
相关标签: lucene Apache