lucene 分词器
程序员文章站
2022-07-01 15:31:36
...
lucene的英文分词器主要用到StandardAnalyzer,中文的主要是极易分词MMAnalyzer(需要单独引jar包je-analysis-1.5.3.jar)。
英文分词的过程:[color=red][size=large]1,关键词切分->2,去除停用词(is of)->3,形态还原(ing,ed,复数等)->4,转化为小写[/size][/color]
中文分词::[color=red][size=large]1,关键词切分->2,去除停用词(的 着)[/size][/color]
英文分词的过程:[color=red][size=large]1,关键词切分->2,去除停用词(is of)->3,形态还原(ing,ed,复数等)->4,转化为小写[/size][/color]
中文分词::[color=red][size=large]1,关键词切分->2,去除停用词(的 着)[/size][/color]
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
public class AnalyzerTest {
static String enText = "The PGP signatures can be verified using PGP or GPG. ";
static String chText = "世界发达国家居民消费1000度的电能的费用占全国月平均工资的6.79%";
static Analyzer en1 = new StandardAnalyzer();
static Analyzer en2 = new SimpleAnalyzer();
static Analyzer ch1 = new MMAnalyzer();
/**
* @param args
*/
public static void main(String[] args) throws Exception{
// TODO Auto-generated method stub
new AnalyzerTest().analyze(chText, ch1);
}
public void analyze(String text,Analyzer analyzer) throws Exception{
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
for (Token token = new Token();(token = tokenStream.next(token))!= null;){
System.out.println(token);
}
}
}
上一篇: Lucene分词器
下一篇: activemq消息中间件