nutch-JE分词
程序员文章站
2022-05-16 09:08:49
...
先下载Nutch 1.0的源文件:
co http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0 ./nutch-1.0
更改查询语法解析部分:
改变tokenize的方式(原来为中文单字识别)
modify “src/java/org/apache/nutch/analysis/NutchAnalysis.jj”
line 130:
| <SIGRAM: <CJK> >
change to:
| <SIGRAM: (<CJK>)+ >
run “javacc”
cd nutch-1.0/src/java/org/apache/nutch/analysis
/usr/local/javacc-3.2/bin/javacc NutchAnalysis.jj
/usr/local/javacc-3.2/bin/javacc NutchAnalysis.jj
3 files will be regenerated:
中文分析部分(查询和索引):
将analyzer更换为JE中文分析器
a). copy “je-analysis-1.5.3.jar” to lib/
b). modify NutchDocumentAnalyzer.javaIndex: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
===================================================================
--- src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java (revision 764668)
+++ src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java (working copy)
@@ -27,6 +27,8 @@
import org.apache.lucene.analysis.Token;
import org.apache.hadoop.conf.Configuration;
+import jeasy.analysis.*;
+
/**
* The analyzer used for Nutch documents. Uses the JavaCC-defined lexical
* analyzer {@link NutchDocumentTokenizer}, with no stop list. This keeps it
@@ -65,8 +67,14 @@
/** Constructs a {@link NutchDocumentTokenizer}. */
public TokenStream tokenStream(String field, Reader reader) {
- return this.commonGrams.getFilter(new NutchDocumentTokenizer(reader),
- field);
+ if ("content".equals(field) || "title".equals(field) || "DEFAULT".equals(field)) {
+ MMAnalyzer analyzer=new MMAnalyzer();
+ return analyzer.tokenStream(field, reader);
+ }
+ else {
+ return this.commonGrams.getFilter(new NutchDocumentTokenizer(reader),
+ field);
+ }
}
}
重新编译Nutch:
在build.xml添加一条指令(在第195行的下面加入一行),使的编译war文件的时候加入je-analysis的jar文件。
build.xml<include name="lucene*.jar"/>
<include name="taglibs-*.jar"/>
<include name="hadoop-*.jar"/>
<include name="dom4j-*.jar"/>
<include name="xerces-*.jar"/>
<include name="tika-*.jar"/>
<include name="apache-solr-*.jar"/>
<include name="commons-httpclient-*.jar"/>
<include name="commons-codec-*.jar"/>
<include name="commons-collections-*.jar"/>
<include name="commons-beanutils-*.jar"/>
<include name="commons-cli-*.jar"/>
<include name="commons-lang-*.jar"/>
<include name="commons-logging-*.jar"/>
<include name="log4j-*.jar"/>
<include name="je-analysis-*.jar"/> <!-- add this line -->
</lib>
compile:
cd nutch-1.0
export ANT_HOME=/usr/local/apache-ant-1.7.1
/usr/local/apache-ant-1.7.1/bin/ant
/usr/local/apache-ant-1.7.1/bin/ant war
export ANT_HOME=/usr/local/apache-ant-1.7.1
/usr/local/apache-ant-1.7.1/bin/ant
/usr/local/apache-ant-1.7.1/bin/ant war
使用新生成的含中文分词功能的模块:
只用到刚才编译生成的下面三个文件,替换Nutch 1.0的tarball解压后的对应文件上一篇: lucene-使用Highlighter高亮显示查询项
下一篇: lucene入门-复杂索引建立