欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

nutch-JE分词

程序员文章站 2022-05-16 09:08:49
...

先下载Nutch 1.0的源文件:

co http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0 ./nutch-1.0

更改查询语法解析部分:

改变tokenize的方式(原来为中文单字识别)

modify “src/java/org/apache/nutch/analysis/NutchAnalysis.jj”

line 130:

| <SIGRAM: <CJK> >
change to:

| <SIGRAM: (<CJK>)+ >
run “javacc”

cd nutch-1.0/src/java/org/apache/nutch/analysis
/usr/local/javacc-3.2/bin/javacc NutchAnalysis.jj


3 files will be regenerated:

中文分析部分(查询和索引):

将analyzer更换为JE中文分析器

a). copy “je-analysis-1.5.3.jar” to lib/

b). modify NutchDocumentAnalyzer.java

Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
===================================================================
--- src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java (revision 764668)
+++ src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java (working copy)
@@ -27,6 +27,8 @@
import org.apache.lucene.analysis.Token;
import org.apache.hadoop.conf.Configuration;

+import jeasy.analysis.*;
+
/**
* The analyzer used for Nutch documents. Uses the JavaCC-defined lexical
* analyzer {@link NutchDocumentTokenizer}, with no stop list. This keeps it
@@ -65,8 +67,14 @@

/** Constructs a {@link NutchDocumentTokenizer}. */
public TokenStream tokenStream(String field, Reader reader) {
- return this.commonGrams.getFilter(new NutchDocumentTokenizer(reader),
- field);
+ if ("content".equals(field) || "title".equals(field) || "DEFAULT".equals(field)) {
+ MMAnalyzer analyzer=new MMAnalyzer();
+ return analyzer.tokenStream(field, reader);
+ }
+ else {
+ return this.commonGrams.getFilter(new NutchDocumentTokenizer(reader),
+ field);
+ }
}
}

重新编译Nutch:

在build.xml添加一条指令(在第195行的下面加入一行),使的编译war文件的时候加入je-analysis的jar文件。

build.xml

<include name="lucene*.jar"/>
<include name="taglibs-*.jar"/>
<include name="hadoop-*.jar"/>
<include name="dom4j-*.jar"/>
<include name="xerces-*.jar"/>
<include name="tika-*.jar"/>
<include name="apache-solr-*.jar"/>
<include name="commons-httpclient-*.jar"/>
<include name="commons-codec-*.jar"/>
<include name="commons-collections-*.jar"/>
<include name="commons-beanutils-*.jar"/>
<include name="commons-cli-*.jar"/>
<include name="commons-lang-*.jar"/>
<include name="commons-logging-*.jar"/>
<include name="log4j-*.jar"/>
<include name="je-analysis-*.jar"/> <!-- add this line -->
</lib>

compile:
cd nutch-1.0
export ANT_HOME=/usr/local/apache-ant-1.7.1
/usr/local/apache-ant-1.7.1/bin/ant
/usr/local/apache-ant-1.7.1/bin/ant war

使用新生成的含中文分词功能的模块:

只用到刚才编译生成的下面三个文件,替换Nutch 1.0的tarball解压后的对应文件