Lucene学习总结之十:Lucene的分词器Analyzer 博客分类: Lucene 学习总结 luceneC++CC#Gmail
1、抽象类Analyzer
其主要包含两个接口,用于生成TokenStream:
- TokenStream tokenStream(String fieldName, Reader reader);
- TokenStream reusableTokenStream(String fieldName, Reader reader) ;
所谓TokenStream,后面我们会讲到,是一个由分词后的Token结果组成的流,能够不断的得到下一个分成的Token。
为了提高性能,使得在同一个线程中无需再生成新的TokenStream对象,老的可以被重用,所以有reusableTokenStream一说。
所以Analyzer中有CloseableThreadLocal< Object > tokenStreams = new CloseableThreadLocal< Object >();成员变量,保存当前线程原来创建过的TokenStream,可用函数setPreviousTokenStream设定,用函数getPreviousTokenStream得到。
在reusableTokenStream函数中,往往用getPreviousTokenStream得到老的TokenStream对象,然后将TokenStream对象reset以下,从而可以从新开始得到Token流。
让我们看一下最简单的一个Analyzer:
public final class SimpleAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { //返回的是将字符串最小化,并且按照空格分隔的Token return new LowerCaseTokenizer(reader); } @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { //得到上一次使用的TokenStream,如果没有则生成新的,并且用setPreviousTokenStream放入成员变量,使得下一个可用。 Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new LowerCaseTokenizer(reader); setPreviousTokenStream(tokenizer); } else //如果上一次生成过TokenStream,则reset。 tokenizer.reset(reader); return tokenizer; } } |
2、TokenStream抽象类
TokenStream主要包含以下几个方法:
- boolean incrementToken()用于得到下一个Token。
- public void reset() 使得此TokenStrean可以重新开始返回各个分词。
和原来的TokenStream返回一个Token对象不同,Lucene 3.0的TokenStream已经不返回Token对象了,那么如何保存下一个Token的信息呢。
在Lucene 3.0中,TokenStream是继承于AttributeSource,其包含Map,保存从class到对象的映射,从而可以保存不同类型的对象的值。
在TokenStream中,经常用到的对象是TermAttributeImpl,用来保存Token字符串;PositionIncrementAttributeImpl用来保存位置信息;OffsetAttributeImpl用来保存偏移量信息。
所以当生成TokenStream的时候,往往调用AttributeImpl tokenAtt = (AttributeImpl) addAttribute(TermAttribute.class)将TermAttributeImpl添加到Map中,并保存一个成员变量。
在incrementToken()中,将下一个Token的信息写入当前的tokenAtt,然后使用TermAttributeImpl.term()得到Token的字符串。
3、几个具体的TokenStream
在索引的时候,添加域的时候,可以指定Analyzer,使其生成TokenStream,也可以直接指定TokenStream:
public Field(String name, TokenStream tokenStream);
下面介绍两个单独使用的TokenStream
3.1、NumericTokenStream
上一节介绍NumericRangeQuery的时候,在生成NumericField的时候,其会使用NumericTokenStream,其incrementToken如下:
public boolean incrementToken() { if (valSize == 0) throw new IllegalStateException("call set???Value() before usage"); if (shift >= valSize) return false; clearAttributes(); //虽然NumericTokenStream欲保存数字,然而Lucene的Token只能保存字符串,因而要将数字编码为字符串,然后存入索引。 final char[] buffer; switch (valSize) { //首先分配TermBuffer,然后将数字编码为字符串 case 64: buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_LONG); termAtt.setTermLength(NumericUtils.longToPrefixCoded(value, shift, buffer)); break; case 32: buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_INT); termAtt.setTermLength(NumericUtils.intToPrefixCoded((int) value, shift, buffer)); break; default: throw new IllegalArgumentException("valSize must be 32 or 64"); } typeAtt.setType((shift == 0) ? TOKEN_TYPE_FULL_PREC : TOKEN_TYPE_LOWER_PREC); posIncrAtt.setPositionIncrement((shift == 0) ? 1 : 0); shift += precisionStep; return true; } |
public static int intToPrefixCoded(final int val, final int shift, final char[] buffer) { if (shift>31 || shift<0) throw new IllegalArgumentException("Illegal shift value, must be 0..31"); int nChars = (31-shift)/7 + 1, len = nChars+1; buffer[0] = (char)(SHIFT_START_INT + shift); int sortableBits = val ^ 0x80000000; sortableBits >>>= shift; while (nChars>=1) { //int按照每七位组成一个utf-8的编码,并且字符串大小比较的顺序同int大小比较的顺序完全相同。 buffer[nChars--] = (char)(sortableBits & 0x7f); sortableBits >>>= 7; } return len; } |
3.2、SingleTokenTokenStream
SingleTokenTokenStream顾名思义就是此TokenStream仅仅包含一个Token,多用于保存一篇文档仅有一个的信息,如id,如time等,这些信息往往被保存在一个特殊的Token(如ID:ID, TIME:TIME)的倒排表的payload中的,这样可以使用跳表来增加访问速度。
所以SingleTokenTokenStream返回的Token则不是id或者time本身,而是特殊的Token,"ID:ID", "TIME:TIME",而是将id的值或者time的值放入payload中。
//索引的时候 int id = 0; //用户自己的文档号 String tokenstring = "ID"; byte[] value = idToBytes(); //将id装换为byte数组 Token token = new Token(tokenstring, 0, tokenstring.length); token.setPayload(new Payload(value)); SingleTokenTokenStream tokenstream = new SingleTokenTokenStream(token); Document doc = new Document(); doc.add(new Field("ID", tokenstream)); …… //当得到Lucene的文档号docid,并不想构造Document对象就得到用户的文档号时 TermPositions tp = reader.termPositions("ID:ID"); boolean ret = tp.skipTo(docid); tp.nextPosition(); int payloadlength = tp.getPayloadLength(); byte[] payloadBuffer = new byte[payloadlength]; tp.getPayload(payloadBuffer, 0); int id = bytesToID(); //将payloadBuffer转换为用户id |
4、Tokenizer也是一种TokenStream
public abstract class Tokenizer extends TokenStream { protected Reader input; protected Tokenizer(Reader input) { this.input = CharReader.get(input); } public void reset(Reader input) throws IOException { this.input = input; } } |
以下重要的Tokenizer如下,我们将一一解析:
- CharTokenizer
- LetterTokenizer
- LowerCaseTokenizer
- WhitespaceTokenizer
- LetterTokenizer
- ChineseTokenizer
- CJKTokenizer
- EdgeNGramTokenizer
- KeywordTokenizer
- NGramTokenizer
- SentenceTokenizer
- StandardTokenizer
4.1、CharTokenizer
CharTokenizer是一个抽象类,用于对字符串进行分词。
在构造函数中,生成了TermAttribute和OffsetAttribute两个属性,说明分词后除了返回分词后的字符外,还要返回offset。
offsetAtt = addAttribute(OffsetAttribute.class); termAtt = addAttribute(TermAttribute.class); |
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { clearAttributes(); int length = 0; int start = bufferIndex; char[] buffer = termAtt.termBuffer(); while (true) { //不断读取reader中的字符到buffer中 if (bufferIndex >= dataLen) { offset += dataLen; dataLen = input.read(ioBuffer); if (dataLen == -1) { dataLen = 0; if (length > 0) break; else return false; } bufferIndex = 0; } //然后逐一遍历buffer中的字符 final char c = ioBuffer[bufferIndex++]; //如果是一个token字符,则normalize后接着取下一个字符,否则当前token结束。 if (isTokenChar(c)) { if (length == 0) start = offset + bufferIndex - 1; else if (length == buffer.length) buffer = termAtt.resizeTermBuffer(1+length); buffer[length++] = normalize(c); if (length == MAX_WORD_LEN) break; } else if (length > 0) break; } termAtt.setTermLength(length); offsetAtt.setOffset(correctOffset(start), correctOffset(start+length)); return true; } |
CharTokenizer是一个抽象类,其isTokenChar函数和normalize函数由子类实现。
其子类WhitespaceTokenizer实现了isTokenChar函数:
//当遇到空格的时候,当前token结束 protected boolean isTokenChar(char c) { return !Character.isWhitespace(c); } |
其子类LetterTokenizer如下实现isTokenChar函数:
protected boolean isTokenChar(char c) { return Character.isLetter(c); } |
LetterTokenizer的子类LowerCaseTokenizer实现了normalize函数,将字符串转换为小写:
protected char normalize(char c) { return Character.toLowerCase(c); } |
4.2、ChineseTokenizer
其在初始化的时候,添加TermAttribute和OffsetAttribute。
其incrementToken实现如下:
public boolean incrementToken() throws IOException { clearAttributes(); length = 0; start = offset; while (true) { final char c; offset++; if (bufferIndex >= dataLen) { dataLen = input.read(ioBuffer); bufferIndex = 0; } if (dataLen == -1) return flush(); else c = ioBuffer[bufferIndex++]; switch(Character.getType(c)) { //如果是英文下小写字母或数字的时候,则属于同一个Token,push到buffer中 case Character.DECIMAL_DIGIT_NUMBER: case Character.LOWERCASE_LETTER: case Character.UPPERCASE_LETTER: push(c); if (length == MAX_WORD_LEN) return flush(); break; //中文属于OTHER_LETTER,当出现中文字符的时候,则上一个Token结束,并将当前字符push到buffer中 case Character.OTHER_LETTER: if (length>0) { bufferIndex--; offset--; return flush(); } push(c); return flush(); default: if (length>0) return flush(); break; } } } |
4.3、KeywordTokenizer
KeywordTokenizer是将整个字符作为一个Token返回的。
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { if (!done) { clearAttributes(); done = true; int upto = 0; char[] buffer = termAtt.termBuffer(); //将字符串全部读入buffer,然后返回。 while (true) { final int length = input.read(buffer, upto, buffer.length-upto); if (length == -1) break; upto += length; if (upto == buffer.length) buffer = termAtt.resizeTermBuffer(1+buffer.length); } termAtt.setTermLength(upto); finalOffset = correctOffset(upto); offsetAtt.setOffset(correctOffset(0), finalOffset); return true; } return false; } |
4.4、CJKTokenizer
其incrementToken函数如下:
public boolean incrementToken() throws IOException { clearAttributes(); while(true) { int length = 0; int start = offset; while (true) { //得到当前的字符,及其所属的Unicode块 char c; Character.UnicodeBlock ub; offset++; if (bufferIndex >= dataLen) { dataLen = input.read(ioBuffer); bufferIndex = 0; } if (dataLen == -1) { if (length > 0) { if (preIsTokened == true) { length = 0; preIsTokened = false; } break; } else { return false; } } else { c = ioBuffer[bufferIndex++]; ub = Character.UnicodeBlock.of(c); } //如果当前字符输入ASCII码 if ((ub == Character.UnicodeBlock.BASIC_LATIN) || (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)) { if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) { int i = (int) c; if (i >= 65281 && i <= 65374) { //将半型及全型形式Unicode转变为普通的ASCII码 i = i - 65248; c = (char) i; } } //如果当前字符是字符或者"_" "+" "#" if (Character.isLetterOrDigit(c) || ((c == '_') || (c == '+') || (c == '#'))) { if (length == 0) { start = offset - 1; } else if (tokenType == DOUBLE_TOKEN_TYPE) { offset--; bufferIndex--; if (preIsTokened == true) { length = 0; preIsTokened = false; break; } else { break; } } //将当前字符放入buffer buffer[length++] = Character.toLowerCase(c); tokenType = SINGLE_TOKEN_TYPE; if (length == MAX_WORD_LEN) { break; } } else if (length > 0) { if (preIsTokened == true) { length = 0; preIsTokened = false; } else { break; } } } else { //如果非ASCII字符 if (Character.isLetter(c)) { if (length == 0) { start = offset - 1; buffer[length++] = c; tokenType = DOUBLE_TOKEN_TYPE; } else { if (tokenType == SINGLE_TOKEN_TYPE) { offset--; bufferIndex--; break; } else { //非ASCII码字符,两个字符作为一个Token //(如"*"分词为"中华","华人","人民","民共","共和","和国") buffer[length++] = c; tokenType = DOUBLE_TOKEN_TYPE; if (length == 2) { offset--; bufferIndex--; preIsTokened = true; break; } } } } else if (length > 0) { if (preIsTokened == true) { length = 0; preIsTokened = false; } else { break; } } } } if (length > 0) { termAtt.setTermBuffer(buffer, 0, length); offsetAtt.setOffset(correctOffset(start), correctOffset(start+length)); typeAtt.setType(TOKEN_TYPE_NAMES[tokenType]); return true; } else if (dataLen == -1) { return false; } } } |
4.5、SentenceTokenizer
其是按照如下的标点来拆分句子:"。,!?;,!?;"
让我们来看下面的例子:
String s = "据纽约时报周三报道称,苹果已经超过微软成为美国最有价值的 科技公司。这是一个不容忽视的转折点。"; StringReader sr = new StringReader(s); SentenceTokenizer tokenizer = new SentenceTokenizer(sr); boolean hasnext = tokenizer.incrementToken(); while(hasnext){ TermAttribute ta = tokenizer.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = tokenizer.incrementToken(); } |
结果为: 据纽约时报周三报道称, |
其incrementToken函数如下:
public boolean incrementToken() throws IOException { clearAttributes(); buffer.setLength(0); int ci; char ch, pch; boolean atBegin = true; tokenStart = tokenEnd; ci = input.read(); ch = (char) ci; while (true) { if (ci == -1) { break; } else if (PUNCTION.indexOf(ch) != -1) { //出现标点符号,当前句子结束,返回当前Token buffer.append(ch); tokenEnd++; break; } else if (atBegin && Utility.SPACES.indexOf(ch) != -1) { tokenStart++; tokenEnd++; ci = input.read(); ch = (char) ci; } else { buffer.append(ch); atBegin = false; tokenEnd++; pch = ch; ci = input.read(); ch = (char) ci; //当连续出现两个空格,或者\r\n的时候,则当前句子结束,返回当前Token if (Utility.SPACES.indexOf(ch) != -1 && Utility.SPACES.indexOf(pch) != -1) { tokenEnd++; break; } } } if (buffer.length() == 0) return false; else { termAtt.setTermBuffer(buffer.toString()); offsetAtt.setOffset(correctOffset(tokenStart), correctOffset(tokenEnd)); typeAtt.setType("sentence"); return true; } } |
5、TokenFilter也是一种TokenStream
来对Tokenizer后的Token作过滤,其使用的是装饰者模式。
public abstract class TokenFilter extends TokenStream { protected final TokenStream input; protected TokenFilter(TokenStream input) { super(input); this.input = input; } } |
5.1、ChineseFilter
其incrementToken函数如下:
public boolean incrementToken() throws IOException { while (input.incrementToken()) { char text[] = termAtt.termBuffer(); int termLength = termAtt.termLength(); //如果不被停词表过滤掉 if (!stopTable.contains(text, 0, termLength)) { switch (Character.getType(text[0])) { //如果是英文且长度超过一,则算一个Token,否则不算一个Token case Character.LOWERCASE_LETTER: case Character.UPPERCASE_LETTER: if (termLength>1) { return true; } break; //如果是中文则算一个Token case Character.OTHER_LETTER: return true; } } } return false; } |
举例:
String s = "Javaeye: IT外企那点儿事。1.外企也就那么会儿事。"; StringReader sr = new StringReader(s); ChineseTokenizer ct = new ChineseTokenizer(sr); ChineseFilter filter = new ChineseFilter(ct); boolean hasnext = filter.incrementToken(); while(hasnext){ TermAttribute ta = filter.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = filter.incrementToken(); } |
结果为: javaeye |
5.2、LengthFilter
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { while (input.incrementToken()) { int len = termAtt.termLength(); //当当前字符串的长度在指定范围内的时候则返回。 if (len >= min && len <= max) { return true; } } return false; } |
举例如下:
String s = "a it has this there string english analyzer"; StringReader sr = new StringReader(s); WhitespaceTokenizer wt = new WhitespaceTokenizer(sr); LengthFilter filter = new LengthFilter(wt, 4, 7); boolean hasnext = filter.incrementToken(); while(hasnext){ TermAttribute ta = filter.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = filter.incrementToken(); } |
结果如下: this |
5.3、LowerCaseFilter
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { if (input.incrementToken()) { final char[] buffer = termAtt.termBuffer(); final int length = termAtt.termLength(); for(int i=0;i<length;i++) //转小写 buffer[i] = Character.toLowerCase(buffer[i]); return true; } else return false; } |
5.4、NumericPayloadTokenFilter
public final boolean incrementToken() throws IOException { if (input.incrementToken()) { if (typeAtt.type().equals(typeMatch)) //设置payload payloadAtt.setPayload(thePayload); return true; } else { return false; } } |
5.5、PorterStemFilter
其成员变量PorterStemmer stemmer,其实现著名的stemming算法是The Porter Stemming Algorithm,其主页为http://tartarus.org/~martin/PorterStemmer/,也可查看其论文http://tartarus.org/~martin/PorterStemmer/def.txt。
通过以下网页可以进行简单的测试:Porter's Stemming Algorithm Online[http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]
cars –> car
driving –> drive
tokenization –> token
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { if (!input.incrementToken()) return false; if (stemmer.stem(termAtt.termBuffer(), 0, termAtt.termLength())) termAtt.setTermBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength()); return true; } |
举例:
String s = "Tokenization is the process of breaking a stream of text up into meaningful elements called tokens."; StringReader sr = new StringReader(s); LowerCaseTokenizer lt = new LowerCaseTokenizer(sr); PorterStemFilter filter = new PorterStemFilter(lt); boolean hasnext = filter.incrementToken(); while(hasnext){ TermAttribute ta = filter.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = filter.incrementToken(); } |
结果为: token |
5.6、ReverseStringFilter
public boolean incrementToken() throws IOException { if (input.incrementToken()) { int len = termAtt.termLength(); if (marker != NOMARKER) { len++; termAtt.resizeTermBuffer(len); termAtt.termBuffer()[len - 1] = marker; } //将token反转 reverse( termAtt.termBuffer(), len ); termAtt.setTermLength(len); return true; } else { return false; } } |
public static void reverse( char[] buffer, int start, int len ){ if( len <= 1 ) return; int num = len>>1; for( int i = start; i < ( start + num ); i++ ){ char c = buffer[i]; buffer[i] = buffer[start * 2 + len - i - 1]; buffer[start * 2 + len - i - 1] = c; } } |
举例:
String s = "Tokenization is the process of breaking a stream of text up into meaningful elements called tokens."; StringReader sr = new StringReader(s); LowerCaseTokenizer lt = new LowerCaseTokenizer(sr); ReverseStringFilter filter = new ReverseStringFilter(lt); boolean hasnext = filter.incrementToken(); while(hasnext){ TermAttribute ta = filter.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = filter.incrementToken(); } |
结果为: noitazinekot |
5.7、SnowballFilter
其包含成员变量SnowballProgram stemmer,其是一个抽象类,其子类有EnglishStemmer和PorterStemmer等。
public final boolean incrementToken() throws IOException { if (input.incrementToken()) { String originalTerm = termAtt.term(); stemmer.setCurrent(originalTerm); stemmer.stem(); String finalTerm = stemmer.getCurrent(); if (!originalTerm.equals(finalTerm)) termAtt.setTermBuffer(finalTerm); return true; } else { return false; } } |
举例:
String s = "Tokenization is the process of breaking a stream of text up into meaningful elements called tokens."; StringReader sr = new StringReader(s); LowerCaseTokenizer lt = new LowerCaseTokenizer(sr); SnowballFilter filter = new SnowballFilter(lt, new EnglishStemmer()); boolean hasnext = filter.incrementToken(); while(hasnext){ TermAttribute ta = filter.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = filter.incrementToken(); } |
结果如下: token |
5.8、TeeSinkTokenFilter
TeeSinkTokenFilter可以使得已经分好词的Token全部或者部分的被保存下来,用于生成另一个TokenStream可以保存在其他的域中。
我们可用如下的语句生成一个TeeSinkTokenFilter:
TeeSinkTokenFilter source = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader)); |
然后使用函数newSinkTokenStream()或者newSinkTokenStream(SinkFilter filter)生成一个SinkTokenStream:
TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream(); |
其中在newSinkTokenStream(SinkFilter filter)函数中,将新生成的SinkTokenStream保存在TeeSinkTokenFilter的成员变量sinks中。
在TeeSinkTokenFilter的incrementToken函数中:
public boolean incrementToken() throws IOException { if (input.incrementToken()) { //对于每一个Token,依次遍历成员变量sinks AttributeSource.State state = null; for (WeakReference<SinkTokenStream> ref : sinks) { //对于每一个SinkTokenStream,首先调用函数accept看是否接受,如果接受则将此Token也加入此SinkTokenStream。 final SinkTokenStream sink = ref.get(); if (sink != null) { if (sink.accept(this)) { if (state == null) { state = this.captureState(); } sink.addState(state); } } } return true; } return false; } |
SinkTokenStream.accept调用SinkFilter.accept,对于默认的ACCEPT_ALL_FILTER则接受所有的Token:
private static final SinkFilter ACCEPT_ALL_FILTER = new SinkFilter() { @Override public boolean accept(AttributeSource source) { return true; } }; |
这样SinkTokenStream就能够保存下所有WhitespaceTokenizer分好的Token。
当我们使用比较复杂的分成系统的时候,分词一篇文章往往需要耗费比较长的时间,当分好的词需要再次使用的时候,再分一次词实在太浪费了,于是可以用上述的例子,将分好的词保存在一个TokenStream里面就可以了。
如下面的例子:
String s = "this is a book"; StringReader reader = new StringReader(s); TeeSinkTokenFilter source = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader)); TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream(); boolean hasnext = source.incrementToken(); while(hasnext){ TermAttribute ta = source.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = source.incrementToken(); } System.out.println("---------------------------------------------"); hasnext = sink.incrementToken(); while(hasnext){ TermAttribute ta = sink.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = sink.incrementToken(); } |
结果为: this |
当然有时候我们想在分好词的一系列Token中,抽取我们想要的一些实体,保存下来。
如下面的例子:
String s = "Japan will always balance its national interests between China and America."; StringReader reader = new StringReader(s); TeeSinkTokenFilter source = new TeeSinkTokenFilter(new LowerCaseTokenizer(reader)); //一个集合,保存所有的国家名称 final HashSet<String> countryset = new HashSet<String>(); countryset.add("japan"); countryset.add("china"); countryset.add("america"); countryset.add("korea"); SinkFilter countryfilter = new SinkFilter() { @Override public boolean accept(AttributeSource source) { TermAttribute ta = source.getAttribute(TermAttribute.class); //如果在国家名称列表中,则保留 if(countryset.contains(ta.term())){ return true; } return false; } }; TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream(countryfilter); //由LowerCaseTokenizer对语句进行分词,并把其中的国家名称保存在SinkTokenStream中 boolean hasnext = source.incrementToken(); while(hasnext){ TermAttribute ta = source.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = source.incrementToken(); } System.out.println("---------------------------------------------"); hasnext = sink.incrementToken(); while(hasnext){ TermAttribute ta = sink.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = sink.incrementToken(); } } |
结果为: japan |
6、不同的Analyzer就是组合不同的Tokenizer和TokenFilter得到最后的TokenStream
6.1、ChineseAnalyzer
public final TokenStream tokenStream(String fieldName, Reader reader) { //按字分词,并过滤停词,标点,英文 TokenStream result = new ChineseTokenizer(reader); result = new ChineseFilter(result); return result; } |
举例:"This year, president Hu 科学发展观" 被分词为 "year","president","hu","科","学","发","展","观"
6.2、CJKAnalyzer
public final TokenStream tokenStream(String fieldName, Reader reader) { //每两个字组成一个词,并去除停词 return new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), new CJKTokenizer(reader), stopTable); } |
举例:"This year, president Hu 科学发展观" 被分词为"year","president","hu","科学","学发","发展","展观"。
6.3、PorterStemAnalyzer
public TokenStream tokenStream(String fieldName, Reader reader) { //将转为小写的token,利用porter算法进行stemming return new PorterStemFilter(new LowerCaseTokenizer(reader)); } |
6.4、SmartChineseAnalyzer
public TokenStream tokenStream(String fieldName, Reader reader) { //先分句子 TokenStream result = new SentenceTokenizer(reader); //句子中分词组 result = new WordTokenFilter(result); //用porter算法进行stemming result = new PorterStemFilter(result); //去停词 if (!stopWords.isEmpty()) { result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), result, stopWords, false); } return result; } |
6.5、SnowballAnalyzer
public TokenStream tokenStream(String fieldName, Reader reader) { //使用标准的分词器 TokenStream result = new StandardTokenizer(matchVersion, reader); //标准的过滤器 result = new StandardFilter(result); //转换为小写 result = new LowerCaseFilter(result); //去停词 if (stopSet != null) result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), result, stopSet); //根据设定的stemmer进行stemming result = new SnowballFilter(result, name); return result; } |
7、Lucene的标准分词器
7.1、StandardTokenizerImpl.jflex
和QueryParser类似,标准分词器也需要词法分析,在原来的版本中,也是用javacc,当前的版本中,使用的是jflex。
jflex也是一个词法及语法分析器的生成器,它主要包括三部分,由%%分隔:
- 用户代码部分:多为package或者import
- 选项及词法声明
- 语法规则声明
用于生成标准分词器的flex文件尾StandardTokenizerImpl.jflex,如下:
import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.tokenattributes.TermAttribute; %% //以上是用户代码部分,以下是选项及词法声明 %class StandardTokenizerImpl //类名 %unicode %integer //下面函数的返回值 %function getNextToken //进行词法及语法分析的函数 %pack %char %{ //此之间的代码之间拷贝到生成的java文件中 public static final int ALPHANUM = StandardTokenizer.ALPHANUM; public static final int APOSTROPHE = StandardTokenizer.APOSTROPHE; public static final int ACRONYM = StandardTokenizer.ACRONYM; public static final int COMPANY = StandardTokenizer.COMPANY; public static final int EMAIL = StandardTokenizer.EMAIL; public static final int HOST = StandardTokenizer.HOST; public static final int NUM = StandardTokenizer.NUM; public static final int CJ = StandardTokenizer.CJ; public static final int ACRONYM_DEP = StandardTokenizer.ACRONYM_DEP; public static final String [] TOKEN_TYPES = StandardTokenizer.TOKEN_TYPES; public final int yychar() { return yychar; } final void getText(Token t) { t.setTermBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead); } final void getText(TermAttribute t) { t.setTermBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead); } %} THAI = [\u0E00-\u0E59] //一系列字母和数字的组合 ALPHANUM = ({LETTER}|{THAI}|[:digit:])+ //省略符号,如you're APOSTROPHE = {ALPHA} ("'" {ALPHA})+ //缩写,如U.S.A. ACRONYM = {LETTER} "." ({LETTER} ".")+ ACRONYM_DEP = {ALPHANUM} "." ({ALPHANUM} ".")+ // 公司名称如AT&T,Excite@Home. COMPANY = {ALPHA} ("&"|"@") {ALPHA} // 邮箱地址 EMAIL = {ALPHANUM} (("."|"-"|"_") {ALPHANUM})* "@" {ALPHANUM} (("."|"-") {ALPHANUM})+ // 主机名 HOST = {ALPHANUM} ((".") {ALPHANUM})+ NUM = ({ALPHANUM} {P} {HAS_DIGIT} | {HAS_DIGIT} {P} {ALPHANUM} | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+ | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) //标点 P = ("_"|"-"|"/"|"."|",") //至少包含一个数字的字符串 HAS_DIGIT = ({LETTER}|[:digit:])* [:digit:] ({LETTER}|[:digit:])* ALPHA = ({LETTER})+ //所谓字符,即出去所有的非字符的ASCII及中日文。 LETTER = !(![:letter:]|{CJ}) //中文或者日文 CJ = [\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f] //空格 WHITESPACE = \r\n | [ \r\n\t\f] %% //以下是语法规则部分,由于是分词器,因而不需要进行语法分析,则全部原样返回 {ALPHANUM} { return ALPHANUM; } {APOSTROPHE} { return APOSTROPHE; } {ACRONYM} { return ACRONYM; } {COMPANY} { return COMPANY; } {EMAIL} { return EMAIL; } {HOST} { return HOST; } {NUM} { return NUM; } {CJ} { return CJ; } {ACRONYM_DEP} { return ACRONYM_DEP; }
|
下面我们看下面的例子,来说明StandardTokenizerImpl的功能:
String s = "I'm Juexian, my email is forfuture1978@gmail.com. My ip address is 192.168.0.1, AT&T and I.B.M are all great companies."; StringReader reader = new StringReader(s); StandardTokenizerImpl impl = new StandardTokenizerImpl(reader); while(impl.getNextToken() != StandardTokenizerImpl.YYEOF){ TermAttributeImpl ta = new TermAttributeImpl(); impl.getText(ta); System.out.println(ta.term()); } |
结果为: I'm |
7.2、StandardTokenizer
其有一个成员变量StandardTokenizerImpl scanner;
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { clearAttributes(); int posIncr = 1; while(true) { //用词法分析器得到下一个Token以及Token的类型 int tokenType = scanner.getNextToken(); if (tokenType == StandardTokenizerImpl.YYEOF) { return false; } if (scanner.yylength() <= maxTokenLength) { posIncrAtt.setPositionIncrement(posIncr); //得到Token文本 scanner.getText(termAtt); final int start = scanner.yychar(); offsetAtt.setOffset(correctOffset(start), correctOffset(start+termAtt.termLength())); //设置类型 typeAtt.setType(StandardTokenizerImpl.TOKEN_TYPES[tokenType]); return true; } else posIncr++; } } |
7.3、StandardFilter
其incrementToken函数如下:
public final boolean incrementToken() throws java.io.IOException { if (!input.incrementToken()) { return false; } char[] buffer = termAtt.termBuffer(); final int bufferLength = termAtt.termLength(); final String type = typeAtt.type(); //如果是省略符号,如He's,则去掉's if (type == APOSTROPHE_TYPE && bufferLength >= 2 && buffer[bufferLength-2] == '\'' && (buffer[bufferLength-1] == 's' || buffer[bufferLength-1] == 'S')) { termAtt.setTermLength(bufferLength - 2); } else if (type == ACRONYM_TYPE) { //如果是缩略语I.B.M.,则去掉. int upto = 0; for(int i=0;i<bufferLength;i++) { char c = buffer[i]; if (c != '.') buffer[upto++] = c; } termAtt.setTermLength(upto); } return true; } |
7.4、StandardAnalyzer
public TokenStream tokenStream(String fieldName, Reader reader) { //用词法分析器分词 StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader); tokenStream.setMaxTokenLength(maxTokenLength); //用标准过滤器过滤 TokenStream result = new StandardFilter(tokenStream); //转换为小写 result = new LowerCaseFilter(result); //去停词 result = new StopFilter(enableStopPositionIncrements, result, stopSet); return result; } |
举例如下:
String s = "He's Juexian, His email is forfuture1978@gmail.com. He's an ip address 192.168.0.1, AT&T and I.B.M. are all great companies."; StringReader reader = new StringReader(s); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); TokenStream ts = analyzer.tokenStream("field", reader); boolean hasnext = ts.incrementToken(); while(hasnext){ TermAttribute ta = ts.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = ts.incrementToken(); } |
结果为: he |
8、不同的域使用不同的分词器
8.1、PerFieldAnalyzerWrapper
有时候,我们想不同的域使用不同的分词器,则可以用PerFieldAnalyzerWrapper进行封装。
其有两个成员函数:
- Analyzer defaultAnalyzer:即当域没有指定分词器的时候使用此分词器
- Map<String,Analyzer> analyzerMap = new HashMap<String,Analyzer>():一个从域名到分词器的映射,将根据域名使用相应的分词器。
其TokenStream函数如下:
public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer analyzer = analyzerMap.get(fieldName); if (analyzer == null) { analyzer = defaultAnalyzer; } return analyzer.tokenStream(fieldName, reader); } |
举例说明:
String s = "Hello World"; TokenStream ts = analyzer.reusableTokenStream("f1", new StringReader(s));
System.out.println("---------------------------------------------"); ts = analyzer.reusableTokenStream("f2", new StringReader(s)); System.out.println("---------------------------------------------"); ts = analyzer.reusableTokenStream("none", new StringReader(s)); |
结果为: Hello World |
推荐阅读
-
Lucene 原理与代码分析完整版 博客分类: Lucene 学习总结 lucene搜索引擎IDEA数据结构全文检索
-
Lucene学习总结之十:Lucene的分词器Analyzer 博客分类: Lucene 学习总结 luceneC++CC#Gmail
-
Lucene应用开发揭秘 博客分类: Lucene 学习总结 lucene搜索引擎全文检索
-
Lucene学习总结之十:Lucene的分词器Analyzer 博客分类: Lucene 学习总结 luceneC++CC#Gmail
-
LinkedIn公司实现的实时搜索引擎Zoie 博客分类: Lucene 学习总结 搜索引擎luceneBean数据结构Solr
-
Lucene应用开发揭秘 博客分类: Lucene 学习总结 lucene搜索引擎全文检索
-
LinkedIn公司实现的实时搜索引擎Zoie 博客分类: Lucene 学习总结 搜索引擎luceneBean数据结构Solr
-
Lucene学习总结之一:全文检索的基本原理 博客分类: Lucene 学习总结 全文检索lucene搜索引擎HadoopApache
-
Lucene学习总结之五:细述Lucene的索引文件格式(2)
-
Lucene学习总结之五:细述Lucene的索引文件格式(2)