Lucene4的Version字段探索

程序员文章站 2022-05-13 13:34:48

...

在创建Lucene的IndexWriter，analyzer和QueryParser的时候都要指定一个version字段，这个version有何意义？为什么最新的版本总是把之前的version都打上deprecated标签？高版本的Lucene能否打开低版本的索引并且索引呢？

先看Version类本身的定义：

Use by certain classes to match version compatibility across releases of Lucene.

WARNING: When changing the version parameter that you supply to components in Lucene, do not simply change the version at search-time, but instead also adjust your indexing code to match, and re-index.

在lucene的发布的各个版本中，Version被某些类用来做兼容性检查。比如说，某些功能可能只在3.0这个版本有效，那么这个功能会先检查当前的lucene版本，如果版本等于3.0，则执行这个功能代码。

同时，注解中也说，当更改代码中的Lucene版本号时，不能光修改搜索时使用的版本号，并且必须将索引时的版本号也改掉，同时重建索引。也就是说，搜索时用的版本号和索引时使用的版本号必须相同。

看IndexWriterConfig的构造方法注释：

Creates a new config that with defaults that match the specified Version as well as the default Analyzer. If matchVersion is >= Version.LUCENE_32, TieredMergePolicy is used for merging; else LogByteSizeMergePolicy. Note that TieredMergePolicy is free to select non-contiguous merges, which means docIDs may not remain monotonic over time. If this is a problem you should switch to LogByteSizeMergePolicy or LogDocMergePolicy.

这段注释告诉我们，如果Lucene的版本大于Lucene_32，将使用TieredMergePolicy，否则将使用LogByteSizePolicy。使用不同的版本，将导致lucene采用不同的合并策略。

再看StandardAnalyzer

You must specify the required Version compatibility when creating StandardTokenizer:

As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
As of 3.1, StandardTokenizer implements Unicode text segmentation. If you use a previous version number, you get the exact behavior ofClassicTokenizer for backwards compatibility.

这句话，不知道我理解的对不对，就是说在3.4的版本之前，会将汉字错误的分词。如果你在新的Lucene4中现在仍然使用3.4之前的版本，那么你还会有这样的bug。

我们在Lucene的源码中能找到很多类似这样的代码：

StandardFilter类

if (matchVersion.onOrAfter(Version.LUCENE_31))
      return input.incrementToken(); // TODO: add some niceties for the new grammar
    else
      return incrementTokenClassic();

特别是分词包中。

那么，低版本的索引文件，在高版本中是否可以使用呢，经过测试，高版本肯定能够做到读取和搜索，因此可以断定，高版本肯定能过兼容低版本的数据，甚至大版本号都不一样。

结论：

因此总结来说，在使用Lucene时，我们只遵循一个原则，只使用最新的版本，如果变更版本，需要重建索引。在老的版本你无法使用新的版本号，在新的版本中，你不要使用老的版本号。Lucene版本号主要作用是Lucene自己对索引的某些特性做的一些向下兼容，如果你想使用一个较老的版本，某些特性还是可以使用的，只是新的版本不再提供这些特性。

请支持原创：

http://donlianli.iteye.com/blog/1976920

对这类话题感兴趣？欢迎发送邮件至donlianli@126.com

关于我：邯郸人，擅长Java，Javascript，Extjs，oracle sql。

更多我之前的文章，可以访问我的空间