欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

lucene 4.6 之indexing 之 IndexWriter, DocumentWriter

程序员文章站 2022-05-17 09:31:15
...

lucene 的操作主要分成 indexing 和 searching , 两个操作也就完成了整个闭环操作,咱们先从这个indexing说起。

class IndexWriter 可以说是lucene暴露给上层应用的一个类。上层应用程序通过这个类打开lucene的索引世界。

通过了解这个类得成员变量来了解这个类到底是干什么的,有几个比较重要的对象:

private final Directory directory;  // where this index resides
  private final Analyzer analyzer;    // how to analyze text
  private final DocumentsWriter docWriter;
private final MergeScheduler mergeScheduler;
  private LinkedList<MergePolicy.OneMerge> pendingMerges = new LinkedList<MergePolicy.OneMerge>();
  private Set<MergePolicy.OneMerge> runningMerges = new HashSet<MergePolicy.OneMerge>();
  private List<MergePolicy.OneMerge> mergeExceptions = new ArrayList<MergePolicy.OneMerge>();
  privatelongmergeGen;
privatebooleanstopMerges;

 

  

 目录,segment信息,段之间merge的策略,analyzer,还有负责真正写的 DocumentWriter。

 

在构造函数中,基本做了以下几件事情:

1.  加锁

2.  加载配置

3.  初始化Flush策略(从RAM flush 到磁盘上)

4.  初始化DocumentWriter

5.  初始化IndexDeleter(用来最后删除没用的索引文件的,记录每一个文件的引用计数)

 

DocumentWriter

IndexWriter通过调用DocumentWriter的方法,来操作索引。

每一个文档传给DocuentWriter中得DocConsumer , DocConsumer是整个搜索的核心,是indexing chain的源头。

DocumentWriter 中有一个synchronized的方法getThreadState为每一个线程分配一个ThreadState,然后就可以调用ThreadState中得方法,大多数heavy lifting 的任务在这个调用中,最后同步的synchronized finishDocument方法去flush change.

  private final Directory directory;

  private volatile boolean closed;

  private final InfoStream infoStream;

  private final LiveIndexWriterConfig config;

  private final AtomicInteger numDocsInRAM = new AtomicInteger(0);

  // TODO: cut over to BytesRefHash in BufferedDeletes
  volatile DocumentsWriterDeleteQueue deleteQueue = new DocumentsWriterDeleteQueue();
  private final DocumentsWriterFlushQueue ticketQueue = new DocumentsWriterFlushQueue();
  /*
   * we preserve changes during a full flush since IW might not checkout before
   * we release all changes. NRT Readers otherwise suddenly return true from
   * isCurrent while there are actually changes currently committed. See also
   * #anyChanges() & #flushAllThreads
   */
  private volatile boolean pendingChangesInCurrentFullFlush;

  final DocumentsWriterPerThreadPool perThreadPool;
  final FlushPolicy flushPolicy;
  final DocumentsWriterFlushControl flushControl;
  private final IndexWriter writer;
  private final Queue<Event> events;

在构造函数zh中可以看到,他主要就是做一些策略的管理,管理DocumentsWriterPerThreadPool.

 

 

DocumentsWriterPerThread 对象创建了DocConsumer 即IndexChain(整个索引的核心),下一章会详细讲这件事情,同时

 

 

ThreadState  封装了DocumentsWriterPerThread对象,同时拥有每一个线程需要flush的对象数据,他得每一个成员和方法必须在一个时刻只能一个线程访问,调用者必须自己加锁,解锁。

 

 

DocumentsWriterPerThreadPool  控制indexing的时候 ThreadState的分配,每一个ThreadState存在对DocumentsWriterPerThread的一个引用,每一个线程必须获取这么一个ThreadState来进行indexing,

 

 

DocumentsWriterFlushControl 类来控制flush策略,记录每一个DocumentsWriterPerThread内存消耗的量。