lucene代码分析
[email protected]
lucene读取过程
Analyzer analyzer = new StandardAnalyzer();
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
<code>IndexWriter</code> creates and maintains an index.
与dictionary连接过程
The {@link OpenMode} option on {@link IndexWriterConfig#setOpenMode(OpenMode)} determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with {@link OpenMode#CREATE} even while readers are using the index. The old readers will continue to search the “point in time” snapshot they had opened, and won’t see the newly created index until they re-open. If {@link OpenMode#CREATE_OR_APPEND} is used IndexWriter will create a new index if there is not already an index at the provided path and otherwise open the existing index.
In either case, documents are added with {@link #addDocument(Iterable) addDocument} and removed with {@link #deleteDocuments(Term…)} or {@link #deleteDocuments(Query…)}. A document can be updated with {@link #updateDocument(Term, Iterable) updateDocument} (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, {@link #close() close} should be called.
Each method that changes the index returns a {@code long} sequence number, which expresses the effective order in which each change was applied. {@link #commit} also returns a sequence number, describing which changes are in the commit point and which are not. Sequence numbers are transient (not saved into the index in any way) and only valid within a single {@code IndexWriter} instance.
These changes are buffered in memory and periodically flushed to the {@link Directory} (during the above method calls). A flush is triggered when there are enough added documents since the last flush. Flushing is triggered either by RAM usage of the documents (see {@link IndexWriterConfig#setRAMBufferSizeMB}) or the number of added documents (see {@link IndexWriterConfig#setMaxBufferedDocs(int)}). The default is to flush when RAM usage hits {@link IndexWriterConfig#DEFAULT_RAM_BUFFER_SIZE_MB} MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. Additionally, if IndexWriter reaches the configured number of buffered deletes (see {@link IndexWriterConfig#setMaxBufferedDeleteTerms}) the deleted terms and queries are flushed and applied to existing segments. In contrast to the other flush options {@link IndexWriterConfig#setRAMBufferSizeMB} and {@link IndexWriterConfig#setMaxBufferedDocs(int)}, deleted terms won’t trigger a segment flush. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either {@link #commit()} or {@link #close} is called. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (see below for changing the {@link MergeScheduler}).
Opening an IndexWriter
creates a lock file for the directory in use. Trying to open another IndexWriter
on the same directory will lead to a {@link LockObtainFailedException}.
Expert: IndexWriter
allows an optional {@link IndexDeletionPolicy} implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy is {@link KeepOnlyLastCommitDeletionPolicy} which removes all prior commits as soon as a new commit is done. Creating your own policy can allow you to explicitly keep previous “point in time” commits alive in the index for some time, either because this is useful for your application, or to give readers enough time to refresh to the new commit without having the old commit deleted out from under them. The latter is necessary when multiple computers take turns opening their own {@code IndexWriter} and {@code IndexReader}s against a single shared index mounted via remote filesystems like NFS which do not support “delete on last close” semantics. A single computer accessing an index via NFS is fine with the default deletion policy since NFS clients emulate “delete on last close” locally. That said, accessing an index via NFS will likely result in poor performance compared to a local IO device.
Expert: IndexWriter
allows you to separately change the {@link MergePolicy} and the {@link MergeScheduler}. The {@link MergePolicy} is invoked whenever there are changes to the segments in the index. Its role is to select which merges to do, if any, and return a {@link MergePolicy.MergeSpecification} describing the merges. The default is {@link LogByteSizeMergePolicy}. Then, the {@link MergeScheduler} is invoked with the requested merges and it decides when and how to run the merges. The default is {@link ConcurrentMergeScheduler}.
NOTE: if you hit a VirtualMachineError, or disaster strikes during a checkpoint then IndexWriter will close itself. This is a defensive measure in case any internal state (buffered documents, deletions, reference counts) were corrupted. Any subsequent calls will throw an AlreadyClosedException.
NOTE: {@link IndexWriter} instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexWriter
instance as this may cause deadlock; use your own (non-Lucene) objects instead.
NOTE: If you call Thread.interrupt()
on a thread that’s within IndexWriter, IndexWriter will try to catch this (eg, if it’s in a wait() or Thread.sleep()), and will then throw the unchecked exception {@link ThreadInterruptedException} and clear the interrupt status on the thread.