lucene-相关概念与定义
- 原文: http://wiki.apache.org/lucene-java/ConceptsAndDefinitions
- 导航:Lucene-java Wiki-》1 Overview-》1.1 Informational-》 1.1.2 ConceptsAndDefinitions
- 注意:“ 红色 ”,表示不知道、不确定怎么翻译。 “ 蓝色”自己的描述。
这里主要描述了一些Lucene的相关概念和定义
定义
Analyzer - 用于在分析文本,英语和拉丁语系通常用StandardAnalyzer 。编制索引的文本Lucene的类。大多数应用程序可以使用英语和拉丁语的语言StandardAnalyzer。
Payloads(有效载荷) - payload 是一个字节数组(array of bytes),用于存储term的位置。
Snowball Stemmers(雪球词干分析器 ) --Snowball Stemmers是lucene引入的词干分析器之一。 更多信息请参看 nowball website 。
Stemmer (词干分析器) - 以下解释来自于维基:“这种算法用来降低干扰词、同义词的影响……,以用于降低索引大小……” 。这一段原文如下:
核心类
Document
A Lucene Document is a record in the index. A Document has a list of fields; each field has a name and a textual value.
Term
A Term is Lucene's unit of indexing. In western languages, a Term is often a word.
TermEnum
TermEnum 通常用于统计某个field中的term个数,但不考虑这些term出现在哪个document中。
一些查询子类就是通过对比terms 来实现查询的,例如: WildcardQuery,PrefixQuery, RangeQuery.
TermEnum is used to enumerate all terms in the index for a given field, regardless of which documents the terms occur in (or where they occur).
Some query subclasses are implemented by enumerating terms that match a pattern, and building a large OR query from the enumeration. E.g. WildcardQuery,PrefixQuery, RangeQuery.
See LuceneFAQ, How do I retrieve all the values of a particular field that exists within an index, across all documents? which also includes sample code.
TermDocs
不像TermEnum (see above), TermDocs 通常用于确定哪些文档包括给定的Term。另外,TermDocs 也提供了term 在文档中出现的频率。
TermFreqVector
A TermFreqVector (aka Term Frequency Vector or just Term Vector) is a data structure containing a given Document's term and frequency information and can be retrieved from the IndexReader only when Term Vectors are stored during indexing.
TermFreqVector 是一个包含 given Document's term 和**的数据结构。
Directory
IndexReader
IndexSearcher
上一篇: 随笔:通俗理解IaaS