lucene中的docValue实现源码解读（三）——NumericDocValue的读取

程序员文章站 2022-03-30 15:24:38

...

对lucene的docValue的读取是在Lucene410DocValuesProducer中，我们看下他的构造方法，其中类的代码我也复制了一些有用的

class Lucene410DocValuesProducer extends DocValuesProducer implements Closeable {
	/** 适用于numericDocValue的docValue，key是域号，value是对应的域的meta文件的属性，在打开时候就会读取meta文件*/
	private final Map<Integer, NumericEntry> numerics;
	private final AtomicLong ramBytesUsed;
	private final IndexInput data;
	private final int maxDoc;
	private final int version;

	// memory-resident structures
	private final Map<Integer, MonotonicBlockPackedReader> addressInstances = new HashMap<>();
	private final Map<Integer, MonotonicBlockPackedReader> ordIndexInstances = new HashMap<>();

	/** expert: instantiates a new reader */
	Lucene49DocValuesProducer(SegmentReadState state, String dataCodec, String dataExtension, String metaCodec,
			String metaExtension) throws IOException {
		String metaName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, metaExtension);
		// read in the entries from the metadata file.
		ChecksumIndexInput in = state.directory.openChecksumInput(metaName, state.context);
		this.maxDoc = state.segmentInfo.getDocCount();
		boolean success = false;
		try {
			version = CodecUtil.checkHeader(in, metaCodec, Lucene49DocValuesFormat.VERSION_START,
					Lucene49DocValuesFormat.VERSION_CURRENT);
			numerics = new HashMap<>();
			。。。//省略了一些不相关的
			readFields(in, state.fieldInfos);//读取所有的docValue，包括各种类型的，这个是在meta（也就是在索引文件，dvm）中读取的

			CodecUtil.checkFooter(in);
			success = true;
		} finally {
			if (success) {
				IOUtils.close(in);
			} else {
				IOUtils.closeWhileHandlingException(in);
			}
		}

		String dataName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, dataExtension);
		this.data = state.directory.openInput(dataName, state.context);//打开实际的存储文件。
		success = false;
		。。。//省略

		ramBytesUsed = new AtomicLong(RamUsageEstimator.shallowSizeOfInstance(getClass()));
	}

在构造这个读取docValue的对象的时候，就会读取索引文件，里面会读取meta文件，然后封装为XxxEntry对象，放到内存中，这个类中的private final Map<Integer, NumericEntry> numerics;属性就是根据域存放的meta文件的描述。读取meta文件在readFields方法中，里面会读取各种不同的类型的docValue，这里只看了数字类型的：

static NumericEntry readNumericEntry(IndexInput meta) throws IOException {
		
	//读取的当前的docvalue的信息，来自于meta，
	NumericEntry entry = new NumericEntry();
	entry.format = meta.readVInt();//存储的 格式，比如最大公约数差值、差值、压缩表
	entry.missingOffset = meta.readLong();//存储的missing的docs在data中的fp（偏移量）
	entry.offset = meta.readLong();//正真的存储位置的偏移量（也就是再超过上面的missingDocBitset）
	entry.count = meta.readVLong();//一共多少个doc
	switch (entry.format) {
	case GCD_COMPRESSED://基于最大公约数的
		entry.minValue = meta.readLong();//最小值
		entry.gcd = meta.readLong();//最大公约数
		entry.bitsPerValue = meta.readVInt();//每个数字占据的bit的数量，用于解码
		break;
	case TABLE_COMPRESSED://压缩表的，适用于docValue的数字比较少的时候
		final int uniqueValues = meta.readVInt();//具体的数字的个数
		if (uniqueValues > 256) {
			throw new CorruptIndexException(
					"TABLE_COMPRESSED cannot have more than 256 distinct values, input=" + meta);
		}
		entry.table = new long[uniqueValues];
		for (int i = 0; i < uniqueValues; ++i) {//读取所有的值，之前都放在meta中了。
			entry.table[i] = meta.readLong();
		}
		entry.bitsPerValue = meta.readVInt();//这个值是用于解码用的。
		break;
	case DELTA_COMPRESSED:
		entry.minValue = meta.readLong();//最小值，
		entry.bitsPerValue = meta.readVInt();//在data中用于记录某个值的位数，用于解码。
		break;
	case MONOTONIC_COMPRESSED://这个是没有的。在Lucene49DocValuesConsumer只有上面的三种格式
		entry.packedIntsVersion = meta.readVInt();
		entry.blockSize = meta.readVInt();
		break;
	default:
		throw new CorruptIndexException("Unknown format: " + entry.format + ", input=" + meta);
	}
	entry.endOffset = meta.readLong();//结束位置，
	return entry;
}

这里可以发现，是把meta文件中有用到的属性都读取出来，然后放到内存中了，但是还没有真正的读取数字类型的docValue。真正的读取操作是在这个方法里面

@Override
public NumericDocValues getNumeric(FieldInfo field) throws IOException {
	NumericEntry entry = numerics.get(field.number);
	return getNumeric(entry);
}

其中的numerics就是之前读取的所有的域的meta文件，里面根据域的序号进行查找的，所以更关键的方法是getNumeric方法：

//尽管我们可以使用多种类型的数字类型的，但是在存储的时候都是存储的long类型的，这里读取的也是
LongValues getNumeric(NumericEntry entry) throws IOException {
	
	RandomAccessInput slice = this.data.randomAccessSlice(entry.offset, entry.endOffset - entry.offset);//这个方法是最关键的，他会读取data文件中的指定的一块。但是是否读取到内存中，我也不知道，求帮助！
	switch (entry.format) {
	
	case DELTA_COMPRESSED://差值的
		final long delta = entry.minValue;
		final LongValues values = DirectReader.getInstance(slice, entry.bitsPerValue);
		return new LongValues() {
			@Override
			public long get(long id) {
				return delta + values.get(id);//直接读取差值+最小值即可
			}
		};
		
	case GCD_COMPRESSED://最大公约数的，和差值的差不多
		final long min = entry.minValue;
		final long mult = entry.gcd;
		final LongValues quotientReader = DirectReader.getInstance(slice, entry.bitsPerValue);
		return new LongValues() {
			@Override
			public long get(long id) {
				return min + mult * quotientReader.get(id);
			}
		};
		
	case TABLE_COMPRESSED://压缩表的，根据排序值进行读取
		final long table[] = entry.table;
		final LongValues ords = DirectReader.getInstance(slice, entry.bitsPerValue);
		return new LongValues() {
			@Override
			public long get(long id) {
				return table[(int) ords.get(id)];
			}
		};
	default:
		throw new AssertionError();
	}
}

在看懂了写入的前提下，读取的代码要简单的多，但是有一个重要的地方就是上面标红的randomAcessSlice方法。如果有大神看到了，希望能给我们解疑答惑一下，他到底是读取到内存中呢，还是靠操作系统加载到内存中呢还是压根就不会再内存中操作。

还有一个需要注意的地方是，在读取的时候没有考虑了那个bitset，就是记录含有值得doc的id的bitset，所以如果读到了0，还需要判断一下是不是存在，即是不是在那个bitset中。

上一篇： html5中在用户可以开始播放视频/音频时触发的事件oncanplay

下一篇： JS操作页面背景变暗