欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

HBase Filter 过滤器之 Comparator 原理及源码学习

程序员文章站 2022-06-24 22:46:21
前言: 上篇文章 "HBase Filter 过滤器概述" 对HBase过滤器的组成及其家谱进行简单介绍,本篇文章主要对HBase过滤器之比较器作一个补充介绍,也算是HBase Filter学习的必备低阶魂技吧。本篇文中源码基于HBase 1.1.2.2.6.5.0 292 HDP版本。 HBase ......

前言:上篇文章hbase filter 过滤器概述对hbase过滤器的组成及其家谱进行简单介绍,本篇文章主要对hbase过滤器之比较器作一个补充介绍,也算是hbase filter学习的必备低阶魂技吧。本篇文中源码基于hbase 1.1.2.2.6.5.0-292 hdp版本。

hbase所有的比较器实现类都继承于父类bytearraycomparable,而bytearraycomparable又实现了comparable接口;不同功能的比较器差别在于对父类compareto()方法的重写逻辑不同。

下面分别对hbase filter默认实现的七大比较器一一进行介绍。

1. binarycomparator

介绍:二进制比较器,用于按字典顺序比较指定字节数组。

先看一个小例子:

public class binarycomparatordemo {

    public static void main(string[] args) {

        binarycomparator bc = new binarycomparator(bytes.tobytes("bbb"));

        int code1 = bc.compareto(bytes.tobytes("bbb"), 0, 3);
        system.out.println(code1); // 0
        int code2 = bc.compareto(bytes.tobytes("aaa"), 0, 3);
        system.out.println(code2); // 1
        int code3 = bc.compareto(bytes.tobytes("ccc"), 0, 3);
        system.out.println(code3); // -1
        int code4 = bc.compareto(bytes.tobytes("bbf"), 0, 3);
        system.out.println(code4); // -4
        int code5 = bc.compareto(bytes.tobytes("bbbedf"), 0, 6);
        system.out.println(code5); // -3
    }
}

不难看出,该比较器的比较规则如下:

  • 两个字符串首字母不同,则该方法返回首字母的asc码的差值
  • 参与比较的两个字符串如果首字符相同,则比较下一个字符,直到有不同的为止,返回该不同的字符的asc码差值
  • 两个字符串不一样长,可以参与比较的字符又完全一样,则返回两个字符串的长度差值

看一下以上规则对应其compareto()方法的源码实现:
实现一:

static enum unsafecomparer implements bytes.comparer<byte[]> {
instance;
....
public int compareto(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
	if (buffer1 == buffer2 && offset1 == offset2 && length1 == length2) {
		return 0;
	} else {
		int minlength = math.min(length1, length2);
		int minwords = minlength / 8;
		long offset1adj = (long)(offset1 + byte_array_base_offset);
		long offset2adj = (long)(offset2 + byte_array_base_offset);
		int j = minwords << 3;

		int offset;
		for(offset = 0; offset < j; offset += 8) {
			long lw = theunsafe.getlong(buffer1, offset1adj + (long)offset);
			long rw = theunsafe.getlong(buffer2, offset2adj + (long)offset);
			long diff = lw ^ rw;
			if (diff != 0l) {
				return lessthanunsignedlong(lw, rw) ? -1 : 1;
			}
		}

		offset = j;
		int b;
		int a;
		if (minlength - j >= 4) {
			a = theunsafe.getint(buffer1, offset1adj + (long)j);
			b = theunsafe.getint(buffer2, offset2adj + (long)j);
			if (a != b) {
				return lessthanunsignedint(a, b) ? -1 : 1;
			}

			offset = j + 4;
		}

		if (minlength - offset >= 2) {
			short sl = theunsafe.getshort(buffer1, offset1adj + (long)offset);
			short sr = theunsafe.getshort(buffer2, offset2adj + (long)offset);
			if (sl != sr) {
				return lessthanunsignedshort(sl, sr) ? -1 : 1;
			}

			offset += 2;
		}

		if (minlength - offset == 1) {
			a = buffer1[offset1 + offset] & 255;
			b = buffer2[offset2 + offset] & 255;
			if (a != b) {
				return a - b;
			}
		}

		return length1 - length2;
	}
}

实现二:

static enum purejavacomparer implements bytes.comparer<byte[]> {
	instance;

	private purejavacomparer() {
	}

	public int compareto(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
		if (buffer1 == buffer2 && offset1 == offset2 && length1 == length2) {
			return 0;
		} else {
			int end1 = offset1 + length1;
			int end2 = offset2 + length2;
			int i = offset1;

			for(int j = offset2; i < end1 && j < end2; ++j) {
				int a = buffer1[i] & 255;
				int b = buffer2[j] & 255;
				if (a != b) {
					return a - b;
				}

				++i;
			}

			return length1 - length2;
		}
	}
}

实现一是对实现二的一个优化,都引自bytes类,hbase优先执行实现一方案,如果有异常再执行实现二方案。如下:

public static int compareto(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
	return bytes.lexicographicalcomparerholder.best_comparer.compareto(buffer1, offset1, length1, buffer2, offset2, length2);
}
...
...

static final string unsafe_comparer_name = bytes.lexicographicalcomparerholder.class.getname() + "$unsafecomparer";
static final bytes.comparer<byte[]> best_comparer = getbestcomparer();
static bytes.comparer<byte[]> getbestcomparer() {
	try {
		class<?> theclass = class.forname(unsafe_comparer_name);
		bytes.comparer<byte[]> comparer = (bytes.comparer)theclass.getenumconstants()[0];
		return comparer;
	} catch (throwable var2) {
		return bytes.lexicographicalcomparerjavaimpl();
	}
}

2. binaryprefixcomparator

介绍:二进制比较器,只比较前缀是否与指定字节数组相同。

先看一个小例子:

public class binaryprefixcomparatordemo {

    public static void main(string[] args) {

        binaryprefixcomparator bc = new binaryprefixcomparator(bytes.tobytes("b"));

        int code1 = bc.compareto(bytes.tobytes("bbb"), 0, 3);
        system.out.println(code1); // 0
        int code2 = bc.compareto(bytes.tobytes("aaa"), 0, 3);
        system.out.println(code2); // 1
        int code3 = bc.compareto(bytes.tobytes("ccc"), 0, 3);
        system.out.println(code3); // -1
        int code4 = bc.compareto(bytes.tobytes("bbf"), 0, 3);
        system.out.println(code4); // 0
        int code5 = bc.compareto(bytes.tobytes("bbbedf"), 0, 6);
        system.out.println(code5); // 0
        int code6 = bc.compareto(bytes.tobytes("ebbedf"), 0, 6);
        system.out.println(code6); // -3
    }
}

该比较器只是基于binarycomparator比较器稍作更改而已,以下代码一目了然:

public int compareto(byte[] value, int offset, int length) {
	return bytes.compareto(this.value, 0, this.value.length, value, offset, this.value.length <= length ? this.value.length : length);
}

看一下同binarycomparator方法的异同:

public int compareto(byte[] value, int offset, int length) {
	return bytes.compareto(this.value, 0, this.value.length, value, offset, length);
}

区别只在于最后一个传参,即length=min(this.value.length,value.length),取小。这样在后面的字节逐位比较时,即只需比较min length次。

3. bitcomparator

介绍:位比价器,通过bitwiseop提供的and(与)、or(或)、not(非)进行比较。返回结果要么为1要么为0,仅支持 equal 和非 equal。

先看一个小例子:

public class bitcomparatordemo {

    public static void main(string[] args) {

        // 长度相同按位或比较:由低位起逐位比较,每一位按位或比较都为0,则返回1,否则返回0。
        bitcomparator bc1 = new bitcomparator(new byte[]{0,0,0,0}, bitcomparator.bitwiseop.or);
        int i = bc1.compareto(new byte[]{0,0,0,0}, 0, 4);
        system.out.println(i); // 1
        // 长度相同按位与比较:由低位起逐位比较,每一位按位与比较都为0,则返回1,否则返回0。
        bitcomparator bc2 = new bitcomparator(new byte[]{1,0,1,0}, bitcomparator.bitwiseop.and);
        int j = bc2.compareto(new byte[]{0,1,0,1}, 0, 4);
        system.out.println(j); // 1
        // 长度相同按位异或比较:由低位起逐位比较,每一位按位异或比较都为0,则返回1,否则返回0。
        bitcomparator bc3 = new bitcomparator(new byte[]{1,0,1,0}, bitcomparator.bitwiseop.xor);
        int x = bc3.compareto(new byte[]{1,0,1,0}, 0, 4);
        system.out.println(x); // 1
        // 长度不同,返回1,否则按位比较
        bitcomparator bc4 = new bitcomparator(new byte[]{1,0,1,0}, bitcomparator.bitwiseop.xor);
        int y = bc4.compareto(new byte[]{1,0,1}, 0, 3);
        system.out.println(y); // 1
    }
}

上述注释阐述的规则,对应以下代码:
···
public int compareto(byte[] value, int offset, int length) {
if (length != this.value.length) {
return 1;
} else {
int b = 0;

	for(int i = length - 1; i >= 0 && b == 0; --i) {
		switch(this.bitoperator) {
		case and:
			b = this.value[i] & value[i + offset] & 255;
			break;
		case or:
			b = (this.value[i] | value[i + offset]) & 255;
			break;
		case xor:
			b = (this.value[i] ^ value[i + offset]) & 255;
		}
	}

	return b == 0 ? 1 : 0;
}

}
···
核心思想就是:由低位起逐位比较,直到b!=0退出循环。

4. longcomparator

介绍:long 型专用比较器,返回值:0 -1 1。上篇概述没有提到,这里补上。

先看一个小例子:

public class longcomparatordemo {

    public static void main(string[] args) {
        longcomparator longcomparator = new longcomparator(1000l);
        int i = longcomparator.compareto(bytes.tobytes(1000l), 0, 8);
        system.out.println(i); // 0
        int i2 = longcomparator.compareto(bytes.tobytes(1001l), 0, 8);
        system.out.println(i2); // -1
        int i3 = longcomparator.compareto(bytes.tobytes(998l), 0, 8);
        system.out.println(i3); // 1
    }
}

这个比较器实现相当简单,不多说了,如下:

public int compareto(byte[] value, int offset, int length) {
	long that = bytes.tolong(value, offset, length);
	return this.longvalue.compareto(that);
}

5. nullcomparatordemo

介绍:控制比较式,判断当前值是不是为null。是null返回0,不是null返回1,仅支持 equal 和非 equal。

先看一个小例子:

public class nullcomparatordemo {

    public static void main(string[] args) {
        nullcomparator nc = new nullcomparator();
        int i1 = nc.compareto(bytes.tobytes("abc"));
        int i2 = nc.compareto(bytes.tobytes(""));
        int i3 = nc.compareto(null);
        system.out.println(i1); // 1
        system.out.println(i2); // 1
        system.out.println(i3); // 0
    }
}

这个比较器实现相当简单,不多说了,如下:

public int compareto(byte[] value) {
	return value != null ? 1 : 0;
}

6. regexstringcomparator

介绍:提供一个正则的比较器,支持正则表达式的值比较,仅支持 equal 和非 equal。匹配成功返回0,匹配失败返回1。

先看一个小例子:

public class regexstringcomparatordemo {

    public static void main(string[] args) {
        regexstringcomparator rsc = new regexstringcomparator("abc");
        int abc = rsc.compareto(bytes.tobytes("abcd"), 0, 3);
        system.out.println(abc); // 0
        int bcd = rsc.compareto(bytes.tobytes("bcd"), 0, 3);
        system.out.println(bcd); // 1

        string check = "^([a-z0-9a-z]+[-|\\.]?)+[a-z0-9a-z]@([a-z0-9a-z]+(-[a-z0-9a-z]+)?\\.)+[a-za-z]{2,}$";
        regexstringcomparator rsc2 = new regexstringcomparator(check);
        int code = rsc2.compareto(bytes.tobytes("zpb@163.com"), 0, "zpb@163.com".length());
        system.out.println(code); // 0
        int code2 = rsc2.compareto(bytes.tobytes("zpb#163.com"), 0, "zpb#163.com".length());
        system.out.println(code2); // 1
    }
}

其compareto()方法有两种引擎实现,对应两套正则匹配规则,分别是java版和joni版(面向jruby),默认为regexstringcomparator.enginetype.java。如下:

public int compareto(byte[] value, int offset, int length) {
	return this.engine.compareto(value, offset, length);
}

public static enum enginetype {
	java,
	joni;

	private enginetype() {
	}
}

具体实现都很简单,都是调用正则语法匹配。以下是java enginetype 实现:

public int compareto(byte[] value, int offset, int length) {
	string tmp;
	if (length < value.length / 2) {
		tmp = new string(arrays.copyofrange(value, offset, offset + length), this.charset);
	} else {
		tmp = new string(value, offset, length, this.charset);
	}

	return this.pattern.matcher(tmp).find() ? 0 : 1;
}

joni enginetype 实现:

public int compareto(byte[] value, int offset, int length) {
	matcher m = this.pattern.matcher(value);
	return m.search(offset, length, this.pattern.getoptions()) < 0 ? 1 : 0;
}

都很容易理解,不多说了。

7. substringcomparator

介绍:判断提供的子串是否出现在value中,并且不区分大小写。包含字串返回0,不包含返回1,仅支持 equal 和非 equal。

先看一个小例子:

public class substringcomparatordemo {

    public static void main(string[] args) {
        string value = "aslfjllkabcxxljsl";
        substringcomparator sc = new substringcomparator("abc");
        int i = sc.compareto(bytes.tobytes(value), 0, value.length());
        system.out.println(i); // 0

        substringcomparator sc2 = new substringcomparator("abd");
        int i2 = sc2.compareto(bytes.tobytes(value), 0, value.length());
        system.out.println(i2); // 1

        substringcomparator sc3 = new substringcomparator("abc");
        int i3 = sc3.compareto(bytes.tobytes(value), 0, value.length());
        system.out.println(i3); // 0
    }
}

这个比较器实现也相当简单,不多说了,如下:

public int compareto(byte[] value, int offset, int length) {
	return bytes.tostring(value, offset, length).tolowercase().contains(this.substr) ? 0 : 1;
}

到此,七种比较器就介绍完了。如果对源码不敢兴趣,也建议一定要看看文中的小例子,熟悉下每种比较器的构造函数及结果输出。后续在使用hbase过滤器的过程中,会经常用到。当然除了这七种比较器,大家也可以自定义比较器。

HBase Filter 过滤器之 Comparator 原理及源码学习

转载请注明出处!欢迎关注本人微信公众号【hbase工作笔记】