BitSet的实现原理

程序员文章站 2024-03-15 20:33:30

...

1.BitSet介绍

Bitset是Java中的一种数据结构。Bitset中主要存储的是二进制位，做的也都是位运算，每一位只用来存储0，1值，主要用于对数据的标记。

Bitset的基本原理是，用1位来表示一个数据是否出现过，0为没有出现过，1表示出现过。使用的时候可以根据某一个位是否为0表示此数是否出现过。JDK中的BitSet集合对是布隆过滤器中经常使用的数据结构Bitmap的相对简单的实现。BitSet采用了Bitmap的算法思想。

使用场景：整数，无重复。

下面就通过Bitset的源代码来看一下BitSet在Java中是如何实现的。

在java中，bitset的实现位于java.util包中，从jdk 1.0就引入了这个数据结构。在多个jdk的演变中，bitset也不断演变。这里参照的是jdk 1.7 源代码中的实现。

package java.util;  
  
import java.io.*;  
import java.nio.ByteBuffer;  
import java.nio.ByteOrder;  
import java.nio.LongBuffer;  
  
public class BitSet implements Cloneable, java.io.Serializable {
    /*
     * BitSets are packed into arrays of "words."  Currently a word is
     * a long, which consists of 64 bits, requiring 6 address bits.
     * The choice of word size is determined purely by performance concerns.
     */
    private final static int ADDRESS_BITS_PER_WORD = 6;
    private final static int BITS_PER_WORD = 1 << ADDRESS_BITS_PER_WORD;
    private final static int BIT_INDEX_MASK = BITS_PER_WORD - 1;

    /* Used to shift left or right for a partial word mask */
    private static final long WORD_MASK = 0xffffffffffffffffL;

    /**
     * @serialField bits long[]
     *
     * The bits in this BitSet.  The ith bit is stored in bits[i/64] at
     * bit position i % 64 (where bit position 0 refers to the least
     * significant bit and 63 refers to the most significant bit).
     */
    private static final ObjectStreamField[] serialPersistentFields = {
        new ObjectStreamField("bits", long[].class),
    };

    /**
     * The internal field corresponding to the serialField "bits".
     */
    private long[] words;

    .....
}

可以看到，BitSet的底层实现是使用long数组作为内部存储结构的，这就决定了BitSet至少为一个long的大小，而且BitSet的大小为long类型大小(64位)的整数倍。

long数组的每一个元素都可以当做是64位的二进制数，也是整个BitSet的子集。在BitSet中把这些子集叫做[Word]。

2.BitSet构造方法

它有两个构造函数：

1、BitSet()：

    /**
     * Creates a new bit set. All bits are initially {@code false}.
     */
    public BitSet() {
        initWords(BITS_PER_WORD);
        sizeIsSticky = false;
    }

2、BitSet(int nbits)：

    /**
     * Creates a bit set whose initial size is large enough to explicitly
     * represent bits with indices in the range {@code 0} through
     * {@code nbits-1}. All bits are initially {@code false}.
     *
     * @param  nbits the initial size of the bit set
     * @throws NegativeArraySizeException if the specified initial size
     *         is negative
     */
    public BitSet(int nbits) {
        // nbits can't be negative; size 0 is OK
        if (nbits < 0)
            throw new NegativeArraySizeException("nbits < 0: " + nbits);

        initWords(nbits);
        sizeIsSticky = true;
    }

BitSet初始化时使用到的常量和方法：

    private final static int ADDRESS_BITS_PER_WORD = 6;
    private final static int BITS_PER_WORD = 1 << ADDRESS_BITS_PER_WORD;  
    private void initWords(int nbits) {
        words = new long[wordIndex(nbits-1) + 1];
    }
    private static int wordIndex(int bitIndex) {
        return bitIndex >> ADDRESS_BITS_PER_WORD;
    }

两个构造函数，一个构造函数没有参数，没有指定初始大小，另一个构造函数带一个int型参数用于指定大小。如果新建BitSet时没有指定大小，BitSet就会有一个默认的初始大小，默认的初始大小为64bit。也就是说，BitSet默认的是一个long整形的大小。

无参构造函数的默认大小是64bit。下面我们根据构造函数推导一下BitSet默认分配的大小。

（1）无参构造函数BitSet( )调用initWords(BITS_PER_WORD)，其中BITS_PER_WORD = 1 << ADDRESS_BITS_PER_WORD = 1 << 6 = 64。下面就应该计算initWords(64)。

（2）initWords(int nbits)调用words = new long[wordIndex(nbits-1) + 1]，把nbits = 64带入，wordIndex(64-1) = wordIndex(63) 。下面计算initWords(63)。

（3）wordIndex(int bitIndex)中计算bitIndex >> ADDRESS_BITS_PER_WORD，把bitIndex = 63带进去，63 >> 6是对63做六次右移运算，结果为0。

（4）把wordIndex(int bitIndex)的计算结果带入words = new long[wordIndex(nbits-1) + 1]得words = new long[1]，说明初始化的时候分配了1个long型数组，占64位。

带参数的构造函数的分配内存大小的推导过程与此类似，也可以按照这种方式推导。

在新建BitSet时默认大小是64位，如果BitSet指定的初始大小没有超过64 bit时也会分配64 bit的大小：

import java.util.BitSet;
public class BitSetDemo {
    public static void main(String[] args) {
        BitSet bitSet = new BitSet();
        System.out.println("default size="+bitSet.size()+" "+bitSet.length());

        BitSet bitSet2 = new BitSet(1);
        System.out.println("        size="+bitSet2.size()+" "+bitSet.length());
    }
}

程序运行结果：

BitSet的实现原理

如果size不够用了就会自动翻倍，比如：

import java.util.BitSet;
public class BitSetDemo {
    public static void main(String[] args) {
        //这个时候bitSet.size() = 128;
        BitSet bitSet = new BitSet(100);
        System.out.println("allocate size="+bitSet.size()+" "+bitSet.length());

        //这个时候bitSet.size() = 256；
        BitSet bitSet2 = new BitSet(200);
        System.out.println("allocate size="+bitSet2.size()+" "+bitSet.length());
    }
}

程序运行结果：

BitSet的实现原理

如果BitSet指定了初始化大小，那么会把他规整到一个大于或者等于这个数字的64的整倍数。比如64位，BitSet的大小是1个long，而65位时，指定了大小是2个long，即128位。做这么一个规定，主要是为了内存对齐，同时避免考虑到特殊情况的处理，简化程序。

3.BitSet常用方法

基于此，下面就可以看一下Bitmap的一些常用的基本操作：

（1）初始化一个bitset

初始化的时候指定或不指定初始大小。上面的两个构造方法已经介绍啦。

（2）设置某一指定位

作用就是把某一个值存放入BitSet中。

    /**
     * Sets the bit at the specified index to {@code true}.
     *
     * @param  bitIndex a bit index
     * @throws IndexOutOfBoundsException if the specified index is negative
     * @since  JDK1.0
     */
    public void set(int bitIndex) {
        if (bitIndex < 0)
            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

        int wordIndex = wordIndex(bitIndex);
        expandTo(wordIndex);

        words[wordIndex] |= (1L << bitIndex); // Restores invariants

        checkInvariants();
    }

再看代码之前，我们先搞清楚一个问题，一个数在BitSet里面是如何存储的，怎么快速定位它的存储位置。

上面也说了，BitSet做的是位运算，针对BitSet的操作都是通过bit的逻辑运算得到的，每一位只用来存储0，1值，主要用于对数据的标记。那么Bitset是怎么对数据进行标记的呢？

BitSet的默认初始大小是一个long数组，一个long数组就是64个bit，每一个bit的值就是二进制的0或者1，bit的值和相应位置就代表一个数在不在BitSet当中，0代表该数值不存在，1代表该数组值存在。这样就可以描述数据对数据进行标记了。具体如下图：

BitSet的实现原理

在这里，0、3、63等存放入了long数组中。

从上面的BitSet的结构图我们可以看到，要想定位一个数据，需要确定两个值：

（1）这个数位于哪个数组，也就是确定words[wordIndex] 的wordIndex是多少。

（2）这个数位于数组的哪一部分，也就是确定这个数的bitIndex是哪一位。

上面就是用于快速定位一个数的存储位置也就是索引号的过程。

那么对于set( )操作，我们可以通过往BitSet中放入一个数据看一下set( )是如何存数据的。假如要放入的数是14。

（1）传入bitIndex = 14。首先判断传入的下标是否越界。

（2）int wordIndex = wordIndex(bitIndex)判断要存入的bitIndex应该存入哪个数组。通过计算wordIndex应该是0，就是存入第一个long数组。

private static int wordIndex(int bitIndex) {
        return bitIndex >> ADDRESS_BITS_PER_WORD;
}

在进行逻辑运算之前，执行了一个函数 expandTo(wordIndex); 这个函数是确保BitSet中有对应的这个long数组。如果没有的话，就对BitSet中的long数组进行扩展。扩展的策略，是将当前的空间翻一倍。如果wordIndex的值大于当前BitSet的size就会进行扩充。 expandTo(wordIndex)代码如下：

   /**
     * Ensures that the BitSet can hold enough words.
     * @param wordsRequired the minimum acceptable number of words.
     */
    private void ensureCapacity(int wordsRequired) {
        if (words.length < wordsRequired) {
            // Allocate larger of doubled size or required size
            int request = Math.max(2 * words.length, wordsRequired);
            words = Arrays.copyOf(words, request);
            sizeIsSticky = false;
        }
    }

    /**
     * Ensures that the BitSet can accommodate a given wordIndex,
     * temporarily violating the invariants.  The caller must
     * restore the invariants before returning to the user,
     * possibly using recalculateWordsInUse().
     * @param wordIndex the index to be accommodated.
     */
    private void expandTo(int wordIndex) {
        int wordsRequired = wordIndex+1;
        if (wordsInUse < wordsRequired) {
            ensureCapacity(wordsRequired);
            wordsInUse = wordsRequired;
        }
    }

（3）words[wordIndex] |= (1L << bitIndex)是进行的逻辑运算，对1进行左移，然后与words[wordIndex]做或(or)运算。对1右移14位，然后与所在的数组做与运算，目的就是把对应数组words的bitIndex所在的位置置为1，从而达到标记一个数存在于BitSet中的目的。

同时jdk中提供了，对某一个数据具体设置成0或1的操作，以及设置某一区间的操作。下面是对数据设置0或1的操作。

    /**
     * Sets the bit at the specified index to the specified value.
     *
     * @param  bitIndex a bit index
     * @param  value a boolean value to set
     * @throws IndexOutOfBoundsException if the specified index is negative
     * @since  1.4
     */
    public void set(int bitIndex, boolean value) {
        if (value)
            set(bitIndex);
        else
            clear(bitIndex);
    }

（3）清空BitSet

a. 清空所有的bit位，即全部置0。

    /**
     * Sets all of the bits in this BitSet to {@code false}.
     *
     * @since 1.4
     */
    public void clear() {
        while (wordsInUse > 0)
            words[--wordsInUse] = 0;
    }

b. 清空某一位。

    /**
     * Sets the bit specified by the index to {@code false}.
     *
     * @param  bitIndex the index of the bit to be cleared
     * @throws IndexOutOfBoundsException if the specified index is negative
     * @since  JDK1.0
     */
    public void clear(int bitIndex) {
        if (bitIndex < 0)
            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

        int wordIndex = wordIndex(bitIndex);
        if (wordIndex >= wordsInUse)
            return;

        words[wordIndex] &= ~(1L << bitIndex);

        recalculateWordsInUse();
        checkInvariants();
    }

第一行是参数检查，如果bitIndex小于0，则抛参数非法异常。

后面执行的是BitSet中操作中经典的两步曲：a. 找到对应的long数组 b. 操作对应的位。

a. 找到对应的long数组。这行语句是：int wordIndex = wordIndex(bitIndex);
b. 操作对应的位。这行语句是：words[wordIndex] &= ~(1L << bitIndex);

对1进行左移，然后取反，最后与words[wordIndex]作与操作。

~(1L << bitIndex)首先通过1L << bitIndex移动到指定的位bitIndex，这一位设置为1，然后取反，把其他位都设置为1，这一位设置为0。1L << bitIndex的目的就是定位到bitIndex所在的位并把该位标记为1。最后和words[wordIndex]做&运算，words[wordIndex]&bitIndex = 0 的位把该为清空，words[wordIndex]&非bitIndex = 1不影响其余位原来的数值。如果清空一个原本就不在BitSet的位返回的值是不影响正确结果的。

注意：这里的参数检查， if (bitIndex < 0)是对负数index抛出异常。if (wordIndex >= wordsInUse)是对超出大小的index，不做任何操作，直接返回。因为输入一个超出大小的index，没有对应位置的数据，不需要做取反操作，就可以直接返回啦。

c. 清空指定范围的bits。

    /**
     * Sets the bits from the specified {@code fromIndex} (inclusive) to the
     * specified {@code toIndex} (exclusive) to {@code false}.
     *
     * @param  fromIndex index of the first bit to be cleared
     * @param  toIndex index after the last bit to be cleared
     * @throws IndexOutOfBoundsException if {@code fromIndex} is negative,
     *         or {@code toIndex} is negative, or {@code fromIndex} is
     *         larger than {@code toIndex}
     * @since  1.4
     */
    public void clear(int fromIndex, int toIndex) {
        checkRange(fromIndex, toIndex);

        if (fromIndex == toIndex)
            return;

        int startWordIndex = wordIndex(fromIndex);
        if (startWordIndex >= wordsInUse)
            return;

        int endWordIndex = wordIndex(toIndex - 1);
        if (endWordIndex >= wordsInUse) {
            toIndex = length();
            endWordIndex = wordsInUse - 1;
        }

        long firstWordMask = WORD_MASK << fromIndex;
        long lastWordMask  = WORD_MASK >>> -toIndex;
        if (startWordIndex == endWordIndex) {
            // Case 1: One word
            words[startWordIndex] &= ~(firstWordMask & lastWordMask);
        } else {
            // Case 2: Multiple words
            // Handle first word
            words[startWordIndex] &= ~firstWordMask;

            // Handle intermediate words, if any
            for (int i = startWordIndex+1; i < endWordIndex; i++)
                words[i] = 0;

            // Handle last word
            words[endWordIndex] &= ~lastWordMask;
        }

        recalculateWordsInUse();
        checkInvariants();
    }

这个方法是将所有的long数组的这个范围分成三块，startWord、intervalWord和endWord。

其中startWord，是从该long数组words[startWordIndex]对应的firstWordMask 就是开始位置开始到该long数组结束的位置全部置0；intervalWord则是这些long数组的所有bits全部置0；而endWord这是对ong数组words[startWordIndex]从起始位置0到指定的结束位lastWordMask全部置0。在这里需要分别对startword和stopword进行逻辑运算。

此外还有一些特殊情形需要处理，特殊情形如startWord越界、startWord与endWord是同一个long数组等。

（4）两个重要的内部检查函数

上面的代码，可以看到每个对数据进行操作的函授结尾都会有一个或两个函数：recalculateWordsInUse()和checkInvariants()。
这两个函数，是对bitset的内部状态进行维护和检查的函数。recalculateWordsInUse()实现如下：

   /**
     * Sets the field wordsInUse to the logical size in words of the bit set.
     * WARNING:This method assumes that the number of words actually in use is
     * less than or equal to the current value of wordsInUse!
     */
    private void recalculateWordsInUse() {
        // Traverse the bitset until a used word is found
        int i;
        for (i = wordsInUse-1; i >= 0; i--)
            if (words[i] != 0)
                break;

        wordsInUse = i+1; // The new logical size
    }

wordsInUse是检查当前的long数组中，实际使用的long的个数。通过从数组的末位循环判断wordsInUse的值，并在最后对wordsInUse的值进行更新。long[wordsInUse-1]是当前最后一个存储有有效bit的long，这个值是用于保存bitset有效大小的。

这个方法的返回值就是返回long数组中最高位的1的位置。

   /**
     * Every public method must preserve these invariants.
     */
    private void checkInvariants() {
        assert(wordsInUse == 0 || words[wordsInUse - 1] != 0);
        assert(wordsInUse >= 0 && wordsInUse <= words.length);
        assert(wordsInUse == words.length || words[wordsInUse] == 0);
    }

checkInvariants 可以看出是检查内部状态，尤其是对wordsInUse是否合法的检查。

（5）反转某一指定位

反转，就是把某一位的1变成0，0变成1，是一个与1的异或(xor)运算的操作。

   /**
     * Sets the bit at the specified index to the complement of its
     * current value.
     *
     * @param  bitIndex the index of the bit to flip
     * @throws IndexOutOfBoundsException if the specified index is negative
     * @since  1.4
     */
    public void flip(int bitIndex) {
        if (bitIndex < 0)
            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

        int wordIndex = wordIndex(bitIndex);
        expandTo(wordIndex);

        words[wordIndex] ^= (1L << bitIndex);

        recalculateWordsInUse();
        checkInvariants();
    }

反转的基本操作也是两步，找到bitIndex所在的long数组，定位bitIndex所在的位置并把该位置标记为1，然后让long数组与指定的位进行异或(xor)运算。

int wordIndex = wordIndex(bitIndex)用于找到bitIndex所在的long数组，
words[wordIndex] ^= (1L << bitIndex)定位bitIndex所在的位置并让long数组与该位置进行异或(xor)运算。

同时在flip(int bitIndex)执行逻辑运算之前也执行了expandTo(wordIndex)函数，这个函数是确保bitset中有对应的这个long。如果没有就对bitset中的long数组进行扩展。在介绍set(int bitIndex)方法的时候讲过了。

同样，也提供了一个指定区间的反转，实现方案与clear(int fromIndex, int toIndex)方法的区间运算基本相同，只是对应的逻辑运算不同。代码如下：

    /**
     * Sets each bit from the specified {@code fromIndex} (inclusive) to the
     * specified {@code toIndex} (exclusive) to the complement of its current
     * value.
     *
     * @param  fromIndex index of the first bit to flip
     * @param  toIndex index after the last bit to flip
     * @throws IndexOutOfBoundsException if {@code fromIndex} is negative,
     *         or {@code toIndex} is negative, or {@code fromIndex} is
     *         larger than {@code toIndex}
     * @since  1.4
     */
    public void flip(int fromIndex, int toIndex) {
        checkRange(fromIndex, toIndex);

        if (fromIndex == toIndex)
            return;

        int startWordIndex = wordIndex(fromIndex);
        int endWordIndex   = wordIndex(toIndex - 1);
        expandTo(endWordIndex);

        long firstWordMask = WORD_MASK << fromIndex;
        long lastWordMask  = WORD_MASK >>> -toIndex;
        if (startWordIndex == endWordIndex) {
            // Case 1: One word
            words[startWordIndex] ^= (firstWordMask & lastWordMask);
        } else {
            // Case 2: Multiple words
            // Handle first word
            words[startWordIndex] ^= firstWordMask;

            // Handle intermediate words, if any
            for (int i = startWordIndex+1; i < endWordIndex; i++)
                words[i] ^= WORD_MASK;

            // Handle last word
            words[endWordIndex] ^= lastWordMask;
        }

        recalculateWordsInUse();
        checkInvariants();
    }

（6）获取某一指定位的状态

    /**
     * Returns the value of the bit with the specified index. The value
     * is {@code true} if the bit with the index {@code bitIndex}
     * is currently set in this {@code BitSet}; otherwise, the result
     * is {@code false}.
     *
     * @param  bitIndex   the bit index
     * @return the value of the bit with the specified index
     * @throws IndexOutOfBoundsException if the specified index is negative
     */
    public boolean get(int bitIndex) {
        if (bitIndex < 0)
            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

        checkInvariants();

        int wordIndex = wordIndex(bitIndex);
        return (wordIndex < wordsInUse)
            && ((words[wordIndex] & (1L << bitIndex)) != 0);
    }

同样是首先获取bitIndex所在的long数组，然后bitIndex和数组进行位运算，这里的位操作是与(&)运算。可以看到，如果指定的bit不存在的话，返回的是false，即该为不在BitSet中。如果不等于0的话就是存在。

jdk同时提供了一个获取指定区间的BitSet的方法：get(int fromIndex, int toIndex)。

此方法返回一个新的 BitSet，它由原BitSet中从fromIndex（包括）到 toIndex（不包括）范围内的位组成。

（7）获取当前bitset总bit的大小

    /**
     * Returns the "logical size" of this {@code BitSet}: the index of
     * the highest set bit in the {@code BitSet} plus one. Returns zero
     * if the {@code BitSet} contains no set bits.
     *
     * @return the logical size of this {@code BitSet}
     * @since  1.2
     */
    public int length() {
        if (wordsInUse == 0)
            return 0;

        return BITS_PER_WORD * (wordsInUse - 1) +
            (BITS_PER_WORD - Long.numberOfLeadingZeros(words[wordsInUse - 1]));
    }

length( )方法返回此BitSet实际使用的空间大小：BitSet中位数值为1的最高位的索引值加 1。

与此类似的方法还有一个返回BitSet的size的方法：

   /**
     * Returns the number of bits of space actually in use by this
     * {@code BitSet} to represent bit values.
     * The maximum element in the set is the size - 1st element.
     *
     * @return the number of bits currently in this bit set
     */
    public int size() {
        return words.length * BITS_PER_WORD;
    }

此方法返回BitSet实际占用空间的大小，返回的值都是long数组长度的倍数，因为申请空间的时候就是按照long数组长度申请的。占用了多少空间不代表使用了多少空间。但是占用空间对应的最后一个long数组和使用空间所在的long数组应该是同一个数组，因为只有使用空间才会申请空间。比如：

BitSet bitSet = new BitSet(65);
System.out.println(bitSet.length()+" "+bitSet.size());

那么次bitSet返回的length = 66，返回的size = 128。申请了128位的空间，但是只用了66，如果申请28位的空间就需要两个数组，第66位正好位于第二个数组。如果初始大小是63，length = 66就不需要第二个long数组了。

`（8）hashcode`

hashcode是一个非常重要的属性，可以用来表明一个数据结构的特征。bitset的hashcode是用下面的方式实现的：

   /**
     * Returns the hash code value for this bit set. The hash code depends
     * only on which bits are set within this {@code BitSet}.
     *
     * <p>The hash code is defined to be the result of the following
     * calculation:
     *  <pre> {@code
     * public int hashCode() {
     *     long h = 1234;
     *     long[] words = toLongArray();
     *     for (int i = words.length; --i >= 0; )
     *         h ^= words[i] * (i + 1);
     *     return (int)((h >> 32) ^ h);
     * }}</pre>
     * Note that the hash code changes if the set of bits is altered.
     *
     * @return the hash code value for this bit set
     */
    public int hashCode() {
        long h = 1234;
        for (int i = wordsInUse; --i >= 0; )
            h ^= words[i] * (i + 1);

        return (int)((h >> 32) ^ h);
    }

这个hashcode考虑了words数组中每一位的位置。因为当有words数组中的bit的状态发生变化时，hashcode也会随之改变。

（9）Java中Bitet的应用

BitSet的使用非常简单，只要对需要的操作调用对应的函数即可。

BitSet常见的应用场景是对海量数据的处理，可以用于对大数量的查找，去重，排序等工作，相比使用其他的方法，占用更少的空间，显著提高效率；也可以使用BitSet进行一些统计工作，比如日志分析、用户数统计等；还可以使用其方法做一些集合方面的运算，比如求并集、交集和补集等。有关BitSet的更多使用和应用，可以参考：BitSet的应用。

有关BitSet的基本内容和常用方法就介绍到这里啦。

参考：（1）https://blog.csdn.net/cpfeed/article/details/7342480

（2）https://blog.csdn.net/jiangnan2014/article/details/53735429

（3）https://www.cnblogs.com/xujian2014/p/5491286.html

（4）https://www.cnblogs.com/lqminn/archive/2012/08/30/2664122.html

相关标签： BitSet 大数据查找大数据去重大数据排序求并交补集

上一篇： PHP试题---10W个不同数字组成的数组取出TOP5

下一篇：在一个整数数组中，有一个重复的数字，如何找到重复数字？(请用代码实现)

BitSet的实现原理

1.BitSet介绍

2.BitSet构造方法

3.BitSet常用方法

（1）初始化一个bitset

（2）设置某一指定位

（3）清空BitSet

（4）两个重要的内部检查函数

（5）反转某一指定位

（6）获取某一指定位的状态

（7）获取当前bitset总bit的大小

`（8）hashcode`

（9）Java中Bitet的应用