欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

仿redis的HyperLogLog数据结构

程序员文章站 2022-07-11 12:21:46
...

对于网站的UV计数来说,使用集合这种数据结构是可以满足的,但是使用集合会带来占用空间过大的问题。

一旦数据量增大,它的某些统计特性就会涌现。HyperLogLog就是应用了大数据量的统计特性(随机次数和随机数二进制最大尾部0个数之间的关系),实现的数据量估计。
GitHub源码

若想有更深入的理解,可以参考《Redis 深度历险:核心原理与应用实践》这本书,我想比我理解的更深入一些。

class Bucket
{
    private $maxTailZeroCount = 0;

    private function calculateTailZeroCount($number)
    {
        $tailZeroCount = 0;

        for ($i = 0; $i < 32; $i++) {
            if (($number >> $i << $i) !== $number) {
                $tailZeroCount = $i - 1;
                break;
            }
        }

        return $tailZeroCount;
    }

    public function random()
    {
        $max = pow(2, 32) - 1;

        $value = mt_rand(0, $max);

        $tailZeroCount = $this->calculateTailZeroCount($value);

        if ($tailZeroCount > $this->maxTailZeroCount) {
            $this->maxTailZeroCount = $tailZeroCount;
        }
    }

    public function getMaxTailZeroCount()
    {
        return $this->maxTailZeroCount;
    }
}

class HyperLogLog
{

    private $bucketMap = [];
    private $bucketCount = 16384;

    public function __construct($bucketCount = 16384)
    {
        $this->bucketCount = $bucketCount;

        for ($i = 0; $i < $bucketCount; $i++) {
            $this->bucketMap[$i] = new Bucket();
        }
     }

    public function add($value)
    {
        $number = crc32($value);

        $bucketOffset = (($number & 0xffff0000) >> 16) % $this->bucketCount;

        $this->getBucket($bucketOffset)->random();
    }

    public function count()
    {
        $number = 0.0;
        $notEmptyCount = 0;

        for ($i = 0; $i < $this->bucketCount; $i++) {
            $division = floatval($this->getBucket($i)->getMaxTailZeroCount());
            if ($division != 0) {
                $number += 1.0 / $division;
                $notEmptyCount++;
            }
        }

        $avg = floatval($notEmptyCount) / $number;
        return round(pow(2, $avg) * $notEmptyCount);
    }

    /**
     * @param $offset
     * @return Bucket
     */
    private function getBucket($offset)
    {
        return $this->bucketMap[$offset];
    }
}

class Test
{
    public function run()
    {
        $hyper = new HyperLogLog();

        for ($i = 0; $i < 10000000; $i++) {
            $hyper->add('Test01'.$i);
        }

        echo $hyper->count();
    }
}

$test = new Test();
$test->run();
相关标签: HyperLogLog redis