Redis中LRU淘汰策略的深入分析
前言
redis作为缓存使用时,一些场景下要考虑内存的空间消耗问题。redis会删除过期键以释放空间,过期键的删除策略有两种:
- 惰性删除:每次从键空间中获取键时,都检查取得的键是否过期,如果过期的话,就删除该键;如果没有过期,就返回该键。
- 定期删除:每隔一段时间,程序就对数据库进行一次检查,删除里面的过期键。
另外,redis也可以开启lru功能来自动淘汰一些键值对。
lru算法
当需要从缓存中淘汰数据时,我们希望能淘汰那些将来不可能再被使用的数据,保留那些将来还会频繁访问的数据,但最大的问题是缓存并不能预言未来。一个解决方法就是通过lru进行预测:最近被频繁访问的数据将来被访问的可能性也越大。缓存中的数据一般会有这样的访问分布:一部分数据拥有绝大部分的访问量。当访问模式很少改变时,可以记录每个数据的最后一次访问时间,拥有最少空闲时间的数据可以被认为将来最有可能被访问到。
举例如下的访问模式,a每5s访问一次,b每2s访问一次,c与d每10s访问一次,|代表计算空闲时间的截止点:
~~~~~a~~~~~a~~~~~a~~~~a~~~~~a~~~~~a~~|
~~b~~b~~b~~b~~b~~b~~b~~b~~b~~b~~b~~b~|
~~~~~~~~~~c~~~~~~~~~c~~~~~~~~~c~~~~~~|
~~~~~d~~~~~~~~~~d~~~~~~~~~d~~~~~~~~~d|
可以看到,lru对于a、b、c工作的很好,完美预测了将来被访问到的概率b>a>c,但对于d却预测了最少的空闲时间。
但是,总体来说,lru算法已经是一个性能足够好的算法了
lru配置参数
redis配置中和lru有关的有三个:
- maxmemory: 配置redis存储数据时指定限制的内存大小,比如100m。当缓存消耗的内存超过这个数值时, 将触发数据淘汰。该数据配置为0时,表示缓存的数据量没有限制, 即lru功能不生效。64位的系统默认值为0,32位的系统默认内存限制为3gb
- maxmemory_policy: 触发数据淘汰后的淘汰策略
- maxmemory_samples: 随机采样的精度,也就是随即取出key的数目。该数值配置越大, 越接近于真实的lru算法,但是数值越大,相应消耗也变高,对性能有一定影响,样本值默认为5。
淘汰策略
淘汰策略即maxmemory_policy的赋值有以下几种:
- noeviction:如果缓存数据超过了maxmemory限定值,并且客户端正在执行的命令(大部分的写入指令,但del和几个指令例外)会导致内存分配,则向客户端返回错误响应
- allkeys-lru: 对所有的键都采取lru淘汰
- volatile-lru: 仅对设置了过期时间的键采取lru淘汰
- allkeys-random: 随机回收所有的键
- volatile-random: 随机回收设置过期时间的键
- volatile-ttl: 仅淘汰设置了过期时间的键---淘汰生存时间ttl(time to live)更小的键
volatile-lru, volatile-random和volatile-ttl这三个淘汰策略使用的不是全量数据,有可能无法淘汰出足够的内存空间。在没有过期键或者没有设置超时属性的键的情况下,这三种策略和noeviction差不多。
一般的经验规则:
- 使用allkeys-lru策略:当预期请求符合一个幂次分布(二八法则等),比如一部分的子集元素比其它其它元素被访问的更多时,可以选择这个策略。
- 使用allkeys-random:循环连续的访问所有的键时,或者预期请求分布平均(所有元素被访问的概率都差不多)
- 使用volatile-ttl:要采取这个策略,缓存对象的ttl值最好有差异
volatile-lru 和 volatile-random策略,当你想要使用单一的redis实例来同时实现缓存淘汰和持久化一些经常使用的键集合时很有用。未设置过期时间的键进行持久化保存,设置了过期时间的键参与缓存淘汰。不过一般运行两个实例是解决这个问题的更好方法。
为键设置过期时间也是需要消耗内存的,所以使用allkeys-lru这种策略更加节省空间,因为这种策略下可以不为键设置过期时间。
近似lru算法
我们知道,lru算法需要一个双向链表来记录数据的最近被访问顺序,但是出于节省内存的考虑,redis的lru算法并非完整的实现。redis并不会选择最久未被访问的键进行回收,相反它会尝试运行一个近似lru的算法,通过对少量键进行取样,然后回收其中的最久未被访问的键。通过调整每次回收时的采样数量maxmemory-samples,可以实现调整算法的精度。
根据redis作者的说法,每个redis object可以挤出24 bits的空间,但24 bits是不够存储两个指针的,而存储一个低位时间戳是足够的,redis object以秒为单位存储了对象新建或者更新时的unix time,也就是lru clock,24 bits数据要溢出的话需要194天,而缓存的数据更新非常频繁,已经足够了。
redis的键空间是放在一个哈希表中的,要从所有的键中选出一个最久未被访问的键,需要另外一个数据结构存储这些源信息,这显然不划算。最初,redis只是随机的选3个key,然后从中淘汰,后来算法改进到了n个key的策略,默认是5个。
redis3.0之后又改善了算法的性能,会提供一个待淘汰候选key的pool,里面默认有16个key,按照空闲时间排好序。更新时从redis键空间随机选择n个key,分别计算它们的空闲时间idle,key只会在pool不满或者空闲时间大于pool里最小的时,才会进入pool,然后从pool中选择空闲时间最大的key淘汰掉。
真实lru算法与近似lru的算法可以通过下面的图像对比:
浅灰色带是已经被淘汰的对象,灰色带是没有被淘汰的对象,绿色带是新添加的对象。可以看出,maxmemory-samples值为5时redis 3.0效果比redis 2.8要好。使用10个采样大小的redis 3.0的近似lru算法已经非常接近理论的性能了。
数据访问模式非常接近幂次分布时,也就是大部分的访问集中于部分键时,lru近似算法会处理得很好。
在模拟实验的过程中,我们发现如果使用幂次分布的访问模式,真实lru算法和近似lru算法几乎没有差别。
lru源码分析
redis中的键与值都是redisobject对象:
typedef struct redisobject { unsigned type:4; unsigned encoding:4; unsigned lru:lru_bits; /* lru time (relative to global lru_clock) or * lfu data (least significant 8 bits frequency * and most significant 16 bits access time). */ int refcount; void *ptr; } robj;
unsigned的低24 bits的lru记录了redisobj的lru time。
redis命令访问缓存的数据时,均会调用函数lookupkey:
robj *lookupkey(redisdb *db, robj *key, int flags) { dictentry *de = dictfind(db->dict,key->ptr); if (de) { robj *val = dictgetval(de); /* update the access time for the ageing algorithm. * don't do it if we have a saving child, as this will trigger * a copy on write madness. */ if (server.rdb_child_pid == -1 && server.aof_child_pid == -1 && !(flags & lookup_notouch)) { if (server.maxmemory_policy & maxmemory_flag_lfu) { updatelfu(val); } else { val->lru = lru_clock(); } } return val; } else { return null; } }
该函数在策略为lru(非lfu)时会更新对象的lru值, 设置为lru_clock()值:
/* return the lru clock, based on the clock resolution. this is a time * in a reduced-bits format that can be used to set and check the * object->lru field of redisobject structures. */ unsigned int getlruclock(void) { return (mstime()/lru_clock_resolution) & lru_clock_max; } /* this function is used to obtain the current lru clock. * if the current resolution is lower than the frequency we refresh the * lru clock (as it should be in production servers) we return the * precomputed value, otherwise we need to resort to a system call. */ unsigned int lru_clock(void) { unsigned int lruclock; if (1000/server.hz <= lru_clock_resolution) { atomicget(server.lruclock,lruclock); } else { lruclock = getlruclock(); } return lruclock; }
lru_clock()取决于lru_clock_resolution(默认值1000),lru_clock_resolution代表了lru算法的精度,即一个lru的单位是多长。server.hz代表服务器刷新的频率,如果服务器的时间更新精度值比lru的精度值要小,lru_clock()直接使用服务器的时间,减小开销。
redis处理命令的入口是processcommand:
int processcommand(client *c) { /* handle the maxmemory directive. * * note that we do not want to reclaim memory if we are here re-entering * the event loop since there is a busy lua script running in timeout * condition, to avoid mixing the propagation of scripts with the * propagation of dels due to eviction. */ if (server.maxmemory && !server.lua_timedout) { int out_of_memory = freememoryifneededandsafe() == c_err; /* freememoryifneeded may flush slave output buffers. this may result * into a slave, that may be the active client, to be freed. */ if (server.current_client == null) return c_err; /* it was impossible to free enough memory, and the command the client * is trying to execute is denied during oom conditions or the client * is in multi/exec context? error. */ if (out_of_memory && (c->cmd->flags & cmd_denyoom || (c->flags & client_multi && c->cmd->proc != execcommand))) { flagtransaction(c); addreply(c, shared.oomerr); return c_ok; } } }
只列出了释放内存空间的部分,freememoryifneededandsafe为释放内存的函数:
int freememoryifneeded(void) { /* by default replicas should ignore maxmemory * and just be masters exact copies. */ if (server.masterhost && server.repl_slave_ignore_maxmemory) return c_ok; size_t mem_reported, mem_tofree, mem_freed; mstime_t latency, eviction_latency; long long delta; int slaves = listlength(server.slaves); /* when clients are paused the dataset should be static not just from the * pov of clients not being able to write, but also from the pov of * expires and evictions of keys not being performed. */ if (clientsarepaused()) return c_ok; if (getmaxmemorystate(&mem_reported,null,&mem_tofree,null) == c_ok) return c_ok; mem_freed = 0; if (server.maxmemory_policy == maxmemory_no_eviction) goto cant_free; /* we need to free memory, but policy forbids. */ latencystartmonitor(latency); while (mem_freed < mem_tofree) { int j, k, i, keys_freed = 0; static unsigned int next_db = 0; sds bestkey = null; int bestdbid; redisdb *db; dict *dict; dictentry *de; if (server.maxmemory_policy & (maxmemory_flag_lru|maxmemory_flag_lfu) || server.maxmemory_policy == maxmemory_volatile_ttl) { struct evictionpoolentry *pool = evictionpoollru; while(bestkey == null) { unsigned long total_keys = 0, keys; /* we don't want to make local-db choices when expiring keys, * so to start populate the eviction pool sampling keys from * every db. */ for (i = 0; i < server.dbnum; i++) { db = server.db+i; dict = (server.maxmemory_policy & maxmemory_flag_allkeys) ? db->dict : db->expires; if ((keys = dictsize(dict)) != 0) { evictionpoolpopulate(i, dict, db->dict, pool); total_keys += keys; } } if (!total_keys) break; /* no keys to evict. */ /* go backward from best to worst element to evict. */ for (k = evpool_size-1; k >= 0; k--) { if (pool[k].key == null) continue; bestdbid = pool[k].dbid; if (server.maxmemory_policy & maxmemory_flag_allkeys) { de = dictfind(server.db[pool[k].dbid].dict, pool[k].key); } else { de = dictfind(server.db[pool[k].dbid].expires, pool[k].key); } /* remove the entry from the pool. */ if (pool[k].key != pool[k].cached) sdsfree(pool[k].key); pool[k].key = null; pool[k].idle = 0; /* if the key exists, is our pick. otherwise it is * a ghost and we need to try the next element. */ if (de) { bestkey = dictgetkey(de); break; } else { /* ghost... iterate again. */ } } } } /* volatile-random and allkeys-random policy */ else if (server.maxmemory_policy == maxmemory_allkeys_random || server.maxmemory_policy == maxmemory_volatile_random) { /* when evicting a random key, we try to evict a key for * each db, so we use the static 'next_db' variable to * incrementally visit all dbs. */ for (i = 0; i < server.dbnum; i++) { j = (++next_db) % server.dbnum; db = server.db+j; dict = (server.maxmemory_policy == maxmemory_allkeys_random) ? db->dict : db->expires; if (dictsize(dict) != 0) { de = dictgetrandomkey(dict); bestkey = dictgetkey(de); bestdbid = j; break; } } } /* finally remove the selected key. */ if (bestkey) { db = server.db+bestdbid; robj *keyobj = createstringobject(bestkey,sdslen(bestkey)); propagateexpire(db,keyobj,server.lazyfree_lazy_eviction); /* we compute the amount of memory freed by db*delete() alone. * it is possible that actually the memory needed to propagate * the del in aof and replication link is greater than the one * we are freeing removing the key, but we can't account for * that otherwise we would never exit the loop. * * aof and output buffer memory will be freed eventually so * we only care about memory used by the key space. */ delta = (long long) zmalloc_used_memory(); latencystartmonitor(eviction_latency); if (server.lazyfree_lazy_eviction) dbasyncdelete(db,keyobj); else dbsyncdelete(db,keyobj); latencyendmonitor(eviction_latency); latencyaddsampleifneeded("eviction-del",eviction_latency); latencyremovenestedevent(latency,eviction_latency); delta -= (long long) zmalloc_used_memory(); mem_freed += delta; server.stat_evictedkeys++; notifykeyspaceevent(notify_evicted, "evicted", keyobj, db->id); decrrefcount(keyobj); keys_freed++; /* when the memory to free starts to be big enough, we may * start spending so much time here that is impossible to * deliver data to the slaves fast enough, so we force the * transmission here inside the loop. */ if (slaves) flushslavesoutputbuffers(); /* normally our stop condition is the ability to release * a fixed, pre-computed amount of memory. however when we * are deleting objects in another thread, it's better to * check, from time to time, if we already reached our target * memory, since the "mem_freed" amount is computed only * across the dbasyncdelete() call, while the thread can * release the memory all the time. */ if (server.lazyfree_lazy_eviction && !(keys_freed % 16)) { if (getmaxmemorystate(null,null,null,null) == c_ok) { /* let's satisfy our stop condition. */ mem_freed = mem_tofree; } } } if (!keys_freed) { latencyendmonitor(latency); latencyaddsampleifneeded("eviction-cycle",latency); goto cant_free; /* nothing to free... */ } } latencyendmonitor(latency); latencyaddsampleifneeded("eviction-cycle",latency); return c_ok; cant_free: /* we are here if we are not able to reclaim memory. there is only one * last thing we can try: check if the lazyfree thread has jobs in queue * and wait... */ while(biopendingjobsoftype(bio_lazy_free)) { if (((mem_reported - zmalloc_used_memory()) + mem_freed) >= mem_tofree) break; usleep(1000); } return c_err; } /* this is a wrapper for freememoryifneeded() that only really calls the * function if right now there are the conditions to do so safely: * * - there must be no script in timeout condition. * - nor we are loading data right now. * */ int freememoryifneededandsafe(void) { if (server.lua_timedout || server.loading) return c_ok; return freememoryifneeded(); }
几种淘汰策略maxmemory_policy就是在这个函数里面实现的。
当采用lru时,可以看到,从0号数据库开始(默认16个),根据不同的策略,选择redisdb的dict(全部键)或者expires(有过期时间的键),用来更新候选键池子pool,pool更新策略是evictionpoolpopulate:
void evictionpoolpopulate(int dbid, dict *sampledict, dict *keydict, struct evictionpoolentry *pool) { int j, k, count; dictentry *samples[server.maxmemory_samples]; count = dictgetsomekeys(sampledict,samples,server.maxmemory_samples); for (j = 0; j < count; j++) { unsigned long long idle; sds key; robj *o; dictentry *de; de = samples[j]; key = dictgetkey(de); /* if the dictionary we are sampling from is not the main * dictionary (but the expires one) we need to lookup the key * again in the key dictionary to obtain the value object. */ if (server.maxmemory_policy != maxmemory_volatile_ttl) { if (sampledict != keydict) de = dictfind(keydict, key); o = dictgetval(de); } /* calculate the idle time according to the policy. this is called * idle just because the code initially handled lru, but is in fact * just a score where an higher score means better candidate. */ if (server.maxmemory_policy & maxmemory_flag_lru) { idle = estimateobjectidletime(o); } else if (server.maxmemory_policy & maxmemory_flag_lfu) { /* when we use an lru policy, we sort the keys by idle time * so that we expire keys starting from greater idle time. * however when the policy is an lfu one, we have a frequency * estimation, and we want to evict keys with lower frequency * first. so inside the pool we put objects using the inverted * frequency subtracting the actual frequency to the maximum * frequency of 255. */ idle = 255-lfudecrandreturn(o); } else if (server.maxmemory_policy == maxmemory_volatile_ttl) { /* in this case the sooner the expire the better. */ idle = ullong_max - (long)dictgetval(de); } else { serverpanic("unknown eviction policy in evictionpoolpopulate()"); } /* insert the element inside the pool. * first, find the first empty bucket or the first populated * bucket that has an idle time smaller than our idle time. */ k = 0; while (k < evpool_size && pool[k].key && pool[k].idle < idle) k++; if (k == 0 && pool[evpool_size-1].key != null) { /* can't insert if the element is < the worst element we have * and there are no empty buckets. */ continue; } else if (k < evpool_size && pool[k].key == null) { /* inserting into empty position. no setup needed before insert. */ } else { /* inserting in the middle. now k points to the first element * greater than the element to insert. */ if (pool[evpool_size-1].key == null) { /* free space on the right? insert at k shifting * all the elements from k to end to the right. */ /* save sds before overwriting. */ sds cached = pool[evpool_size-1].cached; memmove(pool+k+1,pool+k, sizeof(pool[0])*(evpool_size-k-1)); pool[k].cached = cached; } else { /* no free space on right? insert at k-1 */ k--; /* shift all elements on the left of k (included) to the * left, so we discard the element with smaller idle time. */ sds cached = pool[0].cached; /* save sds before overwriting. */ if (pool[0].key != pool[0].cached) sdsfree(pool[0].key); memmove(pool,pool+1,sizeof(pool[0])*k); pool[k].cached = cached; } } /* try to reuse the cached sds string allocated in the pool entry, * because allocating and deallocating this object is costly * (according to the profiler, not my fantasy. remember: * premature optimizbla bla bla bla. */ int klen = sdslen(key); if (klen > evpool_cached_sds_size) { pool[k].key = sdsdup(key); } else { memcpy(pool[k].cached,key,klen+1); sdssetlen(pool[k].cached,klen); pool[k].key = pool[k].cached; } pool[k].idle = idle; pool[k].dbid = dbid; } }
redis随机选择maxmemory_samples数量的key,然后计算这些key的空闲时间idle time,当满足条件时(比pool中的某些键的空闲时间还大)就可以进pool。pool更新之后,就淘汰pool中空闲时间最大的键。
estimateobjectidletime用来计算redis对象的空闲时间:
/* given an object returns the min number of milliseconds the object was never * requested, using an approximated lru algorithm. */ unsigned long long estimateobjectidletime(robj *o) { unsigned long long lruclock = lru_clock(); if (lruclock >= o->lru) { return (lruclock - o->lru) * lru_clock_resolution; } else { return (lruclock + (lru_clock_max - o->lru)) * lru_clock_resolution; } }
空闲时间基本就是就是对象的lru和全局的lru_clock()的差值乘以精度lru_clock_resolution,将秒转化为了毫秒。
参考链接
总结
以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作具有一定的参考学习价值,谢谢大家对的支持。
下一篇: Redis字符串对象实用笔记