详细分析Redis集群故障
故障表象:
业务层面显示提示查询redis失败
集群组成:
3主3从,每个节点的数据有8gb
机器分布:
在同一个机架中,
xx.x.xxx.199
xx.x.xxx.200
xx.x.xxx.201
redis-server进程状态:
通过命令ps -eo pid,lstart | grep $pid,
发现进程已经持续运行了3个月
发生故障前集群的节点状态:
xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave
---------------------------------下面是日志分析--------------------------------------
步1:
主节点8371失去和从节点8373的连接:
46590:m 09 sep 18:57:51.379 # connection with slave xx.x.xxx.199:8373 lost.
步2:
主节点8370/8375判定8371失联:
42645:m 09 sep 18:57:50.117 * marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).
步3:
从节点8372/8373/8374收到主节点8375说8371失联:
46986:s 09 sep 18:57:50.120 * fail message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65
步4:
主节点8370/8375授权8373升级为主节点转移:
42645:m 09 sep 18:57:51.055 # failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16
步5:
原主节点8371修改自己的配置,成为8373的从节点:
46590:m 09 sep 18:57:51.488 # configuration change detected. reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6
步6:
主节点8370/8375/8373明确8371失败状态:
42645:m 09 sep 18:57:51.522 * clear fail state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.
步7:
新从节点8371开始从新主节点8373,第一次全量同步数据:
8373日志::
4255:m 09 sep 18:57:51.906 * full resync requested by slave xx.x.xxx.200:8371
4255:m 09 sep 18:57:51.906 * starting bgsave for sync with target: disk
4255:m 09 sep 18:57:51.941 * background saving started by pid 5230
8371日志::
46590:s 09 sep 18:57:51.948 * full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993
步8:
主节点8370/8375判定8373(新主)失联:
42645:m 09 sep 18:58:00.320 * marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).
步9:
主节点8370/8375判定8373(新主)恢复:
60295:m 09 sep 18:58:18.181 * clear fail state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.
步10:
主节点8373完成全量同步所需要的bgsave操作:
5230:c 09 sep 18:59:01.474 * db saved on disk
5230:c 09 sep 18:59:01.491 * rdb: 7112 mb of memory used by copy-on-write
4255:m 09 sep 18:59:01.877 * background saving terminated with success
步11:
从节点8371开始从主节点8373接收到数据:
46590:s 09 sep 18:59:02.263 * master <-> slave sync: receiving 2657606930 bytes from master
步12:
主节点8373发现从节点8371对output buffer作了限制:
4255:m 09 sep 19:00:19.014 # client id=14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=s db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed asap for overcoming of output buffer limits.
4255:m 09 sep 19:00:19.015 # connection with slave xx.x.xxx.200:8371 lost.
步13:
从节点8371从主节点8373同步数据失败,连接断了,第一次全量同步失败:
46590:s 09 sep 19:00:19.018 # i/o error trying to sync with master: connection lost
46590:s 09 sep 19:00:20.102 * connecting to master xx.x.xxx.199:8373
46590:s 09 sep 19:00:20.102 * master <-> slave sync started
步14:
从节点8371重新开始同步,连接失败,主节点8373的连接数满了:
46590:s 09 sep 19:00:21.103 * connecting to master xx.x.xxx.199:8373
46590:s 09 sep 19:00:21.103 * master <-> slave sync started
46590:s 09 sep 19:00:21.104 * non blocking connect for sync fired the event.
46590:s 09 sep 19:00:21.104 # error reply to ping from master: '-err max number of clients reached'
步15:
从节点8371重新连上主节点8373,第二次开始全量同步:
8371日志:
46590:s 09 sep 19:00:49.175 * connecting to master xx.x.xxx.199:8373
46590:s 09 sep 19:00:49.175 * master <-> slave sync started
46590:s 09 sep 19:00:49.175 * non blocking connect for sync fired the event.
46590:s 09 sep 19:00:49.176 * master replied to ping, replication can continue...
46590:s 09 sep 19:00:49.179 * partial resynchronization not possible (no cached master)
46590:s 09 sep 19:00:49.501 * full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
8373日志:
4255:m 09 sep 19:00:49.176 * slave xx.x.xxx.200:8371 asks for synchronization
4255:m 09 sep 19:00:49.176 * full resync requested by slave xx.x.xxx.200:8371
4255:m 09 sep 19:00:49.176 * starting bgsave for sync with target: disk
4255:m 09 sep 19:00:49.498 * background saving started by pid 18413
18413:c 09 sep 19:01:52.466 * db saved on disk
18413:c 09 sep 19:01:52.620 * rdb: 2124 mb of memory used by copy-on-write
4255:m 09 sep 19:01:53.186 * background saving terminated with success
步16:
从节点8371同步数据成功,开始加载经内存:
46590:s 09 sep 19:01:53.190 * master <-> slave sync: receiving 2637183250 bytes from master
46590:s 09 sep 19:04:51.485 * master <-> slave sync: flushing old data
46590:s 09 sep 19:05:58.695 * master <-> slave sync: loading db in memory
步17:
集群恢复正常:
42645:m 09 sep 19:05:58.786 * clear fail state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.
步18:
从节点8371同步数据成功,耗时7分钟:
46590:s 09 sep 19:08:19.303 * master <-> slave sync: finished with success
8371失联原因分析:
由于几台机器在同一个机架,不太可能发生网络中断的情况,于是通过slowlog get命令查看了慢查询日志,发现有一个keys命令被执行了,耗时8.3秒,再查看集群节点超时设置,发现是5s(cluster-node-timeout 5000)
出现节点失联的原因:
客户端执行了耗时1条8.3s的命令,
2016/9/9 18:57:43 开始执行keys命令
2016/9/9 18:57:50 8371被判断失联(redis日志)
2016/9/9 18:57:51 执行完keys命令
总结来说,有以下几个问题:
1.由于cluster-node-timeout设置比较短,慢查询keys导致了集群判断节点8371失联
2.由于8371失联,导致8373升级为主,开始主从同步
3.由于配置client-output-buffer-limit的限制,导致第一次全量同步失败了
4.又由于php客户端的连接池有问题,疯狂连接服务器,产生了类似syn攻击的效果
5.第一次全量同步失败后,从节点重连主节点花了30秒(超过了最大连接数1w)
关于client-output-buffer-limit参数:
# the syntax of every client-output-buffer-limit directive is the following: # # client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds> # # a client is immediately disconnected once the hard limit is reached, or if # the soft limit is reached and remains reached for the specified number of # seconds (continuously). # so for instance if the hard limit is 32 megabytes and the soft limit is # 16 megabytes / 10 seconds, the client will get disconnected immediately # if the size of the output buffers reach 32 megabytes, but will also get # disconnected if the client reaches 16 megabytes and continuously overcomes # the limit for 10 seconds. # # by default normal clients are not limited because they don't receive data # without asking (in a push way), but just after a request, so only # asynchronous clients may create a scenario where data is requested faster # than it can read. # # instead there is a default limit for pubsub and slave clients, since # subscribers and slaves receive data in a push fashion. # # both the hard or the soft limit can be disabled by setting them to zero. client-output-buffer-limit normal 0 0 0 client-output-buffer-limit slave 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60
采取措施:
1.单实例的切割到4g以下,否则发生主从切换会耗时很长
2.调整client-output-buffer-limit参数,防止同步进行到一半失败
3.调整cluster-node-timeout,不能少于15s
4.禁止任何耗时超过cluster-node-timeout的慢查询,因为会导致主从切换
5.修复客户端类似syn攻击的疯狂连接方式
总结
以上就是本文关于详细分析redis集群故障的全部内容,希望对大家有所帮助。感兴趣的朋友可以参阅:spring aop实现redis缓存数据库查询源码、简述redis和mysql的区别、等,如有不足之处,请留言之处。小编会及时更正。感谢朋友们对网站的支持!
推荐阅读
-
Springboot 2.0.x 集成基于Centos7的Redis集群安装及配置
-
JAVAEE——宜立方商城06:Redis安装、数据类型和持久化方案、Redis集群分析与搭建、实现缓存和同步
-
laravel项目利用twemproxy部署redis集群的完整步骤
-
Linux(Centos7)下redis5集群搭建和使用
-
redis集群错误解决:/usr/lib/ruby/gems/1.8/gems/redis-3.0.0/lib/redis/client.rb:79:in `call': ERR Slot 15495 is already busy (Redis::CommandError)
-
详解centos下搭建redis集群
-
基于Docker搭建Redis主从集群的实现
-
Docker上实现Redis集群搭建
-
通过Docker部署Redis 6.x集群的方法
-
Redis集群整合到springboot框架