fence_check脚本bug修复
程序员文章站
2024-03-21 11:57:16
...
fence_check脚本bug修复
项目上使用的是一套corosync+pacemaker+pg stream的3机postgresql集群,使用fence_check脚本进行着脑裂管理。
fence_check脚本的工作原理
判断本机是否为master节点,如果不是master,则返回:
I'm slave ---> I dont need to check other node status
如果是master则去检查其他机器,如果发现未第二个master节点,则返回:
The other node postgres is disconnected or in recovery,dont need to fence
如果发现第二个master节点,则返回:
The other node postgres is also in master primary,now I have to kill and fence it
并触发fence操作,触发脚本,杀死对方master,防止脑裂
bug还原
在实验环境中,有一套集群的3台节点都认为自己是master节点。
根据脚本发现判断本机是否为master的脚本如下:
ip addr show | grep $OCF_RESKEY_cluster_ip
其中$OCF_RESKEY_cluster_ip是集群master vip
这说明3台机器的此语句都有返回值
bug分析
检查集群状态:
[[email protected] ~]# crm status
Stack: corosync
Current DC: sh01-oscar-cmp-prod-pg09 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Thu Apr 25 02:00:26 2019
Last change: Wed Apr 24 23:14:16 2019 by root via cibadmin on sh01-oscar-cmp-prod-pg07
3 nodes configured
11 resources configured
Online: [ sh01-oscar-cmp-prod-pg07 sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]
Full list of resources:
fence-sh01-oscar-cmp-prod-pg07 (ocf::heartbeat:fence_check): Started sh01-oscar-cmp-prod-pg08
fence-sh01-oscar-cmp-prod-pg08 (ocf::heartbeat:fence_check): Started sh01-oscar-cmp-prod-pg07
fence-sh01-oscar-cmp-prod-pg09 (ocf::heartbeat:fence_check): Started sh01-oscar-cmp-prod-pg07
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started sh01-oscar-cmp-prod-pg07
Resource Group: slave-group
vip-slave (ocf::heartbeat:IPaddr2): Started sh01-oscar-cmp-prod-pg09
Master/Slave Set: msPostgresql [pgsql]
Masters: [ sh01-oscar-cmp-prod-pg07 ]
Slaves: [ sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]
Clone Set: clnPingCheck [pingCheck]
Started: [ sh01-oscar-cmp-prod-pg07 sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]
确认master节点在07号机器,在07机器执行fence_check的判断脚本:
[[email protected] ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a8:4a:0d brd ff:ff:ff:ff:ff:ff
inet 10.103.103.20/22 brd 10.103.103.255 scope global ens192
valid_lft forever preferred_lft forever
inet 10.103.103.25/22 brd 10.103.103.255 scope global secondary ens192
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fea8:4a0d/64 scope link
valid_lft forever preferred_lft forever
发现问题,master vip是10.103.103.25,而brd正好是10.103.103.255,所以脚本的匹配结果如下:
[[email protected] ~]# ip addr show | grep 10.103.103.25
inet 10.103.103.20/22 brd 10.103.103.255 scope global ens192
inet 10.103.103.25/22 brd 10.103.103.255 scope global secondary ens192
由于10.103.103.255也满足查询条件,所以三台机器都认为自己是master
bug解决
方案一:master vip不要使用2,25结尾的ip。。。
方案二:修改fence_check的脚本,将判断语句改为:
ip addr show | grep 'inet' | awk '{print $2}' | cut -d"/" -f1 | grep $OCF_RESKEY_cluster_ip
测试结果如下:
[[email protected] ~]# ip addr show | grep 'inet' | awk '{print $2}' | cut -d"/" -f1 |grep 10.103.103.25
10.103.103.25
这样就能保证只会判断inet的值
上一篇: 内联函数的实质
下一篇: Oracle维护数据表
推荐阅读