fence_check脚本bug修复

程序员文章站 2024-03-21 11:57:16

...

fence_check脚本bug修复

项目上使用的是一套corosync+pacemaker+pg stream的3机postgresql集群，使用fence_check脚本进行着脑裂管理。

fence_check脚本的工作原理

判断本机是否为master节点，如果不是master，则返回：

I'm slave ---> I dont need to check other node status

如果是master则去检查其他机器，如果发现未第二个master节点，则返回：

The other node postgres is disconnected or in recovery,dont need to fence

如果发现第二个master节点，则返回：

The other node postgres is also in master primary,now I have to kill and fence it

并触发fence操作，触发脚本，杀死对方master，防止脑裂

bug还原

在实验环境中，有一套集群的3台节点都认为自己是master节点。

根据脚本发现判断本机是否为master的脚本如下：

ip addr show | grep $OCF_RESKEY_cluster_ip

其中$OCF_RESKEY_cluster_ip是集群master vip

这说明3台机器的此语句都有返回值

bug分析

检查集群状态：

[[email protected] ~]# crm status
Stack: corosync
Current DC: sh01-oscar-cmp-prod-pg09 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Thu Apr 25 02:00:26 2019
Last change: Wed Apr 24 23:14:16 2019 by root via cibadmin on sh01-oscar-cmp-prod-pg07

3 nodes configured
11 resources configured

Online: [ sh01-oscar-cmp-prod-pg07 sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]

Full list of resources:

 fence-sh01-oscar-cmp-prod-pg07	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-prod-pg08
 fence-sh01-oscar-cmp-prod-pg08	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-prod-pg07
 fence-sh01-oscar-cmp-prod-pg09	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-prod-pg07
 Resource Group: master-group
     vip-master	(ocf::heartbeat:IPaddr2):	Started sh01-oscar-cmp-prod-pg07
 Resource Group: slave-group
     vip-slave	(ocf::heartbeat:IPaddr2):	Started sh01-oscar-cmp-prod-pg09
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ sh01-oscar-cmp-prod-pg07 ]
     Slaves: [ sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]
 Clone Set: clnPingCheck [pingCheck]
     Started: [ sh01-oscar-cmp-prod-pg07 sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]

确认master节点在07号机器，在07机器执行fence_check的判断脚本：

[[email protected] ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:50:56:a8:4a:0d brd ff:ff:ff:ff:ff:ff
    inet 10.103.103.20/22 brd 10.103.103.255 scope global ens192
       valid_lft forever preferred_lft forever
    inet 10.103.103.25/22 brd 10.103.103.255 scope global secondary ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fea8:4a0d/64 scope link 
       valid_lft forever preferred_lft forever

发现问题，master vip是10.103.103.25，而brd正好是10.103.103.255，所以脚本的匹配结果如下：

[[email protected] ~]# ip addr show | grep 10.103.103.25
    inet 10.103.103.20/22 brd 10.103.103.255 scope global ens192
    inet 10.103.103.25/22 brd 10.103.103.255 scope global secondary ens192

由于10.103.103.255也满足查询条件，所以三台机器都认为自己是master

bug解决

方案一：master vip不要使用2，25结尾的ip。。。

方案二：修改fence_check的脚本，将判断语句改为：

ip addr show | grep 'inet' | awk '{print $2}' | cut -d"/" -f1 | grep $OCF_RESKEY_cluster_ip

测试结果如下：

[[email protected] ~]# ip addr show | grep 'inet' | awk '{print $2}' | cut -d"/" -f1 |grep 10.103.103.25
10.103.103.25

这样就能保证只会判断inet的值

fence_check脚本bug修复

fence_check脚本bug修复

fence_check脚本的工作原理

bug还原

bug分析

bug解决