欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

fence_check脚本bug修复

程序员文章站 2024-03-21 11:57:16
...

fence_check脚本bug修复

项目上使用的是一套corosync+pacemaker+pg stream的3机postgresql集群,使用fence_check脚本进行着脑裂管理。

fence_check脚本的工作原理

判断本机是否为master节点,如果不是master,则返回:

I'm slave ---> I dont need to check other node status

如果是master则去检查其他机器,如果发现未第二个master节点,则返回:

The other node postgres is disconnected or in recovery,dont need to fence

如果发现第二个master节点,则返回:

The other node postgres is also in master primary,now I have to kill and fence it

并触发fence操作,触发脚本,杀死对方master,防止脑裂

bug还原

在实验环境中,有一套集群的3台节点都认为自己是master节点。

根据脚本发现判断本机是否为master的脚本如下:

ip addr show | grep $OCF_RESKEY_cluster_ip

其中$OCF_RESKEY_cluster_ip是集群master vip

这说明3台机器的此语句都有返回值

bug分析

检查集群状态:

[[email protected] ~]# crm status
Stack: corosync
Current DC: sh01-oscar-cmp-prod-pg09 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Thu Apr 25 02:00:26 2019
Last change: Wed Apr 24 23:14:16 2019 by root via cibadmin on sh01-oscar-cmp-prod-pg07

3 nodes configured
11 resources configured

Online: [ sh01-oscar-cmp-prod-pg07 sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]

Full list of resources:

 fence-sh01-oscar-cmp-prod-pg07	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-prod-pg08
 fence-sh01-oscar-cmp-prod-pg08	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-prod-pg07
 fence-sh01-oscar-cmp-prod-pg09	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-prod-pg07
 Resource Group: master-group
     vip-master	(ocf::heartbeat:IPaddr2):	Started sh01-oscar-cmp-prod-pg07
 Resource Group: slave-group
     vip-slave	(ocf::heartbeat:IPaddr2):	Started sh01-oscar-cmp-prod-pg09
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ sh01-oscar-cmp-prod-pg07 ]
     Slaves: [ sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]
 Clone Set: clnPingCheck [pingCheck]
     Started: [ sh01-oscar-cmp-prod-pg07 sh01-oscar-cmp-prod-pg08 sh01-oscar-cmp-prod-pg09 ]

确认master节点在07号机器,在07机器执行fence_check的判断脚本:

[[email protected] ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:50:56:a8:4a:0d brd ff:ff:ff:ff:ff:ff
    inet 10.103.103.20/22 brd 10.103.103.255 scope global ens192
       valid_lft forever preferred_lft forever
    inet 10.103.103.25/22 brd 10.103.103.255 scope global secondary ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fea8:4a0d/64 scope link 
       valid_lft forever preferred_lft forever

发现问题,master vip是10.103.103.25,而brd正好是10.103.103.255,所以脚本的匹配结果如下:

[[email protected] ~]# ip addr show | grep 10.103.103.25
    inet 10.103.103.20/22 brd 10.103.103.255 scope global ens192
    inet 10.103.103.25/22 brd 10.103.103.255 scope global secondary ens192

由于10.103.103.255也满足查询条件,所以三台机器都认为自己是master

bug解决

方案一:master vip不要使用2,25结尾的ip。。。

方案二:修改fence_check的脚本,将判断语句改为:

ip addr show | grep 'inet' | awk '{print $2}' | cut -d"/" -f1 | grep $OCF_RESKEY_cluster_ip

测试结果如下:

[[email protected] ~]# ip addr show | grep 'inet' | awk '{print $2}' | cut -d"/" -f1 |grep 10.103.103.25
10.103.103.25

这样就能保证只会判断inet的值

相关标签: postgres