欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  数据库

记一次AsmLib故障

程序员文章站 2022-06-06 09:21:08
...

刚把客户20T的库恢复起来,运行了几天,突然打电话通知一个节点挂了,vpn连接上去查看crs日志 [oracle@rac2 crsd]$ tail -100f crsd.log 2014-07-19 21:37:02.223: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9 2014-07-19 21:37:02.225: [

刚把客户20T的库恢复起来,运行了几天,突然打电话通知一个节点挂了,vpn连接上去查看crs日志

[oracle@rac2 crsd]$ tail -100f crsd.log 
2014-07-19 21:37:02.223: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9
 
2014-07-19 21:37:02.225: [  CRSRTI][1073453488]0CSS is not ready. Received status 3 from CSS. Waiting for good status .. 
 
2014-07-19 21:37:03.599: [ COMMCRS][1110501696]clsc_connect: (0xb438700) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac2_crs))
 
2014-07-19 21:37:03.599: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9
 
2014-07-19 21:37:03.600: [  CRSRTI][1073453488]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

查看心跳网络不通,重启网卡后问题解决,

[root@rac2 client]# ping rac1-priv
PING rac1-priv (192.168.2.81) 56(84) bytes of data.
From rac2-priv (192.168.2.83) icmp_seq=10 Destination Host Unreachable
From rac2-priv (192.168.2.83) icmp_seq=11 Destination Host Unreachable
From rac2-priv (192.168.2.83) icmp_seq=12 Destination Host Unreachable
From rac2-priv (192.168.2.83) icmp_seq=14 Destination Host Unreachable
From rac2-priv (192.168.2.83) icmp_seq=15 Destination Host Unreachable
From rac2-priv (192.168.2.83) icmp_seq=16 Destination Host Unreachable
 
--- rac1-priv ping statistics ---
19 packets transmitted, 0 received, +6 errors, 100% packet loss, time 18000ms
, pipe 3
[root@rac2 client]# ifconfig bond1
bond1     Link encap:Ethernet  HWaddr 78:2B:CB:0D:32:49
          inet addr:192.168.2.83  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::7a2b:cbff:fe0d:3249/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:105 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21457 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:6720 (6.5 KiB)  TX bytes:1373758 (1.3 MiB)
 
[root@rac2 client]# ifdown bond1
[root@rac2 client]# ifup bond1
[root@rac2 client]# ping rac1-priv
PING rac1-priv (192.168.2.81) 56(84) bytes of data.
64 bytes from rac1-priv (192.168.2.81): icmp_seq=1 ttl=64 time=0.146 ms
64 bytes from rac1-priv (192.168.2.81): icmp_seq=2 ttl=64 time=0.102 ms
64 bytes from rac1-priv (192.168.2.81): icmp_seq=3 ttl=64 time=0.085 ms
64 bytes from rac1-priv (192.168.2.81): icmp_seq=4 ttl=64 time=0.095 ms
64 bytes from rac1-priv (192.168.2.81): icmp_seq=5 ttl=64 time=0.146 ms
64 bytes from rac1-priv (192.168.2.81): icmp_seq=6 ttl=64 time=0.099 ms

启动crs后,发现监听和数据库实例不能正常启动

[oracle@rac2 ~]$ crs_stat -t -v
Name           Type           R/RA   F/FT   Target    State     Host        
----------------------------------------------------------------------
ora.master.db  application    0/0    0/1    ONLINE    ONLINE    rac1 
ora....rtdb.cs application    0/0    0/1    ONLINE    ONLINE    rac1 
ora....er1.srv application    0/0    0/0    ONLINE    ONLINE    rac1 
ora....r1.inst application    0/5    0/0    ONLINE    ONLINE    rac1 
ora....r2.inst application    0/5    0/0    ONLINE    OFFLINE               
ora....pcdb.cs application    0/0    0/1    ONLINE    ONLINE    rac1 
ora....er1.srv application    0/0    0/0    ONLINE    ONLINE    rac1 
ora....SM1.asm application    0/5    0/0    ONLINE    ONLINE    rac1 
ora....N1.lsnr application    0/5    0/0    ONLINE    ONLINE    rac1 
ora....bn1.gsd application    0/5    0/0    ONLINE    ONLINE    rac1 
ora....bn1.ons application    0/3    0/0    ONLINE    ONLINE    rac1 
ora....bn1.vip application    0/0    0/0    ONLINE    ONLINE    rac1 
ora....SM2.asm application    0/5    0/0    ONLINE    ONLINE    rac2 
ora....N2.lsnr application    0/5    0/0    ONLINE    OFFLINE               
ora....bn2.gsd application    0/5    0/0    ONLINE    ONLINE    rac2 
ora....bn2.ons application    0/3    0/0    ONLINE    ONLINE    rac2 
ora....bn2.vip application    0/0    0/0    ONLINE    ONLINE    rac2

手工启动报错

[oracle@rac2 ~]$ srvctl start listener -n rac2
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:LSNRCTL for Linux: Version 10.2.0.3.0 - Production on 19-JUL-2014 22:05:06
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:Copyright (c) 1991, 2006, Oracle.  All rights reserved.
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:Starting /opt/oracle/app/database/bin/tnslsnr: please wait...
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:TNSLSNR for Linux: Version 10.2.0.3.0 - Production
rac2:ora.rac2.LISTENER_rac2.lsnr:System parameter file is /opt/oracle/app/database/network/admin/listener.ora
rac2:ora.rac2.LISTENER_rac2.lsnr:Log messages written to /opt/oracle/app/database/network/log/listener_rac2.log
rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-01151: Missing listener name, LISTENER_rac2, in LISTENER.ORA
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:Listener failed to start. See the error message(s) above...
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:LSNRCTL for Linux: Version 10.2.0.3.0 - Production on 19-JUL-2014 22:05:06
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:Copyright (c) 1991, 2006, Oracle.  All rights reserved.
rac2:ora.rac2.LISTENER_rac2.lsnr:
rac2:ora.rac2.LISTENER_rac2.lsnr:Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac2-vip)(PORT=1521)(IP=FIRST)))
rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-12541: TNS:no listener
rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-12560: TNS:protocol adapter error
rac2:ora.rac2.LISTENER_rac2.lsnr:  TNS-00511: No listener
rac2:ora.rac2.LISTENER_rac2.lsnr:   Linux Error: 111: Connection refused
rac2:ora.rac2.LISTENER_rac2.lsnr:Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.83)(PORT=1521)(IP=FIRST)))
rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-12541: TNS:no listener
rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-12560: TNS:protocol adapter error
rac2:ora.rac2.LISTENER_rac2.lsnr:  TNS-00511: No listener
rac2:ora.rac2.LISTENER_rac2.lsnr:   Linux Error: 111: Connection refused
CRS-0215: Could not start resource 'ora.rac2.LISTENER_rac2.lsnr'.

怀疑监听配置listener.ora文件出现问题,查看

[oracle@standbydbn2 ~]$ cd /opt/oracle/app/database/network/admin/
[oracle@standbydbn2 admin]$ ls -l
total 88
-rw-r--r-- 1 oracle oinstall  240 Jul 27  2011 1
-rw-r--r-- 1 oracle oinstall  378 Jul 27  2011 listener1107271PM5834.bak
-rw-r--r-- 1 oracle oinstall  448 Jul 17 16:09 listener1407174PM0951.bak
-rw-r--r-- 1 oracle oinstall  448 Jul 17 19:00 listener1407177PM0037.bak
-rw-r--r-- 1 oracle oinstall  448 Jul 17 19:01 listener1407177PM0113.bak
-rw-r--r-- 1 oracle oinstall    0 Jul 17 22:10 listener.ora
-rw-r--r-- 1 oracle oinstall  553 Jul 27  2011 listener.ora.20110727bak
-rw-r--r-- 1 oracle oinstall  402 Oct  8  2011 listener.ora.bak
drwxr-x--- 2 oracle oinstall 4096 Jul 26  2011 samples
-rw-r----- 1 oracle oinstall  172 Dec 26  2003 shrept.lst
-rw-r--r-- 1 oracle oinstall   35 Jul 17 16:09 sqlnet1407174PM0951.bak
-rw-r--r-- 1 oracle oinstall   35 Jul 17 19:00 sqlnet1407177PM0037.bak
-rw-r--r-- 1 oracle oinstall   35 Jul 17 19:01 sqlnet1407177PM0113.bak
-rw-r--r-- 1 oracle oinstall 4130 May  7 15:17 sqlnet.log
-rw-r--r-- 1 oracle oinstall   35 Jul 17 22:10 sqlnet.ora
-rw-r--r-- 1 oracle oinstall  416 Jul 27  2011 tnsnames1107271PM5834.bak
-rw-r--r-- 1 oracle oinstall 2393 Jul 17 16:09 tnsnames1407174PM0951.bak
-rw-r--r-- 1 oracle oinstall 2393 Jul 17 19:00 tnsnames1407177PM0037.bak
-rw-r--r-- 1 oracle oinstall 2393 Jul 17 19:01 tnsnames1407177PM0113.bak
-rw-r--r-- 1 oracle oinstall 2393 Jul 17 22:10 tnsnames.ora
-rw-r--r-- 1 oracle oinstall 1736 Jul 27  2011 tnsnames.ora.20110727bak
-rw-r--r-- 1 oracle oinstall 1977 Aug  2  2011 tnsnames.ora.20110802bak
[oracle@standbydbn2 admin]$ cat listener.ora
[oracle@standbydbn2 admin]$ cat listener.ora

文件真是空的,按照正常节点的编辑文件,监听正常启动。手工启动db,发现alter报错

Errors in file /opt/oracle/app/admin/master/bdump/master2_dbw0_25946.trc:
ORA-01157: cannot identify/lock data file 772 - see DBWR trace file
ORA-01110: data file 772: '+DATA5/xxx/datafile/xxx.dbf'
ORA-17503: ksfdopn:2 Failed to open file +DATA5/standby/datafile/xx.dbf
ORA-15001: diskgroup "DATA5" does not exist or is not mounted
ORA-15001: diskgroup "DATA5" does not exist or is not mounted

asmcmd进入发现DATA5不存在
[oracle@rac2 admin]$ export ORACLE_SID=+ASM2

[oracle@rac2 admin]$ asmcmd
ASMCMD> lsdg
State    TYPE    Rebal  Unbal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Name
MOUNTED  EXTERN  N      N         512   4096  1048576   6277049   494723                0          494723              0  DATA2/
MOUNTED  EXTERN  N      N         512   4096  1048576   6655909   388388                0          388388              0  DATA3/
MOUNTED  EXTERN  N      N         512   4096  1048576   8011682   546051                0          546051              0  DATA4/
MOUNTED  EXTERN  N      N         512   4096  1048576   6578238   339569                0          339569              0  NEW_DG/
MOUNTED  EXTERN  N      N         512   4096  1048576    307196   262681                0          262681              0  REDO01/
MOUNTED  EXTERN  N      N         512   4096  1048576    307196   262681                0          262681              0  REDO02/

在正常的节点查看data5磁盘组所包含的磁盘,
因为系统使用asmlib,使用oracleasm查看,发现data5的都不存在,

[root@rac1 admin]# oracleasm listdisks
NEW1
NEW2
NEW3
NEW4
NEW5
NEW6
VOL1
VOL10
VOL11
VOL12
VOL13
VOL14
VOL15
VOL16
VOL17
VOL18
VOL2
VOL3
VOL4
VOL5
VOL6
VOL7
VOL8
VOL9
VOLDATA1
VOLDATA10
VOLDATA11
VOLDATA12
VOLDATA13
VOLDATA14
VOLDATA15
VOLDATA16
VOLDATA17
VOLDATA18
VOLDATA19
VOLDATA2
VOLDATA20
VOLDATA21
VOLDATA22
VOLDATA23
VOLDATA24
VOLDATA25
VOLDATA26
VOLDATA27
VOLDATA28
VOLDATA29
VOLDATA3
VOLDATA30
VOLDATA31
VOLDATA32
VOLDATA33
VOLDATA34
VOLDATA4
VOLDATA5
VOLDATA6
VOLDATA7
VOLDATA8
VOLDATA9
VOLREDO1
VOLREDO2
[root@rac1 admin]# oracleasm querydisk -p VOL1
Disk "VOL1" defines a device with no label

发现VOL1这个磁盘的Lable丢失。

[root@rac1 admin]# oracleasm  scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Cleaning disk "VOL1"
Cleaning disk "VOL16"
Cleaning disk "VOL17"
Cleaning disk "VOL2"
Cleaning disk "VOL3"
Cleaning disk "VOL4"
Cleaning disk "VOL5"
Cleaning disk "VOL7"
Cleaning disk "VOLDATA1"
Cleaning disk "VOLDATA30"
Cleaning disk "VOLDATA33"
Cleaning disk "VOLDATA4"
Cleaning disk "VOLDATA7"
Scanning system for ASM disks...

扫了下盘,擦lable都丢失。
找到以前的记录确定了具体磁盘,使用powermt查看盘状态

[root@rac2 ~]# powermt display dev=emcpowerc 
Pseudo name=emcpowerc
CLARiiON ID=CKM00111201809 [R910]
Logical device ID=6006016025B12C00069BFB6DF996E011 [LUN 6]
state=alive; policy=CLAROpt; priority=0; queued-IOs=0; 
Owner: default=SP B, current=SP B       Array failover mode: 4
==============================================================================
--------------- Host ---------------   - Stor -   -- I/O Path --  -- Stats ---
###  HW Path               I/O Paths    Interf.   Mode    State   Q-IOs Errors
==============================================================================
   3 qla2xxx                  sde       SP A1     active  alive       0      0
   4 qla2xxx                  sdm       SP B1     active  alive       0      0

kfed确定磁盘名

[root@rac1 admin]# /opt/oracle/app/database/bin/kfed read /dev/emcpowerc1 
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:              2147483648 ; 0x008: TYPE=0x8 NUMB=0x0
kfbh.check:                  1205065909 ; 0x00c: 0x47d3d8b5
kfbh.fcn.base:                      173 ; 0x010: 0x000000ad
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr:         ORCLDISK ; 0x000: length=8
kfdhdb.driver.reserved[0]:            0 ; 0x008: 0x00000000
kfdhdb.driver.reserved[1]:            0 ; 0x00c: 0x00000000
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                168820736 ; 0x020: 0x0a100000
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:                    VOL1 ; 0x028: length=4
kfdhdb.grpname:                   DATA5 ; 0x048: length=5
kfdhdb.fgname:                     VOL1 ; 0x068: length=4
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.crestmp.hi:             33005137 ; 0x0a8: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de
kfdhdb.crestmp.lo:           3051740160 ; 0x0ac: USEC=0x0 MSEC=0x177 SECS=0x1e MINS=0x2d
kfdhdb.mntstmp.hi:             33005137 ; 0x0b0: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de
kfdhdb.mntstmp.lo:           3060555776 ; 0x0b4: USEC=0x0 MSEC=0x318 SECS=0x26 MINS=0x2d
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80
kfdhdb.dsksize:                  511993 ; 0x0c4: 0x0007cff9
kfdhdb.pmcnt:                         6 ; 0x0c8: 0x00000006
kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001
kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002
kfdhdb.f1b1locn:                      2 ; 0x0d4: 0x00000002
kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000
kfdhdb.redomirrors[1]:            65535 ; 0x0da: 0xffff
kfdhdb.redomirrors[2]:            65535 ; 0x0dc: 0xffff
kfdhdb.redomirrors[3]:            65535 ; 0x0de: 0xffff
kfdhdb.dbcompat:              168820736 ; 0x0e0: 0x0a100000
kfdhdb.grpstmp.hi:             33005137 ; 0x0e4: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de
kfdhdb.grpstmp.lo:           3051595776 ; 0x0e8: USEC=0x0 MSEC=0xea SECS=0x1e MINS=0x2d
kfdhdb.ub4spare[0]:                   0 ; 0x0ec: 0x00000000
kfdhdb.ub4spare[1]:                   0 ; 0x0f0: 0x00000000

备份磁盘头

[root@rac2 admin]# dd if=/dev/emcpowerc1 of=/tmp/VOL1.50m.dd bs=1M count=50

使用oracleasm renamedisk,这里加-f是强制修改

[root@rac2 disks]# oracleasm renamedisk  -f /dev/emcpowerc1 VOL1
Writing disk header: done
Instantiating disk "VOL1": done

两个节点扫盘、查看

[root@rac2 disks]#  oracleasm listdisks
VOL1
....略
 
[root@rac1 admin]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...
Instantiating disk "VOL1"

按照此步骤修复了出问题的磁盘,手工mount磁盘组,库正常打开。

奇怪的问题,此库刚恢复了没几天,正在运行竟然asmlib的label丢失了。。。。。。