记一次AsmLib故障
程序员文章站
2022-06-08 12:55:27
...
刚把客户20T的库恢复起来,运行了几天,突然打电话通知一个节点挂了,vpn连接上去查看crs日志 [oracle@rac2 crsd]$ tail -100f crsd.log 2014-07-19 21:37:02.223: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9 2014-07-19 21:37:02.225: [
刚把客户20T的库恢复起来,运行了几天,突然打电话通知一个节点挂了,vpn连接上去查看crs日志
[oracle@rac2 crsd]$ tail -100f crsd.log 2014-07-19 21:37:02.223: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9 2014-07-19 21:37:02.225: [ CRSRTI][1073453488]0CSS is not ready. Received status 3 from CSS. Waiting for good status .. 2014-07-19 21:37:03.599: [ COMMCRS][1110501696]clsc_connect: (0xb438700) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac2_crs)) 2014-07-19 21:37:03.599: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9 2014-07-19 21:37:03.600: [ CRSRTI][1073453488]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..
查看心跳网络不通,重启网卡后问题解决,
[root@rac2 client]# ping rac1-priv PING rac1-priv (192.168.2.81) 56(84) bytes of data. From rac2-priv (192.168.2.83) icmp_seq=10 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=11 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=12 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=14 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=15 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=16 Destination Host Unreachable --- rac1-priv ping statistics --- 19 packets transmitted, 0 received, +6 errors, 100% packet loss, time 18000ms , pipe 3 [root@rac2 client]# ifconfig bond1 bond1 Link encap:Ethernet HWaddr 78:2B:CB:0D:32:49 inet addr:192.168.2.83 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::7a2b:cbff:fe0d:3249/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:105 errors:0 dropped:0 overruns:0 frame:0 TX packets:21457 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6720 (6.5 KiB) TX bytes:1373758 (1.3 MiB) [root@rac2 client]# ifdown bond1 [root@rac2 client]# ifup bond1 [root@rac2 client]# ping rac1-priv PING rac1-priv (192.168.2.81) 56(84) bytes of data. 64 bytes from rac1-priv (192.168.2.81): icmp_seq=1 ttl=64 time=0.146 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=2 ttl=64 time=0.102 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=3 ttl=64 time=0.085 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=4 ttl=64 time=0.095 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=5 ttl=64 time=0.146 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=6 ttl=64 time=0.099 ms
启动crs后,发现监听和数据库实例不能正常启动
[oracle@rac2 ~]$ crs_stat -t -v Name Type R/RA F/FT Target State Host ---------------------------------------------------------------------- ora.master.db application 0/0 0/1 ONLINE ONLINE rac1 ora....rtdb.cs application 0/0 0/1 ONLINE ONLINE rac1 ora....er1.srv application 0/0 0/0 ONLINE ONLINE rac1 ora....r1.inst application 0/5 0/0 ONLINE ONLINE rac1 ora....r2.inst application 0/5 0/0 ONLINE OFFLINE ora....pcdb.cs application 0/0 0/1 ONLINE ONLINE rac1 ora....er1.srv application 0/0 0/0 ONLINE ONLINE rac1 ora....SM1.asm application 0/5 0/0 ONLINE ONLINE rac1 ora....N1.lsnr application 0/5 0/0 ONLINE ONLINE rac1 ora....bn1.gsd application 0/5 0/0 ONLINE ONLINE rac1 ora....bn1.ons application 0/3 0/0 ONLINE ONLINE rac1 ora....bn1.vip application 0/0 0/0 ONLINE ONLINE rac1 ora....SM2.asm application 0/5 0/0 ONLINE ONLINE rac2 ora....N2.lsnr application 0/5 0/0 ONLINE OFFLINE ora....bn2.gsd application 0/5 0/0 ONLINE ONLINE rac2 ora....bn2.ons application 0/3 0/0 ONLINE ONLINE rac2 ora....bn2.vip application 0/0 0/0 ONLINE ONLINE rac2
手工启动报错
[oracle@rac2 ~]$ srvctl start listener -n rac2 rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:LSNRCTL for Linux: Version 10.2.0.3.0 - Production on 19-JUL-2014 22:05:06 rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Copyright (c) 1991, 2006, Oracle. All rights reserved. rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Starting /opt/oracle/app/database/bin/tnslsnr: please wait... rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:TNSLSNR for Linux: Version 10.2.0.3.0 - Production rac2:ora.rac2.LISTENER_rac2.lsnr:System parameter file is /opt/oracle/app/database/network/admin/listener.ora rac2:ora.rac2.LISTENER_rac2.lsnr:Log messages written to /opt/oracle/app/database/network/log/listener_rac2.log rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-01151: Missing listener name, LISTENER_rac2, in LISTENER.ORA rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Listener failed to start. See the error message(s) above... rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:LSNRCTL for Linux: Version 10.2.0.3.0 - Production on 19-JUL-2014 22:05:06 rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Copyright (c) 1991, 2006, Oracle. All rights reserved. rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac2-vip)(PORT=1521)(IP=FIRST))) rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-12541: TNS:no listener rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-12560: TNS:protocol adapter error rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-00511: No listener rac2:ora.rac2.LISTENER_rac2.lsnr: Linux Error: 111: Connection refused rac2:ora.rac2.LISTENER_rac2.lsnr:Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.83)(PORT=1521)(IP=FIRST))) rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-12541: TNS:no listener rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-12560: TNS:protocol adapter error rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-00511: No listener rac2:ora.rac2.LISTENER_rac2.lsnr: Linux Error: 111: Connection refused CRS-0215: Could not start resource 'ora.rac2.LISTENER_rac2.lsnr'.
怀疑监听配置listener.ora文件出现问题,查看
[oracle@standbydbn2 ~]$ cd /opt/oracle/app/database/network/admin/ [oracle@standbydbn2 admin]$ ls -l total 88 -rw-r--r-- 1 oracle oinstall 240 Jul 27 2011 1 -rw-r--r-- 1 oracle oinstall 378 Jul 27 2011 listener1107271PM5834.bak -rw-r--r-- 1 oracle oinstall 448 Jul 17 16:09 listener1407174PM0951.bak -rw-r--r-- 1 oracle oinstall 448 Jul 17 19:00 listener1407177PM0037.bak -rw-r--r-- 1 oracle oinstall 448 Jul 17 19:01 listener1407177PM0113.bak -rw-r--r-- 1 oracle oinstall 0 Jul 17 22:10 listener.ora -rw-r--r-- 1 oracle oinstall 553 Jul 27 2011 listener.ora.20110727bak -rw-r--r-- 1 oracle oinstall 402 Oct 8 2011 listener.ora.bak drwxr-x--- 2 oracle oinstall 4096 Jul 26 2011 samples -rw-r----- 1 oracle oinstall 172 Dec 26 2003 shrept.lst -rw-r--r-- 1 oracle oinstall 35 Jul 17 16:09 sqlnet1407174PM0951.bak -rw-r--r-- 1 oracle oinstall 35 Jul 17 19:00 sqlnet1407177PM0037.bak -rw-r--r-- 1 oracle oinstall 35 Jul 17 19:01 sqlnet1407177PM0113.bak -rw-r--r-- 1 oracle oinstall 4130 May 7 15:17 sqlnet.log -rw-r--r-- 1 oracle oinstall 35 Jul 17 22:10 sqlnet.ora -rw-r--r-- 1 oracle oinstall 416 Jul 27 2011 tnsnames1107271PM5834.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 16:09 tnsnames1407174PM0951.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 19:00 tnsnames1407177PM0037.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 19:01 tnsnames1407177PM0113.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 22:10 tnsnames.ora -rw-r--r-- 1 oracle oinstall 1736 Jul 27 2011 tnsnames.ora.20110727bak -rw-r--r-- 1 oracle oinstall 1977 Aug 2 2011 tnsnames.ora.20110802bak [oracle@standbydbn2 admin]$ cat listener.ora [oracle@standbydbn2 admin]$ cat listener.ora
文件真是空的,按照正常节点的编辑文件,监听正常启动。手工启动db,发现alter报错
Errors in file /opt/oracle/app/admin/master/bdump/master2_dbw0_25946.trc: ORA-01157: cannot identify/lock data file 772 - see DBWR trace file ORA-01110: data file 772: '+DATA5/xxx/datafile/xxx.dbf' ORA-17503: ksfdopn:2 Failed to open file +DATA5/standby/datafile/xx.dbf ORA-15001: diskgroup "DATA5" does not exist or is not mounted ORA-15001: diskgroup "DATA5" does not exist or is not mounted
asmcmd进入发现DATA5不存在
[oracle@rac2 admin]$ export ORACLE_SID=+ASM2
[oracle@rac2 admin]$ asmcmd ASMCMD> lsdg State TYPE Rebal Unbal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Name MOUNTED EXTERN N N 512 4096 1048576 6277049 494723 0 494723 0 DATA2/ MOUNTED EXTERN N N 512 4096 1048576 6655909 388388 0 388388 0 DATA3/ MOUNTED EXTERN N N 512 4096 1048576 8011682 546051 0 546051 0 DATA4/ MOUNTED EXTERN N N 512 4096 1048576 6578238 339569 0 339569 0 NEW_DG/ MOUNTED EXTERN N N 512 4096 1048576 307196 262681 0 262681 0 REDO01/ MOUNTED EXTERN N N 512 4096 1048576 307196 262681 0 262681 0 REDO02/
在正常的节点查看data5磁盘组所包含的磁盘,
因为系统使用asmlib,使用oracleasm查看,发现data5的都不存在,
[root@rac1 admin]# oracleasm listdisks NEW1 NEW2 NEW3 NEW4 NEW5 NEW6 VOL1 VOL10 VOL11 VOL12 VOL13 VOL14 VOL15 VOL16 VOL17 VOL18 VOL2 VOL3 VOL4 VOL5 VOL6 VOL7 VOL8 VOL9 VOLDATA1 VOLDATA10 VOLDATA11 VOLDATA12 VOLDATA13 VOLDATA14 VOLDATA15 VOLDATA16 VOLDATA17 VOLDATA18 VOLDATA19 VOLDATA2 VOLDATA20 VOLDATA21 VOLDATA22 VOLDATA23 VOLDATA24 VOLDATA25 VOLDATA26 VOLDATA27 VOLDATA28 VOLDATA29 VOLDATA3 VOLDATA30 VOLDATA31 VOLDATA32 VOLDATA33 VOLDATA34 VOLDATA4 VOLDATA5 VOLDATA6 VOLDATA7 VOLDATA8 VOLDATA9 VOLREDO1 VOLREDO2 [root@rac1 admin]# oracleasm querydisk -p VOL1 Disk "VOL1" defines a device with no label
发现VOL1这个磁盘的Lable丢失。
[root@rac1 admin]# oracleasm scandisks Reloading disk partitions: done Cleaning any stale ASM disks... Cleaning disk "VOL1" Cleaning disk "VOL16" Cleaning disk "VOL17" Cleaning disk "VOL2" Cleaning disk "VOL3" Cleaning disk "VOL4" Cleaning disk "VOL5" Cleaning disk "VOL7" Cleaning disk "VOLDATA1" Cleaning disk "VOLDATA30" Cleaning disk "VOLDATA33" Cleaning disk "VOLDATA4" Cleaning disk "VOLDATA7" Scanning system for ASM disks...
扫了下盘,擦lable都丢失。
找到以前的记录确定了具体磁盘,使用powermt查看盘状态
[root@rac2 ~]# powermt display dev=emcpowerc Pseudo name=emcpowerc CLARiiON ID=CKM00111201809 [R910] Logical device ID=6006016025B12C00069BFB6DF996E011 [LUN 6] state=alive; policy=CLAROpt; priority=0; queued-IOs=0; Owner: default=SP B, current=SP B Array failover mode: 4 ============================================================================== --------------- Host --------------- - Stor - -- I/O Path -- -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors ============================================================================== 3 qla2xxx sde SP A1 active alive 0 0 4 qla2xxx sdm SP B1 active alive 0 0
kfed确定磁盘名
[root@rac1 admin]# /opt/oracle/app/database/bin/kfed read /dev/emcpowerc1 kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD kfbh.datfmt: 1 ; 0x003: 0x01 kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 kfbh.block.obj: 2147483648 ; 0x008: TYPE=0x8 NUMB=0x0 kfbh.check: 1205065909 ; 0x00c: 0x47d3d8b5 kfbh.fcn.base: 173 ; 0x010: 0x000000ad kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 kfdhdb.driver.provstr: ORCLDISK ; 0x000: length=8 kfdhdb.driver.reserved[0]: 0 ; 0x008: 0x00000000 kfdhdb.driver.reserved[1]: 0 ; 0x00c: 0x00000000 kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000 kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000 kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000 kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000 kfdhdb.compat: 168820736 ; 0x020: 0x0a100000 kfdhdb.dsknum: 0 ; 0x024: 0x0000 kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER kfdhdb.dskname: VOL1 ; 0x028: length=4 kfdhdb.grpname: DATA5 ; 0x048: length=5 kfdhdb.fgname: VOL1 ; 0x068: length=4 kfdhdb.capname: ; 0x088: length=0 kfdhdb.crestmp.hi: 33005137 ; 0x0a8: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de kfdhdb.crestmp.lo: 3051740160 ; 0x0ac: USEC=0x0 MSEC=0x177 SECS=0x1e MINS=0x2d kfdhdb.mntstmp.hi: 33005137 ; 0x0b0: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de kfdhdb.mntstmp.lo: 3060555776 ; 0x0b4: USEC=0x0 MSEC=0x318 SECS=0x26 MINS=0x2d kfdhdb.secsize: 512 ; 0x0b8: 0x0200 kfdhdb.blksize: 4096 ; 0x0ba: 0x1000 kfdhdb.ausize: 1048576 ; 0x0bc: 0x00100000 kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80 kfdhdb.dsksize: 511993 ; 0x0c4: 0x0007cff9 kfdhdb.pmcnt: 6 ; 0x0c8: 0x00000006 kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001 kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002 kfdhdb.f1b1locn: 2 ; 0x0d4: 0x00000002 kfdhdb.redomirrors[0]: 0 ; 0x0d8: 0x0000 kfdhdb.redomirrors[1]: 65535 ; 0x0da: 0xffff kfdhdb.redomirrors[2]: 65535 ; 0x0dc: 0xffff kfdhdb.redomirrors[3]: 65535 ; 0x0de: 0xffff kfdhdb.dbcompat: 168820736 ; 0x0e0: 0x0a100000 kfdhdb.grpstmp.hi: 33005137 ; 0x0e4: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de kfdhdb.grpstmp.lo: 3051595776 ; 0x0e8: USEC=0x0 MSEC=0xea SECS=0x1e MINS=0x2d kfdhdb.ub4spare[0]: 0 ; 0x0ec: 0x00000000 kfdhdb.ub4spare[1]: 0 ; 0x0f0: 0x00000000
备份磁盘头
[root@rac2 admin]# dd if=/dev/emcpowerc1 of=/tmp/VOL1.50m.dd bs=1M count=50
使用oracleasm renamedisk,这里加-f是强制修改
[root@rac2 disks]# oracleasm renamedisk -f /dev/emcpowerc1 VOL1 Writing disk header: done Instantiating disk "VOL1": done
两个节点扫盘、查看
[root@rac2 disks]# oracleasm listdisks VOL1 ....略 [root@rac1 admin]# oracleasm scandisks Reloading disk partitions: done Cleaning any stale ASM disks... Scanning system for ASM disks... Instantiating disk "VOL1"
按照此步骤修复了出问题的磁盘,手工mount磁盘组,库正常打开。
奇怪的问题,此库刚恢复了没几天,正在运行竟然asmlib的label丢失了。。。。。。
原文地址:记一次AsmLib故障, 感谢原作者分享。