关于Oracle Data Guard Failover 的实例说明
failover是失败切换。这种情况下切换对redo的处理,就显的很重要。如果处理好,就不会有数据丢失。否则就会有数据丢失。
在oracle 11g里,data guard切换多了一个新的功能:flush redo。
flush能把没有发送的redo从主库传送到standby库。只要主库能启动到mount状态,那么flush就可以把没有发送的归档和current online redo发送到备库。
flush语法:
sql> alter system flush redo to target_db_name;
这里的target_db_name是我们在主库的db_unique_name名称。也就是在tnsnames.ora文件配置的。flush会将未发送的redo从主库传到备库,并且等待redo在standby库上apply之后返回成功。所以只要flush成功,那么failover就没有主句丢失。
如果说我们的primary已经不能启动到mount状态,那么就只能按照之前的方法来。oracle 10g下就是这么操作的。
一.正常的failover
1.1检查gap
sql> select thread#, low_sequence#, high_sequence# from v$archive_gap;
如果有,将对应的归档文件copy到备库,在注册它
sql>alter database register physical logfile 'filespec1';
注意:如果有gap存在,并且没有解决。那么是不能正常的进行一个failover。只能进行一个强制的failover。这种情况下会有数据丢失。
sql> alter database activate physical standby database;
1.2解决gap问题后,进行切换
1.2.1取消apply
sql> recover managed standby database cancel;
1.2.2结束apply
(1)在oracle 10gr2或之后的版本:如果在备用库上有备用库日志文件
sql> alter database recover managed standby database finish; -- [force|wait|nowait]
在执行这个命令的时候,如果主库和备库之间的网络中断了。那么备库的rfs进程就会等待网络的连接,直到tcp超时。因此在这种情况下,我们就需要加上foce关键字。
(2)在oracle 10gr2之前的版本:没有备库日志文件
sql> alter database recover managed standby database finish skip standby logfile;
注意:如果执行了这条命令,就不能在进行recover standby database;
1.2.3将备库切换成主库
sql> alter database commit to switchover to primary;
sql> shutdown immediate;
sql> startup
二.强行切换(激活)
2.1使用条件
当我们正常切换的时候,提示我们需要介质恢复的时候,就需要使用强行激活standby库。如:
sql> alter database commit to switchover to primary;
alter database commit to switchover to primary
*
error at line 1:
ora-16139: media recovery required
2.2强行激活下的redo问题
在这里需要说明一点,就是我们在主库commit之后,然后shutdown abort,这时候,主库的online redo会自动的写入备库的最后一个归档文件里(大小会发生变化)。我们在恢复的时候需要对备库的最后一个归档文件进行重新的注册。
sql>alter database register physical logfile 'filespec1';
如果说,主库os是整个宕机了。这个时候,online redo是不会发送到备库。所以我们需要手工的将主库的所有online redo copy到备库。然后进行recover。
步骤如下:
sql>alter database recover managed standby database cancel;
database altered.
sql>recover standby database until cancel;
ora-00279: change 509016 generated at 11/05/2010 11:40:27 needed for thread 1
ora-00289: suggestion : /u01/archive/1_17_734225750.dbf
ora-00280: change 509016 for thread 1 is in sequence #17
--默认情况下会提示需要归档17,实际上这个序列为17的归档还没有生成,我们忽略它,使用我们刚才copy过来的redo日志来恢复。
specify log: {
/u01/app/oracle/oradata/orcl/redo01.log--注意,这个位置是我手动写的
log applied.
media recovery complete.
这里一次就搞定了。实际上有三个redo,如果不确定使用哪个redo的,只能一个一个试。
当我们使用了recover standby database until cancel之后,只能使用强制激活备库,如果使用正常模式,会提示我们需要:
ora-16139: media recovery required
2.3强制激活备库:
sql> alter database recover managed standby database cancel;
sql> recover standby database until cancel;
sql>alter database activate standby database;
sql>shutdown immediate;
sql>startup
三.switchover
3.1主库操作:
(1)查看状态:
sql>select switchover_status from v$database;
(2)切换
sql> alter database commit to switchover to physical standby with session shutdown;
sql> shutdown immediate;
sql> startup;
sql> alter database mount standby database;
sql> recover managed standby database disconnect;
3.2备库操作:
sql> alter database commit to switchover to primary with session shutdown;
sql> shutdown immediate
sql> startup
四.对failover过程的研究
4.1 failover日志
thu mar 17 15:01:47 2011
alter database activate standby database
thu mar 17 15:01:47 2011
alter database activate [physical] standby database (dave)
--我们切换的时候,命令写全命令,db自动补全了
resetlogs after complete recovery through change 1255060
resetting resetlogs activation id 808909668 (0x3036fb64)
-- resetlogs了.这就以为着产生一个新的incarnation。online redo会被清空
online log /u01/app/oracle/oradata/dave/redo01.log: thread 1 group 1 was previously cleared
online log /u01/app/oracle/oradata/dave/redo02.log: thread 1 group 2 was previously cleared
online log /u01/app/oracle/oradata/dave/redo03.log: thread 1 group 3 was previously cleared
standby became primary scn: 1255058
thu mar 17 15:01:48 2011
setting recovery target incarnation to 3
--修改incarnation版本
thu mar 17 15:01:48 2011
converting standby mount to primary mount.
--将standby转成primary
thu mar 17 15:01:48 2011
activate standby: complete - database mounted as primary (dave)
completed: alter database activate standby database
--完成active
thu mar 17 15:01:59 2011
shutting down instance: further logons disabled
--关闭实例
thu mar 17 15:01:59 2011
stopping background process cjq0
thu mar 17 15:01:59 2011
stopping background process mmnl
thu mar 17 15:01:59 2011
stopping background process mmon
thu mar 17 15:01:59 2011
shutting down instance (immediate)
license high water mark = 7
thu mar 17 15:01:59 2011
stopping job queue slave processes, flags = 7
thu mar 17 15:01:59 2011
job queue slave processes stopped
all dispatchers and shared servers shutdown
thu mar 17 15:02:35 2011
arc1: archival disabled due to shutdown: 1089
shutting down archive processes
archiving is disabled
thu mar 17 15:02:45 2011
arch shutting down
arc0: archival stopped
thu mar 17 15:02:50 2011
arch shutting down
arc1: archival stopped
thu mar 17 15:07:04 2011
shutdown: active processes prevent shutdown operation
thu mar 17 15:07:50 2011
alter database close normal
thu mar 17 15:07:50 2011
ora-1109 signalled during: alter database close normal...
thu mar 17 15:07:50 2011
alter database dismount
completed: alter database dismount
arch: archival disabled due to shutdown: 1089
shutting down archive processes
archiving is disabled
archive process shutdown avoided: 0 active
arch: archival disabled due to shutdown: 1089
shutting down archive processes
archiving is disabled
archive process shutdown avoided: 0 active
thu mar 17 15:08:13 2011
starting oracle instance (normal)
--开始重新启动实例
license_max_session = 0
license_sessions_warning = 0
picked latch-free scn scheme 2
autotune of undo retention is turned on.
imode=br
ilat =18
license_max_users = 0
sys auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
starting up oracle rdbms version: 10.2.0.4.0.
system parameters with non-default values:
processes= 150
__shared_pool_size= 113246208
__large_pool_size= 4194304
__java_pool_size= 25165824
__streams_pool_size= 0
nls_territory= america
sga_target= 247463936
control_files= /u01/app/oracle/oradata/dave/control01.ctl, /u01/app/oracle/oradata/dave/control02.ctl, /u01/app/oracle/oradata/dave/control03.ctl
db_block_size= 8192
__db_cache_size= 100663296
compatible= 10.2.0.1.0
log_archive_config= dg_config=(dave_pd,dave_st)
log_archive_dest_1= location=/u01/archivelog valid_for=(all_logfiles,all_roles) db_unique_name=dave_st
log_archive_dest_2= service=dave_pd reopen=120 lgwr async valid_for=(online_logfiles,primary_role) db_unique_name=dave_pd
log_archive_dest_state_1 = enable
log_archive_dest_state_2 = enable
standby_archive_dest= /u01/archivelog
fal_client= dave_st
fal_server= dave_pd
db_file_multiblock_read_count= 16
standby_file_management= auto
undo_management= auto
undo_tablespace= undotbs1
remote_login_passwordfile= exclusive
db_domain=
dispatchers= (protocol=tcp) (service=davexdb)
job_queue_processes= 10
background_dump_dest= /u01/app/oracle/admin/dave/bdump
user_dump_dest= /u01/app/oracle/admin/dave/udump
core_dump_dest= /u01/app/oracle/admin/dave/cdump
audit_file_dest= /u01/app/oracle/admin/dave/adump
db_name= dave
db_unique_name= dave_st
open_cursors= 300
pga_aggregate_target= 81788928
pmon started with pid=2, os id=5909
psp0 started with pid=3, os id=5911
mman started with pid=4, os id=5913
dbw0 started with pid=5, os id=5915
lgwr started with pid=6, os id=5917
ckpt started with pid=7, os id=5919
smon started with pid=8, os id=5921
reco started with pid=9, os id=5923
cjq0 started with pid=10, os id=5925
mmon started with pid=11, os id=5927
thu mar 17 15:08:14 2011
starting up 1 dispatcher(s) for network address '(address=(partial=yes)(protocol=tcp))'...
mmnl started with pid=12, os id=5929
thu mar 17 15:08:14 2011
starting up 1 shared server(s) ...
thu mar 17 15:08:15 2011
alter databasemount
thu mar 17 15:08:19 2011
setting recovery target incarnation to 3
thu mar 17 15:08:19 2011
successful mount of redo thread 1, with mount id 808884895
thu mar 17 15:08:19 2011
database mounted in exclusive mode
completed: alter databasemount
thu mar 17 15:08:19 2011
alter database open
thu mar 17 15:08:19 2011
assigning activation id 808884895 (0x30369a9f)
lgwr: starting arch processes
arc0 started with pid=16, os id=5937
thu mar 17 15:08:19 2011
arc0: archival started
arc1: archival started
lgwr: starting arch processes complete
arc1 started with pid=17, os id=5939
lns1 started with pid=18, os id=5941
thu mar 17 15:08:22 2011
thread 1 advanced to log sequence 2 (thread open)
thu mar 17 15:08:23 2011
arc0: starting arch processes
thu mar 17 15:08:23 2011
arc1: becoming the 'no fal' arch
arc1: becoming the 'no srl' arch
thu mar 17 15:08:23 2011
thread 1 opened at log sequence 2
current log# 2 seq# 2 mem# 0: /u01/app/oracle/oradata/dave/redo02.log
successful open of redo thread 1
thu mar 17 15:08:23 2011
******************************************************************
lgwr: setting 'active' archival for destination log_archive_dest_2
******************************************************************
thu mar 17 15:08:23 2011
arc1: lgwr is actively archiving destination log_archive_dest_2
thu mar 17 15:08:23 2011
arc2: archival started
arc0: starting arch processes complete
arc0: becoming the heartbeat arch
arc2 started with pid=19, os id=5943
thu mar 17 15:08:23 2011
mttr advisory is disabled because fast_start_mttr_target is not set
thu mar 17 15:08:23 2011
smon: enabling cache recovery
thu mar 17 15:08:24 2011
successfully onlined undo tablespace 1.
dictionary check beginning
dictionary check complete
thu mar 17 15:08:24 2011
smon: enabling tx recovery
thu mar 17 15:08:24 2011
database characterset is zhs16gbk
opening with internal resource manager plan
where numa pg = 1, cpus = 1
replication_dependency_tracking turned off (no async multimaster replication found)
starting background process qmnc
qmnc started with pid=20, os id=5945
thu mar 17 15:08:26 2011
logstdby: validating controlfile with logical metadata
thu mar 17 15:08:26 2011
logstdby: validation complete
completed: alter database open
4.2对failover的补充说明
在4.1中看了failover的整个过程,db会进行一次resetlogs。这个是个很有意思的过程。
(1)resetlogs会产生一个新的incarnation。这个会影响我们的rman恢复。我们查看一下:
rman> list incarnation;
list of database incarnations
db keyinc key db namedb idstatusreset scnreset time
------- ------- -------- ---------------- --- ---------- ----------
11dave808637274parent130-jun-05
22dave808637274parent44607514-mar-11
33dave808637274current 125506117-mar-11
这个时候,我们只能恢复incarnation为3之内的信息,如果要恢复到其他版本的信息,要保证对应备份集存在的同时,在使用reset database incarnation to 3或者其他的版本。之后在恢复。
(2)看下归档日志
先看备库:
sql> select max(sequence#) from v$archived_log;
max(sequence#)
--------------
6
sql> select sequence#,applied from v$archived_log;
sequence# app
---------- ---
3 yes
2 yes
5 yes
4 yes
6 yes
这个是重新开始的,没有什么问题。
我们看下主库:
sql> select max(sequence#) from v$archived_log;
max(sequence#)
--------------
88
sql> select sequence#,applied from v$archived_log;
sequence# app
---------- ---
4 yes
3 yes
5 yes
6 yes
7 yes
8 yes
......
82 yes
83 yes
84 yes
85 yes
86 yes
87 yes
88 yes
2 no
2 yes--注意这部分,有重新开始了。
3 no
3 yes
4 no
5 no
5 yes
4 yes
6 no
6 yes
因为resetlogs会重置sequence#。将其设置为1.所以这里又重新开始了。但是scn不会重置。我们查看一下:
sql> select sequence#,first_change#,next_change# from v$log_history;
sequence# first_change# next_change#
---------- ------------- ------------
1446075451208
2451208483347
3483347485272
4485272485277
5485277486119
.....
8312276671229252
8412292521252272
8512522721252277
8612522771252293
8712522931252294
8812522941253301
112550611255062
--sequence#重新开始了,但是scn还是继续增加的。
212550621257639
312576391257644
412576441265602
512656021265607
612656071265913
94 rows selected.
sql>
所以,这种情况下,查看同步情况还是有点不直观。但是v$log_history和v$archived_log显示的log历史信息是从控制文件中取得的,所以说,如果要删除以前的记录,只有重建控制文件了。
(3)重建控制文件
sql> shutdown immediate
sql> startup nomount;
sql>create controlfile reuse database davenoresetlogsarchivelog
logfile
group 1 '/u01/app/oracle/oradata/dave/redo01.log',
group 2 '/u01/app/oracle/oradata/dave/redo02.log',
group 3 '/u01/app/oracle/oradata/dave/redo03.log'
datafile
'/u01/app/oracle/oradata/dave/sysaux01.dbf',
'/u01/app/oracle/oradata/dave/system01.dbf',
'/u01/app/oracle/oradata/dave/undotbs01.dbf',
'/u01/app/oracle/oradata/dave/users01.dbf'
character set zhs16gbk;
--注意,使用的是noresetlogs,如果使用resetlogs,dg就需要重新搭建了。
sql> alter tablespace temp add tempfile '/u01/app/oracle/oradata/dave/temp01.dbf' size 100m;
tablespace altered.
--添加临时表空间,在重建控制文件的时候,不能添加temp表空间,只能在控制文件重建好之后,在添加temp表空间。
更多信息参考:
oracle控制文件
https://www.cndba.cn/dave/article/1216
(4)在次验证归档信息
主库:
sql> select max(sequence#) from v$archived_log;
max(sequence#)
--------------
9
sql> alter system switch logfile;
system altered.
sql> select max(sequence#) from v$archived_log;
max(sequence#)
--------------
10
sql> select sequence#,applied from v$archived_log;
sequence# app
---------- ---
8 no
7 no
9 no
9 yes
10 yes
10 no
6 rows selected.
sql> select sequence#,first_change#,next_change# from v$log_history;
sequence# first_change# next_change#
---------- ------------- ------------
912678281268212
1012682121274690
从这个结果来看,重建控制文件之后,之前的所有的有关归档的信息都会被删除。
备库:
sql> select max(sequence#) from v$archived_log;
max(sequence#)
--------------
10
sql> select sequence#,applied from v$archived_log;
sequence# app
---------- ---
3 yes
2 yes
5 yes
4 yes
6 yes
7 yes
8 yes
9 yes
10 yes
9 rows selected.
说明:
我这里是测试环境,所以重建控制文件测试一下,如果是生产环境,小心操作。