欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  数据库

记一次故障诊断:事物回滚

程序员文章站 2024-01-20 12:26:58
...

客户数据库运行缓慢,告诉都在等待物理读事件,远程连接查看statspack报告 DB Name DB Id Instance Inst Num Release Cluster Host------------ ----------- ------------ -------- ----------- ------- ------------xxxx 259264653 xxxx 1 9.2.0.1.0 NO M S

客户数据库运行缓慢,告诉都在等待物理读事件,远程连接查看statspack报告

DB Name         DB Id    Instance     Inst Num Release     Cluster Host
------------ ----------- ------------ -------- ----------- ------- ------------
xxxx           259264653 xxxx                1 9.2.0.1.0   NO      M
 
            Snap Id     Snap TIME      Sessions Curs/Sess Comment
            ------- ------------------ -------- --------- -------------------
BEGIN Snap:                 18 04-12月-14 00:00:03      56       5.2
 
  END Snap:                 26 04-12月-14 08:00:06      69       9.1
 
   Elapsed:                              480.05 (mins)
 
Cache Sizes (END)
~~~~~~~~~~~~~~~~~
               Buffer Cache:       256M      Std Block SIZE:        8K
           Shared Pool SIZE:       600M          Log Buffer:    2,048K
 
LOAD Profile
~~~~~~~~~~~~                            Per SECOND       Per TRANSACTION
                                   ---------------       ---------------
                  Redo SIZE:             33,077.28            157,241.29
              Logical reads:             46,175.56            219,507.30
              Block changes:             12,514.78             59,492.17
             Physical reads:              1,582.20              7,521.40
            Physical writes:                 79.65                378.62
                 USER calls:                 41.86                199.01
                     Parses:                  6.11                 29.04
                Hard parses:                  0.02                  0.07
                      Sorts:                  9.51                 45.23
                     Logons:                  0.19                  0.92
                   Executes:              6,285.99             29,882.05
               Transactions:                  0.21

这里发现Block changes 高达12000以上,也就是每秒修改的12000块以上。

Top SQL
逻辑读

  Buffer Gets    Executions  Gets per EXEC  %Total TIME (s)  TIME (s) Hash VALUE
--------------- ------------ -------------- ------ -------- --------- ----------
    802,758,097       15,772       50,897.7   60.4 ######## #########  827999392
UPDATE "xxx" "A1" SET "col1" = :B1 WHERE "A1"
."col2"=:B2 AND "A1"."col1" IS NULL OR "A1"."col1"  :B1

物理读

Physical Reads  Executions  Reads per EXEC %Total TIME (s)  TIME (s) Hash VALUE
--------------- ------------ -------------- ------ -------- --------- ----------
     21,011,968       15,772        1,332.2   46.1 ######## #########  827999392
UPDATE "xxx" "A1" SET "col1" = :B1 WHERE "A1"
."col2"=:B2 AND "A1"."col1" IS NULL OR "A1"."col1":B1

表空间IO问题
Tablespace

------------------------------
                 Av      Av     Av                    Av        Buffer Av Buf
         Reads Reads/s Rd(ms) Blks/Rd       Writes Writes/s      Waits Wt(ms)
-------------- ------- ------ ------- ------------ -------- ---------- ------
DATA
    11,721,822     407    1.6     3.2      196,380        7        281    8.7
UNDOTBS1
     7,477,389     260    9.2     1.0    2,088,770       73     37,178    8.0
 
          -------------------------------------------------------------

数据文件

Tablespace               Filename
------------------------ ----------------------------------------------------
                 Av      Av     Av                    Av        Buffer Av Buf
         Reads Reads/s Rd(ms) Blks/Rd       Writes Writes/s      Waits Wt(ms)
-------------- ------- ------ ------- ------------ -------- ---------- ------
DATA                     D:\ORACLE\ORA92\xxx\xxxxxx.ORA
     5,839,496     203    1.9     3.2      118,428        4         46    8.7
                         D:\ORACLE\ORA92\xxx\xxxxx.ORA
     5,882,326     204    1.4     3.3       77,952        3        235    8.7
 
UNDOTBS1                 D:\ORACLE\ORA92\xxxx\UNDOTBS01.DBF
     7,477,389     260    9.2     1.0    2,088,770       73     37,178    8.0
 
 
 
          -------------------------------------------------------------

从上面可以看出Tablespace data、undotbs1的IO使用比较严重,严重了上面的update语句导致。
查看下undo的使用

ROLLBACK Segment Storage FOR DB: FSJZ  Instance: fsjz  Snaps: 18 -26
->Optimal SIZE should be larger than Avg Active
 
RBS No    Segment SIZE      Avg Active    Optimal SIZE    Maximum SIZE
------ --------------- --------------- --------------- ---------------
     0         385,024           6,553                         385,024
     1      35,774,464       1,003,908                     129,097,728
     2      10,608,640         996,987                      75,620,352
     3      13,819,904         772,807                      18,997,248
     4       5,365,760       1,355,300                      14,802,944
     5       8,511,488       5,457,853                   3,885,096,960
     6   4,183,941,120   2,771,161,530                   4,252,098,560
     7       5,365,760         619,248                      10,608,640
     8      14,802,944       7,010,756                      21,094,400
     9       3,268,608       2,908,152                      22,142,976
    10       4,317,184         567,499                       4,317,184
          -------------------------------------------------------------

上面的信息可以得出6号undo段已经使用了4G。
查看事物开始时间

会话的信息

SELECT
      --t.xid   xid
       s.sid sid
     , t.inst_id
     , p.spid
     , s.STATUS
     , t.start_time
     , t.STATUS      transaction_status
     , t.SPACE
     , t.recursive
     , t.noundo
     , t.ptx
     , ROUND((t.used_ublk * p.VALUE)/1024/1024) used_undo
     , t.log_io
     , t.phy_io
     , t.cr_get
     , s.username
     , s.osuser
     --, s.sql_id
     --, s.sql_child_number
     , s.program
FROM gv$transaction t
     INNER JOIN gv$session s ON t.inst_id = s.inst_id   AND t.ses_addr = s.saddr
     INNER JOIN gv$process p ON p.inst_id = s.inst_id   AND p.addr = s.paddr
ORDER BY  t.start_time
;

记一次故障诊断:事物回滚

从上面来看这个事物的Undo使用达到了17G

查看Update 语句的执行计划

	SQL> EXPLAIN plan FOR
  2  UPDATE xxxxx."xxxxxxxxxxxxxxxxx" "A1"
  3  SET "col2" = '1'
  4  WHERE "A1" ."col2"=96
  5    AND "A1"."col1" IS NULL
  6    OR "A1"."col1"'0'
  7  ;
 
已解释。
 
 
SQL> SELECT * FROM TABLE(dbms_xplan.display());
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
 
 
---------------------------------------------------------------------------
| Id  | Operation            |  Name              | ROWS  | Bytes | Cost  |
---------------------------------------------------------------------------
|   0 | UPDATE STATEMENT     |                    |       |       |       |
|   1 |  UPDATE              | xxxxxxxxxxxxxxxxx  |       |       |       |
|*  2 |   TABLE ACCESS FULL  | xxxxxxxxxxxxxxxxx  |       |       |       |
---------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
 
 
   2 - FILTER("A1"."col2"=96 AND "A1"."col1" IS NULL OR "A
              1"."col1"'0')
 
Note: rule based optimization

已选择16行。
这里得知SQL走的全表扫,仔细查看SQL 发现

《AND "A1"."col1" IS NULL  OR "A1"."col1"'0'》

这两个条件应该增加括号,否则会oracle会找到col1不等于某个值的所有结果进行更新。

手工KILL掉这个进程后,数据库事物回滚异常缓慢

查看数据库参数

SQL> SHOW parameter cpu_count
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------
cpu_count                            INTEGER     16
SQL> SHOW parameter  fast_start
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------
fast_start_parallel_rollback         string      LOW
SQL>  SHOW parameter parallel_max
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------
parallel_max_servers                 INTEGER     5

fast_start_parallel_rollback参数是加快回滚速度,这个参数默认是low,即2倍cpu数的并发度,进行并发的回滚。这个参数可以设置成high,即为4倍cpu数的并发度进行回滚。
如果设置了此参数为high或low,也就是4倍或2倍的cpu数,也会受到另外一些参数的影响,如PARALLEL_MAX_SERVERS,这个才是真正最大的并发度设置。
PARALLEL_MAX_SERVERS这个参数的默认值为PARALLEL_THREADS_PER_CPU * CPU_COUNT * concurrent_parallel_users * 5。PARALLEL_THREADS_PER_CPU 和CPU_COUNT都是初始化参数。

在一般情况下,并发的回滚总是比串行的快,但是当并发的子进程之间存在资源冲突的情况。
在并发子进程之间需要的资源冲突时,往往此时smon的等待事件是长期处于

Wait for stopper event to be increased

而子进程的等待事件是较多出现

Wait for a undo record。

此时,就是子进程冲突了。并发的回滚速度反而不如串行的回滚速度。

查询数据库得知现在最大并行进程设置5,远远不够并行恢复需要的进程。查看smon进程等待事件时,竟然没有记录。

通过关闭并行进程恢复,提高恢复速度

ALTER system SET fast_start_parallel_rollback=FALSE;

查看恢复进度

SQL> DECLARE
  2  l_start NUMBER;
  3  l_end    NUMBER;
  4  BEGIN
  5    SELECT ktuxesiz INTO l_start FROM x$ktuxe WHERE  KTUXEUSN=6 AND KTUXESLT=46;
  6    dbms_lock.sleep(60);
  7    SELECT ktuxesiz INTO l_end FROM x$ktuxe WHERE  KTUXEUSN=6 AND KTUXESLT=46;
  8    dbms_output.put_line('time est Day:'|| round(l_end/(l_start -l_end)/60/24,2));
  9  END;
 10  /
TIME est DAY:.14
 
SQL> SELECT 0.14*24 FROM dual
  2  ;
 
   0.14*24
----------
      3.36

大体评估事务回滚需要3.36个小时