博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
手工模拟Oracle数据块逻辑讹误引发,ORA-00600:[13013] [5001]一例
阅读量:4988 次
发布时间:2019-06-12

本文共 17319 字,大约阅读时间需要 57 分钟。

上周在客户那里遇到了一例由Oracle Bug引发的表数据块逻辑讹误触发ORA-00600:[13013], [5001]的问题,这里为了更好地说明该问题,于是萌发了手工模拟该数据块逻辑讹误的想法。
基础知识 Oracle中表的数据块由块头、事务槽、行字典、行数据等多种结构组成。 行数据(rowdata)实际是由许多row piece 行片组成的,每一条row piece的头部都有flag、locks、cols(cc)三个标志位。 其中flag标记了该row piece的类型,该flag位占用一个字节,其不同的bit位代表不同的含义,见下表:
ROW_CLUSTER_KEY = 0x80;              KDRHFKROW_CTABLE_NUMBER = 0x40;            KDRHFCROW_HEAD_PIECE = 0x20;               KDRHFHROW_DELETED_ROW = 0x10;              KDRHFDROW_FIRST_PIECE = 0x08;              KDRHFFROW_LAST_PIECE = 0x04;               KDRHFLROW_FROM_PREVIOUS = 0x02;            KDRHFPROW_CONTINUE_NEXT = 0x01;            KDRHFN
一般来说最普通的一条row piece是普通堆表(heap table)的未被删除的且无行迁移/链接的,其flag位应为
普通row的flag一般为Single Row =ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE= 0x20 + 0x08 + 0x04= 0x2c===================================================================================cluster key的flag一般为Cluster Key =ROW_CLUSTER_KEY + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE=KDRHFL, KDRHFF, KDRHFH, KDRHFK =0x80 + 0x2c =  0xacBBED> x /rnrowdata[68]                                 @8166-----------flag@8166: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK)lock@8167: 0x00cols@8168:    1kref@8169:    1mref@8171:    1hrid@8173:0x01800014.0nrid@8179:0x01800014.0col    0[2] @8185: 10 ===================================================================================Cluster Row =ROW_CTABLE_NUMBER + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE =(KDRHFL, KDRHFF, KDRHFH, KDRHFC) = 0x6c BBED> x /rnccrowdata[0]                                  @8098----------flag@8098: 0x6c (KDRHFL, KDRHFF, KDRHFH, KDRHFC)lock@8099: 0x00cols@8100:   10col    0[2] @8102: 200col    1[8] @8105: Jennifercol    2[6] @8114: Whalencol    3[7] @8121: JWHALENcol   4[12] @8129: 515.123.4444col    5[7] @8142: w....col    6[7] @8150: AD_ASSTcol    7[2] @8158:                     col    8[0] @8161: *NULL*col    9[3] @8162: .出现ORA-00600:[13013], [5001]且Arg [f] Code =3 代表这一row piece的flag >0xc0,也就是该行片同时被标记为key和clustered(row is marked as both a Key and Clustered), 其检验代码为check code 6251。当flag >= 0xc0 时 会出现kdrchk: row is marked as both a Key and Clustered Block 12 failed with check code 6251当 0xac >flag >= 0xa0 时 会 kdrchk: row is Key and Not only piece of key Block 12 failed with check code 6255当 flag = 0x43 是 会出现 kdrchk: C and neither of H or F Block 12 failed with check code 6263当 flag = 0x83 时 会出现 kdrchk: row is marked both as a Key and being continued Block 12 failed with check code 6254
  当Oracle进程访问数据块时首先会校验block的sum值并与block中的CHECKSUM值进行对比,若一致则说明该block没有物理讹误。但是光这一项检查是不够的,不足以保证block无误。所以Oracle引入了一些列的逻辑检验,每一种逻辑检验对应一个检测代码(check code),这些检测包括row piece的flag、cols(cc)状态是否正确等。 实际负责这类逻辑检验的函数包括:kdbchk、kddummy_blkchk、kco_blkchk、kdBlkCheckError、kdrchk等等。 这里当服务进程访问到问题数据块,检测代码发现其flag为0xff(KCHDFLPN),该flag从逻辑上讲是冲突的,所以检测代码认为该row piece存在异常,进而会引发update的ORA-00600:[13013], [5001]或查询的ORA-600 [qertbFetchByRowID]内部错误。 这里需要说明一下的是,很多人认为dbv工具时无法检测出逻辑讹误的,实际上dbv、rman、validate structure和bbed-verify均可以检测出一定程度的逻辑讹误,但是最可靠的还是db_block_checksum=true情况下的validate structure [online]验证命令。从另一个角度来说,普通的dbv只能做单一的检测,而无法做到交叉地检验,从而了解表和索引上的不一致问题,但是validate structure online却可以做到。
正式模拟 以上我们了解了ORA-00600:[13013], [5001]内部错误是如何被引发的,那么下面手工模拟该错误也就不困难了,当然这里需要用到bbed工具。 以下我们会创建实验用的tablespace,table,index:
SQL> select * from v$version;BANNER----------------------------------------------------------------Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64biPL/SQL Release 10.2.0.4.0 - ProductionCORE 10.2.0.4.0 ProductionTNS for Linux: Version 10.2.0.4.0 - ProductionNLSRTL Version 10.2.0.4.0 - ProductionSQL> select * from global_name;GLOBAL_NAME--------------------------------------------------------------------------------www.oracledatabase12g.com/* 创建实验用的表空间  */SQL> create tablespace maclean datafile '/home/oracle/maclean.dbf' size 20M;Tablespace created.SQL> create table tv tablespace maclean as select rownum t1,'find me' t2 fromdba_tables where rownumcreate index ind_tv on tv(t1) tablespace users;Index created.SQL> update tv set t2='corrption here' where t1=200;update tv set t2='corrption here' where t1=200*ERROR at line 1:ORA-12899: value too large for column "SYS"."TV"."T2" (actual: 14, maximum: 7)SQL> alter table tv modify t2 varchar2(200);Table altered.SQL> update tv set t2='corruption here' where t1=200;1 row updated.SQL> commit;Commit complete./* 以上创建了示例用表,其中t1=200的记录是之后将会   手动修改为存在讹误的行             */SQL> select dump(200,16) from dual;DUMP(200,16)-----------------Typ=2 Len=2: c2,3/* 通过16进制码可以方便找出该t1=200的记录行 */ SQL> alter system checkpoint;System altered.SQL> alter tablespace maclean read only;Tablespace altered.SQL> select dbms_rowid.rowid_block_number(rowid) bno ,dbms_rowid.rowid_relative_fno(rowid) fno from tv;BNO FNO---------- ----------12 6[oracle@rh2 ~]$ cp maclean.dbf maclean.dbf.bak
  接着使用BBED工具找到目标行并实施手工修改:  
[oracle@rh2 ~]$ bbed filename=maclean.dbf mode=editPassword:BBED: Release 2.0.0.0.0 - Limited Production on Sun Sep 18 22:14:59 2011Copyright (c) 1982, 2007, Oracle. All rights reserved.BBED> set blocksize 8192BLOCKSIZE 8192BBED> set block 13BLOCK# 13BBED> map /vFile: maclean.dbf (0)Block: 13 Dba:0x00000000------------------------------------------------------------KTB Data Block (Table/Cluster)struct kcbh, 20 bytes @0ub1 type_kcbh @0ub1 frmt_kcbh @1ub1 spare1_kcbh @2ub1 spare2_kcbh @3ub4 rdba_kcbh @4ub4 bas_kcbh @8ub2 wrp_kcbh @12ub1 seq_kcbh @14ub1 flg_kcbh @15ub2 chkval_kcbh @16ub2 spare3_kcbh @18struct ktbbh, 96 bytes @20ub1 ktbbhtyp @20union ktbbhsid, 4 bytes @24struct ktbbhcsc, 8 bytes @28b2 ktbbhict @36ub1 ktbbhflg @38ub1 ktbbhfsl @39ub4 ktbbhfnx @40struct ktbbhitl[3], 72 bytes @44struct kdbh, 14 bytes @124ub1 kdbhflag @124b1 kdbhntab @125b2 kdbhnrow @126sb2 kdbhfrre @128sb2 kdbhfsbo @130sb2 kdbhfseo @132b2 kdbhavsp @134b2 kdbhtosp @136struct kdbt[1], 4 bytes @138b2 kdbtoffs @138b2 kdbtnrow @140sb2 kdbr[200] @142ub1 freespace[4725] @542ub1 rowdata[2921] @5267ub4 tailchk @8188BBED> find /x c203File: maclean.dbf (0)Block: 13 Offsets: 5271 to 5782 Dba:0x00000000------------------------------------------------------------------------c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e 64206d652c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420 6d652c000203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65 2c000203c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00 0203c2025e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203 c2025c0766696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202 5a0766696e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807 66696e64206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669 6e64206d652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64 206d652c000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d 652c000203c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c 000203c2024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002 03c2024d0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2 024b0766696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249 0766696e64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766 696e64206d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e 64206d65找到了t1=200的偏移值为5271则其fb的偏移值为5271 -4 = 5267BBED> set offset 5267OFFSET 5267BBED> dFile: maclean.dbf (0)Block: 13 Offsets: 5267 to 5778 Dba:0x00000000------------------------------------------------------------------------2c020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e64206d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d652c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c000203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c2025a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c202580766696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 560766696e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c000203c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c202490766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e/* 找到指定行的地址为5267,其当前flag为正常的0x2c  */BBED> x /rncrowdata[0] @5267----------flag@5267: 0x2c (KDRHFL, KDRHFF, KDRHFH)lock@5268: 0x02cols@5269: 2col 0[2] @5270: 200col 1[15] @5273: corruption here修改该flag 为 0xff BBED> modify /x 0xffWarning: contents of previous BIFILE will be lost. Proceed? (Y/N) yFile: maclean.dbf (0)Block: 13 Offsets: 5267 to 5778 Dba:0x00000000------------------------------------------------------------------------ff020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e64206d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d652c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c000203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c2025a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c202580766696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 560766696e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c000203c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c202490766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696eBBED> x /rncrowdata[0] @5267----------flag@5267: 0xff (KDRHFN, KDRHFP, KDRHFL, KDRHFF, KDRHFD, KDRHFH, KDRHFC, KDRHFK)lock@5268: 0x02cols@5269: 0ckix@5270: 2BBED> sum applyCheck value for File 0, Block 13:current = 0x0000, required = 0x0000我们使用bbed的verify命令验证数据块会发现问题flagBBED> verifyDBVERIFY - Verification startingFILE = maclean.dbfBLOCK = 12kdrchk: row is marked as both a Key and Clusteredprow=0x7f5335f05693 flag=0xffBlock Checking: DBA = 25165836, Block Type = KTB-managed data blockdata header at 0x7f5335f0427ckdbchk: bad row tab 0, slot 199Block 12 failed with check code 6251DBVERIFY - Verification completeTotal Blocks Examined : 1Total Blocks Processed (Data) : 1Total Blocks Failing (Data) : 1Total Blocks Processed (Index): 0Total Blocks Failing (Index): 0Total Blocks Empty : 0Total Blocks Marked Corrupt : 0Total Blocks Influx : 0使用dbv工具是也可以验证这种逻辑讹误的[oracle@rh2 ~]$ dbv file=maclean.dbfDBVERIFY: Release 10.2.0.4.0 - Production on Sun Sep 18 22:27:49 2011Copyright (c) 1982, 2007, Oracle. All rights reserved.DBVERIFY - Verification starting : FILE = maclean.dbfkdrchk: row is marked as both a Key and Clusteredprow=0x7f9ef25f7693 flag=0xffBlock Checking: DBA = 25165836, Block Type = KTB-managed data blockdata header at 0x7f9ef25f627ckdbchk: bad row tab 0, slot 199Page 12 failed with check code 6251DBVERIFY - Verification completeTotal Pages Examined : 2560Total Pages Processed (Data) : 1Total Pages Failing (Data) : 1Total Pages Processed (Index): 0Total Pages Failing (Index): 0Total Pages Processed (Other): 11Total Pages Processed (Seg) : 0Total Pages Failing (Seg) : 0Total Pages Empty : 2548Total Pages Marked Corrupt : 0Total Pages Influx : 0Highest block SCN : 691691 (0.691691)
  回到sqlplus中访问之前修改的数据行,触发ORA-600[13013] [5001]错误:  
SQL> alter system flush buffer_cache;System altered.SQL> update tv set t2='correct here' where t1=200;update tv set t2='correct here' where t1=200*ERROR at line 1:ORA-00600: internal error code, arguments: [13013], [5001], [52937],[25165836], [199], [25165836], [3], []PLAN_TABLE_OUTPUT---------------------------------------------------------Plan hash value: 568795662----------------------------------------------------------------------------| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |----------------------------------------------------------------------------|   0 | UPDATE STATEMENT  |        |     1 |   115 |     2   (0)| 00:00:01 ||   1 |  UPDATE           | TV     |       |       |            |          ||*  2 |   INDEX RANGE SCAN| IND_TV |     1 |   115 |     1   (0)| 00:00:01 |----------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   2 - access("T1"=200)SQL> select * from tv where t1=200;select * from tv where t1=200*ERROR at line 1:ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [],[], [], []PLAN_TABLE_OUTPUT------------------------------------------------------------------------------------------------------------------------Plan hash value: 1015724781--------------------------------------------------------------------------------------| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------------------|   0 | SELECT STATEMENT            |        |     1 |   115 |     2   (0)| 00:00:01 ||   1 |  TABLE ACCESS BY INDEX ROWID| TV     |     1 |   115 |     2   (0)| 00:00:01 ||*  2 |   INDEX RANGE SCAN          | IND_TV |     1 |       |     1   (0)| 00:00:01 |--------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   2 - access("T1"=200)
可以看到当正好update到问题行记录时如预料出现了ORA-00600:[13013], [5001]错误,而ACCESS BY INDEX ROWID时出现了ORA-00600:[qertbFetchByRowID]。
解决方案 1.在有备份的情况下可以通过blockrecovery在线修复该问题数据块: RMAN> blockrecover datafile 6 block 12; Starting blockrecover at 18-SEP-11 using target database control file instead of recovery catalog allocated channel: ORA_DISK_1 channel ORA_DISK_1: sid=144 devtype=DISK starting media recovery media recovery complete, elapsed time: 00:00:01 Finished blockrecover at 18-SEP-11 但是请注意如果该逻辑讹误确实是由Oracle Bug引起的话,那么很有可能blockrecover也无能为力,那么可以借鉴第二种方法。   2. 第二种方法针对没有备份可用的数据库或者recover数据块不管用的场景,可以设置10231事件并ctas复制该表,但是这种方法可能会丢失有问题的行记录: SQL> alter session set events '10231 trace name context forever, level 10' SQL> Create table.TABLE_COPY as select * from TABLE;   了解更多关于kdrchk函数的信息:
Add check for continued row piece pointing to itself withcorruption description:"kdrchk: Row piece pointing to itself"DB_BLOCK_CHECKING = MEDIUM will check for row pieces where thenext rowid (nrid) points to itself (chained row points to itself).It produces error ORA-600 [kddummy_blkchk] or ORA-600 [kdBlkCheckError]with check code [6266] (3rd ORA-600 argument).DBVERIFY reports the same corruption description if the block is corrupt on disk.RMAN when run with the CHECK LOGICAL option reports it as     corruption_type=CORRUPT/LOGICAL in v$database_block_corruption."ANALYZE TABLE  VALIDATE STRUCTURE" produces error ORA-1498 and trace fileshows the same corruption description.With this fix in place DBMS_REPAIR can be used to identify and mark the affectedblock as Soft Corrupt producing error ORA-1578 and it can be skipped it for DML'susing DBMS_REPAIR.SKIP_CORRUPT_BLOCKS. [CM][SG][event 1][domain Q423][mem 0] Joining shared group  kdrchk: column length 0 but not null            prow=0x2a97f4d9d6 flag=0x2c column=57  Block Checking: DBA = 29635651, Block Type = KTB-managed data block  data header at 0x2a97f4be7c  kdbchk: bad row tab 0, slot 2  data_block_dump,data header at 0x2a97d113d8  data_block_dump,data header at 0x2a97d113d8 kdrchk: found invalid symbol reference 48  reference to delete symbol  valid symbol range [0,78)  Block Checking: DBA = 411055291, Block Type = KTB-managed data block  data header at 0x68a3f4  kdbchk: bad row tab 0, slot 4  Page 13499 failed with check code 6265kdrchk: C and neither of H or F          prow=0x4282803ae flag=0x41Block Checking: DBA = 322963095, Block Type = KTB-managed data blockdata header at 0x42828007ckdrchk: column length 0 but not null          prow=0x10021035e flag=0x2c column=40Block Checking: DBA = 25189259, Block Type = KTB-managed data blockdata header at 0x10020fe7ckdbchk: bad row tab 0, slot 0Page 23435 failed with check code 6264kdrchk: column length 0 but not null          prow=0x1002122e5 flag=0x2c column=40Block Checking: DBA = 25189260, Block Type = KTB-managed data blockkdrchk:  row is marked as both a Key and Clusteredprow=0xd2bfa981 flag=0xffFile#67, Block#74754kdbchk: bad row tab 0, slot 0kdrchk:  no columns, but has one of P or N          prow=0x934fbffa flag=0x31DIAGNOSTIC ANALYSIS:====================A look at the block dump in the analyze trace file revealed two verysuspicious looking rows:tab 0, row 0, @0x1edetl: 2 fb: --HD---N lb: 0x0tab 0, row 1, @0x1edctl: 2 fb: --HD---N lb: 0x0The flag bytes in these rows look incorrect.
待修订!

转载于:https://www.cnblogs.com/macleanoracle/archive/2013/03/19/2968002.html

你可能感兴趣的文章
http协议和web本质
查看>>
黑客教父郭盛华:大数据时代,黑客渗透测试工程师有多重要?
查看>>
Extjs中GridPanel的各个属性与方法
查看>>
2019春第六周作业
查看>>
ServiceFabric极简文档-1.3删除群集
查看>>
EXT2 文件系统(转)
查看>>
.NET - 代码重构技巧
查看>>
EasyUI - DataGrid 组建 - [ 搜索功能 ]
查看>>
Linux菜鸟起飞之路【七】文件合并、归档和压缩
查看>>
Redis设计与实现 -- 动态字符串对象(SDS)
查看>>
信息体系结构原则之二——有用性目标
查看>>
SQL 测试
查看>>
你了解栈溢出StackOverFloweExeption的原理吗?
查看>>
力扣(LeetCode)125. 验证回文串
查看>>
【转载】Eclipse快捷键
查看>>
RabbitMQ 集群
查看>>
关于docker
查看>>
Git的使用
查看>>
table表格设置边框线为单实线
查看>>
poj 2992 Divisors
查看>>