Wednesday, October 22, 2008

Yet another RMAN bug

One of our RMAN backup failed with

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of delete command at 10/22/2008 00:30:31
RMAN-03014: implicit resync of recovery catalog failed
RMAN-03009: failure of full resync command on default channel at 10/22/2008 00:30:31
ORA-00001: unique constraint (RMANCAT.TF_U2) violated

Which is perfect hit for bug

Rman Resync Fails After Adding Temp Data File ORA-00001 on TF_U2
Doc ID:
NOTE:402352.1


According to the doc, this bug supposed to be fixed in 10.2.0.2 but we are using 10.2.0.3.
10.2.0.4 patchset note specifically mentioned this bug in fixed list so your safe bet is upgrade to 10.2.0.4

The dangerous part is the cause of this problem. After you dropped a tempfile and recreated a new one with bigger size, expecting your RMAN backup fail tonight with this error if you are using 10.2.0.3 and earlier.

RMAN catalog table TF has a unique key on ("DBINC_KEY", "TS#", "TS_CREATE_SCN", "FILE#"), it's turned out Oracle used the same FILE# but somehow forget to use new TS_CREATE_SCN

ALTER TABLE "RCAT"."TF" ADD CONSTRAINT "TF_U2" UNIQUE ("DBINC_KEY", "TS#", "TS_CREATE_SCN", "FILE#")

Anyway, just something you need to remember after your changed your TEMP tablespace. Or yet another reason to stay fully patched to terminal release.

Addition,

Looks like there are few other people had the same problem from OTN forum, let me include some steps to tackle this problem. Since you need to remove the duplicate record that causing the error, first you need to identify the problem record.

1. Find out the DBINC_KEY, if your RMAN catalog only serving one database, it's easy. But in most cases, you have multiple instances. You need to find out DBINC_KEY of your instance by DBID. Your DBID will show when you connect to RMAN,

connected to target database: ENGDB (DBID=620206583)
Or, select dbid from v$database;

select DBID,NAME,RESETLOGS_TIME, DBINC_KEY
from rc_database_incarnation where dbid=620206583

DBID NAME RESETLOGS DBINC_KEY
---------- -------- --------- ----------
620206583 EDB 06-DEC-06 21822 620206583 EDB 22-OCT-05 21828

2. Find out the problem file#

select "DBINC_KEY", "TS#", "TS_CREATE_SCN", "FILE#"
from tf where DBINC_KEY=21822;

3. Take a note and remove the record from TF_U2

Do a resync catalog using RMAN after delete. With the duplicate record removed the resync should finish.

Monday, October 06, 2008

RMAN backup failed with ORA-01400

The backup of one of our production servers suddenly has following errors in the log.

RMAN-03014: implicit resync of recovery catalog failed

RMAN-03009: failure of partial resync command on default channel at 10/06/2008 10:54:54

ORA-01400: cannot insert NULL into ("RMANCAT"."ROUT"."ROUT_SKEY")

It looks like this is a hit of Oracle bug

Bug No:5528078

ORA-01400: CANNOT INSERT NULL INTO ("RMAN"."ROUT"."ROUT_SKEY")

The strange thing is I didn't had any changes lately in production. I am not quite sure what event has triggered this bug.

The workaround involve changing script $ORACLE_HOME/rdbms/admin/recover.bsq and UPGRADE CATALOG .

This error usually happens after database migration.

Friday, October 03, 2008

Strange Temporary Tablespace problem

Yesterday morning, one user from Application group sent me an email regarding a failed production procedure of loading process.

The error was
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

From the error itself, it's easy to make you believe this is TEMP tablespace space issue. Or the procedure doing large sorting operation. However, this production database has 32G temporary tablespace the combined production data is only around 10G.

So it's rather something went wrong than real space problem. And this procedure was running ok before.

With the help of OEM Grid Control and AWR report snapshot. I quickly find out the culprit query, which is

SELECT A.PRODUCT_ID,A.VENDOR_ID,A.PROD_AREA,B.ALT_GROUP,B.GRADE_SET,B.GRADE,C.BOG_ID,C.AT_STEP,C.STEP_PRIORITY,C.POWER_SPEED,C.ROUTE,C.TOPMARK, CFI,MIN_LOT_SIZE MINLOT,MAX_LOT_SIZE MAXLOT,STD_LOT_SIZE STDLOT,INCR_LOT_SIZE INCRLOT FROM MDMSCP.SAT_PRODUCT_VENDOR A, MDMSCP.SAT_BOM B, TMP_SAT_BOG CWHERE A.PRODUCT_ID = B.PARENT_PART_ID AND B.CHILD_PART_ID = C.BOG_ID AND B.ALT_GROUP = C.NAME ORDER BY A.PRODUCT_ID,B.GRADE DESC,STEP_PRIORITY

However, it's hard for me to make sense the problem. The execute plan revealed that optimizer has chosen a very bad execute path for this particular query. Instead of join A and B with correct condition, the optimizer used a MERGE JOIN CARTESIAN to join A and C first. Which went terribly wrong, with 150K records in each table, Oracle is merging a whopping 22500000000 records! It's easily defeated our temporary tablespace.
From the wrong plan I noticed that optimizer somehow think table A only have 1 row. A checking on statistics revealed that both A and B has wrong statistics that reporting these two are empty tables. So optimizer just did whatever.
After collection of statistics, the execution plan make a lot more sense, it started join A and B first and refer C later.

It's again approved how important to have correct statistics collected for your schema. Otherwise even a small query can screw up your database big time.

P.S. While doing investigation on this issue, I come accross Janaton's good write up about MERGE JOIN CARTESIAN

http://jonathanlewis.wordpress.com/2006/12/13/cartesian-merge-join/