Wednesday, October 22, 2008

Yet another RMAN bug

One of our RMAN backup failed with

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of delete command at 10/22/2008 00:30:31
RMAN-03014: implicit resync of recovery catalog failed
RMAN-03009: failure of full resync command on default channel at 10/22/2008 00:30:31
ORA-00001: unique constraint (RMANCAT.TF_U2) violated

Which is perfect hit for bug

Rman Resync Fails After Adding Temp Data File ORA-00001 on TF_U2
Doc ID:
NOTE:402352.1


According to the doc, this bug supposed to be fixed in 10.2.0.2 but we are using 10.2.0.3.
10.2.0.4 patchset note specifically mentioned this bug in fixed list so your safe bet is upgrade to 10.2.0.4

The dangerous part is the cause of this problem. After you dropped a tempfile and recreated a new one with bigger size, expecting your RMAN backup fail tonight with this error if you are using 10.2.0.3 and earlier.

RMAN catalog table TF has a unique key on ("DBINC_KEY", "TS#", "TS_CREATE_SCN", "FILE#"), it's turned out Oracle used the same FILE# but somehow forget to use new TS_CREATE_SCN

ALTER TABLE "RCAT"."TF" ADD CONSTRAINT "TF_U2" UNIQUE ("DBINC_KEY", "TS#", "TS_CREATE_SCN", "FILE#")

Anyway, just something you need to remember after your changed your TEMP tablespace. Or yet another reason to stay fully patched to terminal release.

Addition,

Looks like there are few other people had the same problem from OTN forum, let me include some steps to tackle this problem. Since you need to remove the duplicate record that causing the error, first you need to identify the problem record.

1. Find out the DBINC_KEY, if your RMAN catalog only serving one database, it's easy. But in most cases, you have multiple instances. You need to find out DBINC_KEY of your instance by DBID. Your DBID will show when you connect to RMAN,

connected to target database: ENGDB (DBID=620206583)
Or, select dbid from v$database;

select DBID,NAME,RESETLOGS_TIME, DBINC_KEY
from rc_database_incarnation where dbid=620206583

DBID NAME RESETLOGS DBINC_KEY
---------- -------- --------- ----------
620206583 EDB 06-DEC-06 21822 620206583 EDB 22-OCT-05 21828

2. Find out the problem file#

select "DBINC_KEY", "TS#", "TS_CREATE_SCN", "FILE#"
from tf where DBINC_KEY=21822;

3. Take a note and remove the record from TF_U2

Do a resync catalog using RMAN after delete. With the duplicate record removed the resync should finish.

Monday, October 06, 2008

RMAN backup failed with ORA-01400

The backup of one of our production servers suddenly has following errors in the log.

RMAN-03014: implicit resync of recovery catalog failed

RMAN-03009: failure of partial resync command on default channel at 10/06/2008 10:54:54

ORA-01400: cannot insert NULL into ("RMANCAT"."ROUT"."ROUT_SKEY")

It looks like this is a hit of Oracle bug

Bug No:5528078

ORA-01400: CANNOT INSERT NULL INTO ("RMAN"."ROUT"."ROUT_SKEY")

The strange thing is I didn't had any changes lately in production. I am not quite sure what event has triggered this bug.

The workaround involve changing script $ORACLE_HOME/rdbms/admin/recover.bsq and UPGRADE CATALOG .

This error usually happens after database migration.

Friday, October 03, 2008

Strange Temporary Tablespace problem

Yesterday morning, one user from Application group sent me an email regarding a failed production procedure of loading process.

The error was
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

From the error itself, it's easy to make you believe this is TEMP tablespace space issue. Or the procedure doing large sorting operation. However, this production database has 32G temporary tablespace the combined production data is only around 10G.

So it's rather something went wrong than real space problem. And this procedure was running ok before.

With the help of OEM Grid Control and AWR report snapshot. I quickly find out the culprit query, which is

SELECT A.PRODUCT_ID,A.VENDOR_ID,A.PROD_AREA,B.ALT_GROUP,B.GRADE_SET,B.GRADE,C.BOG_ID,C.AT_STEP,C.STEP_PRIORITY,C.POWER_SPEED,C.ROUTE,C.TOPMARK, CFI,MIN_LOT_SIZE MINLOT,MAX_LOT_SIZE MAXLOT,STD_LOT_SIZE STDLOT,INCR_LOT_SIZE INCRLOT FROM MDMSCP.SAT_PRODUCT_VENDOR A, MDMSCP.SAT_BOM B, TMP_SAT_BOG CWHERE A.PRODUCT_ID = B.PARENT_PART_ID AND B.CHILD_PART_ID = C.BOG_ID AND B.ALT_GROUP = C.NAME ORDER BY A.PRODUCT_ID,B.GRADE DESC,STEP_PRIORITY

However, it's hard for me to make sense the problem. The execute plan revealed that optimizer has chosen a very bad execute path for this particular query. Instead of join A and B with correct condition, the optimizer used a MERGE JOIN CARTESIAN to join A and C first. Which went terribly wrong, with 150K records in each table, Oracle is merging a whopping 22500000000 records! It's easily defeated our temporary tablespace.
From the wrong plan I noticed that optimizer somehow think table A only have 1 row. A checking on statistics revealed that both A and B has wrong statistics that reporting these two are empty tables. So optimizer just did whatever.
After collection of statistics, the execution plan make a lot more sense, it started join A and B first and refer C later.

It's again approved how important to have correct statistics collected for your schema. Otherwise even a small query can screw up your database big time.

P.S. While doing investigation on this issue, I come accross Janaton's good write up about MERGE JOIN CARTESIAN

http://jonathanlewis.wordpress.com/2006/12/13/cartesian-merge-join/

Wednesday, July 02, 2008

Large TCP Socket (KGAS) event wait

One of our dev database has a large number of TCP Socket (KGAS) event waits when a piece of PL/SQL code runs.

SYS@dev>
select active_session_history.event,
sum(active_session_history.wait_time +
active_session_history.time_waited) ttl_wait_time
from v$active_session_history active_session_history
where active_session_history.sample_time between
sysdate - 120/2880 and sysdate
group by active_session_history.event
order by 2 desc;

EVENT TTL_WAIT_TIME
------------------- -------------
TCP Socket (KGAS) 843316255
log file sync 1912981
.....

I check the Oracle reference of TCP Socket wait events, it says,

KGAS is a component in the server which handles TCP/IP sockets which is typically used in dedicated connections in 10.2+ by some
PLSQL built in packages such as UTL_HTTP and UTL_TCP.

However, in this particular piece of code, there's no such package called, Momen blogged about the same event when he's using SMTP package. But it looks like this doesn't apply to us.

http://momendba.blogspot.com/2007/03/tcp-socket-kgas-wait-event.html

I then looked into metalink, I found this Doc,

''TCP Socket (Kgas)'' Waits Present in 10.2
Doc ID:
Note:416451.1


It basically says this event is merely reporting some network related event, it's not threatening performance. The conclusion is this event can be safely ignored :D

Well, I hope Oracle could have fixed the bug in 10.2.0.4 and 11g, so that reporting of event in more DBA comforting method.

Monday, June 30, 2008

Oracle RMAN bug

I just hit an Oracle RMAN Bug while revising one of my RMAN backup scripts. I had a typo in my ORACLE_SID setting. So RMAN started without a target database connection, the script subsequently issued,

sql "alter system switch logfile";

Which crashed RMAN with ORA-600 numbers

corpdb 15 oracle %setenv ORACLE_SID ctest
corpdb 16 oracle %rman catalog
rmancat/rman@rman target /
Recovery Manager: Release 10.2.0.3.0 - Production on Mon Jun 30 18:10:10 2008
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database (not started)connected to recovery catalog database
RMAN> sql "alter system switch logfile";

sql statement: alter system switch logfile

DBGANY: CMD type=sql id=1 status=NOT STARTED
DBGANY: 1 STEP id=1 status=NOT STARTED chid=default
DBGANY: 1 TEXTNOD = -- sql
DBGANY: 2 TEXTNOD = begin
DBGANY: 3 TEXTNOD = krmicd.execSql(
DBGANY: 4 PRMVAL = stmt=>'alter system switch logfile'
DBGANY: 5 TEXTNOD = );
DBGANY: 6 TEXTNOD = end;
RMAN-00571:===========================================================
RMAN-00569:=============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571:===========================================================
RMAN-00601: fatal error in recovery manager
RMAN-03004: fatal error during execution of command
RMAN-00600: internal error, arguments [6000] [] [] [] []
corpdb 17 oracle %


While runing other RMAN Command should result following errors,

RMAN>
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of delete command at 06/30/2008 18:04:48
RMAN-06403: could not obtain a fully authorized session
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
HPUX-ia64 Error: 2: No such file or directory

Thursday, June 12, 2008

ORA-01882: timezone region %s not found

ORA-01882: timezone region %s not found

I got this error while running

select * from dba_scheduler_jobs;

The error message itself turns out not very informative.

01882, 00000, "timezone region %s not found"
// *Cause: The specified region name was not found.
// *Action: Please contact Oracle Customer Support.

A little research on metalink help solved the problem. Metalink has a Doc specifically explain how to fix this error. In short, the error is because there are 7 timezone region IDs changed from version 3 and above. If you have old Timezone data from Version 2 that using one of these IDs the error raises.
The Doc provided a convenience script to fix the problem. After running the script problem gone. For more information check,

Time Zone IDs for 7 Time Zones Changed in Time Zone Files Version 3 and Higher, Possible ORA-1882 After Upgrade
Doc ID: Note:414590.1

Wednesday, May 21, 2008

Oracle licensing on multicore processor

With the increasing popularity of multi-core processor nowadays, it's unclear for a lot of people including DBAs how Oracle calculate the required license for certain server configuration.
A quick guideline is, for one server one core has processor licensing factor of .50, a quad core CPU require 2 processor licenses.
However there's a catch here, this equation is not apply to more than one multicore servers, for example RAC setup. Two multicore servers installed and/or running the program on 8 cores will need 8 multiplied by a core processor licensing factor of .75 equals 6. Therefore instead of 4 processor licenses this require 6 processor licenses.
Also note, When licensing Oracle programs with Standard Edition One or Standard Edition in the product name, a processor is counted equivalent to an occupied socket; however, in the case of multi-chip modules, each chip in the multi-chip module is counted as one occupied socket.

Please refer to following licensing document from Oracle for more detail.

Oracle Licensing

Tuesday, May 06, 2008

ORA-01466 and flashback query

Someone brought this up in one OTN forum post.

I did a quick search on google and found this,
http://www.adp-gmbh.ch/ora/err/ora_01466.html

It seems a viable explanation. However I really need some sort reference to verify the 5 minutes time frame. Why 5 minutes? Not 1 minutes for example.

Then I found metalink note,

Error ORA-01466 while executing a flashback query.
Doc ID:
Note:281510.1

It explains everything.
The reason is because smon_scn_time is updated every 5 minutes in 9i, please note, in 10g smon_scn_time is updated every 6 seconds which is much more reasonable and making it very unlikely to hit the error.

Friday, May 02, 2008

ORA-03115 unsupported network datatype

ORA-03115: unsupported network datatype or representation

This is an annoying error that comes and goes during one of your implementation project. The application is using an utility that connecting to database from Excel. The client is version 9.2.0.1 and database is 10.2.0.3

After a quick search on internet,
Metalink note
OCI Application Errors with ORA-3115
Doc ID: Note:460498.1

seems a hit. After I applied the workaround, the problem seems go away for now. The thing I don't like about this is
  1. It applies not only between 9i and 10g, according to the notes, the problem exists between 10.2.0.3 and 11g as well.
  2. The workaround is set CURSOR_SHARING to EXACT, which is not preferred setting if you having a lot of customized SQL running on your database.

Looks like Oracle didn't have a patch specifically for this bug yet. If anyone knows otherwise please drop a line.

Monday, March 31, 2008

ORA-39082 error while import using data pump

When you use data pump import (impdp) doing database migration, sometime you will get ORA-39082 error, for example,

ORA-39082: Object type ALTER_PROCEDURE:"DX"."UPDATE_MKT"
created with compilation warnings
ORA-39082: Object type ALTER_PROCEDURE:"DX"."UPDATE_SALES"
created with compilation warnings
ORA-39082: Object type ALTER_PROCEDURE:"DX"."CHECK_CURRENT"
created with compilation warnings
ORA-39082: Object type ALTER_PROCEDURE:"DX"."DELETE_PAST"
created with compilation warnings

After import, when issue

alter procedure UPDATE_MKT compile;
alter procedure DELETE_PAST compile;
etc.

They compiled successfully without any errors or warning.

The reason behind this is because data pump import create procedures before views, if your procedure have dependency on views then you will have the ORA-39082 compilation errors at import.

Even the problem can be easily fixed by recompile all procedures after import. This is a little inconvenience for DBA that need to do DB refresh regularly.

Or perhaps this is the way Oracle suggest us don't reference views in procedures :)

Also there are Bugs return similar error

Impdp Returns ORA-39082 When Importing Wrapped Procedures

Doc ID:
Note:460267.1

Tuesday, March 04, 2008

update gone wild

Currently our company is implementing a Supply Chain Application to replace our out-dated I2. Needless to say, I am the person who's going to setup and maintain the Oracle database for the project. Oracle version is 10.2.0.3. The consultants keep having some performance problem with one of the procedures they build. I spent time to look into it, and found out how easy it is to screw up Oracle database.

The first problem was runaway session, the procedure had some sort of dead loop and doing massive updates without commit, at one time the undo tablespace grew all the way up to 11G. After the session was killed, Oracle need to recover all these 1.4 million undo blocks. As you have expected, it's not going to be a fast process and you literally can't skip it in anyway. (unless drop and recreate database of course) My instance is recovering at ~150 blocks/sec rate. You can check the recovering progress by query this view v$fast_start_transactions. I felt it's kinda of slow, not sure if other people had similar experience can share.

The interesting thing is while SMON doing the recovery, the consultant is trying to access the database from the application, whenever the application touch the subject tables, SMON fire up whole bunch of parallel processes in the effort to speed up the rollback processes. It's called Parallel Rollback. However in this case, it's not helping the situation at all, on the contrary, with so many processes trying to recover the same transaction, they created quite some contention on undo blocks. Each of these processes plus SMON spend a lot of time waiting for each other, buffer busy waits events on Undo blocks sky rocketing. Parallel rollback mostly improve performance on parallel transaction recovery. In this case, better disable the parallel rollback by

alter system set fast_start_parallel_rollback = false;

The second problem is one of the update statement keep hung the session and do all sort of crazy things in database. The simple update query generated millions of buffer gets and huge number of db file scattered read and latch:cache buffers chains events. With help of some query,

select CHILD# "cCHILD",
ADDR "sADDR",
GETS "sGETS",
MISSES "sMISSES",
SLEEPS "sSLEEPS"
from v$latch_children where name = 'cache buffers chains'
order by 5, 1, 2, 3 desc

select /*+ RULE */ e.owner '.' e.segment_name
segment_name, e.extent_id extent#,
x.dbablk - e.block_id + 1 block#,
x.tch, l.child#from sys.v$latch_children l,
sys.x$bh x, sys.dba_extents e
where x.hladdr = 'ADDR' and
e.file_id = x.file# and
x.hladdr = l.addr and
x.dbablk between e.block_id and
e.block_id + e.blocks -1
order by x.tch desc ;

The culprit object quickly discovered, but why?
After a quick analyze on the table the problem went away.
I suspect this is some sort of Oracle bug. The table causing the problem only has 20k rows. It was truncated and inserted some data earlier in the procedure, however after truncate Oracle reset the statistics of this table and didn't update it after insert. When the update kick in, optimizer was using the wrong information and went into some dead loop situation. Currently I suggested the consultant do an analyze after each insert as work around, however I would like to find if more organic and permanent solution for this from Oracle.

Reference,

How To Identify a Hot Block Within The Database Buffer Cache.
Doc ID:
Note:163424.1


LATCH: CACHE BUFFERS CHAINS
Doc ID:
Note:42152.1


Parallel Rollback may hang database, Parallel query servers get 100% cpu
Doc ID:
Note:144332.1

Tuesday, February 26, 2008

Bypass buffer cache for Full Table Scans

Paypal DBA Saibabu mentioned an interesting undocumented parameter to bypass buffer cache for full table scans in his blog http://sai-oracle.blogspot.com/

alter session set "_serial_direct_read" = true;

I found it's particular useful in OLTP environment where you need to occasionally run a FTS query against a large table.

Thursday, February 21, 2008

Install Oracle 10gR2 on RHEL5 or OEL5

It comes to my attention that a lot of people still having problem installing Oracle 10gR2 on RedHat Enterprise Linux/Oracle Enterprise Linux 5.

Actually Oracle already posted a series of metalink notes covering all aspect of such issue. By following the procedures listed in these notes you should have a success installation.

These notes contains links to each other, you shouldn't have problem to find all of them once you got one.

Requirements For Installing Oracle10gR2 On RHEL/OEL 5 (x86_64)
Doc ID: Note:421308.1

Note 376183.1 - Defining a "default RPMs" installation of the RHEL OS
Note 419646.1 - Requirements For Installing Oracle 10gR2 On RHEL5 (x86)
Note 456634.1 - Installer Is Failing on Prereqs for Redhat-5 - RHEL5