Thursday, October 25, 2012

AIX ASM Diskgroup reporting wrong Lun size


  We discovered this strange problem while trying to convert one of our big databases from file system to ASM. After we assigned a 1TB lun to AIX and created a diskgroup on top of it. ASM reporting wrong disk size on it. ASM only be able to detect around 100G out of 1TB.
  After a few poke around to make sure we didn't make any mistake provision the storage to the OS. I did some research on Metalink. And fair enough, I found this is a known bug on AIX ASM

Bug 9495887 AIX: ASM does not recognize correct diskgroup size for large disks

-------------------------------------------------------------------------------
Device         Size (GB)  Paths  Vol Name       Vol Id   XIV Id   XIV Host     
-------------------------------------------------------------------------------
/dev/hdisk4    1135.7     5/5    hedata16     113      7825812  heproddb102


NAME            PATH              GROUP_NUMBER   TOTAL_MB    FREE_MB READS WRITES 
--------------- ----------------- ------------ ---------- ---------- ----- ------ 
TEST_0000       /dev/rhdisk4                 5     101920     101800    60     10 

The workaround suggested was to create the disk group with specific size like following

SQL> create diskgroup DATA external redundancy disk '/dev/rhdisk4' size 1135G;

Frankly it's quite surprising we still hit such basis bugs on AIX ASM even after it was released 3 years. 1TB disk is hardly a large disk nowadays, I guess AIX is just such an unpopular OS for Oracle installation and have very little customer base. And I can totally understand why.






Sunday, November 20, 2011

GATHER_TABLE_STATS and ORA-01652

We had ORA-01652 error from one production database recently. The culprit is GATHER_TABLE_STATS  job. By the way, this DB is 10.2.0.4



ORA-01652: unable to extend temp segment by 128 in tablespace TEMP
*** 2011-11-20 20:28:51.105
GATHER_STATS_JOB: GATHER_TABLE_STATS('"L53"','"L_CARD"','""', ...)
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

Originally it was error out and paging at 4AM which is really not preferred timing for On Call DBA.
Since this DB tend to have higher load during early morning any way. I changed maintenance windows to the late afternoon.
To change this auto statistics collection job time use this command.


BEGIN
DBMS_SCHEDULER.SET_ATTRIBUTE (
name => 'GATHER_STATS_JOB',
attribute => 'repeat_interval',
value => 'freq=daily;byday=SUN,MON,TUE,WED,THU,FRI,SAT;byhour=17;byminute=0; bysecond=0');
END;




However this didn't address the root cause of the issue apparently. A couple of days later the job failed again with same error. 

Increase TEMP tablespace is not an option. The TEMP TBS on this DB is 95G. This job run for 3 hours and used them all. Adding more TEMP will only delay the inevitable.
I decide to changed estimate percent from auto sampling to 1% , this fixed the issue. I did some research on google about this but there's not much useful past discussion.
Only found this asktom thread pretty helpful by pointing the right statement to change the default GATHER_TABLE_STATS job
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:652425700346984666
ops$tkyte%ORA10GR2> select dbms_stats.get_param( 'estimate_percent' ) from dual;

DBMS_STATS.GET_PARAM('ESTIMATE_PERCENT')
-------------------------------------------------------------------------------
DBMS_STATS.AUTO_SAMPLE_SIZE

reset or set them with these:

http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/d_stats.htm#i1047505

http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/d_stats.htm#i1048566


Wednesday, September 07, 2011

How to connect wireless router to another wireless router

This post is not related to Oracle. It just a hint for a problem I found while I was setting up my home network by connecting a new wireless router to an existing one.

In beginning, this task seems super easy and no brainer to me. Just connect 'Internet' port of new router to any Local ethernet port on existing router and setup new router and Wala!

Oh well, it doesn't work. The new router keeps complaining it's not connected to internet. The Internet setup page show it get 127.0.0.1 (localhost) as DHCP address from old router and that doesn't work obviously. Actually some network guru probably already figured out of the problem when they saw this.

So why it's get a 127.0.0.1 address instead of a valid DHCP release? Well, the trick is most router by default using 192.168.1.1 address and subnet. So if two routers sharing the same address, of course the new one will get localhost as address thinking he is 192.168.1.1

The solution is easy, change the new router's default subnet to 192.168.2.1 etc. or change new router's IP to something like 192.168.1.10

Monday, August 29, 2011

ORA-01555 with Query Duration=0 sec

Most DBAs know that ORA-1555 is caused by long running query. And in alert.log file it will tell you which SQL caused ORA-1555 and run for how long.
However from time to time you will see errors like this. It's basically tell you that the query failed right away. So why's the case?


Mon Aug 29 06:39:09 2011
ORA-01555 caused by SQL statement below (SQL ID: 0jc2g6km899ps, Query Duration=0 sec, SCN: 0x00ae.75483a06):
Mon Aug 29 06:39:09 2011
SELECT.xxxx

I ran  a query to find out the time stamp of this query's SCN and found out that the query has a time stamp of 6AM. But i was failed 40 min later. That could only mean one thing that it was in a transaction that started 6AM and Oracle already over written the data in UNDO.

SYS@VAULTPROD>select scn_to_timestamp(749291977222) from dual;
SCN_TO_TIMESTAMP(749291977222)
---------------------------------------------------------------------------
29-AUG-11 06.00.01.000000000 AM


There’s a couple of ways to help improve the situation.
  • Does all the statements in this job need to be in single transaction? If not, don’t put them into single transaction.
  • Increase undo retention of DB, Oracle will try to honor this retention subject to UNDO space.
  • Increase the UNDO tablespace to mitigate the potential space squeeze but remember the reason we got this error is not from UNDO space limitation.

Wednesday, March 09, 2011

Data pump expdp failed with DMSYS related errors ORA-39126 ORA-06512 etc

Today one of our data pump export/import jobs failed with errors attached at the bottom.  The process working fine before our 11g upgrade. 
I did a little research and found metalink doc 304449.1 has perfect solution.


The problem is we removed some unused database options before we upgrade from 10g to 11g.
The reason is because with all these unnecessary options, the upgrade scripts will run almost two hours. Removing them the upgrade will finish in 15 minutes. 


It turns out DMSYS data mining option is among them, but somehow Oracle didn't cleanly remove the option with some left over records in data pump export table.


The solution in this case is delete these records,


SQL> DELETE FROM exppkgact$ WHERE SCHEMA='DMSYS';
SQL> commit;



There are other potential causes for the same error. You can check the metalink doc for more info.


Database Data Pump Export fails with PLS-00201 identifier DMSYS.DBMS_MODEL_EXP must be declared [ID 304449.1]


Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "SYSTEM"."SYS_IMPORT_SCHEMA_11":  userid=system/********@TEST parfile=/home/oracle/dba/sql/DWS.par 
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
ORA-39126: Worker unexpected fatal error in KUPW$WORKER.GET_TABLE_DATA_OBJECTS [] 
ORA-31642: the following SQL statement fails: 
BEGIN "DMSYS"."DBMS_DM_MODEL_EXP".SCHEMA_CALLOUT(:1,0,1,'11.02.00.00.00'); END;
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 86
ORA-06512: at "SYS.DBMS_METADATA", line 1245
ORA-04063: package body "DMSYS.DBMS_DM_MODEL_EXP" has errors
ORA-06508: PL/SQL: could not find program unit being called: "DMSYS.DBMS_DM_MODEL_EXP"
ORA-06512: at "SYS.DBMS_METADATA", line 5300
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 86
ORA-06512: at "SYS.KUPW$WORKER", line 8159
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
70000007ddbc258     19028  package body SYS.KUPW$WORKER
70000007ddbc258      8191  package body SYS.KUPW$WORKER
70000007ddbc258     12728  package body SYS.KUPW$WORKER
70000007ddbc258      4618  package body SYS.KUPW$WORKER
70000007ddbc258      8902  package body SYS.KUPW$WORKER
70000007ddbc258      1651  package body SYS.KUPW$WORKER
70000007eaf9060         2  anonymous block

Monday, January 31, 2011

Oracle won't do partition pruning on MAX/MIN query of partition key.


  Oracle doesn’t do a partition pruning on MAX/MIN query on partition key. Even it makes perfect sense for Oracle to scan only the partition that has MAX/MIN value. And this is not something new, the user community certainly noticed this.

http://www.oramoss.com/blog/2009/06/no-pruning-for-minmax-of-partition-key.html

  Right now, all we can do is some work around. For example one of our database use this query to figure out MAX AGG_DATE as part of daily ETL process. AGG_DATE is partition key of the table and not indexed.
The old execution plan looks like this,
Ouch and yes, the Pstart is 1 and Pstop is 1149. Oracle scanned all 1149 partitions of the table and took a very long time as expected.


SQL>  explain plan for SELECT max(AGG_DATE) from (SELECT "A1"."AGG_DATE" FROM "WEB_APPS"."COUNTER_DAY_AGG" "A1" order by AGG_DATE desc );
Explained.
SQL>  select * from table(dbms_xplan.display());


PLAN_TABLE_OUTPUT
------------------------------------
Plan hash value: 4125776214
----------------------------------------------------------------------------------| Id  | Operation            | Name                       | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                            |     1 |     8 |    10M  (2)| 34:45:45 |       |       |
|   1 |  SORT AGGREGATE      |                            |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE ALL|                            |  5196M|    38G|    10M  (2)| 34:45:45 |     1 |  1149 |
|   3 |    TABLE ACCESS FULL |            COUNTER_DAY_AGG |  5196M|    38G|    10M  (2)| 34:45:45 |     1 |  1149 |
----------------------------------------------------------------------------------


Since this our daily job, the work around I put in is where clause. 
The plan looks better after that, Pstart is now KEY instead 1. In our case it will scan 7 daily partitions.
The stats give bogus running time estimate. The actual run time reduced from 20 minutes to 1 minute. 

SQL>  explain plan for SELECT MAX("A1"."AGG_DATE") FROM "ODS_WEB_APPS"."COUNTER_DAY_AGG" "A1" where AGG_DATE > sysdate-7;
Explained.
SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------Plan hash value: 1669369268

----------------------------------------------------------------------------------| Id  | Operation                 | Name                       | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                            |     1 |     8 |    10M  (4)| 36:09:51 |       |       |
|   1 |  SORT AGGREGATE           |                            |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR|                            |  6919K|    52M|    10M  (4)| 36:09:51 |   KEY |  1149 |
|*  3 |    TABLE ACCESS FULL      |            COUNTER_DAY_AGG |  6919K|    52M|    10M  (4)| 36:09:51 |   KEY |  1149 |
----------------------------------------------------------------------------------

Of course there's one trade off of this work around. It will limit the script's ability to catch up failed or missed loading. The script use this query to find out max loading date and catch up load from that date. So if our loading didn't run for more than 7 days, the script won't be able to catchup. I guess that's something we can live with, it's not possible that we didn't notice our daily ETL job was not running for past 7 days  :) Even in worst case scenario that really happens, we can still deal with it individually. 

Friday, October 29, 2010

ORA-01591 and quick solution

One of the user reported they got this error from application.

ORA-01591: lock held by in-doubt distributed transaction 4.7.533420

We don't really see this error often. So I did a little research. 
The error message doc from Oracle has pretty good explanation but didn't provide a solution how to resolve this.

ORA-01591:

lock held by in-doubt distributed transaction string
Cause:Trying to access resource that is locked by a dead two-phase commit transaction that is in prepared state.
Action:DBA should query the pending_trans$ and related tables, and attempt to repair network connection(s) to coordinator and commit point. If timely repair is not possible, DBA should contact DBA at commit point if known or end user for correct outcome, or use heuristic default if given to issue a heuristic commit or abort command to finalize the local portion of the distributed transaction.


What I end up did is pretty easy,  rollback force didn't do the trick. The DBMS_TRANSACTION helped.

SQL> select local_tran_id from dba_2pc_pending;

LOCAL_TRAN_ID
----------------------
4.7.533420

SQL> rollback force '4.7.533420';

Rollback complete.

SQL> select local_tran_id from dba_2pc_pending;

LOCAL_TRAN_ID
----------------------
4.7.533420

SQL> exec dbms_transaction.purge_lost_db_entry('4.7.533420');

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

SQL> select local_tran_id from dba_2pc_pending;

no rows selected