Google Search

自訂搜尋

2009年2月20日 星期五

[PL/SQL] 讓人吐血的ORA-02046(Distributed Transaction)

今天中午,客訴事件又添一件,人客又在抱怨資料沒產生,給客人看的網站上查不到想看的資料!天啊,這已是這個月來的第N起啦! 工程師查半天,程式就是跑不過,ERROR LOG中滿
滿的 ORA-02046. 這時看起來大勢已去,怒火中燒的客戶又在樓下等資料,腦中滿是問號,突
然靈機一動,將程式移到測試環境再去存取前端ERP的資料(這隻程式是一支ETL,從後端
Data Warehouse資料庫去存取前端ERP的資料庫),莫名其妙的程式就突然跑過了,這時趕緊
將產生好的資料搬到正是環境中,好跟客戶交差(為甚麼測試環境可以過,正式環境不行,至
今未解,但公司的環境因為DBA要求開發人員先行測試即將Upgrade的Oracle版本,所以兩個
環境的版本不一樣,測試環境是9i,正式環境是8i).

但是到底出了甚麼是啊?下面是官方說法:
=====================
ORA-02046:distributed transaction already begun
Cause:internal error or error in external transaction manager. A server session
received a begin_tran RPC before finishing with a previous distributed tran.
Action:none
=====================
看到了嗎? Action是'None',在Google一下,網路上幾乎沒有針對這個問題有提供相關解答.
唯一有正面回應的是,這個ERROR都發生在跨DB存取的動作,可以將遠端要存取的TABLE先
複製回LOCAL端,免除掉DBLINK的動作. 但這個建議相信大多人不會採用,因為就以本公司
為例,相關ETL程式有數百隻,每支都這樣改還得了,況且複製資料還要整體考量Timing的
問題,牽扯實在太大.

那Metalink呢?有沒有解答啊?引用一下http://www.orafaq.com/forum/t/72797/0/的問答
,看看內文提到,metalink也沒有太多解答.

========================================
Actually, quite a few people have sent this error into Oracle, and there are a number of cases
on Metalink about this error, none of them provide much elucidation. So actually, insisting he contact Oracle might not actually do too much good. If so, being able to provide them with a
trace file may help. This is an error that often doesn't throw any alerts.

Basically, what has happened is that the specific server session has received a begin_tran
rpc BEFORE it has finished with a previous distributed transaction.

There is no real clear, set pattern to the conditions that create this error message, and in
many cases, there doesn't seem to be a workaround either.

In one case, the person's application was in autocommit mode and turning that off resolved
the issue. In one case, the error occured in SQLPlus, but not when running the same code in OWB (where the code had been created). In another case, a person said they had increased DISTRIBUTED_TRANSACTION parameter and had it stop. In another case, the problem occured in a jdbc application, but not when the same code was run from SQLPlus.

It has been associated with select statements going from a 9.2.x database to an 8.1.x database.
In most cases, there is a dblink involved, one way or another (buried in a view or somehow). Some reports asscoiate it with insert statements.

In my case, I have been told to write a pl/sql procedure to replace Oracle's multimaster replication. As expected it does an update on a table on a remote database. This update statement causes the error to occur. The earlier version of this procedure, which is as similar
as possible except that it does not use bulk collect to gather the data, does not appear to cause the error to occur.

This is a difficult error and not a lot of information available from metalink. It seems to occur right in the middle of a distributed transaction.
================
=========================

那怎麼辦呢? 工程師嘛,任務就是解決問題,誓死達成任務. 好吧!那就回過頭把問題在定
義得更清楚一點! 底下就是更清楚的問題定義!

問題定義:
一隻每四小時執行一次的ETL程式,每次執行處理的資料量大概是三百筆上下. 主Crusor的
SQL和處理過程中都會不斷透過Database A存取Database B和Database C,最後將結果存於
Database A. 但自從三天前,每次執行此ETL都會撞到ORA-02046的問題,撞到的點都在執
行某副程式時發生,該副程式也是透過Database LINK存取Database B.

問題模擬:
將ETL由背景執行轉成前景執行,執行完後,一樣的ORA-02046出現.
先用以下SQL看看執行前後的Cursor狀態,沒有異常,遠小於Database的設定: Open cursor(Session中可以使用的Cursor, 目前設'10000').
=======================================
select a.name, b.value
from v$statname a, v$mystat b
where a.statistic# = b.statistic#
and lower(a.name) like '%' || 'cursor' ||'%'

NAME                                VALUE
------------------------------ ----------
opened cursors cumulative 26
opened cursors current 9
session cursor cache hits 0
session cursor cache count 13
cursor authentications 1
=======================================

另外一個常見的 Distributed database的問題是參數 'Open Link', session裡所使用的DB
Link總數,看一下目前設定,這個就更不可能超過啦. 重點是若是前述兩個問題,錯誤訊息
也不應該是Ora-02046.

但是同時間也發現,ORA-02046一旦出現,後續所有跨該 DB Link的SQL都會碰到一樣的錯誤訊
息.這時問題既然存在於DB Link,那就看一下 V$DBlink,果然,此View底下的'欄位
'Open Cursors'的值一旦是慢慢變大成9時,討人厭的Ora-02046開始層出不窮. 這個欄位
Oracle的定義是
==========================================================
Whether there are open cursors for the database link
===========================================================

有看沒有懂吧!為什麼這個值一變成9,系統就會拋出Ora-02046,不解!遍尋網路,也沒有更
進一步的解釋.

這時下個指令將DB Link關閉,Ora-02046不再出現.相關跨DB LINK的SQL也可以繼續執行.
====================
ALTER SESSION CLOSE database link A;
====================

但是問題到底怎麼解決啊?看看最近一次程式的Change Log,是在主Cursor的SQL中加了一個
Function,此Function 是存在於Database B,再從Database B中去存取Database A. 先試著
將此Function搬到Database A後,SQL再直接執行呼叫此Function,問題居然解決了. 天啊!
到底是撞倒哪一個參數或限制,至今無解,但和之前談的作法一樣的是:
既然問題發生在 Distributed Transactions,就試著將Distributed Transaction的需求降低
,將跨DB的Function移至Local端.問題暫時解決,未來有機會再將此問題量化來看,看看有沒
有機會看得更清楚問題所在.

也希望ORACLE大人行行好,將此一問題講得更清楚一點,不要只放一個Action='None',這可
讓很多人少掉很多白頭髮耶!

1 則留言:

匿名 提到...

我也遇到了这个问题,我在进程中的sql快运行到使用连接的部分就报这个错误.看了一下网上的问题和回答,我基本感觉是因为link没有在进程中释放的原因.现行释放一下,下次的时候就不会报错了吧.
alter session close database link zw03_zw01

--by luke.ning