你好,游客 登录 注册 搜索
背景:
阅读新闻

由 BUG 引发 ORA-15064 进而导致数据库实例意外中止

[日期:2015-01-11] 来源:Linux社区  作者:aaron8219 [字体: ]

今天早上刚到公司一会,就接到南京客户打来的电话,说他们的核心会员数据库宕机了,让我远程帮忙查看一下。还没来得急看报纸的我赶紧打开电脑,远程连接到客户的服务器进行诊断。

客户的生产环境是AIX 6.1上的Oracle 11.2.0.3.0,在凌晨1:40分左右发生的故障。为了不涉及泄露客户隐私,把数据库实例名进行了替换。下面看具体的分析:
 
1. 数据库alert.log分析
 
Mon Jan 05 01:40:50 2015
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_ora_16581034.trc:
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 288 Serial number: 5649
NOTE: deferred map free for map id 4422
Mon Jan 05 01:40:55 2015
NOTE: ASMB terminating
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
 
Fatal NI connect error 12170.
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
   
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64460))
WARNING: inbound connection timed out (ORA-3136)
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
 
Fatal NI connect error 12170.
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
   
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64530))
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
 
Fatal NI connect error 12170.
WARNING: inbound connection timed out (ORA-3136)
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
  Tns error struct:
ASMB (ospid: 5898342): terminating the instance due to error 15064
    ns main err code: 12535
   
TNS-12535: TNS:operation timed out
Mon Jan 05 01:40:55 2015
 
 
***********************************************************************
    ns secondary err code: 12606
 
Fatal NI connect error 12170.
    nt main err code: 0
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
    nt secondary err code: 0
  Time: 05-JAN-2015 01:40:55
  Tracing not turned on.
    nt OS err code: 0
Mon Jan 05 01:40:56 2015
System state dump requested by (instance=1, osid=5898342 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_diag_20578640.trc
Mon Jan 05 01:40:59 2015
ORA-1092 : opitsk aborting process
Mon Jan 05 01:41:00 2015
ORA-1092 : opitsk aborting process
Instance terminated by ASMB, pid = 5898342
Mon Jan 05 08:46:45 2015
Starting ORACLE instance (normal)
 
由于在故障点之前也会报一些ORA-3136的错误,故这里不做说明,与本次错误干系不大。从alert日志中发现,数据库实例在1:40:55时,被ASMB(进程号:5898342)进程终止,直到8点46分左右才开始启动(客户自己手动重启)。
 
2.相关trace文件分析
 
test5_ora_16581034.trc:
 
*** 2015-01-05 01:40:50.900
*** SESSION ID:(424.19913) 2015-01-05 01:40:50.900
*** CLIENT ID:() 2015-01-05 01:40:50.900
*** SERVICE NAME:(SYS$USERS) 2015-01-05 01:40:50.900
*** MODULE NAME:(backup incr datafile) 2015-01-05 01:40:50.900
*** ACTION NAME:(0000097 STARTED16) 2015-01-05 01:40:50.900
 
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 288 Serial number: 5649
 
基本与alert日志中报的错误相同,就是获知在1:40分这个故障点,数据库与ASM实例的连接中断了
 
test5_asmb_5898342.trc:
 
*** 2015-01-05 01:40:55.870
NOTE: ASMB terminating
 
*** 2015-01-05 01:40:55.922
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 216 Serial number: 7
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+360<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+100<-ksuitm()+5124<-ksbrdp()+4508<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112 
----- End of Abridged Call Stack Trace -----
 
*** 2015-01-05 01:40:55.954
ASMB (ospid: 5898342): terminating the instance due to error 15064
ksuitm: waiting up to [5] seconds before killing DIAG(20578640)
 
从这个trace文件可以获知,由于遭遇ora-15064错误,ASMB进程终止了数据库实例。
 
test5_diag_20578640.trc:
 
Session Wait History:
        elapsed time of 1.543354 sec since last wait
     0: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881966 seq_num=30034 snap_id=1
        wait times: snap=0.000422 sec, exc=0.000422 sec, total=0.000422 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000016 sec of elapsed time
     1: waited for 'ASM file metadata operation'
        msgop=0x2, locn=0x3, =0x0
        wait_id=881964 seq_num=30033 snap_id=2
        wait times: snap=0.000082 sec, exc=0.000127 sec, total=4 min 4 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     2: waited for 'latch: shared pool'
        address=0x700000000107048, number=0x133, tries=0x0
        wait_id=881965 seq_num=30032 snap_id=1
        wait times: snap=4 min 4 sec, exc=4 min 4 sec, total=4 min 4 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     3: waited for 'ASM file metadata operation'
        msgop=0x2, locn=0x3, =0x0
        wait_id=881964 seq_num=30031 snap_id=1
        wait times: snap=0.000045 sec, exc=0.000045 sec, total=0.000045 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000019 sec of elapsed time
     4: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881963 seq_num=30030 snap_id=1
        wait times: snap=3.285876 sec, exc=3.285876 sec, total=3.285876 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000011 sec of elapsed time
     5: waited for 'ASM file metadata operation'
        msgop=0xc, locn=0x3, =0x0
        wait_id=881962 seq_num=30029 snap_id=1
        wait times: snap=0.000034 sec, exc=0.000034 sec, total=0.000034 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000019 sec of elapsed time
     6: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881961 seq_num=30028 snap_id=1
        wait times: snap=5.000528 sec, exc=5.000528 sec, total=5.000528 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000009 sec of elapsed time
     7: waited for 'ASM file metadata operation'
        msgop=0x0, locn=0x3, =0x0
        wait_id=881960 seq_num=30027 snap_id=1
        wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000016 sec of elapsed time
     8: waited for 'ASM background timer'
        =0x0, =0x0, =0x0
        wait_id=881959 seq_num=30026 snap_id=1
        wait times: snap=5.000493 sec, exc=5.000493 sec, total=5.000493 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000011 sec of elapsed time
     9: waited for 'ASM file metadata operation'
        msgop=0x0, locn=0x3, =0x0
        wait_id=881958 seq_num=30025 snap_id=1
        wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000017 sec of elapsed time
 
sample interval: 1 sec, max history 120 sec
    ---------------------------------------------------
      [120 samples,                                        01:38:56 - 01:40:55]
        waited for 'latch: shared pool', seq_num: 30032
          p1: 'address'=0x700000000107048
          p2: 'number'=0x133
          p3: 'tries'=0x0
          time_waited: >= 119 sec (still in wait)
    ---------------------------------------------------
    Sampled Session History Summary:
      longest_non_idle_wait: 'latch: shared pool'
      [120 samples, 01:38:56 - 01:40:55]
          time_waited: >= 119 sec (still in wait)
 
在ASMB进程的的dump文件中,发现有2个4分4秒的等待,waited for 'ASM file metadata operation'和waited for 'latch: shared pool',而其他的相关操作都是只有几微秒,甚至0秒的
 
3.ASM实例的alert日志:
 
Mon Jan 05 01:40:56 2015
 
 
***********************************************************************
TNS-12537: TNS:connection closed
 
Fatal NI connect error 12537, connecting to:
 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
    ns secondary err code: 12560
 
  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
    nt main err code: 0
  Time: 05-JAN-2015 01:40:56
    nt secondary err code: 0
  Tracing not turned on.
    nt OS err code: 0
  Tns error struct:
    ns main err code: 12537
   
TNS-12537: TNS:connection closed
    ns secondary err code: 12560
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
Mon Jan 05 04:42:23 2015
NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds
Mon Jan 05 08:46:53 2015
NOTE: client test5:test5 registered, osid 23593052, mbr 0x0
Mon Jan 05 08:47:24 2015
Starting background process ASMB
Mon Jan 05 08:47:24 2015
ASMB started with pid=26, OS id=20185292 
Mon Jan 05 08:47:25 2015
NOTE: client +ASM:+ASM registered, osid 16384406, mbr 0x1
 
在故障点之前,ASM的alert日志中未报过TNS错误,但故障点的时候,ASM中也出现了TNS连接关闭的错误,因此,也不排除网络因素引起的ASM实例故障。4点多的时候有一个NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds,根据文档NOTE ASMB process exiting due to lack of ASM file activity (文档 ID 754110.1)”,这个NOTE可以忽略。到了8:46的时候,ASM实例随着数据库的重启也正常启动了
 
最终,找到了一篇MOS文档,是和这次遇到的故障相关的,应该就是Bug 13914613导致的这个问题:
 

Bug 13914613  Excessive time holding shared pool latch in kghfrunp with auto memory management

 This note gives a brief overview of bug 13914613. 
 The content was last updated on: 17-DEC-2014
 Click here for details of each of the sections below.

Affects:

Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected (Not specified)
Versions confirmed as being affected
Platforms affected Generic (all / most platforms affected)

Fixed:

The fix for 13914613 is first included in
 

Description

A session may spend excessive time holding the shared pool latch under
kghfrunp when auto memory management is used. This can ultimately 
result in an instance crash due to other sessions holding critical 
resources too long. 
eg: database crash due to ORA-240 and ORA-15064
 
Rediscovery Notes
 Session wait chains show "latch: shared pool" waits
 Wait times for shared pool latch over 10 seconds for the same holder
 Call stacks of shared pool latch holder show kghfrunp as a currently 
  executing function.  
 Auto memory management is in use such that "duration" subheaps are used
 eg: ASMM (sga_target) or AMM (memory_target) being used.
 
Workaround
 Set init.ora parameter _enable_shared_pool_durations=false
 
Getting a Fix
 Use one of the "Fixed" versions listed above
 (for Patch Sets / bundles use the latest version available as
  contents are cumulative - the "Fixed" version listed above is
  the first version where the fix is included)
 
分析了产生故障的原因,我给客户的解决方案就是以下两条:
 
1.设置隐含参数"_enable_shared_pool_durations"=false
2.升级到11.2.0.4的patchset 

CentOS 5.5 i386 上安装 Oracle 10G XE  http://www.linuxidc.com/Linux/2011-02/31928p2.htm

Linux下Oracle 11G XE 安装笔记  http://www.linuxidc.com/Linux/2012-08/66974.htm

在CentOS 6.4下安装Oracle 11gR2(x64) http://www.linuxidc.com/Linux/2014-02/97374.htm

Oracle 11gR2 在VMWare虚拟机中安装步骤 http://www.linuxidc.com/Linux/2013-09/89579p2.htm

Debian 下 安装 Oracle 11g XE R2 http://www.linuxidc.com/Linux/2014-03/98881.htm

更多Oracle相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=12

本文永久更新链接地址http://www.linuxidc.com/Linux/2015-01/111581.htm

linux
相关资讯       ORA-15064 
本文评论   查看全部评论 (0)
表情: 表情 姓名: 字数

       

评论声明
  • 尊重网上道德,遵守中华人民共和国的各项有关法律法规
  • 承担一切因您的行为而直接或间接导致的民事或刑事法律责任
  • 本站管理人员有权保留或删除其管辖留言中的任意内容
  • 本站有权在网站内转载或引用您的评论
  • 参与本评论即表明您已经阅读并接受上述条款