今天早上刚到公司一会,就接到南京客户打来的电话,说他们的核心会员数据库宕机了,让我远程帮忙查看一下。还没来得急看报纸的我赶紧打开电脑,远程连接到客户的服务器进行诊断。
1. 数据库alert.log分析
Mon Jan 05 01:40:50 2015
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_ora_16581034.trc:
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 288 Serial number: 5649
NOTE: deferred map free for map id 4422
Mon Jan 05 01:40:55 2015
NOTE: ASMB terminating
Mon Jan 05 01:40:55 2015
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64460))
WARNING: inbound connection timed out (ORA-3136)
Mon Jan 05 01:40:55 2015
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Mon Jan 05 01:40:55 2015
***********************************************************************
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64530))
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
Fatal NI connect error 12170.
WARNING: inbound connection timed out (ORA-3136)
Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
Tns error struct:
ASMB (ospid: 5898342): terminating the instance due to error 15064
ns main err code: 12535
TNS-12535: TNS:operation timed out
Mon Jan 05 01:40:55 2015
***********************************************************************
ns secondary err code: 12606
Fatal NI connect error 12170.
nt main err code: 0
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
nt secondary err code: 0
Time: 05-JAN-2015 01:40:55
Tracing not turned on.
nt OS err code: 0
Mon Jan 05 01:40:56 2015
System state dump requested by (instance=1, osid=5898342 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_diag_20578640.trc
Mon Jan 05 01:40:59 2015
ORA-1092 : opitsk aborting process
Mon Jan 05 01:41:00 2015
ORA-1092 : opitsk aborting process
Instance terminated by ASMB, pid = 5898342
Mon Jan 05 08:46:45 2015
Starting ORACLE instance (normal)
由于在故障点之前也会报一些ORA-3136的错误,故这里不做说明,与本次错误干系不大。从alert日志中发现,数据库实例在1:40:55时,被ASMB(进程号:5898342)进程终止,直到8点46分左右才开始启动(客户自己手动重启)。
2.相关trace文件分析
test5_ora_16581034.trc:
*** 2015-01-05 01:40:50.900
*** SESSION ID:(424.19913) 2015-01-05 01:40:50.900
*** CLIENT ID:() 2015-01-05 01:40:50.900
*** SERVICE NAME:(SYS$USERS) 2015-01-05 01:40:50.900
*** MODULE NAME:(backup incr datafile) 2015-01-05 01:40:50.900
*** ACTION NAME:(0000097 STARTED16) 2015-01-05 01:40:50.900
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 288 Serial number: 5649
基本与alert日志中报的错误相同,就是获知在1:40分这个故障点,数据库与ASM实例的连接中断了
test5_asmb_5898342.trc:
*** 2015-01-05 01:40:55.870
NOTE: ASMB terminating
*** 2015-01-05 01:40:55.922
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID:
Session ID: 216 Serial number: 7
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+360<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+100<-ksuitm()+5124<-ksbrdp()+4508<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112
----- End of Abridged Call Stack Trace -----
*** 2015-01-05 01:40:55.954
ASMB (ospid: 5898342): terminating the instance due to error 15064
ksuitm: waiting up to [5] seconds before killing DIAG(20578640)
从这个trace文件可以获知,由于遭遇ora-15064错误,ASMB进程终止了数据库实例。
test5_diag_20578640.trc:
Session Wait History:
elapsed time of 1.543354 sec since last wait
0: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881966 seq_num=30034 snap_id=1
wait times: snap=0.000422 sec, exc=0.000422 sec, total=0.000422 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000016 sec of elapsed time
1: waited for 'ASM file metadata operation'
msgop=0x2, locn=0x3, =0x0
wait_id=881964 seq_num=30033 snap_id=2
wait times: snap=0.000082 sec, exc=0.000127 sec, total=4 min 4 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000000 sec of elapsed time
2: waited for 'latch: shared pool'
address=0x700000000107048, number=0x133, tries=0x0
wait_id=881965 seq_num=30032 snap_id=1
wait times: snap=4 min 4 sec, exc=4 min 4 sec, total=4 min 4 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000000 sec of elapsed time
3: waited for 'ASM file metadata operation'
msgop=0x2, locn=0x3, =0x0
wait_id=881964 seq_num=30031 snap_id=1
wait times: snap=0.000045 sec, exc=0.000045 sec, total=0.000045 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000019 sec of elapsed time
4: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881963 seq_num=30030 snap_id=1
wait times: snap=3.285876 sec, exc=3.285876 sec, total=3.285876 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000011 sec of elapsed time
5: waited for 'ASM file metadata operation'
msgop=0xc, locn=0x3, =0x0
wait_id=881962 seq_num=30029 snap_id=1
wait times: snap=0.000034 sec, exc=0.000034 sec, total=0.000034 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000019 sec of elapsed time
6: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881961 seq_num=30028 snap_id=1
wait times: snap=5.000528 sec, exc=5.000528 sec, total=5.000528 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000009 sec of elapsed time
7: waited for 'ASM file metadata operation'
msgop=0x0, locn=0x3, =0x0
wait_id=881960 seq_num=30027 snap_id=1
wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000016 sec of elapsed time
8: waited for 'ASM background timer'
=0x0, =0x0, =0x0
wait_id=881959 seq_num=30026 snap_id=1
wait times: snap=5.000493 sec, exc=5.000493 sec, total=5.000493 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000011 sec of elapsed time
9: waited for 'ASM file metadata operation'
msgop=0x0, locn=0x3, =0x0
wait_id=881958 seq_num=30025 snap_id=1
wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000017 sec of elapsed time
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[120 samples, 01:38:56 - 01:40:55]
waited for 'latch: shared pool', seq_num: 30032
p1: 'address'=0x700000000107048
p2: 'number'=0x133
p3: 'tries'=0x0
time_waited: >= 119 sec (still in wait)
---------------------------------------------------
Sampled Session History Summary:
longest_non_idle_wait: 'latch: shared pool'
[120 samples, 01:38:56 - 01:40:55]
time_waited: >= 119 sec (still in wait)
在ASMB进程的的dump文件中,发现有2个4分4秒的等待,waited for 'ASM file metadata operation'和waited for 'latch: shared pool',而其他的相关操作都是只有几微秒,甚至0秒的
3.ASM实例的alert日志:
Mon Jan 05 01:40:56 2015
***********************************************************************
TNS-12537: TNS:connection closed
Fatal NI connect error 12537, connecting to:
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
ns secondary err code: 12560
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
nt main err code: 0
Time: 05-JAN-2015 01:40:56
nt secondary err code: 0
Tracing not turned on.
nt OS err code: 0
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Mon Jan 05 04:42:23 2015
NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds
Mon Jan 05 08:46:53 2015
NOTE: client test5:test5 registered, osid 23593052, mbr 0x0
Mon Jan 05 08:47:24 2015
Starting background process ASMB
Mon Jan 05 08:47:24 2015
ASMB started with pid=26, OS id=20185292
Mon Jan 05 08:47:25 2015
NOTE: client +ASM:+ASM registered, osid 16384406, mbr 0x1
在故障点之前,ASM的alert日志中未报过TNS错误,但故障点的时候,ASM中也出现了TNS连接关闭的错误,因此,也不排除网络因素引起的ASM实例故障。4点多的时候有一个NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds,根据文档“NOTE ASMB process exiting due to lack of ASM file activity (文档 ID 754110.1)”,这个NOTE可以忽略。到了8:46的时候,ASM实例随着数据库的重启也正常启动了
最终,找到了一篇MOS文档,是和这次遇到的故障相关的,应该就是Bug 13914613导致的这个问题:
Bug 13914613 Excessive time holding shared pool latch in kghfrunp with auto memory management
This note gives a brief overview of bug 13914613.The content was last updated on: 17-DEC-2014
Click here for details of each of the sections below.
Affects:
Product (Component) Oracle Server (Rdbms) Range of versions believed to be affected (Not specified) Versions confirmed as being affected Platforms affected Generic (all / most platforms affected)
Fixed:
The fix for 13914613 is first included in
Description
A session may spend excessive time holding the shared pool latch under kghfrunp when auto memory management is used. This can ultimately result in an instance crash due to other sessions holding critical resources too long. eg: database crash due to ORA-240 and ORA-15064 Rediscovery Notes Session wait chains show "latch: shared pool" waits Wait times for shared pool latch over 10 seconds for the same holder Call stacks of shared pool latch holder show kghfrunp as a currently executing function. Auto memory management is in use such that "duration" subheaps are used eg: ASMM (sga_target) or AMM (memory_target) being used. Workaround Set init.ora parameter _enable_shared_pool_durations=falseGetting a Fix Use one of the "Fixed" versions listed above (for Patch Sets / bundles use the latest version available as contents are cumulative - the "Fixed" version listed above is the first version where the fix is included)
分析了产生故障的原因,我给客户的解决方案就是以下两条:
1.设置隐含参数"_enable_shared_pool_durations"=false
2.升级到11.2.0.4的patchset
在CentOS 5.5 i386 上安装 Oracle 10G XE http://www.linuxidc.com/Linux/2011-02/31928p2.htm
Linux下Oracle 11G XE 安装笔记 http://www.linuxidc.com/Linux/2012-08/66974.htm
在CentOS 6.4下安装Oracle 11gR2(x64) http://www.linuxidc.com/Linux/2014-02/97374.htm
Oracle 11gR2 在VMWare虚拟机中安装步骤 http://www.linuxidc.com/Linux/2013-09/89579p2.htm
Debian 下 安装 Oracle 11g XE R2 http://www.linuxidc.com/Linux/2014-03/98881.htm
更多Oracle相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=12
本文永久更新链接地址:http://www.linuxidc.com/Linux/2015-01/111581.htm