转载

使用oradebug dump hanganalyze分析oracle hang系列一

背景

  oracle db有可能因为诸多因系而hang,可以借助某些工具进行诊断分析,本文主要学习如何用不用oradebug dump hanganalyze 3进行分析产生的TRACE FILE.
 以及所包含的内容结构。 




结论

1,oradebug hanganalyze 3产生的TRACE FILE包括3部分内容
2,各部分内容如下:


第一部分
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: <='latch: cache buffers chains'
     Chain 1 Signature Hash: 0xccebf225
 [b] Chain 2 Signature: <='buffer busy waits'
     Chain 2 Signature Hash: 0x9a13abed
 [c] Chain 3 Signature: <='latch: cache buffers chains'
     Chain 3 Signature Hash: 0xccebf225


第二部分(当然这里面又分为intersecting chains或者non-intersecting chains)
===============================================================================
Non-intersecting chains:
 
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 10574
              process id: 10, oracle@seconary (DBW0)
              session id: 10
        session serial #: 1
    }
    is waiting for 'latch: cache buffers chains' with wait info:
    {
                      p1: 'address'=0xda4f8850
                      p2: 'number'=0x96
                      p3: 'tries'=0x0
            time in wait: 0.012628 sec
           timeout after: never
                 wait id: 50243
                blocking: 0 sessions
             current sql:
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semop()+7<-skgpwwait()+156<-kslgess()+1799<-ksl_get_shared_latch()+620<-kcbbic1()+247<-kcbbiop()+868<-kcbbdrv()+566<-ksbabs()+465<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36
            wait history:
              * time between current wait and wait #1: 0.000058 sec
              1.       event: 'db file async I/O submit'
                 time waited: 0.328988 sec
                     wait id: 50242           p1: 'requests'=0x1
                                              p2: 'interrupt'=0x0
                                              p3: 'timeout'=0x0
              * time between wait #1 and #2: 0.000076 sec
              2.       event: 'db file async I/O submit'
                 time waited: 0.000748 sec
                     wait id: 50241           p1: 'requests'=0x1
                                              p2: 'interrupt'=0x0
                                              p3: 'timeout'=0x0
              * time between wait #2 and #3: 0.000086 sec
              3.       event: 'db file async I/O submit'
                 time waited: 0.000177 sec
                     wait id: 50240           p1: 'requests'=0x1
                                              p2: 'interrupt'=0x0
                                              p3: 'timeout'=0x0
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 6664
              process id: 91, oracle@seconary (J063)
              session id: 32
        session serial #: 9465
    }
    which is not in a wait:
    {
               last wait: 0.000000 sec ago
                blocking: 6 sessions
             current sql:
             short stack:
    }
 
Chain 1 Signature: <='latch: cache buffers chains'
Chain 1 Signature Hash: 0xccebf225




第三部分
State of LOCAL nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):
[9]/1/10/1/0xdcba9740/10574/NLEAF/[31]
[11]/1/12/1/0xdcba3a60/10578/SINGLE_NODE/
中间类似内容略
[346]/1/347/9522/0xdc7d75d0/6413/NLEAF/[118]
[348]/1/349/7696/0xdc7d18f0/6433/NLEAF/[204]


关于各列含义见下:
根据其每列含义
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):






           数据库实例编号         会话sid    会话serial#   会话的地址saddr   会话对应的操作系统ID          表明会话是否等待           如有值,表明是持锁会话的CHAIN编号,否则为空    
[163]   /    1              /   164    /   10419    /   0xdc9ea7e0   /    4046              /           NLEAF         /          [301]


3,non-intersecting chains的特色就是所包含的会话的阻塞会话,不隶属于任何chain
   而intersecting chains包含的会话的阻塞会话隶属于另一个chains中的持锁会话,也就是说chains包含在另一个chains中


4,chains中包括每个会话以及阻塞会话的详细信息,包括SID,PID,SPID以及当前运行的SQL
  所调用堆栈;以及近期的历史等待会话列表
5,关于 ([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]) 中的state列的各值含义如下,暂不全面(仍需要进一步测试)


NLEAF表明是等待会话
     SINGLE_NODE表明它是等待任何会话或资源
     LEAF表明它是持锁会话,即它不等待任何会话或资源
     LEAF_NEW它也是持锁会话


测试

1,数据库版本
SQL> select * from v$version where rownum=1;


BANNER
----------------------------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production


2,数据库未有任何HANG时,产生hanganalyze
SQL> oradebug setmypid
Statement processed.
SQL> oradebug hanganalyze 3
Hang Analysis in /oracle/diag/rdbms/guowang/guowang/trace/guowang_ora_13018.trc


*** 2015-11-02 07:02:02.135
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): guowang.guowang
  oradebug_node_dump_level: 3 --trace level级别
  analysis initiated by oradebug --标明hanganalyze由oradebug生成
===============================================================================
 
Chains most likely to have caused the hang:--导致HANG的原因
 
 
===============================================================================
No chains found.
===============================================================================
Extra information that will be dumped at higher levels:
 
State of ALL nodes  --全部节点的状态(我估计适用于RAC)
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]): --每个列的含义


*** 2015-11-02 07:02:02.163
===============================================================================
END OF HANG ANALYSIS
===============================================================================


*** 2015-11-02 07:02:02.164
===============================================================================
HANG ANALYSIS DUMPS:
  oradebug_node_dump_level: 3
===============================================================================
 
State of LOCAL nodes  --本地节点的状态
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):--同state of all nodes
 
 
No processes qualify for dumping.
 
===============================================================================
HANG ANALYSIS DUMPS: END
===============================================================================






4,手工模拟产生锁等待,产生hanganalyze


测试会话1
SQL> select sid,serial# from v$session where sid=(select sid from v$mystat where rownum=1);


       SID    SERIAL#
---------- ----------
       302       9226


SQL> select pid,spid from v$process where addr=(select paddr from v$session where sid=(select sid from v$mystat where rownum=1));


       PID SPID
---------- ------------------------------------------------
       165 3812


SQL> create table t_hang(a int,b int);


Table created.


SQL> insert into t_hang values(1,1);


1 row created.


SQL> commit;


Commit complete.




测试会话2
SQL> select sid,serial# from v$session where sid=(select sid from v$mystat where rownum=1);


       SID    SERIAL#
---------- ----------
       164      10419


SQL> select pid,spid from v$process where addr=(select paddr from v$session where sid=(select sid from v$mystat where rownum=1));


       PID SPID
---------- ------------------------------------------------
       184 4046


hang住
SQL> update t_hang set a=2 where a=1;     


SQL> oradebug setmypid
Statement processed.
SQL> oradebug hanganalyze 3
Hang Analysis in /oracle/diag/rdbms/guowang/guowang/trace/guowang_ora_6813.trc


*** 2015-11-02 07:16:19.252
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): guowang.guowang
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
===============================================================================
 
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: <='latch: cache buffers chains'
     Chain 1 Signature Hash: 0xccebf225
 [b] Chain 2 Signature: <='buffer busy waits'
     Chain 2 Signature Hash: 0x9a13abed
 [c] Chain 3 Signature: <='latch: cache buffers chains'
     Chain 3 Signature Hash: 0xccebf225
 


--经在TRACE FILE过滤等待会话的sid,即164,以后可以直接查session id:164或者对应的进程号process id:184,或者对应的操作系统进程os id:4096
发现它是在一个chain中,这个chain中清晰标明了,等待会话到底在等待什么资源,以及是哪个会话阻塞它,并且会列出阻塞会话的详细 信息
 Chain 10:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 4046
              process id: 184, oracle@seconary (TNS V1-V3)
              session id: 164
        session serial #: 10419
    }
    is waiting for 'enq: TX - row lock contention' with wait info:
    {
                      p1: 'name|mode'=0x54580006
                      p2: 'usn<<16 | slot'=0xca001d
                      p3: 'sequence'=0x2842
            time in wait: 1 min 23 sec
           timeout after: never
                 wait id: 33
                blocking: 0 sessions
             current sql: update t_hang set a=2 where a=1  -会列出当前等待会话正在执行的SQL以及调用堆栈
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semtimedop()+10<-skgpwwait()+156<-ksliwat()+1821<-kslwaitctx()+162<-ksqcmi()+5641<-ksqgtlctx()+3401<-ksqgelctx()+552<-ktcwit1()+330<-kdddgb()+7099<-kdusru()+490<-updrowFastPath()+1075<-qerupFetch()+2171<-updaul()+1179<-updThreePhaseExe()+328<-updexe()+663<-opiexe()+14377<-kpoal8()+2288<-opiodr()+1149<-ttcpip()+1251<-opitsk()+1633<-opiino()+958<-opiodr()+1149<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+214<-main()+201<-__lib
            wait history:  --等待会话近期的等待事件历史列表,且基于时间反向排序
              * time between current wait and wait #1: 0.043281 sec
              1.       event: 'Disk file operations I/O'
                 time waited: 0.000090 sec
                     wait id: 32              p1: 'FileOperation'=0x2
                                              p2: 'fileno'=0x4
                                              p3: 'filetype'=0x2
              * time between wait #1 and #2: 0.000764 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 1 min 0 sec
                     wait id: 31              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000007 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000003 sec
                     wait id: 30              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
    }
    and is blocked by  --标明被哪个会话所阻塞
 => Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 3812
              process id: 165, oracle@seconary (TNS V1-V3)
              session id: 302
        session serial #: 9226
    }
    which is waiting for 'SQL*Net message from client' with wait info:  -同上理,也会列出持锁会话近期的历史等待事件列
    {
                      p1: 'driver id'=0x62657100
                      p2: '#bytes'=0x1
            time in wait: 2 min 49 sec
           timeout after: never
                 wait id: 42
                blocking: 1 session
             current sql:
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-read()+14<-ntpfprd()+115<-nsbasic_brc()+338<-nsbrecv()+69<-nioqrc()+485<-__PGOSF24_opikndf2()+978<-opitsk()+824<-opiino()+958<-opiodr()+1149<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36
            wait history:
              * time between current wait and wait #1: 0.000015 sec
              1.       event: 'SQL*Net message to client'
                 time waited: 0.000005 sec
                     wait id: 41              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.338132 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 24.359100 sec
                     wait id: 40              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000013 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000005 sec
                     wait id: 39              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
    }




经后查,hanganalyze会列出所有ORACLE进程的信息




并且chain又分为intersecting chains和non-intersecting chains


我们来分析下intersecting chains和non-intersecting chains的区别是什么,先看
non-intersecting chains


经过对比分析,non-intersecting chains的特色就是所包含的会话的阻塞会话,不隶属于任何chain
Chain 2:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 6728
              process id: 146, oracle@seconary (J117)
              session id: 29
        session serial #: 10069
    }
    is waiting for 'buffer busy waits' with wait info:
    {
                      p1: 'file#'=0x1
                      p2: 'block#'=0x15ab9
                      p3: 'class#'=0x1
            time in wait: 0.012849 sec
           timeout after: never
                 wait id: 31
                blocking: 0 sessions
             current sql:
             short stack:
            wait history:
              * time between current wait and wait #1: 0.000028 sec
              1.       event: 'enq: TX - index contention'
                 time waited: 0.016712 sec
                     wait id: 30              p1: 'name|mode'=0x54580004
                                              p2: 'usn<<16 | slot'=0x12a0009
                                              p3: 'sequence'=0x1953
              * time between wait #1 and #2: 0.000066 sec
              2.       event: 'buffer busy waits'
                 time waited: 0.552796 sec
                     wait id: 29              p1: 'file#'=0x1
                                              p2: 'block#'=0x15ab9
                                              p3: 'class#'=0x1
              * time between wait #2 and #3: 0.000340 sec
              3.       event: 'buffer busy waits'
                 time waited: 0.497199 sec
                     wait id: 28              p1: 'file#'=0x1
                                              p2: 'block#'=0x225eb
                                              p3: 'class#'=0x1
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 6439  --也就是说持锁会话不隶属于chain
              process id: 111, oracle@seconary (J083)
              session id: 205
        session serial #: 9696




再来看看intersecting chains


发现intersecting chains包含的会话的阻塞会话隶属于另一个chains中的持锁会话,也就是说chains包含在另一个chains中
-------------------------------------------------------------------------------
Chain 12:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 6612
              process id: 59, oracle@seconary (J031)
              session id: 33
        session serial #: 11792
    }
    is waiting for 'buffer busy waits' with wait info:
    {
                      p1: 'file#'=0x1
                      p2: 'block#'=0x15ab9
                      p3: 'class#'=0x1
            time in wait: 0.007726 sec
           timeout after: never
                 wait id: 0
                blocking: 0 sessions
             current sql:
             short stack:
    }
    and is blocked by 'instance: 1, os id: 6439, session id: 205',
    which is a member of 'Chain 2'.
 
Chain 12 Signature: <='buffer busy waits'
Chain 12 Signature Hash: 0x9a13abed




最后学习下state of all nodes(注:如果是单实例数据库,其state of all nodes和state of local nodes内容相同)


先摘录一节内容
State of ALL nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):
[9]/1/10/1/0xdcba9740/10574/NLEAF/[31]
[11]/1/12/1/0xdcba3a60/10578/SINGLE_NODE/
[28]/1/29/10069/0xdcb724f0/6728/NLEAF/[204]
[31]/1/32/9465/0xdcb699a0/6664/LEAF_NW/
[32]/1/33/11792/0xdcb66b30/6612/NLEAF/[204]
[33]/1/34/3/0xdcb63cc0/10707/SINGLE_NODE/
[38]/1/39/9453/0xdcb55490/6754/NLEAF/[204]


其实每个列的含义都有注明,我们依次来分析,我采用的方法仍是基于我上述构建的测试2个会话即持锁会话以及等待锁会话,在TRACE FILE查等待会话164,匹配结果如下:


[163]/1/164/10419/0xdc9ea7e0/4046/NLEAF/[301]


根据其每列含义
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):


我们来分解下:


SQL> select saddr from v$session where sid=164;


SADDR
----------------
00000000DC9EA7E0


           数据库实例编号         会话sid    会话serial#   会话的地址saddr   会话对应的操作系统ID          表明会话是否等待           如有值,表明是持锁会话的CHAIN编号,否则为空    
[163]   /    1              /   164    /   10419    /   0xdc9ea7e0   /    4046              /           NLEAF         /          [301]




但是163是什么含义呢,以及301是何义呢,会不会是chains的编号呢,还有NLEAF是何义,另急,我们试着分析下,我试着查301,定位到如下内容


[301]/1/302/9226/0xdc859f80/3812/LEAF/


经过对比分析,301就是持锁会话所在的chains的编号,所以可知NLEAF表明是等待锁的会话,而LEAF就是不等待锁的会话


还不不够,因为state列发现有几个不同的值:NLEAF,SINGLE_NODE,LEAF_NEW,继续分析


以chain 9为例
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):
[9]/1/10/1/0xdcba9740/10574/NLEAF/[31]


从adjlist 31定位到chain 31
[31]/1/32/9465/0xdcb699a0/6664/LEAF_NW/






先看chain 9
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 10574
              process id: 10, oracle@seconary (DBW0)
              session id: 10
        session serial #: 1
    }
    is waiting for 'latch: cache buffers chains' with wait info:
    {
                      p1: 'address'=0xda4f8850
                      p2: 'number'=0x96
                      p3: 'tries'=0x0
            time in wait: 0.012628 sec
           timeout after: never
                 wait id: 50243
                blocking: 0 sessions
             current sql:
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semop()+7<-skgpwwait()+156<-kslgess()+1799<-ksl_get_shared_latch()+620<-kcbbic1()+247<-kcbbiop()+868<-kcbbdrv()+566<-ksbabs()+465<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36
            wait history:
              * time between current wait and wait #1: 0.000058 sec
              1.       event: 'db file async I/O submit'
                 time waited: 0.328988 sec
                     wait id: 50242           p1: 'requests'=0x1
                                              p2: 'interrupt'=0x0
                                              p3: 'timeout'=0x0
              * time between wait #1 and #2: 0.000076 sec
              2.       event: 'db file async I/O submit'
                 time waited: 0.000748 sec
                     wait id: 50241           p1: 'requests'=0x1
                                              p2: 'interrupt'=0x0
                                              p3: 'timeout'=0x0
              * time between wait #2 and #3: 0.000086 sec
              3.       event: 'db file async I/O submit'
                 time waited: 0.000177 sec
                     wait id: 50240           p1: 'requests'=0x1
                                              p2: 'interrupt'=0x0
                                              p3: 'timeout'=0x0
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 6664
              process id: 91, oracle@seconary (J063)
              session id: 32
        session serial #: 9465




再看下SINGLE_NODE的含义是什么
 [33]/1/34/3/0xdcb63cc0/10707/SINGLE_NODE/       


可见SINGLE_NODE表明会话不被任何会话或进程所阻塞
 Chain 146:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (guowang.guowang)
                   os id: 10707
              process id: 29, oracle@seconary (CJQ0)
              session id: 34
        session serial #: 3
    }
    is waiting for 'job scheduler coordinator slave wait' with wait info:
    {
            time in wait: 0.449655 sec
           timeout after: 15.550345 sec
                 wait id: 81976
                blocking: 0 sessions
             current sql:
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semtimedop()+10<-skgpwwait()+156<-ksliwat()+1821<-kslwaitctx()+162<-kslwait()+141<-jscrs_select0()+6801<-jscrs_select()+595<-rpiswu2()+1541<-kkjcjexe()+713<-kkjssrh()+559<-ksbcti()+212<-ksbabs()+1732<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36
            wait history:
              * time between current wait and wait #1: 0.000108 sec
              1.       event: 'rdbms ipc message'
                 time waited: 2.064533 sec
                     wait id: 81975           p1: 'timeout'=0xc8
              * time between wait #1 and #2: 0.006081 sec
              2.       event: 'rdbms ipc message'
                 time waited: 0.501429 sec
                     wait id: 81974           p1: 'timeout'=0x32
              * time between wait #2 and #3: 0.000706 sec
              3.       event: 'job scheduler coordinator slave wait'
                 time waited: 16.003771 sec
                     wait id: 81973           
    }
 
Chain 146 Signature: 'job scheduler coordinator slave wait'
Chain 146 Signature Hash: 0x8c8a9c97


总结:NLEAF表明是等待会话
     SINGLE_NODE表明它是等待任何会话或资源
     LEAF表明它是持锁会话,即它不等待任何会话或资源
     LEAF_NEW它也是持锁会话


 
 LEAF和LEAF_NEW的区别是什么?我们将在下文进行分析与测试。


正文到此结束
Loading...