转载

基于oracle 10.2.0.1 rac学习lms进程系列四

背景

   之前,我们了解了lmd进程一些概念,这儿我们学习下另一个重要RAC后台进程,LMS进程。


结论

1,如果lms出现故障,会导致dml事务无法提交,即提交或回退hang住;并且远端节点的DML操作会HANG住
2,lms进程的等待事件和lmd进程的相似,是gcs remote message
  引申,即和远端节点的LMS进程进行通讯
3,lms进程出现故障,会引发ipc timeout,如果ipc timeout达到一定期限,会导致RAC节点重启
4,lms进程恢复正常后,由于远端的变更数据积压,需要在对应节点进行实例恢复,以达到数据一致性的要求
5,性能指标gcs messages sent 也是反映数据库DML或活动是否频繁的一个维度,如果其值很高,说明数据库非常繁忙
  当然我们说这只是一种维度,不是绝值,毕竟数据库是非常复杂的,要由多个进程协作交互方可正常运行
  但它为我们分析问题提供一种思路


6,暂未发现LMS相关的隐含参数,当然也可能是我查找方法不对或理解不深,还要继续学习
7, 10706 event可以分析跟瞎RAC全局队列锁的获取过程
    其操作过程:


SQL> select program,spid,pid from v$process where addr='0000000083A62BD0';


PROGRAM                                          SPID                PID
------------------------------------------------ ------------ ----------
oracle@jingfa1 (TNS V1-V3)                       6496                 27




SQL> oradebug setospid 6496
Oracle pid: 27, Unix process pid: 6496, image: oracle@jingfa1 (TNS V1-V3)
SQL> oradebug event 10706 trace name context forever,level 12
Statement processed.




SQL> insert into t_lock values(18,18);


1 row created.


SQL> oradebug event 10706 trace name context off
Statement processed.
SQL> oradebug tracefile_name
/u01/app/oracle/admin/jingfa/udump/jingfa1_ora_6496.trc




测试                                                                                             



----oracle version
SQL> select * from v$version where rownum=1;


BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi


---lms含义
The LMS process maintains records of the data file statuses and each cached block by recording information in a Global Resource Directory (GRD). The LMS process also controls the flow of messages to remote instances and manages global data block access and transmits block images between the buffer caches of different instances. This processing is part of the Cache Fusion feature.


1,在RAC的GRD即全局资源目录中,维护数据文件的状况,并记录全局缓存中数据块的信息
2,控制发送到远端节点的消息
3,管理在节点间传输数据块以及缓存数据块
4,它也是RAC机制CACHE FUSION功能一部分,也就是没有LMS,CACHE FUSION无从谈起,可见其重要性




----了解lms的作用
SQL> select addr,program,username,pid,spid from v$process where username='oracle' and lower(program) like '%lms%';


ADDR             PROGRAM                                          USERNAME               PID SPID
---------------- ------------------------------------------------ --------------- ---------- ------------
0000000083A58DB0 oracle@jingfa1 (LMS0)                            oracle                   7 30504


--hang lms
SQL> oradebug setospid 30504
Oracle pid: 7, Unix process pid: 30504, image: oracle@jingfa1 (LMS0)
SQL> oradebug suspend
Statement processed


--new session
SQL> select * from t_lock;


         A          B
---------- ----------
        11          1
         2          2


--重复多次测试
SQL> insert into t_lock values(3,3);


1 row created.


SQL> commit;
提交hang住


--resume lms
SQL> oradebug resume
Statement processed.


提交完成
SQL> commit;


Commit complete.




---再看看lms的等待事件
SQL> select sid,serial#,program,event,p1,p1text,p2,p2text,p3,p3text from v$session where paddr='0000000083A58DB0';


       SID    SERIAL# PROGRAM                   EVENT                                  P1 P1TEXT                  P2 P2TEXT                  P3 P3TEXT
---------- ---------- ------------------------- ------------------------------ ---------- --------------- ---------- --------------- ---------- ---------------
       165          1 oracle@jingfa1 (LMS0)     gcs remote message                     24 waittime                 0 poll                     0 event




--ipc timeout过久,会引发节点重启
Wed Nov 11 04:38:23 2015
IPC Send timeout detected. Receiver ospid 30504
Wed Nov 11 04:38:24 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lms0_30504.trc:
Wed Nov 11 04:38:26 2015
Trace dumping is performing id=[cdmp_20151111043845]
Wed Nov 11 04:39:27 2015
Unix process pid: 30504, image: oracle@jingfa1 (LMS0) resumed
Wed Nov 11 04:40:19 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lmon_30500.trc:
ORA-29740: evicted by member 1, group incarnation 16
Wed Nov 11 04:40:19 2015
LMON: terminating instance due to error 29740
Wed Nov 11 04:40:19 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lmd0_30502.trc:
ORA-29740: evicted by member , group incarnation 
Wed Nov 11 04:40:19 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lms0_30504.trc:
ORA-29740: evicted by member , group incarnation 
Wed Nov 11 04:40:19 2015
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/jingfa/bdump/jingfa1_diag_30496.trc
Wed Nov 11 04:40:19 2015
Trace dumping is performing id=[cdmp_20151111044019]
Wed Nov 11 04:40:23 2015
Instance terminated by LMON, pid = 30500




---查看下lms进程的TRC文件
[oracle@jingfa1 bdump]$ pwd
/u01/app/oracle/admin/jingfa/bdump
[oracle@jingfa1 bdump]$ ll -lht *lms*
-rw-rw---- 1 oracle oinstall 1.3K Nov 11 04:41 jingfa1_lms0_21223.trc
-rw-r----- 1 oracle oinstall  80K Nov 11 04:40 jingfa1_lms0_30504.trc
-rw-r----- 1 oracle oinstall 369K Nov 11 03:30 jingfa1_lms0_15273.trc
-rw-rw---- 1 oracle oinstall 356K Nov 10 07:28 jingfa1_lms0_21824.trc
-rw-rw---- 1 oracle oinstall 1.3K Nov 10 05:21 jingfa1_lms0_32705.trc
-rw-r----- 1 oracle oinstall 6.1K Nov 10 05:20 jingfa1_lms0_11222.trc
-rw-r----- 1 oracle oinstall 2.0K Nov 10 01:51 jingfa1_lms0_10351.trc


可见lms发送gcs message以及数据块(PI)
 0 GCS shadows cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 0 GCS shadows traversed, 0 replayed, 0 duplicates
 lms 0 finished replaying gcs resources
 0 write requests issued in 1724 GCS resources
 0 PIs marked suspect, 0 flush PI msgs
 0 write requests issued in 1719 GCS resources
 0 PIs marked suspect, 0 flush PI msgs
 0 write requests issued in 1676 GCS resources
 0 PIs marked suspect, 0 flush PI msgs
 0 write requests issued in 1689 GCS resources
 0 PIs marked suspect, 0 flush PI msgs
 0 write requests issued in 143 GCS resources
 0 PIs marked suspect, 0 flush PI msgs
 lms 0 finished fixing gcs write protocol


--再从性能指标方面看看
SQL> select statistic#,name,class from v$statname where lower(name) like '%gcs%' and STATISTIC#=44;


STATISTIC# NAME                                                                  CLASS
---------- ---------------------------------------------------------------- ----------
        44 gcs messages sent                                                        32




SQL> select v$statname.name,v$sysstat.value from v$sysstat,v$statname where v$sysstat.statistic#=v$statname.statistic# and v$statname.statistic# in (44);


NAME                                                                  VALUE
---------------------------------------------------------------- ----------
gcs messages sent                                                      2840




可见发生重级别DML时,性能指标暴增
SQL> select * from t_lock;


         A          B
---------- ----------
        11          1
         2          2
         3          3
         3          3
         4          4
         1          1
         2          2
         3          3
         4          4
         5          5
         6          6


         A          B
---------- ----------
         7          7
         8          8
         9          9
        10         10


15 rows selected.


SQL> insert into t_lock select level,level from dual connect by level<=1000000;
1000000 rows created.






SQL> select v$statname.name,v$sysstat.value from v$sysstat,v$statname where v$sysstat.statistic#=v$statname.statistic# and v$statname.statistic# in (44);


NAME                                                                  VALUE
---------------------------------------------------------------- ----------
gcs messages sent                                                      4039




--hang lms
SQL> oradebug setospid 21223
Oracle pid: 7, Unix process pid: 21223, image: oracle@jingfa1 (LMS0)
SQL> oradebug suspend
Statement processed.


SQL> select v$statname.name,v$sysstat.value from v$sysstat,v$statname where v$sysstat.statistic#=v$statname.statistic# and v$statname.statistic# in (44);


NAME                                                                  VALUE
---------------------------------------------------------------- ----------
gcs messages sent                                                      4041


可见发生重量级DML,性能指标只会微量增高
SQL> insert into t_lock select level,level from dual connect by level<=1000000;
--dml hang




SQL> select v$statname.name,v$sysstat.value from v$sysstat,v$statname where v$sysstat.statistic#=v$statname.statistic# and v$statname.statistic# in (44);


NAME                                                                  VALUE
---------------------------------------------------------------- ----------
gcs messages sent                                                      4060


恢复lms,性能指标才会恢复正常,暴增起来
SQL> oradebug resume
Statement processed.


SQL> insert into t_lock select level,level from dual connect by level<=1000000;


1000000 rows created.




SQL> select v$statname.name,v$sysstat.value from v$sysstat,v$statname where v$sysstat.statistic#=v$statname.statistic# and v$statname.statistic# in (44);


NAME                                                                  VALUE
---------------------------------------------------------------- ----------
gcs messages sent                                                      5166


--再通过10706 EVENT分析下全局队列锁的操作


SQL> select program,spid,pid from v$process where addr='0000000083A62BD0';


PROGRAM                                          SPID                PID
------------------------------------------------ ------------ ----------
oracle@jingfa1 (TNS V1-V3)                       6496                 27




SQL> oradebug setospid 6496
Oracle pid: 27, Unix process pid: 6496, image: oracle@jingfa1 (TNS V1-V3)
SQL> oradebug event 10706 trace name context forever,level 12
Statement processed.




SQL> insert into t_lock values(18,18);


1 row created.


SQL> oradebug event 10706 trace name context off
Statement processed.
SQL> oradebug tracefile_name
/u01/app/oracle/admin/jingfa/udump/jingfa1_ora_6496.trc


ksipget: LB,cd381e0a,b3a45175 opt=0x10020 mode=3 timeout=0   ---LB,cd381e0a,b3a45175是什么,opt=0x10020又是什么,说白了表t_lock加3级锁(分析见下)
ksipget: lid=0x82f51e90 res=0 time=62
*** 2015-11-11 05:21:19.228
ksiprls: lid=0x82f51e90 opt=0x0
*** 2015-11-11 05:21:30.797
ksipget: LB,cd381e0a,b3a45175 opt=0x10020 mode=3 timeout=0
ksipget: lid=0x82f51e90 res=0 time=80
*** 2015-11-11 05:21:30.799
ksiprls: lid=0x82f51e90 opt=0x0
*** 2015-11-11 05:21:30.799
ksipget: TM,2860,0 opt=0x10021 mode=2 timeout=0
ksipget: lid=0x82f51e90 res=0 time=29




SQL> oradebug setmypid
Statement processed.
SQL> oradebug dump library_cache 12
Statement processed.
SQL> oradebug tracefile_name
/u01/app/oracle/admin/jingfa/udump/jingfa1_ora_26571.trc


[oracle@jingfa1 bdump]$ more /u01/app/oracle/admin/jingfa/udump/jingfa1_ora_26571.trc|grep -i --color cd381e0a
  hash=cd381e0ab3a451752d5c0ad44df74f5e timestamp=11-09-2015 21:49:32
    LOCK INSTANCE LOCK: id=LBcd381e0ab3a45175
    PIN INSTANCE LOCK: id=NBcd381e0ab3a45175 mode=S release=F flags=[00]
[oracle@jingfa1 bdump]$ more /u01/app/oracle/admin/jingfa/udump/jingfa1_ora_26571.trc|grep -i --color b3a45175
  hash=cd381e0ab3a451752d5c0ad44df74f5e timestamp=11-09-2015 21:49:32
    LOCK INSTANCE LOCK: id=LBcd381e0ab3a45175
    PIN INSTANCE LOCK: id=NBcd381e0ab3a45175 mode=S release=F flags=[00]
[oracle@jingfa1 bdump]$ 


由上述的内容再上向推导,可知是library object handle,而它又属于bucket 85854,这个对应表t_lock
BUCKET 85854:
  LIBRARY OBJECT HANDLE: handle=6fe77340 mutex=0x6fe77470(0)
  name=TBS_ZXY.T_LOCK 
  hash=cd381e0ab3a451752d5c0ad44df74f5e timestamp=11-09-2015 21:49:32
  namespace=TABL flags=KGHP/TIM/XLR/[00000020]
  kkkk-dddd-llll=0000-0709-0709 lock=N pin=0 latch#=1 hpc=0002 hlc=0002
  lwt=0x6fe773e8[0x6fe773e8,0x6fe773e8] ltm=0x6fe773f8[0x6fe773f8,0x6fe773f8]
  pwt=0x6fe773b0[0x6fe773b0,0x6fe773b0] ptm=0x6fe773c0[0x6fe773c0,0x6fe773c0]
  ref=0x6fe77418[0x6fe77418,0x6fe77418] lnd=0x6fe77430[0x6fe736f8,0x6fe77c60]
    LOCK INSTANCE LOCK: id=LBcd381e0ab3a45175 ---1个锁
    PIN INSTANCE LOCK: id=NBcd381e0ab3a45175 mode=S release=F flags=[00]  --可见加了2个锁
    INVALIDATION INSTANCE LOCK: id=IV0000286009163221 mode=S
    LIBRARY OBJECT: object=7e36b610
    type=TABL flags=EXS/LOC[0005] pflags=[0000] status=VALD load=0
    DATA BLOCKS:
    data#     heap  pointer    status pins change whr
    ----- -------- -------- --------- ---- ------ ---
        0 6fe77280 7e36b728 I/-/A/-/-    0 NONE   00 
        3 7e36aed8 7e367940 I/-/A/-/-    0 NONE   00 
        8 7e36b910 7e36b3a0 I/-/A/-/-    0 NONE   00 
        9 7e36ad88 7e36abd0 I/-/A/-/-    0 NONE   00 
       10 7e36ae10 7e36a7e8 I/-/A/-/-    0 NONE   00 
  BUCKET 85854 total object count=1
正文到此结束
Loading...