转载

RAC共享磁盘物理路径故障导致OCR、Votedisk所在ASM磁盘组不可访问的案例分析

客户的环境是两台IBM X3850，安装Oracle Linux 6.x x86_64bit的操作系统部署的Oracle 11.2.0.4.0 RAC Database，共享存储是EMC，使用了EMC vplex虚拟化软件对存储做了镜像保护，操作系统安装了EMC原生的多路径软件。故障的现象是当vplex内部发生切换时，RAC其中一个节点的OCR和Votedisk所在的磁盘组变得不可访问，导致ora.crsd服务脱机，Grid Infrastrasture集群堆栈宕掉，但是该节点的数据库实例不受影响，但不再接受外部新的连接，在这个过程中另外一个节点完全不受影响。下面是相关的日志信息：

1.操作系统日志：
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 4 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 2 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 3 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 1 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 0 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 11 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 12 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 10 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 9 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 8 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 7 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 5 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 6 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM00142000957 port CL2-00 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 1 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 12 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 11 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 10 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 7 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 4 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 8 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 9 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 5 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 3 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 6 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 2 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 0 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM00142000957 port CL2-04 is dead.
从操作系统日志可以看出，Mar 18 08:25:48的时候port CL2-00和port CL2-04两个链路dead了。

2.ASM日志：
Fri Mar 18 08:25:59 2016
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3. <<<< 几乎在和操作系统报错的相同时间，ASM开始检查所有磁盘的PST(partnership state table)，ASM的等待时间为15秒。
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
Fri Mar 18 08:25:59 2016
NOTE: process _b000_+asm1 (66994) initiating offline of disk 0.3190888900 (OCRVDISK_0000) with mask 0x7e in group 3 <<<< group 3是OCR和Votedisk所在的磁盘组。
NOTE: process _b000_+asm1 (66994) initiating offline of disk 1.3190888899 (OCRVDISK_0001) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (66994) initiating offline of disk 2.3190888898 (OCRVDISK_0002) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
GMON checking disk modes for group 3 at 10 for pid 48, osid 66994
ERROR: no read quorum in group: required 2, found 0 disks <<<< 由于OCR和Votedisk所在的磁盘组是Normal冗余级别，3个ASM磁盘，要求2个可访问，但是实际是0个可访问。
NOTE: checking PST for grp 3 done.
NOTE: initiating PST update: grp = 3, dsk = 0/0xbe3119c4, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 3, dsk = 1/0xbe3119c3, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 3, dsk = 2/0xbe3119c2, mask = 0x6a, op = clear
GMON updating disk modes for group 3 at 11 for pid 48, osid 66994
ERROR: no read quorum in group: required 2, found 0 disks <<<< 0个磁盘可访问。
Fri Mar 18 08:25:59 2016
NOTE: cache dismounting (not clean) group 3/0x3D81E95D (OCRVDISK)
WARNING: Offline for disk OCRVDISK_0000 in mode 0x7f failed. <<<< OCR和Votedisk所在的磁盘组对应的所有磁盘都脱机。
WARNING: Offline for disk OCRVDISK_0001 in mode 0x7f failed.
WARNING: Offline for disk OCRVDISK_0002 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 66996, image: oracle@dzqddb01 (B001)
Fri Mar 18 08:25:59 2016
NOTE: halting all I/Os to diskgroup 3 (OCRVDISK) <<<< OCRVDISK磁盘组下面的所有I/O都不可用。
Fri Mar 18 08:25:59 2016
NOTE: LGWR doing non-clean dismount of group 3 (OCRVDISK)
NOTE: LGWR sync ABA=11.69 last written ABA 11.69
Fri Mar 18 08:25:59 2016
kjbdomdet send to inst 2
detach from dom 3, sending detach message to inst 2
Fri Mar 18 08:25:59 2016
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 96)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 3 invalid = TRUE
Fri Mar 18 08:25:59 2016
NOTE: No asm libraries found in the system
2 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Fri Mar 18 08:25:59 2016
WARNING: dirty detached from domain 3
NOTE: cache dismounted group 3/0x3D81E95D (OCRVDISK)
SQL> alter diskgroup OCRVDISK dismount force /* ASM SERVER:1031924061 */ <<<< dismount OCRVDISK磁盘组。
Fri Mar 18 08:25:59 2016
NOTE: cache deleting context for group OCRVDISK 3/0x3d81e95d
GMON dismounting group 3 at 12 for pid 51, osid 66996
NOTE: Disk OCRVDISK_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVDISK_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVDISK_0002 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 3
ASM Health Checker found 1 new failures

3.Clusterware告警日志：
2016-03-18 11:53:19.394:
[crsd(47973)]CRS-1006:The OCR location +OCRVDISK is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log. <<<< 时间上要比OCRVDISK被dismount的时间晚很多。
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/oraagent.bin(48283)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:121} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/scriptagent.bin(80385)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/scriptagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:9:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(48177)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:3303} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/orarootagent_root/orarootagent_root.log.
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/oraagent.bin(48168)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_grid/oraagent_grid.log.
2016-03-18 11:53:38.442:
[ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'. <<<< ora.crsd 已经OFFLINE。
2016-03-18 11:53:39.773:
[crsd(45323)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log.
2016-03-18 11:53:39.779:
[crsd(45323)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log. <<<< 物理设备不可访问。
2016-03-18 11:53:40.470:
[ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'.

  这里我们会产生一个疑问，为什么ora.crsd挂掉，但是ora.cssd没有OFFLINE（通过crsctl stat res -t -init可以确认ora.cssd没有挂掉，数据库实例还正常运行，节点并没有被踢出去），原因在于OCRVDISK对应的磁盘只是短暂的不可访问，cssd进程是直接访问OCRVDISK对应的3个ASM磁盘，并不依赖于OCRVDISK磁盘组是MOUNT状态，并且Clusterware默认的磁盘心跳超时时间为200秒，所以cssd进程没有出现问题。

  由此我们又会有更多的疑问，为什么RAC的另外一个节点没有出现故障？为什么只有OCRVDISK磁盘组dismount，其他的磁盘组都正常？

  在出现问题后重启has服务之后该节点即可恢复正常，加上其他磁盘组，其他节点并没有出现故障，所以可以简单的判断共享存储没有太大的问题，只是链路断掉之后有短时间的不可访问，寻找问题的关键是ASM实例日志中的这个信息：WARNING: Waited 15 secs for write IO to PST disk，15秒的时间是否过短影响了OCRVDISK的脱机，下面是MOS上的解释：

Generally this kind messages comes in ASM alertlog file on below situations,

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup, <<<< 在normal或high冗余度的磁盘组上的ASM磁盘被执行延迟ASM PST心跳检查。
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds. <<<< 检查失败，ASM实例会dismount磁盘组，默认的超时时间为15秒。

By the way the heart beat delays are sort of ignored for external redundancy diskgroup. <<<< PST heartbeat检查会忽略外部冗余的磁盘组。
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly. <<<< PST heartbeat检查即使超过了15秒也不会dismount外部冗余的磁盘组。

The ASM disk could go into unresponsiveness, normally in the following scenarios: <<< ASM磁盘出现无反应的情况通常是由于以下几个原因：

+   Some of the paths of the physical paths of the multipath device are offline or lost <<<< 1.聚合设备下的一些物理路径offline或丢失。
+   During path 'failover' in a multipath set up <<<< 2.具有设备下的物理路径发生failover。
+   Server load, or any sort of storage/multipath/OS maintenance <<<< 3.系统或设备的维护操作。

通过上面的这段描述，能大概的解释出现问题的原因，由于存储链路断掉了2条（可能发生failover），导致聚合后的共享存储设备短暂的不可访问，OCRVDISK是Normal冗余度的磁盘组，ASM会执行PST heartbeat检查，由于超过15秒OCRVDISK对应的磁盘组不可访问导致ASM将OCRVDISK直接dismount，进而导致OCR文件不可访问，导致crs服务OFFLINE，由于cssd的磁盘心跳超时时间为200秒，且是直接访问ASM磁盘，不经过ASM磁盘组，所以css服务没有受影响，hasd高可用堆栈依然正常工作，集群节点未被踢出，数据库实例正常工作。

Oracle给出了在数据库层面解决这个问题的办法：

If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

_asm_hbeatiowait <<<< 该参数指定了PST heartbeat超时时间。

As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, which is fixed in 12.1.0.2 <<<< 从12.1.0.2开始，该参数默认值被增加到了120秒。

Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"= scope=spfile sid='*'; <<<< 运行这条命令修改ASM实例的该参数，之后重启ASM实例，CRS。

And then restart asm instance / crs, to take new parameter value in effect.

为了避免类似的问题，可以将OCR镜像到不同的ASM磁盘组，这样将进一步的提高ora.crsd服务的可用性。

更详细的内容请参考文章：《ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文档 ID 1581684.1)》

--end--

正文到此结束