转载

HBase RegionServer无法正常启动,META元数据信息无法分配,导致整个集群无法使用【断电重启后】

事故报告:2013-11-21 因为OPS迁移服务器,需要对整个HBase相关服务器进行断电处理,迁移完后,HBase集群启动,但META元数据信息无法分配,导致整个集群无法使用。

【事故时间】:11月21日 15:00:00 至 11月21日 23:30:00

【事故现象】:HBase集群相关机器断电重启后,-ROOT-和.META.元数据表一直在尝试分配,但无法成功分配,导致HBase用户数据分区无法分配,整个集群无法提供对外服务(读和写)

【事故原因】:HBase集群相关服务器断电重启后,系统时间被篡改,导致集群之间无法进行数据同步。HBase集群重启过程中,因为错误的系统时间篡改了HBase元数据表,而该“错误的时间”比出错时刻时间晚了10小时,导致元数据表无法更新,从而导致元数据表无法正常分配,最终导致用户数据无法分配,集群无法启动。

【事故解决】:

(1)跟踪异常信息对应的HBase源代码,定位到问题是 -ROOT-中数据根本没有被初始化,从而定位到元数据表出现问题

(2)查看HBase -ROOT-和.META.等元数据数据(压缩、特定编码格式,需要解析后查看),定位到-ROOT-表信息中最新数据的时间戳都存在问题(比出错时刻晚4-10小时不等)

(3)根据时间戳写入规律,写入时间戳与写入当时的系统时间对应,所以最终定位到系统时间存在问题

(4)重新同步整个集群的系统时间,并且等待到元数据表中最新一个版本的数据时间,重启HBase集群

【事故影响】:

整个事故期间HBase集群无法提供对外服务,影响业务包含推荐、盒子前端分类标签、银河写入

【事故责任人】:OPS

【事故经验及后续解决办法】:

需要OPS对每天服务器,包含但不限HBase集群,添加启动后系统时间自动同步脚本,或者直接更换主板BIOS电池

添加服务器系统时间监控脚本,有问题及时报警

相关异常信息:hbase-master log  2013-11-21 15:07:01,710 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=JN-PEGASUS-114,60020,1385045053456; java.net.ConnectException: Connection refused  2013-11-21 15:07:01,710 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@375b4ad2; serverName=JN-PEGASUS-113,60020,1385045125259  2013-11-21 15:07:01,711 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@375b4ad2; serverName=JN-PEGASUS-113,60020,1385045125259  2013-11-21 15:07:01,714 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=JN-PEGASUS-114,60020,1385045053456; java.net.ConnectException: Connection refused  2013-11-21 15:07:01,766 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@375b4ad2; serverName=JN-PEGASUS-113,60020,1385045125259  2013-11-21 15:07:01,767 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@375b4ad2; serverName=JN-PEGASUS-113,60020,1385045125259  2013-11-21 15:07:01,770 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=JN-PEGASUS-114,60020,1385045053456; java.net.ConnectException: Connection refused  2013-11-21 15:07:01,823 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@375b4ad2; serverName=JN-PEGASUS-113,60020,1385045125259

查看.META.数据

[root@hbase-regionserver-63 hbase-0.94.0]# hadoop fs -ls /hbase/.META./1028785192/info/  Found 3 items  -rw-r--r-- 3 hadoop supergroup 123594 2013-11-21 15:18 /hbase/.META./1028785192/info/0ba39c9e5c5243cc9ddc858b1404f1a5  -rw-r--r-- 3 hadoop supergroup 878632 2013-11-21 14:48 /hbase/.META./1028785192/info/71f8fd115df84b508f901089c8fed485  -rw-r--r-- 3 hadoop supergroup 108793 2013-11-21 15:07 /hbase/.META./1028785192/info/f5783219194b43e3b10f3987f90e87bd

更改HBase .META.对应HDFS路径信息后,底层块被大量删除,对应Hadoop namenode日志如下所示:

2013-11-21 15:28:27,304 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.1.173:50010 to delete blk_-7487020361358762  966_215408 blk_-8122039544872838071_573783 blk_-9132401723635506730_240836 blk_-8817011066820721755_222709 blk_-7482838182036631  217_580489 blk_-8692623902011880720_271585 blk_-9194213764489003189_566324 blk_-7818575025633702859_567413 blk_-7782092996271748  724_567886 blk_-7537221059427585801_248381 blk_-7771186196426641298_566676 blk_-8655366304697579801_242585 blk_-7743335007402662  114_577511 blk_-9190733966930396917_243073 blk_-7937301671034920772_223154 blk_-7073893609675722132_582664 blk_-8350559128659457  949_230876 blk_-8553614884598493217_566197 blk_-8426073302466263409_582529 blk_-7307226984810852741_224596 blk_-8276944556575532635_204718 blk_-7094785672879703044_311193 blk_-8703802982016668747_337641 blk_-8957425091611830600_325176 blk_-7101261243070711149_581128 blk_-9114505605008147640_567956 blk_-7624362142277101805_574272 blk_-7444056656823134825_258581 blk_-7627508745426111550_274398 blk_-8788415480607622228_580487 blk_-8879320210914569367_576120 blk_-7980065865047754611_581199 blk_-8201911779370100799_227623 blk_-8221512299380293656_228309 blk_-8898550848935267825_228220 blk_-8560662783047417856_221085 blk_-8245115673920669219_228052 blk_-8026254562528372449_568315 blk_-8409330970038244921_290651 blk_-8502878669933890068_238440 blk_-9040317541050851963_576441 blk_-8606620627256410646_198367 blk_-8706413775495311664_221476 blk_-8781366489754699730_570888 blk_-8191269834838165637_302634 blk_-8247039854664297302_568805 blk_-9186944313302370549_322456 blk_-8173682028915241318_572036 blk_-8215283262245060161_238145 blk_-8149486253055794552_205286 blk_-7889941863090052733_568954 blk_-7438551359495144991_576304 blk_-7424948931261643978_274803 blk_-7143392720225392279_205389 blk_-9019474090336869472_575799 blk_-7110760074426158306_581124 blk_-7974155553549995297_580156 blk_-7802755268552362846_572298 blk_-7420764168713689257_221881 blk_-7911852451373201679_246489 blk_-8678689179677029277_241032 blk_-8056817889703560039_237664 blk_-7527834895910541546_567919 blk_-8097419931370685547_577068 blk_-9059993578730187465_240467 blk_-9154536910228387731_568741 blk_-8480532808429517979_580176 blk_-8387071889833388762_272474 blk_-8911047820229195712_576556 blk_-7205488071883889303_247984 blk_-7393970579119441293_242732 blk_-8969406286682796283_571306 blk_-7853143534075685742_299359 blk_-8600672482111359836_572929 blk_-7148477550526966769_573478 blk_-8593210505834013445_232951 blk_-7869022322271931544_247585 blk_-8957077099829349126_238431 blk_-8264472700111405412_251618 blk_-8137831625403494930_236682 blk_-8116076500375044514_220246 blk_-7493124999822841166_570972 blk_-8427625322060820453_576032 blk_-7312111569799052353_317283 blk_-7472523439944790774_575667 blk_-8712787557845009667_222695 blk_-8649997418881103862_243528 blk_-8690262803917715412_577045 blk_-8675098457193867694_241669 blk_-8134884608826753660_579784 blk_-8607813306634309118_228439 blk_-7940276436763175414_237717 blk_-8764578934631030635_569916 blk_-7404152548434567948_576265 blk_-7907711717632351655_232119 blk_-8514141389241039309_568166 blk_-8645939788227975687_239504 blk_-8649945796518195260_244074 blk_-8327493436867398936_338018 blk_-8695120613245157009_216737
原文  http://www.adintellig.com/hbase-regionserver-will-not-start-normally/
正文到此结束
Loading...