直接上错误日志,日志中提示分配资源给container_e11_1531648435560_0733_01_000003出现异常。
继续往上搜索日志,发现对应的application已经出于killed by user状态。
Container exited with a non-zero exit code 143 , ExitStatus: 143, Priority: 0], [container_e11_1531648435560_0733_01_000003, CreateTime: 1534861451578, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0], [con tainer_e11_1531648435560_0735_01_000001, CreateTime: 1534861268376, State: COMPLETE, Capability: <memory:1024, vCores:1>, Diagnostics: Container [pid=9636,containerID=container_e11_1531648435560_0735_01_000001] is running beyond virtual memory limits. Current usage: 400.4 MB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_e11_1531648435560_0735_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 9649 9636 9636 9636 (java) 4552 1390 2524225536 102195 /etc/alternatives/java_sdk/bin/java -Xmx424m -Dbackend.checkpoint.dir=hdfs://thflink/flink/store/checkpoints/rapido-gateway -Dlog.file=/data0/ha doop/yarn/log/application_1531648435560_0735/container_e11_1531648435560_0735_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.y arn.YarnApplicationMasterRunner |- 9636 9634 9636 9636 (bash) 0 0 116060160 300 /bin/bash -c /etc/alternatives/java_sdk/bin/java -Xmx424m -Dbackend.checkpoint.dir=hdfs://thflink/flink/store/checkpoints/rapido-gateway -Dlog.file=/data0 /hadoop/yarn/log/application_1531648435560_0735/container_e11_1531648435560_0735_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flin k.yarn.YarnApplicationMasterRunner 1> /data0/hadoop/yarn/log/application_1531648435560_0735/container_e11_1531648435560_0735_01_000001/jobmanager.out 2> /data0/hadoop/yarn/log/application_1531648435560_0735/co ntainer_e11_1531648435560_0735_01_000001/jobmanager.err
这时候想到YARN fault tolerance相关的一个参数,yarn.nodemanager.recovery.enabled,默认为true,设置为false关闭即可解决恢复上面异常问题。
PS:通过日志ERROR信息,第一时间想到的是资源不足的问题,去调整container的资源分配,不管怎么设置,都没解决该问题。
百般无奈之下再去分析日志,原来是在恢复一个已经不存在的container。