MySQL案例-mysqld got signal 11
-------------------------------------------------------------------------------------------------正文---------------------------------------------------------------------------------------------------------------
背景:
MySQL-5.7.12, debian 8核16G虚拟机, 业务方反馈在某一个时间点, 出现了大量的数据库报错, 之后恢复正常;
场景:
开发查看日志后, 发现在某个时间点, 应用断开了所有与数据库的连接, 几秒钟以后就恢复了;
同时监控系统的内存使用率出现了异常的骤降;
3min之后收到了报警系统的信息, 内存使用率82%;
分析:
第一时间的判断是网络的问题造成了应用层的连接断开了, 但是这种内存使用率骤降的现象不会是网络造成的;
查看MySQL的日志, 发现MySQL实例发生了crash, 相关的报错信息如下:
-
07:42:44 UTC - mysqld got signal 11 ;
-
This could be because you hit a bug. It is also possible that this binary
-
or one of the libraries it was linked against is corrupt, improperly built,
-
or misconfigured. This error can also be caused by malfunctioning hardware.
-
Attempting to collect some information that could help diagnose the problem.
-
As this is a crash and something is definitely wrong, the information
-
collection process might fail.
-
-
key_buffer_size=8388608
-
read_buffer_size=16777216
-
max_used_connections=29
-
max_threads=5000
-
thread_count=32
-
connection_count=22
-
It is possible that mysqld could use up to
-
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 245834871 K bytes of memory
-
Hope that is ok; if not, decrease some variables in the equation.
-
-
Thread pointer: 0x7f607c0072c0
-
Attempting backtrace. You can use the following information to find out
-
where mysqld died. If you see no messages after this, something went
-
terribly wrong...
-
stack_bottom = 7f6141b36e80 thread_stack 0x40000
-
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xe77fec]
-
/usr/sbin/mysqld(handle_fatal_signal+0x459)[0x7a7019]
-
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f643257a8d0]
-
/usr/sbin/mysqld(_ZN16Partition_helper25handle_ordered_index_scanEPh+0x5c)[0xbbabec]
-
/usr/sbin/mysqld(_ZN7handler13ha_index_lastEPh+0x1b0)[0x7f4410]
-
/usr/sbin/mysqld(_Z14join_read_lastP7QEP_TAB+0x65)[0xc1f605]
-
/usr/sbin/mysqld(_Z10sub_selectP4JOINP7QEP_TABb+0x11b)[0xc25e4b]
-
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x3b8)[0xc1ea78]
-
/usr/sbin/mysqld(_Z12handle_queryP3THDP3LEXP12Query_resultyy+0x238)[0xc8e408]
-
/usr/sbin/mysqld[0x770ccf]
-
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THDb+0x3403)[0xc51103]
-
/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x3ad)[0xc531bd]
-
/usr/sbin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x817)[0xc53a47]
-
/usr/sbin/mysqld(_Z10do_commandP3THD+0x18f)[0xc54faf]
-
/usr/sbin/mysqld(handle_connection+0x278)[0xd108d8]
-
/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xe90784]
-
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f64325730a4]
-
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6430e1b87d]
-
-
Trying to get some variables.
-
Some pointers may be invalid and cause the dump to abort.
-
Query (7f607c015ad0): select * from test where time>='2016-07-29 00:00:00' and time<='2016-07-29 23:59:59' and tag in (2,3,6) order by id desc limit 2000
-
Connection ID (thread ID): 138760
-
Status: NOT_KILLED
-
-
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
-
information that should help you find out what is causing the crash.
-
2016-07-29T07:42:45.661724Z mysqld_safe Number of processes running now: 0
-
2016-07-29T07:42:45.664516Z mysqld_safe mysqld restarted
-
2016-07-29T15:42:45.991109+08:00 0 [Note] /usr/sbin/mysqld (mysqld 5.7.12-log) starting as process 8367 ...
首先是第一部分的信息:
通过perror命令(感谢@杨奇龙的场外援助..._(:з」∠)_...)看到ErrorCode的信息:
-
Resource temporarily unavailable
代表MySQL发现某一项资源临时不可用, 应该是资源耗尽 or 申请失败等情况;
然后是第二部分信息:
-
It is possible that mysqld could use up to
-
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 245834871 K bytes of memory
这一段计算了当前配置下, 需要的最大内存数, 大概换算了一下, 是234G;
这么高, 明显是有问题的, 联想到82%内存使用率的报警信息, 推测是内存耗尽造成的;
用max_used_connections来算一下使用的内存的话,有约1.5G;
加上BP的9.6G, 有11G了, 算上MySQL本身占用的一部分内存, 确实达到了比较高的程度;
但是看了一下kernel和message, 都没有发现系统出现OOM的日志, 应该不是系统kill的;
再看看堆栈相关的信息, 在里面记录了MySQL crash时的状态;
-
stack_bottom = 7f6141b36e80 thread_stack 0x40000
-
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xe77fec]
-
/usr/sbin/mysqld(handle_fatal_signal+0x459)[0x7a7019]
-
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f643257a8d0]
-
/usr/sbin/mysqld(_ZN16Partition_helper25handle_ordered_index_scanEPh+0x5c)[0xbbabec]
-
/usr/sbin/mysqld(_ZN7handler13ha_index_lastEPh+0x1b0)[0x7f4410]
-
/usr/sbin/mysqld(_Z14join_read_lastP7QEP_TAB+0x65)[0xc1f605]
-
/usr/sbin/mysqld(_Z10sub_selectP4JOINP7QEP_TABb+0x11b)[0xc25e4b]
-
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x3b8)[0xc1ea78]
-
/usr/sbin/mysqld(_Z12handle_queryP3THDP3LEXP12Query_resultyy+0x238)[0xc8e408]
-
/usr/sbin/mysqld[0x770ccf]
-
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THDb+0x3403)[0xc51103]
-
/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x3ad)[0xc531bd]
-
/usr/sbin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x817)[0xc53a47]
-
/usr/sbin/mysqld(_Z10do_commandP3THD+0x18f)[0xc54faf]
-
/usr/sbin/mysqld(handle_connection+0x278)[0xd108d8]
-
/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xe90784]
-
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f64325730a4]
-
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6430e1b87d]
从红字等地方的信息, 可以推断出当时MySQL是正在执行查询, 这些查询中有join, 也有subquery, 且查询的表包含了分区表;
可以预料到在crash的时候, MySQL执行这些语句时肯定需要申请一部分join用的buffer, 同时子查询也会建立临时表, 都需要占用内存空间;
同时还有分区表的使用, 看了一下当时候分区表的大小:
发现当天的数据超过了BP的大小, 且用到分区表的查询走的全表扫描, 并且还有order by, 会用到sort的buffer, 且由于全表扫描的数据很多, 这个buffer有可能是需要申请满的;
综合这些信息, 基本确认是内存耗尽造成了MySQL crash;
那么根据堆栈的信息尝试还原crash时的场景:
在内存占用率很高的情况下, MySQL thread在执行较大表的查询时, 无法再申请到足够的内存(sort的buffer, join的buffer等), 因此发生了crash;
处理方式:
最终把BP从9.6G调整为9G, 并把sort, read等buffer的数值调整到了4M, 其他的buffer也调低了;
PS: 算是疏忽吧, 因为说在生产环境已经用这么一套配置很久了, 没出过问题, 所以也没有仔细的排查配置文件中的设置;
PSS: sort的buffer原来是多少? 32M...sort的buffer还是per thread的...失职了..._(:з」∠)_
正文到此结束