最近线上业务报了几次死锁问题,决定跟进。
有一个业务数据表business,维护了一个名为contract id的外键,一个contract id对应多个business数据。
在业务数据新增或者修改的时候,需要同步的维护 business 的数据,这时候正确的做法是diff新旧数据,得到需要删除的一部分数据,需要新增的一部分数据以及需要更新的一部分数据,这种实现有点麻烦(其实也不麻烦,使用Guava的集合操作),因此工程师们的通常做法是先根据contract _id删除现有数据,再插入新数据。这个时候很容易出现死锁。
这里也解释一下外键,在业务DB中,出于性能考虑,通常禁止使用外键,通常的做法是,外键这种关系的维护都体现在表中手动维护一个外键。
在交代一下数据库相关的背景:
DB:Mysql 5.6 tx_isolation:REPEATABLE-READ
我们在本地重现死锁信息。
建表语句已经初始化的数据如下:
CREATE TABLE `ct_contract_business` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '自增主键', `contract_id` int(11) NOT NULL DEFAULT '0' COMMENT '合同ID', `business_id` tinyint(3) unsigned NOT NULL DEFAULT '0' COMMENT '业务ID', PRIMARY KEY (`id`), UNIQUE KEY `uniq_idx_contract_id_business_id` (`contract_id`,`business_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT=''; mysql> select * from ct_contract_business; +----+-------------+-------------+ | id | contract_id | business_id | +----+-------------+-------------+ | 20 | 1 | 2 | | 21 | 1 | 3 | | 23 | 1 | 4 | | 22 | 1 | 5 | | 10 | 2 | 1 | | 11 | 2 | 2 | | 5 | 3 | 1 | | 6 | 4 | 1 | | 7 | 5 | 1 | +----+-------------+-------------+
我们模拟同时两个新数据的插入过程:
步骤 | 事务1 | 事务2 | 备注 |
---|---|---|---|
1 | begin | ||
2 | begin | ||
3 | mysql> delete from ct_contract_business where contract_id = 6; Query OK, 0 rows affected (0.00 sec) | 事务1:Gap锁,锁住区域(5, +∞) | |
4 | mysql> delete from ct_contract_business where contract_id = 7; Query OK, 0 rows affected (0.00 sec) | 事务2:Gap锁,锁住区域(5, +∞) | |
5 | mysql> insert into ct_contract_business (contract_id, business_id) values (6, 1); 等待… | 事务1:插入意向锁(Insert Intention Lock),期望获取(5, 6)这个Gap锁和一个contract_id=6的Recored锁。 但是因为 事务2 已经锁住了区域(5, +∞)因此这时候,事务1只能等待 事务2 释放锁. | |
6 | mysql> insert into ct_contract_business (contract_id, business_id) values (7, 1); ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction | 和上面第5步类似, 事务2:等待获取事务1的锁 出现循环等待,死锁(roll back这个事务,事务2的锁释放) | |
7 | Query OK, 2 rows affected (2.89 sec) Records: 2 Duplicates: 0 Warnings: 0 事务2 rollback了,事务1的insert成功 | 事务1等待的锁得到,事务1成功。 |
第 5 步的锁信息如下:
// 锁信息 // 事务1( 即这里的事务id:203797) 持有一个Gap锁,事务2( 即这里的事务id:203798) 持有一个Gap锁 mysql> select * from INNODB_LOCKS; +---------------+-------------+-----------+-----------+-------------------------------+----------------------------------+------------+-----------+----------+------------------------+ | lock_id | lock_trx_id | lock_mode | lock_type | lock_table | lock_index | lock_space | lock_page | lock_rec | lock_data | +---------------+-------------+-----------+-----------+-------------------------------+----------------------------------+------------+-----------+----------+------------------------+ | 203797:20:4:1 | 203797 | X | RECORD | `test`.`ct_contract_business` | uniq_idx_contract_id_business_id | 20 | 4 | 1 | supremum pseudo-record | | 203798:20:4:1 | 203798 | X | RECORD | `test`.`ct_contract_business` | uniq_idx_contract_id_business_id | 20 | 4 | 1 | supremum pseudo-record | +---------------+-------------+-----------+-----------+-------------------------------+----------------------------------+------------+-----------+----------+------------------------+ // 锁等待信息 // 事务1( 即这里的事务id:203797) 等待事务2(即这里的事务id:203798 )的锁 mysql> select * from INNODB_LOCK_WAITS; +-------------------+-------------------+-----------------+------------------+ | requesting_trx_id | requested_lock_id | blocking_trx_id | blocking_lock_id | +-------------------+-------------------+-----------------+------------------+ | 203797 | 203797:20:4:1 | 203798 | 203798:20:4:1 | +-------------------+-------------------+-----------------+------------------+
有些人可能迷惑 lock_data 的 supremum pseudo-record 是什么东西,我们先看看 lock_data 的解释,这里面解释了 supremum pseudo-record,简单说就是正无穷。
Primary key value(s) of the locked record if LOCK_TYPE='RECORD', otherwise NULL. This column contains the value(s) of the primary key column(s) in the locked row, formatted as a valid SQL string (ready to be copied to SQL commands). If there is no primary key then the InnoDB internal unique row ID number is used. If a gap lock is taken for key values or ranges above the largest value in the index, LOCK_DATA reports “supremum pseudo-record”. When the page containing the locked record is not in the buffer pool (in the case that it was paged out to disk while the lock was held), InnoDB does not fetch the page from disk, to avoid unnecessary disk operations. Instead, LOCK_DATA is set to NULL.
死锁信息:
mysql> show engine innodb status /G ... ------------------------ LATEST DETECTED DEADLOCK ------------------------ 2016-07-21 19:11:05 7f6b90de8700 *** (1) TRANSACTION: TRANSACTION 203797, ACTIVE 42 sec inserting mysql tables in use 1, locked 1 LOCK WAIT 3 lock struct(s), heap size 360, 2 row lock(s), undo log entries 1 MySQL thread id 9, OS thread handle 0x7f6b90db7700, query id 144 localhost root update insert into ct_contract_business (contract_id, business_id) values (6, 1) *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 20 page no 4 n bits 80 index `uniq_idx_contract_id_business_id` of table `test`.`ct_contract_business` trx id 203797 lock_mode X insert intention waiting Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0 0: len 8; hex 73757072656d756d; asc supremum;; *** (2) TRANSACTION: TRANSACTION 203798, ACTIVE 38 sec inserting mysql tables in use 1, locked 1 3 lock struct(s), heap size 360, 2 row lock(s), undo log entries 1 MySQL thread id 10, OS thread handle 0x7f6b90de8700, query id 147 localhost root update insert into ct_contract_business (contract_id, business_id) values (7, 1) *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 20 page no 4 n bits 80 index `uniq_idx_contract_id_business_id` of table `test`.`ct_contract_business` trx id 203798 lock_mode X Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0 0: len 8; hex 73757072656d756d; asc supremum;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 20 page no 4 n bits 80 index `uniq_idx_contract_id_business_id` of table `test`.`ct_contract_business` trx id 203798 lock_mode X insert intention waiting Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0 0: len 8; hex 73757072656d756d; asc supremum;; *** WE ROLL BACK TRANSACTION (2) ------------ TRANSACTIONS ------------
从死锁信息中,我们也可以看到事务1(事务id:203797) 和 事务2(事务id:203798) 持有的锁是锁住相同的一块区域:
0: len 8; hex 73757072656d756d; asc supremum;;
参考: mysql InnoDB锁等待的查看及分析
有两个点需要我们知道的信息:各种SQL语句都加什么锁,为什么这里的两个delete的锁没有冲突。
关于各种SQL语句加什么锁,参见Mysql官方文档: Locks Set by Different SQL Statements in InnoDB
我们这里来说涉及的删除和插入,先说删除:
//删除,会在满足条件的记录上加一个next-key锁,也就是锁住之前的Gap和待删除的记录。 DELETE FROM ... WHERE ... sets an exclusive next-key lock on every record the search encounters. //显然,如果删除的数据比现有最大数据max还大,就会锁(max, +∞)这个Gap //同理,如果删除的数据比现有最小数据min还小,就会锁(-∞, min)这个Gap
再说插入,插入比较麻烦,因为涉及到插入意向锁(Insert Intention Lock),还是参考Mysql官方文档: InnoDB Locking
An insert intention lock is a type of gap lock set by INSERT operations prior to row insertion. This lock signals the intent to insert in such a way that multiple transactions inserting into the same index gap need not wait for each other if they are not inserting at the same position within the gap. Suppose that there are index records with values of 4 and 7. Separate transactions that attempt to insert values of 5 and 6, respectively, each lock the gap between 4 and 7 with insert intention locks prior to obtaining the exclusive lock on the inserted row, but do not block each other because the rows are nonconflicting.
简单的是说,插入意向锁可以归结为如下:
锁的类型:Gap Lock 加锁时间:插入之前 锁的区域:待插入的区域,比如已有数据4,7,想要插入5,就会锁住(4, 7)这个区域 锁的冲突:只要两个插入的数据不在同一个位置(其实可以理解为同一个数据),插入意向锁之间就不会冲突
插入的锁如下:
INSERT sets an exclusive lock on the inserted row. This lock is an index-record lock, not a next-key lock (that is, there is no gap lock) and does not prevent other sessions from inserting into the gap before the inserted row. Prior to inserting the row, a type of gap lock called an insert intention gap lock is set. This lock signals the intent to insert in such a way that multiple transactions inserting into the same index gap need not wait for each other if they are not inserting at the same position within the gap.
就是说插入之前会加一把插入意向锁,除此之外,会在插入的记录上加一把锁。
关于锁冲突,我们熟知的肯定是S和S兼容,X和其他所有都不兼容。事实上并没有这么简单。比如我们这里前面的例子,两个delete都加了X型的Gap锁,应该排斥才对,但事实上是兼容的。这里参考了 从一个死锁看mysql innodb的锁机制 这篇文章的结论(准备读源码验证):
下面这个是 precise mode 的兼容矩阵:(这个兼容矩阵发生在X与X,S与X不兼容的情况下再进行比对的) G I R N (已经存在的锁,包括等待的锁) G + + + + I - + + - R + + - - N + + - - + 代表兼容, -代表不兼容. I代表插入意图锁, G代表Gap锁,I代表插入意图锁,R代表记录锁,N代表Next-Key锁.
其实仔细读 Mysql官方文档 ,我们也能发现上面的两个delete的Gap锁是兼容的:
Gap locks in InnoDB are “purely inhibitive”, which means they only stop other transactions from inserting to the gap. They do not prevent different transactions from taking gap locks on the same gap. Thus, a gap X-lock has the same effect as a gap S-lock.
意思就是说Gap锁的作用是只防止其他事务在这个Gap内的插入,而不排斥其他事务在同一个Gap加上Gap锁。因此Gap X锁和Gap S锁效果相同。 (真心文档每句话都需要仔细理解哈。)
DBA的建议:先根据 contract_id 查询id,根据 id 删除;
其实只要保证数据存在再区删除就没问题,我们假设我们执行两个以存在数据的先删除再插入。
一个辅助的示意图如下:
其实上面的例子中会出现一个因为 UNIQUE KEY 导致的锁等待问题,我们可以重现,现有数据如下:
mysql> select * from ct_contract_business; +----+-------------+-------------+ | id | contract_id | business_id | +----+-------------+-------------+ | 20 | 1 | 2 | | 21 | 1 | 3 | | 23 | 1 | 4 | | 22 | 1 | 5 | | 10 | 2 | 1 | | 11 | 2 | 2 | | 5 | 3 | 1 | | 30 | 4 | 1 | | 7 | 5 | 1 | +----+-------------+-------------+
过程如下:
步骤 | 事务1 | 事务2 |
---|---|---|
1 | begin | | |
2 | begin | | |
3 | mysql> delete from ct_contract_business where contract_id = 3; Query OK, 0 rows affected (0.00 sec) | |
4 | mysql> delete from ct_contract_business where contract_id = 4; Query OK, 0 rows affected (0.00 sec) | |
5 | mysql> insert into ct_contract_business (contract_id, business_id) values (3, 1); 等待… |
INNODB_LOCKS信息及INNODB_LOCK_WAITS信息如下:
mysql> select * from INNODB_LOCK_WAITS; +-------------------+-------------------+-----------------+------------------+ | requesting_trx_id | requested_lock_id | blocking_trx_id | blocking_lock_id | +-------------------+-------------------+-----------------+------------------+ | 204349 | 204349:20:4:12 | 204350 | 204350:20:4:12 | +-------------------+-------------------+-----------------+------------------+ 1 row in set (0.00 sec) mysql> select * from INNODB_LOCKS; +----------------+-------------+-----------+-----------+-------------------------------+----------------------------------+------------+-----------+----------+-----------+ | lock_id | lock_trx_id | lock_mode | lock_type | lock_table | lock_index | lock_space | lock_page | lock_rec | lock_data | +----------------+-------------+-----------+-----------+-------------------------------+----------------------------------+------------+-----------+----------+-----------+ | 204349:20:4:12 | 204349 | S | RECORD | `test`.`ct_contract_business` | uniq_idx_contract_id_business_id | 20 | 4 | 12 | 4, 1 | | 204350:20:4:12 | 204350 | X | RECORD | `test`.`ct_contract_business` | uniq_idx_contract_id_business_id | 20 | 4 | 12 | 4, 1 | +----------------+-------------+-----------+-----------+-------------------------------+----------------------------------+------------+-----------+----------+-----------+
show engine innodb status信息如下:
---TRANSACTION 204350, ACTIVE 24 sec 4 lock struct(s), heap size 1184, 3 row lock(s), undo log entries 1 MySQL thread id 22, OS thread handle 0x7fd7ee4f0700, query id 245 localhost root cleaning up ---TRANSACTION 204349, ACTIVE 31 sec inserting mysql tables in use 1, locked 1 LOCK WAIT 5 lock struct(s), heap size 1184, 4 row lock(s), undo log entries 2 MySQL thread id 20, OS thread handle 0x7fd7ee4bf700, query id 250 localhost root update insert into ct_contract_business (contract_id, business_id) values (3, 1) ------- TRX HAS BEEN WAITING 6 SEC FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 20 page no 4 n bits 80 index `uniq_idx_contract_id_business_id` of table `test`.`ct_contract_business` trx id 204349 lock mode S waiting Record lock, heap no 12 PHYSICAL RECORD: n_fields 3; compact format; info bits 32 0: len 4; hex 80000004; asc ;; 1: len 1; hex 01; asc ;; 2: len 4; hex 0000001e; asc ;;
这里没明白的一点好事务1(事务id:204349)的insert一个(3, 1)的数据为什么会在(4, 1)上加一个S锁?
不过实验了一下,去掉UNIQUE KEY,使用普通的key,就没有这个锁等到问题。这个问题有待进一步深入查资料。