MySQL DeadLock故障排查全过程记录
【作者】
刘博:携程技术保障中心数据库高级经理,主要关注sql server和mysql的运维和故障处理。
【环境】
版本号:5.6.21
隔离级别:repeatable read
【问题描述】
接到监控报警,有一个线上的应用deadlock报错,每15分钟会准时出现,报错统计如下图:
登录mysql服务器查看日志:
mysql> show engine innodb status\g *** (1) transaction: transaction 102973, active 11 sec starting index read mysql tables in use 3, locked 3 lock wait 4 lock struct(s), heap size 1136, 3 row lock(s) mysql thread id 6, os thread handle 140024996574976, query id 83 localhost us updating update testtable set column1 = 1, column2 = sysdate(), column3= '026' column4 = 0 and column5 = 485 and column6 = 'sek' *** (1) waiting for this lock to be granted: record locks space id 417 page no 1493 n bits 1000 index idx_column6 of table test.testtable trx id 102973 lock_mode x waiting record lock, heap no 859 physical record: n_fields 2; compact format; info bits 0 0: len 3; hex 53454b; asc sek;; 1: len 8; hex 80000000007e1452; asc ~ r;; *** (2) transaction: transaction 102972, active 26 sec starting index read mysql tables in use 3, locked 3 219 lock struct(s), heap size 24784, 2906 row lock(s), undo log entries 7 mysql thread id 5, os thread handle 140024996841216, query id 84 localhost us updating update testtable set column1 = 1, column2 = sysdate(), column3 = '026' column4 = 0 and column5 = 485 and column6 = 'sek' *** (2) holds the lock(s): record locks space id 417 page no 1493 n bits 1000 index idx_column6 of table test.testtable trx id 102972 lock_mode x record lock, heap no 1 physical record: n_fields 1; compact format; info bits 0 0: len 8; hex 73757072656d756d; asc supremum;; record lock, heap no 859 physical record: n_fields 2; compact format; info bits 0 0: len 3; hex 53454b; asc sek;; 1: len 8; hex 80000000007e1452; asc ~ r;;
*** (2) waiting for this lock to be granted: record locks space id 601 page no 89642 n bits 1000 index idx_column6 of table test.testtable trx id 32231892482 lock_mode x locks rec but not gap waiting record lock, heap no 38 physical record: n_fields 2; compact format; info bits 0 0: len 3; hex 53454b; asc sek;; 1: len 8; hex 80000000007eea14; asc ~ ;;
大致一看,更新同一索引的同一行,应该是一个block,报timeout的错才对,怎么会报deadlock?
【初步分析】
先分析下(2) transaction,transaction 32231892482。
等待的锁信息为:
0: len 3; hex 53454b; asc sek;; 1: len 8; hex 80000000007eea14; asc
持有的锁信息为:
0: len 3; hex 53454b; asc sek;; 1: len 8; hex 80000000007eeac4; asc
再先分析下(1) transaction,transaction 32231892617。
等待的锁信息为:
0: len 3; hex 53454b; asc sek;; 1: len 8; hex 80000000007eeac4; asc
于是可以画出的死锁表,两个资源相互依赖,造成死锁:
transaction | hold | wait |
---|---|---|
32231892617 | 53454b\80000000007eea14 | 53454b\80000000007eeac4 |
32231892482 | 53454b\80000000007eeac4 | 53454b\80000000007eea14 |
让我们再看一下explain结果:
mysql>desc update testtable set column1=1, column2 = sysdate(),column3 = '025' column4 = 0 and column5 = 477 and column6 = 'sek' \g;
*************************** 1. row ***************************
id: 1
select_type: update
table: testtable
partitions: null
type: index_merge
possible_keys: column5_index,idx_column5_column6_column1,idxcolumn6
key: column5_index,idxcolumn6
key_len: 8,9
ref: null
rows: 7
filtered: 100.00
extra: using intersect(column5_index,idxcolumn6); using where
可以看到 extra 列:
using intersect(column5_index,idxcolumn6)
从5.1开始,引入了 index merge 优化技术,对同一个表可以使用多个索引分别进行条件扫描。
相关文档:http://dev.mysql.com/doc/refman/5.7/en/index-merge-optimization.html
the index merge method is used to retrieve rows with several range scans and to merge their results into one. the merge can produce unions, intersections, or unions-of-intersections of its underlying scans. this access method merges index scans from a single table; it does not merge scans across multiple tables.
【模拟与验证】
根据以上初步分析,猜测应该就是intersect造成的,于是在测试环境模拟验证,开启2个session模拟死锁:
时间序列 | session1 | session2 |
---|---|---|
1 | begin; | |
2 | update testtable set column2 = sysdate() column4 = 0 and column5 = 47 and column6 = 'sek 执行成功,影响7行 |
|
3 | begin; | |
4 | update testtable set column2 = sysdate(),column4 = 0 and column5 = 485 and column6 = 'sek'; 被blocking |
|
5 | update testtable set column2 = sysdate(),column4 = 0 and column5 = 485 and column6 = 'sek'; 执行成功 |
error 1213 (40001): deadlock found when trying to get lock; try restarting transaction |
依据以上信息可以发现session2虽然被block了,但也获取了一些session1在时间序列5时所需资源的x锁,可以再开启一个查询select count(column5) from testtable where column5 = 485,设置set transaction isolation level serializable,去查询column5 = 485的行,观察锁等待的信息:
mysql> select r.trx_id waiting_trx_id, r.trx_mysql_thread_id waiting_thread, r.trx_query waiting_query, b.trx_id
mysql> select r.trx_id waiting_trx_id, r.trx_mysql_thread_id waiting_thread, r.trx_query waiting_query, b.trx_id blocking_trx_id, b.trx_mysql_thread_id blocking_thread, b.trx_query blocking_query from information_schema.innodb_lock_waits w inner join information_schema.innodb_trx b on b.trx_id = w.blocking_trx_id inner join information_schema.innodb_trx r on r.trx_id = w.requesting_trx_id \g; *************************** 1. row *************************** waiting_trx_id: 103006 waiting_thread: 36 waiting_query: update testtable set column1 = 1, column2 = sysdate(), column3 = '026' column4 = 0 and column5 = 485 and column6 = 'sek' blocking_trx_id: 103003 blocking_thread: 37 blocking_query: null *************************** 2. row *************************** waiting_trx_id: 421500433538672 waiting_thread: 39 waiting_query: select count(column5) from testtable where column5 = 485 blocking_trx_id: 103006 blocking_thread: 36 blocking_query: update testtable set column1 = 1, column2 = sysdate(), column3 = '026' column4 = 0 and column5 = 485 and column6 = 'sek' 2 rows in set, 1 warning (0.00 sec) mysql> select * from information_schema.innodb_lock_waits \g; *************************** 1. row *************************** requesting_trx_id: 103006 requested_lock_id: 103006:417:1493:859 blocking_trx_id: 103003 blocking_lock_id: 103003:417:1493:859 *************************** 2. row *************************** requesting_trx_id: 421500433538672 requested_lock_id: 421500433538672:417:749:2 blocking_trx_id: 103006 blocking_lock_id: 103006:417:749:2 2 rows in set, 1 warning (0.00 sec)
mysql> select * from innodb_locks \g; *************************** 1. row *************************** lock_id: 103006:417:1493:859 lock_trx_id: 103006 lock_mode: x lock_type: record lock_table: test.testtable lock_index: idxcolumn6 lock_space: 417 lock_page: 1493 lock_rec: 859 lock_data: 'sek', 8262738 *************************** 2. row *************************** lock_id: 103003:417:1493:859 lock_trx_id: 103003 lock_mode: x lock_type: record lock_table:test.testtable lock_index: idxcolumn6 lock_space: 417 lock_page: 1493 lock_rec: 859 lock_data: 'sek', 8262738 *************************** 3. row *************************** lock_id: 421500433538672:417:749:2 lock_trx_id: 421500433538672 lock_mode: s lock_type: record lock_table: test.testtable lock_index: column5_index lock_space: 417 lock_page: 749 lock_rec: 2 lock_data: 485, 8317620 *************************** 4. row *************************** lock_id: 103006:417:749:2 lock_trx_id: 103006 lock_mode: x lock_type: record lock_table: test.testtable lock_index: column5_index lock_space: 417 lock_page: 749 lock_rec: 2 lock_data: 485, 8317620 4 rows in set, 1 warning (0.00 sec)
可以看到session2,trx_id 103006阻塞了trx_id 421500433538672,而trx_id 421500433538672 requested_lock也正好是lock_data: 485, 8317620。由此可见session2虽然别block了,但是还是获取到了index column5_index相关的锁。被block是因为intersect的原因,还需要idxcolumn6的锁,至此思路已经清晰,对整个分配锁的信息简化一下,如下表格(请求到的锁用青色表示,需获取但未获取到的锁用红色表示):
时间点 | session1 | session2 |
---|---|---|
1 | 477 sek | |
2 | 485 sek | |
3 | 485 sek | 死锁发生 |
可以看到485 sek这两个资源形成了一个环状,最终发生死锁。
【解决方法】
- 最佳的方法是添加column5和column6的联合索引。
- 我们环境当时的情况发现column6的筛选度非常低,就删除了column6的索引。
10:55左右删除索引后,报错没有再发生:
总结
以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作具有一定的参考学习价值,如果有疑问大家可以留言交流,谢谢大家对的支持。
上一篇: ASP的Global.asa文件技巧用法
下一篇: JAVA基础之Set接口