MariaDB删除重复记录
不管是程序BUG,还是业务变更,重复数据这个老生常谈的问题,总是会出现。以下是我在MariaDB或是MySQL下处理的一些经验。在SQL Server中,使用窗口函数是很容易实现的。不过听说MySQL 8.0和MariaDB 10.2以上均支持窗口函数了。等有机会再来测试使用窗口函数来删除重复记录。
背景
表t_record中的数据fromUserId, toUserId两个字段组合作为唯一的标识,删除重复记录,只留下最大id(或最新时间)的记录。id为自增无重复的主键。
表t_record的id作为自增的主键。
表t_record大概有6万多的数据。以下测试均在资源很差的主机上,t_record没有在使用的情况下的结果。
方法1
查询重复的记录
SELECT fromUserId, toUserId, count(*) FROM t_record as tr GROUP BY fromUserId, toUserId HAVING count(*) > 1;
把重复记录的两个字段放到临时表_tmp1中
CREATE TABLE _tmp1 SELECT fromUserId, toUserId FROM t_record as tr GROUP BY fromUserId, toUserId HAVING count(*) > 1;
把应该删除的id查询出来,放到临时表_tmp2中
CREATE TABLE _tmp2 SELECT id FROM t_record as a WHERE (a.fromUserId, a.toUserId) in ( SELECT fromUserId, toUserId from _tmp1 ) and a.id not in ( SELECT MAX(id) FROM t_record as tr GROUP BY fromUserId, toUserId HAVING count(*) > 1 );
删除原表的记录
DELETE from t_record where id in ( SELECT id from _tmp2 );
通过以前3个步骤,没有删除数据大概需要23秒左右。
方法2
如果表中没有主键,也没有可以标识唯一记录的字段。只能是把原表的数据分表后,插入到另一张临时表,删除原表数据,把临时表的数据导回来。
这种方法也适用合于有主键或有唯一标识的表,但操作过程中会影响在线的业务,需要中断业务。否则可能会造成数据丢失或数据不一致。
数据量大的表,导两次数据,过程会很慢,同时也需要注意硬盘空间是否足够。
方法3
测试mysql不支持以下这种delete语法来删除数据。改为select id 存到临时表,查询非常慢。
DELETE a FROM table_nam a WHERE EXISTS (SELECT 1 FROM table_nam b WHERE b.userid = a.userid AND b.CreateDate > a.CreateDate);
方法4
在mariadb 10.1.19下测试,60多秒。
这个方法简单,只需要一条语句,速度还行。
DELETE from t_record where id not in ( select maxid from (select max(id) as maxid from t_record group by fromUserId,toUserId ) b );
现对方法4的进行改造,再测试下。11.5秒。快好几倍了。
CREATE OR REPLACE TABLE _tmp3 SELECT id FROM t_record WHERE id NOT IN ( SELECT maxid FROM (SELECT max(id) AS maxid FROM t_record GROUP BY fromUserId,toUserId ) b ); DELETE FROM t_record WHERE id IN ( SELECT id from _tmp3 );
对临时表创建主键,再测试下。1.2秒!WOW!!!
CREATE OR REPLACE TABLE _tmp3 (id INT NOT NULL PRIMARY KEY); INSERT INTO _tmp3 (id) SELECT id FROM t_record WHERE id NOT IN ( SELECT maxid FROM (SELECT max(id) AS maxid FROM t_record GROUP BY fromUserId,toUserId ) b ); DELETE FROM t_record WHERE id IN ( SELECT id from _tmp3 );
还以不能再快呢?改IN子句为JOIN,再测试下。1秒!
CREATE OR REPLACE TABLE _tmp3 (id INT NOT NULL PRIMARY KEY); INSERT INTO _tmp3 (id) SELECT id FROM t_record WHERE id NOT IN ( SELECT maxid FROM (SELECT max(id) AS maxid FROM t_record GROUP BY fromUserId,toUserId ) b ); DELETE a FROM t_record as a INNER JOIN _tmp3 as b on b.id = a.id;
难道删除的那个语句的执行计划是不同的吗?由于mysql只支持select的执行查询,所以把删除的语句修改为查询语句。
root@localhost [db1]EXPLAIN SELECT id FROM t_record WHERE id IN ( SELECT id from _tmp3 ); +------+-------------+-------------------+--------+---------------+---------+---------+--------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------------------+--------+---------------+---------+---------+--------------------+------+--------------------------+ | 1 | PRIMARY | _tmp3 | index | PRIMARY | PRIMARY | 4 | NULL | 452 | Using index | | 1 | PRIMARY | t_record | eq_ref | PRIMARY | PRIMARY | 8 | testdb._tmp3.id | 1 | Using where; Using index | +------+-------------+-------------------+--------+---------------+---------+---------+--------------------+------+--------------------------+ 2 rows in set (0.00 sec)
root@localhost [db1]EXPLAIN SELECT a.id FROM t_record as a INNER JOIN _tmp3 as b on b.id = a.id; +------+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+ | 1 | SIMPLE | b | index | PRIMARY | PRIMARY | 4 | NULL | 452 | Using index | | 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 8 | testdb.b.id | 1 | Using where; Using index | +------+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+ 2 rows in set (0.01 sec)
对比执行计划,是一样的。但测试多次,用JOIN方式速度还是快些。为什么呢?
上一篇: 他是隋朝第一功臣,为何却受两任皇帝厌恶?
下一篇: 董明珠:格力机器人,要做成*