欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

【MySQL】你以为设置了并行复制就降低延迟了?这个你绝对想不到!

程序员文章站 2022-05-23 09:26:24

在mysql官方版本中,为了保证其的高可用性,一般情况我们会采用主从复制的方式来解决。当然,方法很多。而我们今天所要处理的是采用gtid方式并且开了多线程复制后,仍然延迟的情况,糟糕的是,延迟还在不断扩大!

  • 环境概要
  • 序号 清单 说明
    1 系统 redhat 6.x(4c,32g)
    2 数据库 mysql-5.7.25
    3 同步方式 基于gtid主从同步

  • 环境排查
  • 1)已经配置的重要参数:

    relay for slave

    slave_parallel_type = logical_clock
    slave_parallel_workers = 6
    master_info_repository = table
    relay_log_info_repository = table
    relay_log_recovery = on
    sync_relay_log = 10000


    注:此时没有设置slave_preserve_commit_order参数。

    2)从库延迟状态查询

    mysql> show slave status\g
    *************************** 1. row ***************************
    slave_io_state: queueing master event to the relay log
    master_host: xxx.xxx.xxx.xxx
    master_user: repl
    master_port: 3306
    connect_retry: 60
    master_log_file: mysql-bin.008978
    read_master_log_pos: 696914605
    relay_log_file: db41-relay-bin.001259
    relay_log_pos: 207377582
    relay_master_log_file: mysql-bin.008970
    slave_io_running: yes
    slave_sql_running: yes
    replicate_do_db: neteagle3
    replicate_ignore_db: mysql
    replicate_do_table:
    replicate_ignore_table:
    replicate_wild_do_table:
    replicate_wild_ignore_table:
    last_errno: 0
    last_error:
    skip_counter: 0
    exec_master_log_pos: 1068770059
    relay_log_space: 8425484286
    until_condition: none
    until_log_file:
    until_log_pos: 0
    master_ssl_allowed: no
    master_ssl_ca_file:
    master_ssl_ca_path:
    master_ssl_cert:
    master_ssl_cipher:
    master_ssl_key:
    seconds_behind_master: 187358
    master_ssl_verify_server_cert: no
    last_io_errno: 0
    last_io_error:
    last_sql_errno: 0
    last_sql_error:
    replicate_ignore_server_ids:
    master_server_id: 42
    master_uuid: eab7fcac-3cda-11e6-ada8-fa163e648db2
    master_info_file: mysql.slave_master_info
    sql_delay: 0
    sql_remaining_delay: null
    slave_sql_running_state: waiting for dependent transaction to commit
    master_retry_count: 86400
    master_bind:
    last_io_error_timestamp:
    last_sql_error_timestamp:
    master_ssl_crl:
    master_ssl_crlpath:
    retrieved_gtid_set: eab7fcac-3cda-11e6-ada8-fa163e648db2:58031191-59927276
    executed_gtid_set: eab7fcac-3cda-11e6-ada8-fa163e648db2:1-58080239:58080241
    auto_position: 1
    replicate_rewrite_db:
    channel_name:
    master_tls_version:
    1 row in set (0.00 sec)


    简单介绍一下几个指标信息:

    1. master_log_file
    2. read_master_log_pos
    3. seconds_behind_master
    4. relay_log_file
    5. relay_log_pos
    6. relay_master_log_file
    7. exec_master_log_pos

    master_log_file
    ,
    read_master_log_pos
    :这两个参数是成对的,表示的是从库io thread传输主库的binlog日志号及具体位置。

    relay_log_file
    ,
    relay_log_pos
    :这两个参数也是成对的,表示的是从库sql thread应用中继日志(relay log)号及具体位置。

    relay_master_log_file
    ,
    relay_log_pos
    :这两个参数也是成对的,表示的是上一项中的中继日志对应的主库binlog日志及具体位置(有点绕)。

    seconds_behind_master
    :此参数可简单理解为主从延迟时间,单位为秒。

    从上面这段mysql从状态信息中,我们可以看到,seconds_behind_master: 187358这是从库sql应用延迟主库的时间为187358秒,转换成天,大概两天多。这说明,我们从库复制的数据是两天前的。

    3)验证并行复制

    mysql> show full processlist;
    +----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+
    | id | user | host | db | command | time | state | info |
    +----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+
    | 1 | system user | | null | connect | 18204 | waiting for master to send event | null |
    | 2 | system user | | null | connect | 0 | waiting for dependent transaction to commit | null |
    | 3 | system user | | null | connect | 154914 | system lock | null |
    | 4 | system user | | null | connect | 154914 | waiting for an event from coordinator | null |
    | 5 | system user | | null | connect | 154918 | waiting for an event from coordinator | null |
    | 6 | system user | | null | connect | 155525 | waiting for an event from coordinator | null |
    | 7 | system user | | null | connect | 180427 | waiting for an event from coordinator | null |
    | 8 | system user | | null | connect | 18204 | waiting for an event from coordinator | null |
    | 10 | root | localhost | neteagle3 | query | 0 | starting | show full processlist |
    | 11 | repl | dbslave:9683 | null | binlog dump gtid | 18156 | master has sent all binlog to slave; waiting for more updates | null |
    | 13 | root | localhost | neteagle3 | sleep | 4962 | | null |
    +----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+


    mysql> select * from performance_schema.replication_applier_status_by_worker ;
    +--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+
    | channel_name | worker_id | thread_id | service_state | last_seen_transaction | last_error_number | last_error_message | last_error_timestamp |
    +--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+
    | | 1 | 51 | on | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240805 | 0 | | 0000-00-00 00:00:00 |
    | | 2 | 52 | on | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240210 | 0 | | 0000-00-00 00:00:00 |
    | | 3 | 53 | on | eab7fcac-3cda-11e6-ada8-fa163e648db2:80235089 | 0 | | 0000-00-00 00:00:00 |
    | | 4 | 54 | on | eab7fcac-3cda-11e6-ada8-fa163e648db2:80191268 | 0 | | 0000-00-00 00:00:00 |
    | | 5 | 55 | on | eab7fcac-3cda-11e6-ada8-fa163e648db2:75296683 | 0 | | 0000-00-00 00:00:00 |
    | | 6 | 56 | on | | 0 | | 0000-00-00 00:00:00 |
    +--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+
    6 rows in set (0.00 sec)
    通过本条查询,可以看到开启了6个并行进行复制。

  • 细找瓶颈?

  • 通过以上来看,所有一切似乎都正常,并行复制开了,并且查看到cpu,io,内存均没有达到瓶颈地步。主库写binlog日志大概为2mb/s,这样的日质量并不是非常高。

    从库也同样查看了是否存在锁的情况,也没有发现。

    因此反复的在查看slave状态,看能否发现一些细节,结果还真看到了一些异常现象。relay_log_pos这个参数在频繁的刷slave状态时,发现时常会卡着不动(此时已确认没有看到锁)。或许问题真正的原因正在这里!

  • 顺藤摸瓜!

  • 分析binlog或者relay log日志,看有啥线索:

    [mysql@xxx data]$ mysqlbinlog --no-defaults -v -v --base64-output=decode-rows relay-bin.001384 --start-position=420090430|more

    /!50530 set @@session.pseudo_slave_mode=1/;
    /!50003 set @old_completion_type=@@completion_type,completion_type=0/;
    delimiter /!/;
    # at 420090430

    190923 9:24:28 server id 42 end_log_pos 420090282 crc32 0xd9097eaf gtid last_committed=57148 sequence_number=57149 rbr_only=yes

    /!50718 set transaction isolation level read committed//!/;
    set @@session.gtid_next= 'eab7fcac-3cda-11e6-ada8-fa163e648db2:69415610'/!/;

    at 420090495

    190923 9:24:28 server id 42 end_log_pos 420090364 crc32 0x82b57dfd query thread_id=95 exec_time=0 error_code=0

    set timestamp=1569201868/!/;
    set @@session.pseudo_thread_id=95/!/;
    set @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/!/;
    set @@session.sql_mode=1075838976/!/;
    set @@session.auto_increment_increment=2, @@session.auto_increment_offset=1/!/;
    /!\c gbk //!/;
    set @@session.character_set_client=28,@@session.collation_connection=28,@@session.collation_server=8/!/;
    set @@session.lc_time_names=0/!/;
    set @@session.collation_database=default/!/;
    begin
    /!/;

    at 420090577

    190923 9:24:28 server id 42 end_log_pos 420090585 crc32 0x752e27cf table_map: net.f_event mapped to number 108

    at 420090798

    190923 9:24:28 server id 42 end_log_pos 420090812 crc32 0x72b8e10d table_map: net.f_eventstorage mapped to number 245

    at 420091025

    190923 9:24:28 server id 42 end_log_pos 420091039 crc32 0x1797f9d8 table_map: net.f_eventstorage mapped to number 245

    at 420091252

    190923 9:24:28 server id 42 end_log_pos 420091106 crc32 0x8af14ad2 table_map: net.f_eventdetail mapped to number 243

    at 420091319

    190923 9:24:28 server id 42 end_log_pos 420091177 crc32 0xf1ce87c8 table_map: net.f_eventoperation mapped to number 244

    at 420091390

    190923 9:24:28 server id 42 end_log_pos 420091244 crc32 0x586c0b9d table_map: net.f_eventaudit mapped to number 242

    at 420091457

    190923 9:24:28 server id 42 end_log_pos 420093382 crc32 0x505e5408 update_rows: table id 108

    at 420093595

    190923 9:24:28 server id 42 end_log_pos 420098858 crc32 0x0f404509 update_rows: table id 245

    at 420099071

    190923 9:24:28 server id 42 end_log_pos 420098910 crc32 0xb8d9ed15 write_rows: table id 243

    at 420099123

    190923 9:24:28 server id 42 end_log_pos 420098966 crc32 0x3c489a7f write_rows: table id 244 flags: stmt_end_f


    我们查看了中继日志relay-bin.001384卡住的位置号420090430,为设置gtid_next,信息没什么用。

    继续看在卡住时刻,数据库open的是什么表?

    mysql> show open tables where in_use=1;
    +-----------+---------------------+--------+-------------+
    | database | table | in_use | name_locked |
    +-----------+---------------------+--------+-------------+
    | net | f_currentxxx | 1 | 0 |
    +-----------+---------------------+--------+-------------+

    这张表有什么特殊的么?查看其表结构

    mysql> show create table net.f_currentxxx\g
    *************************** 1. row ***************************
    table: f_currentxxx
    create table: create table f_currentxxx (
    serial int(20) not null comment 'xxx',
    audittime bigint(20) not null comment 'xxx',
    type int(11) default null comment 'xxx',
    severity int(11) default null comment 'xxx,
    key audittime (audittime)
    ) engine=memory default charset=gbk comment='xxx'
    1 row in set (0.00 sec)

    有没有看到什么与众不同?

    没错,就是表的存储引擎engine=memory。memory的表进行主从复制,首先来说如果从库不做查询,一点意义没有,另外对memory表做复制,性能是非常堪忧的。如果必须进行同步,考虑将表的存储引擎改为innodb


    mysql> select table_name from information_schema.tables where table_schema='net' and engine='memory';
    +----------------------+
    | table_name |
    +----------------------+
    | f_currentxxx |
    +----------------------+
    1 row in set, 6 warnings (0.01 sec)
    出于严谨,我们将要复制的数据库进行彻底排查,确实只有这一张表是memory存储引擎。

  • 措施

  • 停止复制进程,将选项中添加replicate-ignore-table=net.f_currentxxx,重新启动复制进程,观察slave状态。


    mysql> show slave status\g
    *************************** 1. row ***************************
    slave_io_state: waiting for master to send event
    master_host: xxx.xxx.xxx.xxx
    master_user: repl
    master_port: 3306
    connect_retry: 60
    master_log_file: mysql-bin.009194
    read_master_log_pos: 939698255
    relay_log_file: relay-bin.001964
    relay_log_pos: 444060572
    relay_master_log_file: mysql-bin.009027
    slave_io_running: yes
    slave_sql_running: yes
    replicate_do_db: net
    replicate_ignore_db: mysql
    replicate_do_table:
    replicate_ignore_table: net.f_currentxxx
    replicate_wild_do_table:
    replicate_wild_ignore_table:
    last_errno: 0
    last_error:
    skip_counter: 0
    exec_master_log_pos: 444060359
    relay_log_space: 180287882098
    until_condition: none
    until_log_file:
    until_log_pos: 0
    master_ssl_allowed: no
    master_ssl_ca_file:
    master_ssl_ca_path:
    master_ssl_cert:
    master_ssl_cipher:
    master_ssl_key:
    seconds_behind_master: 179221
    master_ssl_verify_server_cert: no
    last_io_errno: 0
    last_io_error:
    last_sql_errno: 0
    last_sql_error:
    replicate_ignore_server_ids:
    master_server_id: 42
    master_uuid: eab7fcac-3cda-11e6-ada8-fa163e648db2
    master_info_file: mysql.slave_master_info
    sql_delay: 0
    sql_remaining_delay: null
    slave_sql_running_state: waiting for dependent transaction to commit
    master_retry_count: 86400
    master_bind:
    last_io_error_timestamp:
    last_sql_error_timestamp:
    master_ssl_crl:
    master_ssl_crlpath:
    retrieved_gtid_set: eab7fcac-3cda-11e6-ada8-fa163e648db2:69497322-107886661
    executed_gtid_set: 1264a536-da12-11e9-81ea-005056856ba5:1,
    eab7fcac-3cda-11e6-ada8-fa163e648db2:1-71980857
    auto_position: 1
    replicate_rewrite_db:
    channel_name:
    master_tls_version:
    1 row in set (0.00 sec)

    我们可以看到net.f_currentxxx表已经被忽略复制。持续观察一段时间后,seconds_behind_master在逐渐缩小。
    中继日志应用速度大约5分钟一个(每个中继日志为1gb大小),而主库binlog日志大约为10分钟一个(每个binlog日志为1gb大小)。

  • 总结

  • 在梳理了整个处理过程后,其实难度不高,主要是要细心,细心去排查每一个想到的点。 在非轻量级的数据库中问题发生的概率也会随着量级的增加而增多。而这恰恰是能够磨练个人的成长。

    同时,知识的储备也要充足,这是进阶高手的必要前提!