[翻译]——MySQL 8.0 Histograms

程序员文章站 2022-04-21 18:18:10

前言：本文是对这篇博客MySQL 8.0 Histograms的翻译，翻译如有不当的地方，敬请谅解，请尊重原创和翻译劳动成果，转载的时候请注明出处。谢谢！英文原文地址：https://lefred.be/content/mysql-8-0-histograms/ 翻译原文地址：https://w... ......

前言：本文是对这篇博客mysql 8.0 histograms的翻译，翻译如有不当的地方，敬请谅解，请尊重原创和翻译劳动成果，转载的时候请注明出处。谢谢！

英文原文地址：https://lefred.be/content/mysql-8-0-histograms/

翻译原文地址：

在mysql 8.0之前，mysql缺失了其它关系数据库中一个众所周知的功能：优化器的直方图

优化器团队（optimizer team）在越来越多的mysql dba的呼声中实现了这个功能。

直方图定义

但什么是直方图呢？我们来看*的定义吧，直方图是数值数据分布的准确表示。对于rdbms来说，直方图是特定列内数据分布的近似值。因此在mysql中，直方图能够帮助优化器找到最有效的执行计划。

直方图例子

为了说明直方图是如何影响优化器工作的，我会用dbt3生成的数据来演示。

我们准备了一个简单查询：

select * from orders  
  join customer on o_custkey = c_custkey 
where o_orderdate < '1993-01-01' 
  and c_mktsegment = "automobile"\g

让我们看一下传统的执行计划的explain输出，以及可视化方式（visual one)：

mysql> explain select * from orders  
       join customer on o_custkey = c_custkey 
       where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g
*************************** 1. row ***************************
           id: 1
  select_type: simple
        table: customer
   partitions: null
         type: all
possible_keys: primary
          key: null
      key_len: null
          ref: null
         rows: 149050
     filtered: 10.00
        extra: using where
*************************** 2. row ***************************
           id: 1
  select_type: simple
        table: orders
   partitions: null
         type: ref
possible_keys: i_o_custkey,i_o_orderdate
          key: i_o_custkey
      key_len: 5
          ref: dbt3.customer.c_custkey
         rows: 14
     filtered: 30.62
        extra: using where
2 rows in set, 1 warning (0.28 sec)

我们看到mysql首先对customer表做了一个全表扫描，并且它的选择估计记录（过滤）是10%；

接下来让我们运行这个查询（我使用了count(*)）,然后我们来看看有多少行记录

mysql> select count(*) from orders  
       join customer on o_custkey = c_custkey 
       where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g
*************************** 1. row ***************************
count(*): 45127
1 row in set (49.98 sec)

创建直方图

现在，我将在表customer上的字段c_mktsegment上创建一个直方图

mysql> analyze table customer update histogram on c_mktsegment with 1024 buckets;
+---------------+-----------+----------+---------------------------------------------------------+
| table         | op        | msg_type | msg_text                                                |
+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status   | histogram statistics created for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+

接下来，我们来验证查询的执行计划：

mysql> explain select * from orders  
               join customer on o_custkey = c_custkey 
               where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g
*************************** 1. row ***************************
           id: 1
  select_type: simple
        table: orders
   partitions: null
         type: all
possible_keys: i_o_custkey,i_o_orderdate
          key: null
      key_len: null
          ref: null
         rows: 1494230
     filtered: 30.62
        extra: using where
*************************** 2. row ***************************
           id: 1
  select_type: simple
        table: customer
   partitions: null
         type: eq_ref
possible_keys: primary
          key: primary
      key_len: 4
          ref: dbt3.orders.o_custkey
         rows: 1
     filtered: 19.84
        extra: using where
2 rows in set, 1 warning (1.06 sec)

现在，使用直方图后，我们可以看到customer表的“吸引力”降低了，因为order表按条件过滤的行的百分比（30.62）几乎是customer表按条件过滤行的百分比的两倍（19.84%），这将导致低order表进行查找。

注意：这段感觉没有翻译恰当，英文原文如下，如果感觉翻译比较生硬，参考原文

now with the histogram we can see that it becomes less attractive to start with customer table since almost twice as many rows (19.84%) will cause look-ups into the order table.

优化器选择对order表进行全表扫描（full sacn），此时执行计划的代价看起来似乎还高一些，，让我们看一下sql的执行时间：

mysql> select count(*) from orders  
       join customer on o_custkey = c_custkey 
       where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g
*************************** 1. row ***************************
count(*): 45127
1 row in set (6.35 sec)

sql语句的执行时间更短，明显比之前要快了

查看数据的分布

直方图数据存贮在information_schema.column_statistics表中，这个表的定义如下

+-------------+-------------+------+-----+---------+-------+
| field       | type        | null | key | default | extra |
+-------------+-------------+------+-----+---------+-------+
| schema_name | varchar(64) | no   |     | null    |       |
| table_name  | varchar(64) | no   |     | null    |       |
| column_name | varchar(64) | no   |     | null    |       |
| histogram   | json        | no   |     | null    |       |
+-------------+-------------+------+-----+---------+-------+

它的一条记录类似下面这样：

select schema_name, table_name, column_name, json_pretty(histogram) 
from information_schema.column_statistics 
where column_name = 'c_mktsegment'\g
*************************** 1. row ***************************
           schema_name: dbt3
            table_name: customer
           column_name: c_mktsegment
json_pretty(histogram): {
  "buckets": [
    [
      "base64:type254:qvvut01pqklmrq==",
      0.19837010534684954
    ],
    [
      "base64:type254:qlvjterjtkc=",
      0.3983104750546611
    ],
    [
      "base64:type254:rlvstkluvvjf",
      0.5978433710991851
    ],
    [
      "base64:type254:se9vu0vit0xe",
      0.799801232359372
    ],
    [
      "base64:type254:tufdselorvjz",
      1.0
    ]
  ],
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 255,
  "last-updated": "2018-03-02 20:21:48.271523",
  "sampling-rate": 0.6709158000670916,
  "histogram-type": "singleton",
  "number-of-buckets-specified": 1024
}

而且可以查看分布

select from_base64(substring_index(v, ':', -1)) value, concat(round(c*100,1),'%') cumulfreq, 
       concat(round((c - lag(c, 1, 0) over()) * 100,1), '%') freq  
from information_schema.column_statistics, json_table(histogram->'$.buckets', 
     '$[*]' columns(v varchar(60) path '$[0]', c double path '$[1]')) hist  
where schema_name  = 'dbt3' and table_name = 'customer' and column_name = 'c_mktsegment';
+------------+-----------+-------+
| value      | cumulfreq | freq  |
+------------+-----------+-------+
| automobile | 19.8%     | 19.8% |
| building   | 39.9%     | 20.1% |
| furniture  | 59.9%     | 19.9% |
| household  | 79.9%     | 20.1% |
| machinery  | 100.0%    | 20.1% |
+------------+-----------+-------+

你也可以用下面语法删除直方图信息。

mysql> analyze table customer drop histogram on c_mktsegment;
+---------------+-----------+----------+---------------------------------------------------------+
| table         | op        | msg_type | msg_text                                                |
+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status   | histogram statistics removed for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+
1 row in set (0.00 sec)

buckets

你会注意到，当我们创建一个直方图时，我们需要指定buckets的数量，事实上，数据被分成包含特定值以及他们基数（cardinality）的一组buckets，如果在上一个例子中检查直方图的类型，你会发现它是等宽直方图（singleton）

"histogram-type": "singleton",

这种类型的直方图最好的，因为基数是针对单个特定值。如果这次我仅使用2个存储桶（buckets）来重新创建直方图（请记住，在c_mktsegment列中有4个不同的值）：

mysql> analyze table customer update histogram on c_mktsegment with 2 buckets;
+---------------+-----------+----------+---------------------------------------------------------+
| table         | op        | msg_type | msg_text                                                |
+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status   | histogram statistics created for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+

如果我检查直方图的类型：

mysql> select schema_name, table_name, column_name, 
              json_pretty(histogram) 
       from information_schema.column_statistics 
      where column_name = 'c_mktsegment'\g
*************************** 1. row ***************************
           schema_name: dbt3
            table_name: customer
           column_name: c_mktsegment
json_pretty(histogram): {
  "buckets": [
    [
      "base64:type254:qvvut01pqklmrq==",
      "base64:type254:rlvstkluvvjf",
      0.5996992690844636,
      3
    ],
    [
      "base64:type254:se9vu0vit0xe",
      "base64:type254:tufdselorvjz",
      1.0,
      2
    ]
  ],
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 255,
  "last-updated": "2018-03-02 20:42:26.165898",
  "sampling-rate": 0.6709158000670916,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 2
}

现在的直方图类型是等高直方图，这意味着将连续范围的值分组到存储桶中，以使落入每个存储桶的数据项的数量相同。

结论：

直方图对那些不是索引中第一列的列非常有用，这些列用于join、in子查询（in-subqueries）或order by…limit的查询的where条件下使用。

另外，可以考虑尝试使用足够的存储通来获取等宽直方图。

上一篇： iP拍照之巅！iPhone 11系列拍照最大亮点：深度融合

下一篇：我TM是她母亲

[翻译]——MySQL 8.0 Histograms

centos7上mysql8.0rpm方式安装教程图解

Ubuntu Server 16.04下mysql8.0安装配置图文教程

阿里云ESC 安装 MYSQL8.0的教程

mysql8.0入坑体验

CentOS7.6安装MYSQL8.0的步骤详解

Mysql 8.0版本导出的sql文件在Mysql 5.5中运行出错

VS2017调用MySQL 8.0的方法

CentOS安装配置MySQL8.0的步骤详解

Ubuntu 18.04下mysql 8.0 安装配置方法图文教程

关于MySql 8.0下密码策略问题