PostgreSQL利用递归优化求稀疏列唯一值的方法

程序员文章站 2022-04-01 22:45:54

在数据库中经常会碰到一些表的列是稀疏列，只有很少的值，例如性别字段，一般就只有2种不同的值。但是当我们求这些稀疏列的唯一值时，如果表的数据量很大，速度还是会很慢。例如：创建测试表bill=# crea...

在数据库中经常会碰到一些表的列是稀疏列，只有很少的值，例如性别字段，一般就只有2种不同的值。
但是当我们求这些稀疏列的唯一值时，如果表的数据量很大，速度还是会很慢。

例如：
创建测试表

bill=# create table t_sex (sex char(1), otherinfo text);
create table
bill=# insert into t_sex select 'm', generate_series(1,10000000)||'this is test';
insert 0 10000000
bill=# insert into t_sex select 'w', generate_series(1,10000000)||'this is test';
insert 0 10000000

查询：
可以看到下面的查询速度很慢。

bill=# select count(distinct sex) from t_sex;
 count
-------
   2
(1 row)

time: 8803.505 ms (00:08.804)
bill=# select sex from t_sex t group by sex;
 sex
-----
 m
 w
(2 rows)

time: 1026.464 ms (00:01.026)

那么我们对该字段加上索引又是什么情况呢？

速度依然没有明显

bill=# create index idx_sex_1 on t_sex(sex);
create index
bill=# select count(distinct sex) from t_sex;
 count
-------
   2
(1 row)

time: 8502.460 ms (00:08.502)
bill=# select sex from t_sex t group by sex;
 sex
-----
 m
 w
(2 rows)

time: 572.353 ms

的变化，可以看到执行计划已经使用index only scan了。

bill=# explain select count(distinct sex) from t_sex;
                     query plan
----------------------------------------------------------------------------------------------
 aggregate (cost=371996.44..371996.45 rows=1 width=8)
  -> index only scan using idx_sex_1 on t_sex (cost=0.44..321996.44 rows=20000000 width=2)
(2 rows)

同样的sql我们看看在oracle中性能如何？

创建测试表：

sql> create table t_sex (sex char(1), otherinfo varchar2(100));

table created.

sql> insert into t_sex select 'm', rownum||'this is test' from dual connect by level <=10000000;

10000000 rows created.

sql> commit;

commit complete.

sql> insert into t_sex select 'w', rownum||'this is test' from dual connect by level <=10000000;

10000000 rows created.

sql> commit;

commit complete.

性能测试：

sql> set lines 1000 pages 2000
sql> set autotrace on
sql> set timing on

sql> select count(distinct sex) from t_sex;

count(distinctsex)
------------------
         2

elapsed: 00:00:01.58

execution plan
----------------------------------------------------------
plan hash value: 3915432945

----------------------------------------------------------------------------
| id | operation     | name | rows | bytes | cost (%cpu)| time   |
----------------------------------------------------------------------------
|  0 | select statement  |    |   1 |   3 | 20132  (1)| 00:00:01 |
|  1 | sort group by   |    |   1 |   3 |      |     |
|  2 |  table access full| t_sex |  14m|  42m| 20132  (1)| 00:00:01 |
----------------------------------------------------------------------------

note
-----
  - dynamic statistics used: dynamic sampling (level=2)


statistics
----------------------------------------------------------
     0 recursive calls
     0 db block gets
   74074 consistent gets
     0 physical reads
     0 redo size
    552 bytes sent via sql*net to client
    608 bytes received via sql*net from client
     2 sql*net roundtrips to/from client
     1 sorts (memory)
     0 sorts (disk)
     1 rows processed

sql> select sex from t_sex t group by sex;

se
--
m
w

elapsed: 00:00:01.08

execution plan
----------------------------------------------------------
plan hash value: 3915432945

----------------------------------------------------------------------------
| id | operation     | name | rows | bytes | cost (%cpu)| time   |
----------------------------------------------------------------------------
|  0 | select statement  |    |  14m|  42m| 20558  (3)| 00:00:01 |
|  1 | sort group by   |    |  14m|  42m| 20558  (3)| 00:00:01 |
|  2 |  table access full| t_sex |  14m|  42m| 20132  (1)| 00:00:01 |
----------------------------------------------------------------------------

note
-----
  - dynamic statistics used: dynamic sampling (level=2)


statistics
----------------------------------------------------------
     0 recursive calls
     0 db block gets
   74074 consistent gets
     0 physical reads
     0 redo size
    589 bytes sent via sql*net to client
    608 bytes received via sql*net from client
     2 sql*net roundtrips to/from client
     1 sorts (memory)
     0 sorts (disk)
     2 rows processed

可以看到oracle的性能即使不加索引也明显比postgresql中要好。
那么我们在postgresql中是不是没办法继续优化了呢？这种情况我们利用pg中的递归语句结合索引可以大幅提升性能。

sql改写：

bill=# with recursive tmp as (
bill(#  (
bill(#   select min(t.sex) as sex from t_sex t where t.sex is not null
bill(#  )
bill(#  union all
bill(#  (
bill(#   select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null)
bill(#    from tmp s where s.sex is not null
bill(#  )
bill(# )
bill-# select count(distinct sex) from tmp;
 count
-------
   2
(1 row)

time: 2.711 ms

查看执行计划：

bill=# explain with recursive tmp as (
bill(#  (
bill(#   select min(t.sex) as sex from t_sex t where t.sex is not null
bill(#  )
bill(#  union all
bill(#  (
bill(#   select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null)
bill(#    from tmp s where s.sex is not null
bill(#  )
bill(# )
bill-# select count(distinct sex) from tmp;
                           query plan
----------------------------------------------------------------------------------------------------------------------
 aggregate (cost=53.62..53.63 rows=1 width=8)
  cte tmp
   -> recursive union (cost=0.46..51.35 rows=101 width=32)
      -> result (cost=0.46..0.47 rows=1 width=32)
         initplan 3 (returns $1)
          -> limit (cost=0.44..0.46 rows=1 width=2)
             -> index only scan using idx_sex_1 on t_sex t (cost=0.44..371996.44 rows=20000000 width=2)
                index cond: (sex is not null)
      -> worktable scan on tmp s (cost=0.00..4.89 rows=10 width=32)
         filter: (sex is not null)
  -> cte scan on tmp (cost=0.00..2.02 rows=101 width=32)
(11 rows)

time: 1.371 ms

可以看到执行时间从原先的8000ms降低到了2ms，提升了几千倍！

甚至对比oracle，性能也是提升了很多。

但是需要注意的是：这种写法仅仅是针对稀疏列，换成数据分布广泛的字段，显然性能是下降的, 所以使用递归sql不适合数据分布广泛的字段的group by或者count(distinct)操作。

到此这篇关于postgresql利用递归优化求稀疏列唯一值的文章就介绍到这了,更多相关postgresql递归优化内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持！