欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

如何使用SQL窗口子句减少语法开销

程序员文章站 2022-06-30 14:00:16
SQL是一种冗长的语言,其中最冗长的特性之一是窗口函数. 在.最近遇到的堆栈溢出问题,有人要求计算某一特定日期的时间序列中的第一个值和最后一个值之间的差额: 输入 volume tstamp 29011 2012-12-28 09:00:00 28701 2012-12-28 10:00:00 28 ......

sql是一种冗长的语言,其中最冗长的特性之一是窗口函数.

在.最近遇到的堆栈溢出问题,有人要求计算某一特定日期的时间序列中的第一个值和最后一个值之间的差额:

输入

 
volume  tstamp
 
---------------------------
 
29011   2012-12-28 09:00:00
 
28701   2012-12-28 10:00:00
 
28830   2012-12-28 11:00:00
 
28353   2012-12-28 12:00:00
 
28642   2012-12-28 13:00:00
 
28583   2012-12-28 14:00:00
 
28800   2012-12-29 09:00:00
 
28751   2012-12-29 10:00:00
 
28670   2012-12-29 11:00:00
 
28621   2012-12-29 12:00:00
 
28599   2012-12-29 13:00:00
 
28278   2012-12-29 14:00:00
 

期望输出

 
first  last   difference  date
 
------------------------------------
 
29011  28583  428         2012-12-28
 
28800  28278  522         2012-12-29
 

如何编写查询

请注意,值和时间戳级数可能不相关。所以,没有一条规定如果timestamp2 > timestamp1然后value2 < value1。否则,这个简单的查询就能工作(使用postgresql语法):

 
select 
 
  max(volume)               as first,
 
  min(volume)               as last,
 
  max(volume) - min(volume) as difference,
 
  cast(tstamp as date)      as date
 
from t
 
group by cast(tstamp as date);
 

有几种方法可以在不涉及窗口函数的组中找到第一个和最后一个值。例如:

  • 在oracle中,可以使用第一和最后函数,由于某些神秘原因,这些函数没有编写。first(...) within group (order by ...)last(...) within group (order by ...),与其他排序集聚合函数一样,但是some_aggregate_function(...) keep (dense_rank first order by ...)。围棋数字
  • 在postgresql中,可以使用distinct on语法与 order bylimit

有关各种方法的更多细节可以在这里找到:
https://blog.jooq.org/2017/09/22/how-to-write-efficient-top-n-queries-in-sql

最好的方法是使用像oracle这样的聚合函数,但是很少有数据库具有这种功能。所以,我们将使用first_valuelast_value窗口函数:

 
select distinct
 
  first_value(volume) over (
 
    partition by cast(tstamp as date) 
 
    order by tstamp
 
    rows between unbounded preceding and unbounded following
 
  ) as first,
 
  last_value(volume) over (
 
    partition by cast(tstamp as date) 
 
    order by tstamp
 
    rows between unbounded preceding and unbounded following
 
  ) as last,
 
  first_value(volume) over (
 
    partition by cast(tstamp as date) 
 
    order by tstamp
 
    rows between unbounded preceding and unbounded following
 
  ) 
 
  - last_value(volume) over (
 
    partition by cast(tstamp as date) 
 
    order by tstamp
 
    rows between unbounded preceding and unbounded following
 
  ) as diff,
 
  cast(tstamp as date) as date
 
from t
 
order by cast(tstamp as date)
 

哎呀。

看上去不太容易读。但它将产生正确的结果。当然,我们可以包装列的定义。firstlast在派生表中,但这仍然会给我们留下两次窗口定义的重复:

 
partition by cast(tstamp as date) 
 
order by tstamp
 
rows between unbounded preceding and unbounded following
 

援救窗口条款

幸运的是,至少有3个数据库实现了sql标准。window条款:

  • mysql
  • postgresql
  • sybase sql anywhere

上面的查询可以重构为这个查询:

 
select distinct
 
  first_value(volume) over w as first,
 
  last_value(volume) over w as last,
 
  first_value(volume) over w 
 
    - last_value(volume) over w as diff,
 
  cast(tstamp as date) as date
 
from t
 
window w as (
 
  partition by cast(tstamp as date) 
 
  order by tstamp
 
  rows between unbounded preceding and unbounded following
 
)
 
order by cast(tstamp as date)
 

请注意,如何使用窗口规范来指定窗口名称,就像定义公共表达式一样(with条款):

 
window 
 
    <window-name> as (<window-specification>)
 
{  ,<window-name> as (<window-specification>)... }
 

我不仅可以重用整个规范,还可以根据部分规范构建规范,并且只重用部分规范。我以前的查询可以这样重写:

 
select distinct
 
  first_value(volume) over w3 as first,
 
  last_value(volume) over w3 as last,
 
  first_value(volume) over w3 
 
    - last_value(volume) over w3 as diff,
 
  cast(tstamp as date) as date
 
from t
 
window 
 
  w1 as (partition by cast(tstamp as date)),
 
  w2 as (w1 order by tstamp),
 
  w3 as (w2 rows between unbounded preceding 
 
                     and unbounded following)
 
order by cast(tstamp as date)
 

每个窗口规范可以从头创建,也可以基于先前定义的窗口规范。注在引用窗口定义时也是如此。如果我想重用partition by条款和order by子句,但请更改frame条款(rows ...),那么我就可以这样写了:

 
select distinct
 
  first_value(volume) over (
 
    w2 rows between unbounded preceding and current row
 
  ) as first,
 
  last_value(volume) over (
 
    w2 rows between current row and unbounded following
 
  ) as last,
 
  first_value(volume) over (
 
    w2 rows unbounded preceding
 
  ) - last_value(volume) over (
 
    w2 rows between 1 preceding and unbounded following
 
  ) as diff,
 
  cast(tstamp as date) as date
 
from t
 
window 
 
  w1 as (partition by cast(tstamp as date)),
 
  w2 as (w1 order by tstamp)
 
order by cast(tstamp as date)