欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

impala-count()报错换用NDV()函数

程序员文章站 2022-03-05 14:29:06
...

使用impala查询引擎的时候,如果count(disticnt column)数据量过大会报错,报错内容

正在提取遇到以下错误的结果:

 (370) Query analysis error occurred during query execution: [HY000] : AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT a.udid_); deviating function: count(DISTINCT a.ip_)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to perform this rewrite automatically.
. Error 370. SQLSTATE HY000

简单来说就是count(distinct column)让你换成NDV(column),换了之后查询会快很多,
如果想正常使用count()需要设置参数,

官方文档:

set APPX_COUNT_DISTINCT=true;

NDV功能
一个聚合函数,返回一个类似于COUNT(DISTINCT col )结果的近似值,即“不同值的数量” 。 它比COUNT和DISTINCT的组合快得多,并且使用恒定的内存量,因此对于具有高基数的列而言,内存密集度较低。
 
句法:
 
 NDV([DISTINCT | ALL] expression ) 
使用说明:
这是COMPUTE STATS语句在内部使用的机制,用于计算列中不同值的数量。
由于此数字是估计值,因此可能无法反映列中不同值的精确数量,尤其是如果基数非常低或非常高。 如果估计的数量高于表中的行数,Impala会在查询计划期间在内部调整值。
返回类型: Impala 2.0及更高版本中的DOUBLE ; 早期版本中的STRING
复杂类型考虑:
要在聚合函数中访问具有复杂类型( ARRAY , STRUCT或MAP )的列,请使用查询中的连接表示法解压缩单个元素,然后将该函数应用于最终标量项,字段,键或值位于列中任何嵌套类型层次结构的底部。 有关在Impala中使用复杂类型的详细信息,请参阅复杂类型(仅限Impala 2.3或更高版本) 。
下面的示例演示如何使用包含嵌套复杂类型( STRUCT项的ARRAY )的列中的值调用多个聚合函数。 使用连接表示法在查询内解压缩数组。 使用ITEM伪列引用数组元素,并使用点表示法引用数组元素内的结构字段。 使用数字R_NATIONKEY字段计算诸如SUM()和AVG()的数值,并且从字符串N_NAME字段计算通用MAX()和MIN()值。
 describe region; +-------------+-------------------------+---------+ | name | type | comment | +-------------+-------------------------+---------+ | r_regionkey | smallint | | | r_name | string | | | r_comment | string | | | r_nations | array<struct< | | | | n_nationkey:smallint, | | | | n_name:string, | | | | n_comment:string | | | | >> | | +-------------+-------------------------+---------+ select r_name, r_nations.item.n_nationkey from region, region.r_nations as r_nations order by r_name, r_nations.item.n_nationkey; +-------------+------------------+ | r_name | item.n_nationkey | +-------------+------------------+ | AFRICA | 0 | | AFRICA | 5 | | AFRICA | 14 | | AFRICA | 15 | | AFRICA | 16 | | AMERICA | 1 | | AMERICA | 2 | | AMERICA | 3 | | AMERICA | 17 | | AMERICA | 24 | | ASIA | 8 | | ASIA | 9 | | ASIA | 12 | | ASIA | 18 | | ASIA | 21 | | EUROPE | 6 | | EUROPE | 7 | | EUROPE | 19 | | EUROPE | 22 | | EUROPE | 23 | | MIDDLE EAST | 4 | | MIDDLE EAST | 10 | | MIDDLE EAST | 11 | | MIDDLE EAST | 13 | | MIDDLE EAST | 20 | +-------------+------------------+ select r_name, count(r_nations.item.n_nationkey) as count, sum(r_nations.item.n_nationkey) as sum, avg(r_nations.item.n_nationkey) as avg, min(r_nations.item.n_name) as minimum, max(r_nations.item.n_name) as maximum, ndv(r_nations.item.n_nationkey) as distinct_vals from region, region.r_nations as r_nations group by r_name order by r_name; +-------------+-------+-----+------+-----------+----------------+---------------+ | r_name | count | sum | avg | minimum | maximum | distinct_vals | +-------------+-------+-----+------+-----------+----------------+---------------+ | AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 | | AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 | | ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 | | EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 | | MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 | +-------------+-------+-----+------+-----------+----------------+---------------+ 
限制:
此功能无法在分析上下文中使用。 也就是说,此函数根本不允许使用OVER()子句。
例子:
以下示例查询十亿行表以说明COUNT(DISTINCT)和NDV()的相对性能。 它显示了COUNT(DISTINCT)如何给出精确答案,但对于近似结果足够的大规模数据来说效率很低。 NDV()函数给出了近似结果,但速度更快。
 select count(distinct col1) from sample_data; +---------------------+ | count(distinct col1)| +---------------------+ | 100000 | +---------------------+ Fetched 1 row(s) in 20.13s select cast(ndv(col1) as bigint) as col1 from sample_data; +----------+ | col1 | +----------+ | 139017 | +----------+ Fetched 1 row(s) in 8.91s 
以下示例显示了如何在单个查询中编写多个NDV()调用,以便轻松了解哪些列具有更多或更少的不同值。 此技术比使用COUNT(DISTINCT)调用运行一系列查询更快。
 select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2, cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4 from sample_data; +----------+-----------+------------+-----------+ | col1 | col2 | col3 | col4 | +----------+-----------+------------+-----------+ | 139017 | 282 | 46 | 145636240 | +----------+-----------+------------+-----------+ Fetched 1 row(s) in 34.97s select count(distinct col1) from sample_data; +---------------------+ | count(distinct col1)| +---------------------+ | 100000 | +---------------------+ Fetched 1 row(s) in 20.13s select count(distinct col2) from sample_data; +----------------------+ | count(distinct col2) | +----------------------+ | 278 | +----------------------+ Fetched 1 row(s) in 20.09s select count(distinct col3) from sample_data; +-----------------------+ | count(distinct col3) | +-----------------------+ | 46 | +-----------------------+ Fetched 1 row(s) in 19.12s select count(distinct col4) from sample_data; +----------------------+ | count(distinct col4) | +----------------------+ | 147135880 | +----------------------+ Fetched 1 row(s) in 266.95s 
父主题: Impala聚合函数