欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  数据库

Hive 令人头痛的multi-distinct

程序员文章站 2022-05-29 20:38:31
...

线上一个查询简化如下:Selectdt,count(distinctc1),count(distinctcasewhenc20andc1=0thenc1end),count(distinctcasewhenc20andc10thenc1end)fromtwheredtbetwe

ABSTRACTSYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT(TOK_SELEXPR (TOK_TABLE_OR_COL dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count(TOK_TABLE_OR_COL c1))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1)))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE(TOK_TABLE_OR_COL dt) '20131108' '20131110')) (TOK_GROUPBY (TOK_TABLE_OR_COLdt)))) STAGEDEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGEPLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: t TableScan alias: t Filter Operator predicate: expr: dt BETWEEN '20131108'AND '20131110' type: Boolean //通过select operator做投影 Select Operator expressions: expr: dt type: string expr: c1 type: int expr: c2 type: int outputColumnNames: dt, c1, c2 //在MAP端进行简单的聚合,雷区1:假设有N个distinct,MAP处理数据有M条,,那么这部处理后的输出是N*M条数据,因为MAP会对dt,keys[i]做聚合操作,所以尽量在MAP端过滤尽可能多的数据 Group By Operator aggregations: expr: count(DISTINCTc1) expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 = 0))) THEN (c1) END) expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 > 0))) THEN (c1) END) bucketGroup: false keys: expr: dt type: string expr: c1 type: int expr: CASE WHEN (((c2> 0) and (c1 = 0))) THEN (c1) END type: int expr: CASE WHEN (((c2> 0) and (c1 > 0))) THEN (c1) END type: int mode: hash outputColumnNames: _col0,_col1, _col2, _col3, _col4, _col5, _col6 //雷区2:在做Reduce Sink时是根据partition cplumns进行HASH的方式,那么对于按date分区的表来说一天的所有数据被放大N倍传输到Reducer进行运算,导致性能长尾或者OOME. Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: int expr: _col2 type: int expr: _col3 type: int sort order: ++++ Map-reduce partitioncolumns: expr: _col0 type: string tag: -1 value expressions: expr: _col4 type: bigint expr: _col5 type: bigint expr: _col6 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCTKEY._col1:0._col0) expr: count(DISTINCTKEY._col1:1._col0) expr: count(DISTINCTKEY._col1:2._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1,_col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col3 type: bigint outputColumnNames: _col0, _col1,_col2, _col3 File Output Operator compressed: true GlobalTableId: 0 table: input format:org.apache.hadoop.mapred.TextInputFormat output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1


查看执行计划(省去非关键部分):

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages:Stage-1, Stage-3, Stage-4 Stage-3 is a root stage Stage-4 is a root stage Stage-0 is a root stage