Hive 令人头痛的multi-distinct

程序员文章站 2022-05-29 20:38:31

...

线上一个查询简化如下：Selectdt,count(distinctc1),count(distinctcasewhenc20andc1=0thenc1end),count(distinctcasewhenc20andc10thenc1end)fromtwheredtbetwe

ABSTRACTSYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT(TOK_SELEXPR (TOK_TABLE_OR_COL dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count(TOK_TABLE_OR_COL c1))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1)))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE(TOK_TABLE_OR_COL dt) '20131108' '20131110')) (TOK_GROUPBY (TOK_TABLE_OR_COLdt)))) STAGEDEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGEPLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: t TableScan alias: t Filter Operator predicate: expr: dt BETWEEN '20131108'AND '20131110' type: Boolean //通过select operator做投影 Select Operator expressions: expr: dt type: string expr: c1 type: int expr: c2 type: int outputColumnNames: dt, c1, c2 //在MAP端进行简单的聚合，雷区1：假设有N个distinct，MAP处理数据有M条，，那么这部处理后的输出是N*M条数据，因为MAP会对dt,keys[i]做聚合操作，所以尽量在MAP端过滤尽可能多的数据 Group By Operator aggregations: expr: count(DISTINCTc1) expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 = 0))) THEN (c1) END) expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 > 0))) THEN (c1) END) bucketGroup: false keys: expr: dt type: string expr: c1 type: int expr: CASE WHEN (((c2> 0) and (c1 = 0))) THEN (c1) END type: int expr: CASE WHEN (((c2> 0) and (c1 > 0))) THEN (c1) END type: int mode: hash outputColumnNames: _col0,_col1, _col2, _col3, _col4, _col5, _col6 //雷区2：在做Reduce Sink时是根据partition cplumns进行HASH的方式，那么对于按date分区的表来说一天的所有数据被放大N倍传输到Reducer进行运算，导致性能长尾或者OOME. Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: int expr: _col2 type: int expr: _col3 type: int sort order: ++++ Map-reduce partitioncolumns: expr: _col0 type: string tag: -1 value expressions: expr: _col4 type: bigint expr: _col5 type: bigint expr: _col6 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCTKEY._col1:0._col0) expr: count(DISTINCTKEY._col1:1._col0) expr: count(DISTINCTKEY._col1:2._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1,_col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col3 type: bigint outputColumnNames: _col0, _col1,_col2, _col3 File Output Operator compressed: true GlobalTableId: 0 table: input format:org.apache.hadoop.mapred.TextInputFormat output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1

查看执行计划（省去非关键部分）：

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages:Stage-1, Stage-3, Stage-4 Stage-3 is a root stage Stage-4 is a root stage Stage-0 is a root stage