hive 执行计划
程序员文章站
2022-06-01 22:19:11
...
hive执行计划语法
EXPLAIN [EXTENDED] query
EXTENDED参数:输出执行计划中操作符的额外信息;通常,展示物理信息,如文件名等
hive查询转换为一个 有向无环图 的阶段序列;这些阶段可能是 Map/Reduce阶段 或者是执行元数据与文件操作(例如:重命名,移动); explain 输出包括三部分:
- 查询语句的抽象语法树
- 执行计划不同阶段间的依赖关系
- 每个阶段的描述
阶段描述信息以操作符和与其相关元数据来显示 操作序列;操作符元数据有以下东西组成,像 FilterOperator 的过滤表达式;SelectOperator 的 选择表达式;FileSinkOperator 的输出文件名
执行计划语法介绍到此结束,下面给出一个例子。
执行计划示例
EXPLAIN
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;
执行计划输出如下:
抽象语法树:
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))
阶段依赖关系图:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
stage-1 是 root 阶段
stage-2在stage-1执行完后执行
stage-0在stage-2执行结束后执行
各阶段执行计划
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: rand()
type: double
tag: -1
value expressions:
expr: substr(value, 4)
type: string
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(UDFToDouble(VALUE.0))
keys:
expr: KEY.0
type: string
mode: partial1
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.mapred.SequenceFileOutputFormat
name: binary_table
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
/tmp/hive-zshao/67494501/106593589.10001
Reduce Output Operator
key expressions:
expr: 0
type: string
sort order: +
Map-reduce partition columns:
expr: 0
type: string
tag: -1
value expressions:
expr: 1
type: double
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE.0)
keys:
expr: KEY.0
type: string
mode: final
Select Operator
expressions:
expr: 0
type: string
expr: 1
type: double
Select Operator
expressions:
expr: UDFToInteger(0)
type: int
expr: 1
type: double
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
上例中,包含两个map/reduce 阶段(stage-1,stage-2),一个文件系统相关阶段(stage-0).stage-0将结果从临时目录移动到dest_g1表相应的目录下;
一个Map/Reduce阶段有两部分组成:
- 从表别名到Map Operator Tree映射:该映射通知Mapper,其操作树被调用来处理特定表的行或者前一个Map/Reduce阶段的输出数据;上例stage-1中,src表的行被以Reduce Output Operator为根的操作符树处理;在stage-2中,stage-1输出行被stage2中以Reduce Output Operator为根的操作符树处理;这两个Reduce Output Operator 根据元数据中展示条件 分区数据到Reducers
- Reduce Operator Tree:处理Map/Reduce Job 中Reducers所有数据行;在stage-1中, Reducer Operator Tree执行部分聚合;在stage-2中, Reducer Operator Tree从stage-1的部分聚合结果计算最终聚合结果
hive执行计划作用
分析作业执行过程,优化作业执行流程,提升作业执行效率;例如,数据过滤条件从reduce端提前到map端,有效减少map/reduce间shuffle数据量,提升作业执行效率;
提前过滤数据数据集,减少不必要的读取操作;例如: hive join 操作先于 where 条件顾虑,将 分区条件放入 on语句中,能够有效减少 输入数据集;