Hive中的排序及优化 ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY
Hive 提供了多重排序语句,我们这里对这几种排序方法做一解析,并进行测试
编排如下:
试验场景
Hive 中的 ORDER BY 与其不足
Hive 中的 SORT BY,DISTRIBUTE BY
Hive 中的CLUSTER BY 与其不足
完成相同功能的Job 执行时长对比
实验场景:
我们创建了一个按天分区的分区表,将数据按天分区。
CREATE EXTERNAL TABLE `clickcube_mid`(
`logtype` bigint,
`date` string,
`hour` bigint,
`projectid` bigint,
`campaignid` bigint,
`templateid` bigint,
`mediaid` bigint,
`slotid` bigint,
`channeltype` bigint,
`regioncode` string,
`campclick` bigint,
`campimp` bigint,
`mediaclick` bigint,
`mediaimp` bigint,
`templateimp` bigint,
`templatecampimp` bigint,
`mediaclickcost` double,
`campclickcost` double)
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
TBLPROPERTIES (
'last_modified_by'='cloudera-scm',
'last_modified_time'='1530676367',
'transient_lastDdlTime'='1530676367')
Hive 中的 ORDER BY 与其不足:
我想大家对ORDER BY 应该都不陌生,常见的MySQL 就支持 ORDER BY,
ORDER BY 可以指定多个字段,可以按照某个字段进行 升序ASC , 或者 降序DESC.
Hive 中 ORDER BY 和其他SQL 方言并没区别,会对查询结果进行一个全局排序。
其缺点:
1) 由于是全局排序,所以所有的数据会通过一个Reducer 进行处理,当数据结果较大的时候,
一个Reducer 进行处理十分影响性能。
注意事项:
当开启MR 严格模式的时候ORDER BY 必须要设置 LIMIT 子句 ,否则会报错
开启严格模式:
set hive.mapred.mode=strict;
执行排序
SELECT * FROM clickcube_mid WHERE day='2018-07-03' ORDER BY mediaid;
Error: Error while compiling statement: FAILED: SemanticException 1:60 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.large.query to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'mediaid' (state=42000,code=40000)
此外
开启严格模式 hive.mapred.mode=strict ,对于分区表, 必须要对分区字段加限制条件
select * from clickcube_mid limit 10;
Error: Error while compiling statement: FAILED: SemanticException Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.large.query to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "clickcube_mid" Table "clickcube_mid" (state=42000,code=40000)
Hive 中的 SORT BY,DISTRIBUTE BY:
由于Hive 中的ORDER BY 对于大数据集 存在性能问题,
延伸出了部分排序,以及将按相同KEY 控制到同一划分集合的需求。
即以下两个方案 SORT BY , DISTRIBUTE BY, 我们分别对这两个方案进行介绍。
SORT BY:
SORT BY 是一个部分排序方案, 其只会在每个reducer 中对数据进行排序,
也就是执行一个局部排序过程。
注意:
使用sort by 你可以指定执行的reduce 个数 (set mapred.reduce.tasks=<number>),
对输出的数据再执行归并排序,即可以得到全部结果。
DISTRIBUTE BY:
DISTRIBUTE BY 控制map 中的输出在 reducer 中是如何进行划分的。
使用DISTRIBUTE BY 可以保证相同KEY的记录被划分到一个Reduce 中。
Hive 中的CLUSTER BY 与其不足:
如果对某一列既想采用SORT BY 也想采用 DISTRIBUTE BY ,
那么可以使用CLUSTER BY 进行排序。
注意:
排序只能是升序排序(默认排序规则),不能指定排序规则为asc 或者desc。
完成相同功能的Job 执行时长对比:
我们先构建一个需求:
按日期的降序排列 ,媒体编号 升序,广告位升序 ,媒体点位的总花费。
数据表与最开始描述的相同,我们导入多个分区,如图所示:
select count(1) from clickcube_mid;
+--------+--+
| _c0 |
+--------+--+
| 51204 |
+--------+--+
Hive 中的查询语句说明如下:
[WITH CommonTableExpression (, CommonTableExpression)*]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
ORDER BY:
SELECT `date`, mediaid, slotid, SUM(mediaclickcost)
FROM clickcube_mid
GROUP BY `date`, mediaid, slotid
ORDER BY `date` DESC, mediaid, slotid;
为了进行测试,我们修改下MR 中 task 中的数量:
0: jdbc:hive2://master:10000> set mapred.reduce.tasks;
+-------------------------+--+
| set |
+-------------------------+--+
| mapred.reduce.tasks=-1 |
+-------------------------+--+
1 row selected (0.011 seconds)
重新设置 tasks 为6个task.
set mapred.reduce.tasks=6;
SORT BY,DISTRIBUTE BY:
SELECT `date`, mediaid, slotid, SUM(mediaclickcost)
FROM clickcube_mid
GROUP BY `date`, mediaid, slotid
DISTRIBUTE BY `date`
SORT BY `date` DESC, mediaid, slotid;
可以看到设置生效了,确实分配了6个task
| 20180525 | 11 | 20 | 0.0 |
| 20180525 | 11 | 23 | 0.25 |
| 20180525 | 11 | 25 | 38.25 |
| 20180516 | 11 | 15 | 0.0 |
| 20180516 | 11 | 16 | 37.5 |
| 20180510 | 11 | 15 | 2.6 |
| 20180507 | -1 | -1 | 0.0 |
| 20180507 | 2 | -1 | -1.0 |
| 20180507 | 11 | 15 | 1.3 |
| 20180408 | -1 | -1 | 0.0 |
| 20180330 | -1 | -1 | 0.0 |
| 20180625 | 2 | 5 | 0.0 |
| 20180625 | 14 | 37 | 0.0 |
| 20180625 | 15 | 38 | 0.5 |
| 20180625 | 16 | 39 | 0.6000000000000001 |
+-----------+----------+---------+----------------------+--+
| date | mediaid | slotid | _c3 |
+-----------+----------+---------+----------------------+--+
| 20180625 | 16 | 41 | 115.0 |
| 20180625 | 16 | 59 | 16.5 |
| 20180625 | 18 | 42 | 0.25 |
| 20180625 | 19 | 46 | 0.0 |
| 20180625 | 22 | 47 | 0.0 |
| 20180625 | 24 | 52 | 1237.9499999999982 |
| 20180625 | 26 | 54 | 1379.4000000000044 |
| 20180616 | 11 | 25 | 0.0 |
| 20180616 | 13 | 7 | 0.0 |
| 20180616 | 14 | 37 | 0.0 |
| 20180616 | 16 | 39 | 0.4 |
| 20180616 | 16 | 41 | 1.5 |
| 20180616 | 18 | 42 | 0.75 |
| 20180616 | 20 | 44 | 0.0 |
| 20180616 | 22 | 47 | 0.4 |
| 20180616 | 22 | 48 | 2.4 |
| 20180610 | 11 | 25 | 0.0 |
| 20180610 | 14 | 37 | 0.0 |
| 20180610 | 16 | 39 | 1.0 |
可以看到数据也不是完全有序的
CLUSTER BY:
首先验证下,CLUSTER 不能指定排序方向
SELECT `date`, mediaid, slotid, SUM(mediaclickcost)
FROM clickcube_mid GROUP BY `date`, mediaid, slotid
CLUSTER BY `date` DESC, mediaid, slotid;
再尝试不指定方向
SELECT `date`, mediaid, slotid, SUM(mediaclickcost)
FROM clickcube_mid
GROUP BY `date`, mediaid, slotid
CLUSTER BY `date` , mediaid , slotid ;
查询的部分结果:
| 20180626 | 14 | 37 | 0.0 |
| 20180626 | 16 | 41 | 259.25 |
| 20180626 | 16 | 59 | 308.0 |
| 20180626 | 22 | 47 | 0.0 |
| 20180627 | 26 | 54 | 1206.6000000000108 |
| 20180628 | 16 | 39 | 0.2 |
| 20180629 | 18 | 42 | 2.0 |
| 20180630 | 26 | 54 | 0.2 |
| 20180703 | 16 | 93 | 78.80000000000018 |
| 20180330 | -1 | -1 | 0.0 |
| 20180408 | -1 | -1 | 0.0 |
| 20180505 | 11 | 15 | 0.0 |
| 20180507 | -1 | -1 | 0.0 |
| 20180514 | 11 | 15 | 0.0 |
| 20180519 | 11 | 16 | 0.25 |
| 20180523 | 11 | 21 | 31.0 |
| 20180524 | 2 | 5 | 0.0 |
| 20180524 | 11 | 20 | 80.0 |
| 20180525 | 11 | 25 | 38.25 |
| 20180528 | 11 | 28 | 1.0 |
| 20180529 | 11 | 27 | 0.5 |
| 20180529 | 13 | 7 | 212.25 |
| 20180531 | 14 | 37 | 2.3499999999999996 |
| 20180531 | 17 | 40 | 0.5 |
| 20180601 | 15 | 38 | 0.0 |
| 20180605 | 16 | 39 | 213.09999999999994 |
| 20180606 | -1 | -1 | -1.0 |
| 20180609 | 14 | 37 | 0.0 |
| 20180609 | 16 | 41 | 5.5 |
| 20180612 | 14 | 37 | 4.25 |
| 20180612 | 16 | 41 | 0.5 |
| 20180613 | 13 | 7 | 256.25 |
| 20180613 | 19 | 43 | 0.0 |
| 20180614 | 16 | 39 | 0.8 |
可以看到 只能保证CLUSTER BY 字段被划分到同一分区,并不能保证划分key在最终结果的有序性。
====================================================
能否利用SORT BY 与 CLUSTER BY 达到 ORDER BY 一样的功能? (不能的,因为只是局部有序,必须外面嵌套一层):
SELECT `date`, mediaid, slotid, SUM(mediaclickcost)
FROM clickcube_mid
GROUP BY `date`, mediaid, slotid
DISTRIBUTE BY concat(`date`, mediaid, slotid)
SORT BY `date` DESC, mediaid, slotid;
从执行时间来看与ORDER BY 基本一致
+-----------+----------+---------+----------------------+--+
| date | mediaid | slotid | _c3 |
+-----------+----------+---------+----------------------+--+
| 20180703 | 14 | 56 | 0.0 |
| 20180703 | 15 | 38 | 0.0 |
| 20180703 | 16 | 41 | 888.6500000000044 |
| 20180703 | 16 | 59 | 257.5999999999984 |
| 20180703 | 16 | 92 | 186.99999999999966 |
| 20180703 | 16 | 93 | 78.80000000000018 |
| 20180703 | 19 | 46 | 0.0 |
| 20180703 | 24 | 52 | 35.699999999999996 |
| 20180630 | 14 | 37 | 0.0 |
| 20180630 | 16 | 41 | 0.0 |
| 20180630 | 16 | 59 | 0.0 |
| 20180630 | 24 | 52 | 59.24999999999999 |
| 20180630 | 26 | 54 | 0.2 |
| 20180629 | 11 | 16 | 0.0 |
| 20180629 | 14 | 37 | 0.0 |
| 20180629 | 15 | 38 | 0.0 |
| 20180629 | 16 | 39 | 0.0 |
| 20180629 | 16 | 59 | 0.5 |
| 20180629 | 18 | 42 | 2.0 |
| 20180629 | 19 | 46 | 0.0 |
| 20180629 | 24 | 52 | 68.85 |
| 20180628 | 15 | 38 | 0.0 |
| 20180628 | 16 | 39 | 0.2 |
| 20180628 | 16 | 41 | 0.5 |
| 20180628 | 16 | 59 | 2.75 |
| 20180628 | 19 | 46 | 0.0 |
| 20180628 | 24 | 52 | 94.8 |
| 20180628 | 26 | 54 | 0.8 |
| 20180627 | 14 | 37 | 0.0 |
| 20180627 | 16 | 41 | 1.25 |
| 20180627 | 16 | 59 | 7.75 |
| 20180627 | 22 | 47 | 0.0 |
| 20180627 | 24 | 52 | 124.64999999999996 |
| 20180627 | 26 | 54 | 1206.6000000000108 |
| 20180626 | 14 | 37 | 0.0 |
| 20180626 | 15 | 38 | 0.5 |
| 20180626 | 16 | 39 | 0.0 |
| 20180626 | 16 | 41 | 259.25 |
| 20180626 | 16 | 59 | 308.0 |
| 20180626 | 19 | 46 | 0.0 |
| 20180626 | 22 | 47 | 0.0 |
| 20180626 | 24 | 52 | 227.25000000000009 |
| 20180626 | 26 | 54 | 1218.4000000000108 |
| 20180625 | 2 | 5 | 0.0 |
| 20180625 | 14 | 37 | 0.0 |
| 20180625 | 15 | 38 | 0.5 |
| 20180625 | 16 | 39 | 0.6000000000000001 |
| 20180625 | 16 | 41 | 115.0 |
| 20180625 | 16 | 59 | 16.5 |
| 20180625 | 18 | 42 | 0.25 |
| 20180625 | 19 | 46 | 0.0 |
| 20180625 | 22 | 47 | 0.0 |
| 20180625 | 24 | 52 | 1237.9499999999982 |
| 20180625 | 26 | 54 | 1379.4000000000044 |
| 20180624 | 14 | 37 | 0.0 |
| 20180624 | 14 | 56 | 0.0 |
| 20180624 | 16 | 39 | 0.2 |
| 20180624 | 16 | 41 | 0.25 |
| 20180624 | 22 | 47 | 0.0 |
| 20180624 | 24 | 52 | 5463.600000000036 |
| 20180623 | 14 | 37 | 0.0 |
| 20180623 | 14 | 56 | 0.0 |
| 20180623 | 15 | 38 | 0.0 |
| 20180623 | 16 | 39 | 0.6000000000000001 |
| 20180623 | 16 | 41 | 1.5 |
| 20180623 | 16 | 59 | 0.0 |
| 20180623 | 21 | 45 | 0.15 |
| 20180623 | 22 | 47 | 0.0 |
| 20180623 | 24 | 52 | 6800.850000000074 |
| 20180622 | 14 | 37 | 0.0 |
| 20180622 | 14 | 56 | 0.0 |
| 20180622 | 15 | 38 | 0.5 |
| 20180622 | 16 | 39 | 0.0 |
| 20180622 | 16 | 41 | 59.5 |
| 20180622 | 16 | 59 | 0.0 |
| 20180622 | 19 | 46 | 0.0 |
| 20180622 | 19 | 60 | 0.0 |
| 20180622 | 20 | 44 | 0.0 |
| 20180622 | 22 | 47 | 0.6000000000000001 |
| 20180622 | 22 | 48 | 0.0 |
| 20180622 | 24 | 52 | 5370.150000000017 |
| 20180622 | 26 | 54 | 0.6000000000000001 |
| 20180621 | 13 | 7 | 0.0 |
| 20180621 | 14 | 37 | 0.0 |
| 20180621 | 16 | 39 | 0.2 |
| 20180621 | 16 | 41 | 1.25 |
| 20180621 | 19 | 46 | 0.0 |
| 20180621 | 22 | 47 | 2.8000000000000007 |
| 20180621 | 22 | 48 | 0.2 |
| 20180621 | 24 | 52 | 72.90000000000003 |
| 20180621 | 26 | 54 | 6.600000000000001 |
| 20180620 | 14 | 37 | 0.0 |
| 20180620 | 15 | 38 | 0.0 |
| 20180620 | 16 | 39 | 0.6000000000000001 |
| 20180620 | 16 | 41 | 0.25 |
| 20180620 | 19 | 46 | 0.0 |
| 20180620 | 19 | 49 | 0.0 |
| 20180620 | 20 | 44 | 0.0 |
| 20180620 | 21 | 45 | 0.44999999999999996 |
| 20180620 | 22 | 47 | 99.39999999999988 |
+-----------+----------+---------+----------------------+--+
| date | mediaid | slotid | _c3 |
+-----------+----------+---------+----------------------+--+
| 20180620 | 23 | 50 | 2.25 |
| 20180620 | 123 | 103 | 0.0 |
| 20180619 | 11 | 25 | 0.0 |
| 20180619 | 14 | 37 | 0.75 |
| 20180619 | 15 | 38 | 0.0 |
| 20180619 | 16 | 39 | 0.6000000000000001 |
| 20180619 | 16 | 41 | 0.5 |
| 20180619 | 19 | 43 | 2.0 |
| 20180619 | 19 | 46 | 0.0 |
| 20180619 | 20 | 44 | 0.0 |
| 20180619 | 21 | 45 | 1181.0999999999333 |
| 20180619 | 22 | 48 | 0.0 |
| 20180618 | 14 | 37 | 0.0 |
| 20180618 | 16 | 39 | 0.0 |
| 20180618 | 16 | 41 | 0.5 |
| 20180618 | 18 | 42 | 0.25 |
| 20180618 | 20 | 44 | 0.0 |
| 20180618 | 22 | 48 | 0.2 |
| 20180617 | 13 | 7 | 0.0 |
| 20180617 | 14 | 37 | 0.0 |
| 20180617 | 16 | 39 | 0.2 |
| 20180617 | 16 | 41 | 1.5 |
| 20180617 | 20 | 44 | 0.0 |
| 20180617 | 22 | 47 | 0.0 |
| 20180617 | 22 | 48 | 0.4 |
| 20180616 | 11 | 25 | 0.0 |
| 20180616 | 13 | 7 | 0.0 |
| 20180616 | 14 | 37 | 0.0 |
| 20180616 | 16 | 39 | 0.4 |
| 20180616 | 16 | 41 | 1.5 |
| 20180616 | 18 | 42 | 0.75 |
| 20180616 | 20 | 44 | 0.0 |
| 20180616 | 22 | 47 | 0.4 |
| 20180616 | 22 | 48 | 2.4 |
| 20180615 | 14 | 37 | 0.0 |
| 20180615 | 15 | 38 | 0.0 |
| 20180615 | 16 | 39 | 1.6 |
| 20180615 | 16 | 41 | 1.5 |
| 20180615 | 18 | 42 | 0.75 |
| 20180615 | 19 | 46 | 0.0 |
| 20180615 | 20 | 44 | 0.0 |
| 20180615 | 22 | 47 | 66.7999999999999 |
| 20180615 | 22 | 48 | 320.00000000000244 |
| 20180614 | 13 | 7 | 0.0 |
| 20180614 | 14 | 37 | 0.0 |
| 20180614 | 16 | 39 | 0.8 |
| 20180614 | 16 | 41 | 0.5 |
| 20180614 | 18 | 42 | 9.5 |
| 20180614 | 19 | 43 | 0.0 |
| 20180614 | 20 | 44 | 1.2 |
| 20180614 | 21 | 45 | 1.05 |
| 20180613 | 11 | 25 | 0.0 |
| 20180613 | 13 | 7 | 256.25 |
| 20180613 | 14 | 37 | 0.0 |
| 20180613 | 15 | 38 | 0.0 |
| 20180613 | 16 | 39 | 1.2 |
| 20180613 | 16 | 41 | 1.25 |
| 20180613 | 18 | 42 | 1998.5 |
| 20180613 | 19 | 43 | 0.0 |
| 20180613 | 20 | 44 | 0.0 |
| 20180612 | -1 | -1 | 0.0 |
| 20180612 | 13 | 7 | 307.0 |
| 20180612 | 14 | 37 | 4.25 |
| 20180612 | 15 | 38 | 0.0 |
| 20180612 | 16 | 39 | 3.2000000000000006 |
| 20180612 | 16 | 41 | 0.5 |
| 20180612 | 18 | 42 | 8.0 |
| 20180612 | 19 | 43 | 1.75 |
| 20180611 | 11 | 25 | 0.0 |
| 20180611 | 13 | 7 | 1086.25 |
| 20180611 | 14 | 37 | 0.0 |
| 20180611 | 16 | 39 | 388.20000000000186 |
| 20180611 | 16 | 41 | 2.75 |
| 20180610 | 11 | 25 | 0.0 |
| 20180610 | 14 | 37 | 0.0 |
| 20180610 | 16 | 39 | 1.0 |
| 20180610 | 16 | 41 | 2.0 |
| 20180609 | 13 | 7 | 3.0 |
| 20180609 | 14 | 37 | 0.0 |
| 20180609 | 16 | 39 | 2.2 |
| 20180609 | 16 | 41 | 5.5 |
| 20180608 | 13 | 7 | 2671.5 |
| 20180608 | 14 | 37 | 0.0 |
| 20180608 | 15 | 38 | 0.0 |
| 20180608 | 16 | 39 | 228.39999999999966 |
| 20180608 | 16 | 41 | 17.0 |
| 20180607 | 11 | 25 | 0.0 |
| 20180607 | 13 | 7 | 283.0 |
| 20180607 | 14 | 37 | 0.5 |
| 20180607 | 16 | 39 | 2.4 |
| 20180607 | 16 | 41 | 890.75 |
| 20180606 | -1 | -1 | -1.0 |
| 20180606 | 14 | 37 | 0.0 |
| 20180606 | 15 | 38 | 0.0 |
| 20180606 | 16 | 39 | 262.00000000000006 |
| 20180606 | 16 | 41 | 1.25 |
| 20180605 | 11 | 25 | 0.0 |
| 20180605 | 14 | 37 | 0.0 |
| 20180605 | 15 | 38 | 0.0 |
| 20180605 | 16 | 39 | 213.09999999999994 |
+-----------+----------+---------+----------------------+--+
| date | mediaid | slotid | _c3 |
+-----------+----------+---------+----------------------+--+
| 20180604 | 11 | 25 | 0.0 |
| 20180604 | 11 | 35 | 0.0 |
| 20180604 | 14 | 37 | 0.0 |
| 20180604 | 15 | 38 | 0.0 |
| 20180604 | 16 | 39 | 8.200000000000003 |
| 20180603 | 11 | 25 | 0.0 |
| 20180603 | 16 | 39 | 4.2 |
| 20180602 | 11 | 25 | 0.0 |
| 20180602 | 16 | 39 | 2.6000000000000005 |
| 20180601 | 11 | 16 | 2.0 |
| 20180601 | 13 | 7 | 0.0 |
| 20180601 | 14 | 37 | 1.5 |
| 20180601 | 15 | 38 | 0.0 |
| 20180601 | 16 | 39 | 553.1500000000028 |
| 20180601 | 90 | 90 | 4.0 |
| 20180601 | 180 | 127 | 0.0 |
| 20180531 | 13 | 7 | 1991.5 |
| 20180531 | 14 | 37 | 2.3499999999999996 |
| 20180531 | 16 | 39 | 302.39999999999986 |
| 20180531 | 17 | 40 | 0.5 |
| 20180530 | -1 | -1 | -1.0 |
| 20180530 | 11 | 25 | 0.0 |
| 20180530 | 13 | 7 | 1741.0 |
| 20180530 | 14 | 37 | 0.0 |
| 20180530 | 16 | 39 | 289.60000000000036 |
| 20180530 | 17 | 40 | 0.5 |
| 20180529 | 11 | 23 | 0.0 |
| 20180529 | 11 | 25 | 0.0 |
| 20180529 | 11 | 26 | 8.25 |
| 20180529 | 11 | 27 | 0.5 |
| 20180529 | 11 | 28 | 0.0 |
| 20180529 | 11 | 29 | 8.0 |
| 20180529 | 11 | 30 | 2.0 |
| 20180529 | 11 | 31 | 6.0 |
| 20180529 | 11 | 32 | 0.0 |
| 20180529 | 11 | 35 | 14.25 |
| 20180529 | 11 | 36 | 97.25 |
| 20180529 | 13 | 7 | 212.25 |
| 20180529 | 14 | 37 | 0.75 |
| 20180529 | 15 | 38 | 0.0 |
| 20180528 | 2 | 5 | 0.0 |
| 20180528 | 11 | 23 | 15.5 |
| 20180528 | 11 | 25 | 119.75 |
| 20180528 | 11 | 26 | 18.0 |
| 20180528 | 11 | 27 | 1.25 |
| 20180528 | 11 | 28 | 1.0 |
| 20180528 | 11 | 29 | 27.0 |
| 20180528 | 11 | 30 | 3.25 |
| 20180528 | 11 | 31 | 17.75 |
| 20180528 | 11 | 32 | 0.25 |
| 20180528 | 11 | 35 | 2.25 |
| 20180528 | 11 | 36 | 108.25 |
| 20180528 | 13 | 7 | 448.75 |
| 20180527 | 11 | 16 | 0.0 |
| 20180527 | 11 | 21 | 0.0 |
| 20180527 | 11 | 25 | 0.0 |
| 20180526 | 11 | 16 | 0.0 |
| 20180526 | 11 | 20 | 0.25 |
| 20180526 | 11 | 21 | 0.5 |
| 20180526 | 11 | 25 | 0.0 |
| 20180525 | 2 | 5 | 0.0 |
| 20180525 | 11 | 16 | 0.0 |
| 20180525 | 11 | 20 | 0.0 |
| 20180525 | 11 | 23 | 0.25 |
| 20180525 | 11 | 25 | 38.25 |
| 20180524 | 2 | 5 | 0.0 |
| 20180524 | 11 | 20 | 80.0 |
| 20180524 | 11 | 21 | 27.0 |
| 20180523 | 11 | 16 | 0.5 |
| 20180523 | 11 | 20 | 64.75 |
| 20180523 | 11 | 21 | 31.0 |
| 20180521 | 11 | 16 | 0.5 |
| 20180520 | 11 | 16 | 0.0 |
| 20180519 | 2 | -1 | 0.0 |
| 20180519 | 11 | 16 | 0.25 |
| 20180518 | 11 | 15 | 14.81999999999999 |
| 20180518 | 11 | 16 | 44.5 |
| 20180517 | 1 | 7 | 0.02 |
| 20180517 | 11 | 15 | 0.78 |
| 20180517 | 11 | 16 | 12.5 |
| 20180516 | 11 | 15 | 0.0 |
| 20180516 | 11 | 16 | 37.5 |
| 20180514 | 11 | 15 | 0.0 |
| 20180512 | 11 | 15 | 0.52 |
| 20180511 | 11 | 15 | 3.8999999999999995 |
| 20180510 | 11 | 15 | 2.6 |
| 20180509 | 1 | 7 | 0.0 |
| 20180509 | 11 | 15 | 0.0 |
| 20180508 | -1 | -1 | 0.0 |
| 20180508 | 1 | 7 | 0.0 |
| 20180508 | 2 | 5 | 0.0 |
| 20180507 | -1 | -1 | 0.0 |
| 20180507 | 2 | -1 | -1.0 |
| 20180507 | 11 | 15 | 1.3 |
| 20180505 | 11 | 15 | 0.0 |
| 20180504 | 11 | 15 | 0.26 |
| 20180503 | 2 | 5 | 0.0 |
| 20180503 | 11 | 15 | 0.0 |
| 20180502 | 11 | 15 | 0.77 |
| 20180408 | -1 | -1 | 0.0 |
+-----------+----------+---------+----------------------+--+
| date | mediaid | slotid | _c3 |
+-----------+----------+---------+----------------------+--+
| 20180404 | -1 | -1 | 0.0 |
| 20180331 | -1 | -1 | 0.0 |
| 20180330 | -1 | -1 | 0.0 |
--------------------------------
前言:hive中order by和关系型数据库中的order by 类似,都是针对一个全局的结果进行排序。So,问题来了!!!但是hive中的order by排序操作会在一个reduce里面!!!那就很费时间了。(咱能不用order by 就不用order by。但是如何实现全局排序呢!)
在hive中如何实现全局排序呢?需要嵌套一层,咱们一步步来解说。
#注意:在Hive中当设置了hive.mapred.mode=strict,在order by以后必须加limit 哈,要不会报错滴!
set hive.mapred.mode=strict
#加了Limit才不会报错
select * from haha order by xixi limit 10
一、sort by不是一个全局排序
sort by不是一个全局排序,是一个进入reduce前就排序了,每一个reduce是排序了的,但是全局是没有排序!
set mapred.reduce.tasks = 3;
select name,ID from haha sort by ID;
Map:2 Reduce:3
上图我们可以观察的到,3个reduce虽然是排序的,但是并不是全局排序的!
Distribute by 能够控制map的输出在reduce中如何划分,其可以按照指定的字段对数据进行划分到不同的reduce中(注:distribute by 和group by 有些许相似,distribute by控制reduce如何处理数据,sort by 控制reduce的数据如何排序)
二:问题来了:我们如何进行全局排序呢?
来,首先我们来看不是全局排序的情况!
select name,Company from haha distribute by Company sort by Company;
Map:2 Reduce:3
呵呵呵呵呵,ruduce虽排序,但全局不排序。
全局排序走起来!
select * from(select name,Company from haha distribute by Company sort by Company)t order by Company
Map:2 Reduce:3
Job 1:Map:2 Reduce:3
Job 2:Map:1 Reduce:1
我们用嵌套查询的方式,实现了全局排序。其中Job 2中的这个Reduce在发挥全局排序的力量!
同志们,请记住!!!
Order by大哥力大无穷,可发挥全局排序大法,但是这逼在数据量很大的情况下,常常会跑不出来。
sort by二弟其力量是对reduce的单个输出进行排序的,木有全局排序的魔法。
distribute by小哥 可以按指定字段将数据划分到不同的reduce中。
上一篇: Java API练习1
推荐阅读
-
hive四种排序Order By , Sort By ,Distribute By ,Cluster By
-
Hive中的排序及优化 ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY
-
Hive中order by、sort by、distribute by和cluster by
-
Hive中的四种排序方式(order by,sort by,distribute by,cluster by)使用与区别详解
-
Hive_Hive 排序及优化 ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY
-
hive排序(order by,sort by,distribute by,cluster by)
-
Hive中 Order by,Sort by, Distribute by, Cluster by详解
-
hive入门之排序查询(order by,sort by,distribute by,cluster by...)
-
hive中order by、distribute by、sort by和cluster by的区别和联系