hive开发教程之静态分区和动态分区
CREATE TABLE emp_dept_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" ;
备注:静态分区表中不能含有分区字段,否则会报错:FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
导入数据
INSERT OVERWRITE TABLE emp_dept_partition PARTITION (deptno=10) IF NOT EXISTS select empno, ename, job, mgr, hiredate, sal, comm from emp where deptno=20;
hive> select * from emp_dept_partition; OK 7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 10 7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 10 7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 10 7876 ADAMS CLERK 7788 1987/5/23 1100.0 NULL 10 7902 FORD ANALYST 7566 1981/12/3 3000.0 NULL 10 Time taken: 0.083 seconds, Fetched: 5 row(s)
为什么emp_dept_partition中没有deptno这个字段,但查询的时候会显示出来的?会不会是当作元数据储存了起来?
insert 语句中是从emp中找出deptno为20的数据导入至emp_dept_10并设置分区字段的值为10,这显然不合理,也缺乏了一个校验机制。
现在从mysql中修改元数据,把数据从10改为20。
mysql> select a.TBL_ID, b.PART_ID from partition_keys a, partitions b where a.TBL_ID=(select TBL_ID from tbls where TBL_NAME='emp_dept_partition') and a.TBL_ID = b.TBL_ID; +--------+---------+ | TBL_ID | PART_ID | +--------+---------+ | 15 | 6 | +--------+---------+ 1 row in set (0.00 sec) mysql> update partitions set PART_NAME='deptno=20' where PART_ID=6; Query OK, 0 rows affected (0.00 sec) Rows matched: 1 Changed: 0 Warnings: 0 mysql> update partition_key_vals set PART_KEY_VAL='20' where PART_ID=6; Query OK, 0 rows affected (0.00 sec) Rows matched: 1 Changed: 0 Warnings: 0 mysql> commit; Query OK, 0 rows affected (0.00 sec) hive> select * from emp_dept_partition; OK 7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20 7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20 7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 20 7876 ADAMS CLERK 7788 1987/5/23 1100.0 NULL 20 7902 FORD ANALYST 7566 1981/12/3 3000.0 NULL 20 Time taken: 0.093 seconds, Fetched: 5 row(s)
证明了分区字段数据的确是当作元数据被保留起来了。
2、动态分区
INSERT OVERWRITE TABLE emp_dept_partition PARTITION (deptno) select empno, ename, job, mgr, hiredate, sal, comm,deptno from emp;
分区字段必须为最后一个字段
第一次执行会报错:
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
根据提示输入命令即可解决:
set hive.exec.dynamic.partition.mode=nonstrict
检查元数据
mysql> select a.TBL_ID, b.PART_ID ,b.PART_NAME from partition_keys a, partitions b where a.TBL_ID=(select TBL_ID from tbls where TBL_NAME='emp_dept_partition') and a.TBL_ID = b.TBL_ID; +--------+---------+-----------+ | TBL_ID | PART_ID | PART_NAME | +--------+---------+-----------+ | 15 | 6 | deptno=20 | | 15 | 7 | deptno=30 | | 15 | 8 | deptno=10 | +--------+---------+-----------+ 3 rows in set (0.00 sec)
自动增加了3个分区,而且deptno都是能对应上的。
4、总结
使用动态分区可以非常智能的加载表,而在动静结合使用时需要注意静态分区值必须在动态分区值的前面。所以建议使用动态分区,且insert进行分区的使用建议使用OVERWRITE,避免数据重复。