Hive教程（三）

程序员文章站 2022-04-29 08:52:09

...

一、集合数据类型应用

Hive 有三种复杂数据类型 ARRAY、MAP 和 STRUCT

案例：1）假设某表有如下一行，我们用 JSON 格式来表示其数据结构。在 Hive 下访问的格式为：

{
"name": "songsong",
"friends": ["bingbing" , "lili"] , //列表 Array,
"children": {                      //键值 Map,
"xiao song": 18 ,
"xiaoxiao song": 19
}
"address": {                       //结构 Struct,
"street": "hui long guan" ,
"city": "beijing"
}
}

创建本地测试文件test.txt

//注意：MAP，STRUCT 和 ARRAY 里的元素间关系都可以用同一个字符表示，这里用“_”。
songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
yangyang,caicai_susu,xiao  yang:18_xiaoxiao  yang:19,chao yang_beijing

hive创建表

create table test(                                        
name string,
friends array<string>,
children map<string, int>,
address struct<street:string, city:string>
)
row format delimited                                
fields terminated by ','                         // 列分隔符
collection items terminated by '_'               // MAP STRUCT和ARRAY 的分隔符(数据分割符)
map keys terminated by ':'                       // MAP 中的 key 与 value 的分隔符
lines terminated by '\n';                        // 行分隔符

导入文本数据测试

load data local inpath "/root/data/hive/test.txt" into table test;

##查询
select friends[0],children["xiao song"],address.street from test where name="songsong";

二、分区表基本操作

1、引入分区表（需要根据日期对日志进行管理）

/root/data/hive/partitioned/20200405.log

/root/data/hive/partitioned/20200406.log

2、创建表

create table ip_count(ip string,username string,count int)
partitioned by (producedate string)           #不可以设置关键词的属性 例如 date
row format delimited
fields terminated by '\t';

加载数据

load data local inpath '/root/data/hive/partitioned/20200405.log' into table ip_count partition (producedate='20200405');
load data local inpath '/root/data/hive/partitioned/20200406.log' into table ip_count partition (producedate='20200406');

存储在HDFS中是分文件夹存放的

Hive教程（三）

查询分区表

#单分区
select * from ip_count where producedate='20200406';

#多分区  会进行MR
select * from ip_count where producedate='20200406' 
union 
select * from ip_count where producedate='20200405';

#添加单个分区
alter table ip_count add partition(producedate='20200408') ;

#添加多个分区
alter table ip_count 
add partition(producedate='20200408') partition(producedate='20200407');

#删除分区     注意：添加多个分区中间空格，删除多个分区中间逗号
alter table ip_count drop partition(producedate='20200408'),partition(producedate='20200407');

#显示所有分区
show partitions ip_count;

#查看分区结构
desc formatted dept_partition;

3、二级分区表

#建表
create table ipc2(ip string,name string,count int) 
partitioned by (month string,day string) 
row format delimited fields terminated by '\t';

#加载数据（正常方法）
load data local inpath '/root/data/hive/partitioned/20200405.log' into table ipc2 partition(month='202004',day='05');

#查询
select * from ipc2 where month='202004' and day='05';

#方法二，上传数据后修复
#hive内创建文件夹并上传数据
hive (school)> dfs -mkdir -p /user/hive/warehouse/school.db/ipc2/month=202004/day=06;
#路径不加'' or ""
dfs -put /root/data/hive/partitioned/20200406.log /user/hive/warehouse/school.db/ipc2/month=202004/day=06;

#修复表  不然查询不到（分区表hive内dfs -put数据需要修复，hadoop fs -put不用。普通表都不用）
#如果不修复，添加分区也是可以查询到数据的
msck repair table ipc2;

#方法三：创建文件夹后 load 数据到分区，也是不用修复就查询到数据

Hive教程（三）

一、集合数据类型应用

二、分区表基本操作

钉钉怎么填写表单? 钉钉表单填写提交的教程

HtoA怎么激活？Solid Angle Houdini To Arnold v3.0.1 for Houdini 激活图文教程

剪映app素材怎么旋转? 剪映画面旋转的教程

2020春运怎么抢票 2020春运火车票抢票详细教程

利用播放器PotPlayer进行教程视频录制

迅雷影音片库怎么删不掉?彻底删除迅雷影音片库教程

南京大学近三年高考录取分数线：2021一般考多少可进南京大学？

小米电视怎么卸载软件？小米电视卸载应用教程

Android自定义View圆形进度条控件（三）

迅雷影音如何看片？迅雷看看播放器看片教程