Hive

程序员文章站 2022-07-14 10:19:23

...

Hive

1.数据库的OLTP(联机事务处理)

针对数据的事务处理
ACID原则(原子性,一致性,持久性,隔离性)
描述了数据库当前的状态
数据库的隔离级别
- 读未提取
- 读已提取
- 可重复读
- 序列化

OLAP(联机分析处理)

我们发现分析一些历史数据可以获取利益
我们会将以前的数据汇集到一起进行分析
于是我们需要建立数据仓库

2,数据仓库(Data Warehouse)

主要用来存放数据,而且这些数据不会被修改
数据的来源非常广泛
- 商品的订单信息
- 网站的访问日志
- 网站的问卷调查
- …

3.Hive的简介

非java编程者对HDFS的数据做mapreduce操作
原来做数据清洗或者数据分析,可以使用MapReduce
但是MapReduce需要有一定的java开发基础,但是数据的管理者可能是其他语言的程序员
这样就为现行的开发带来了不便利
通过研究绝大部分程序员都接触过sql语句
- 直接写sql语句,然后将sql语句解析成MapReduce

对数据的处理方式

admin,123456,18,180,shandong;
zhangsan,123,16,160,shanxi;

//首先将数据按照;分割  -->行
//然后将数据按照,分割  -->列
//将数据一一映射到表的列
//Hive不用存储任何实际数据，HDFS存储，Hive只需要存放映射关系

数据仓库是用来做查询分析的数据库,基本上不用来做修改,删除操作

4,Hive原理和架构

客户端

向Hive发送请求
- DDL,DQL,DML整体仿写一套SQL语句
- Hive允许client连接的方式有很多种
常用的连接方式
- client–需要下载安装包
- JDBC/ODBC也可以连接到Hive
  - 现在主流都在倡导第二种HiveServer2/beeline
  - 做基于用户名和密码安全的一个校验
Web Gui
- hive给我们提供了一套简单的web页面
- 我们可以通过这套web页面访问hive
- 页面太简陋了

MetaStore

一般需要借助其他的数据载体(数据库)
主要用于存放数据库的建表语句等信息
推荐使用MySql数据库存放数据
连接数据库需要提供:URL username password driver

Hive驱动

接受客户端的请求
里面主要包括(解释器,优化器,编译器,执行器)
然后对SQL语句执行如下操作
- 对客户端的SQL语句做语法语义的检测
- 将SQL语句转成逻辑执行的方案
- 对逻辑执行计划进行优化
- 将逻辑执行计划转换成物理执行计划

5.Hive的基本命令

数据库

显示数据库
- show databases;
创建数据库
- create database if not exists sxt;
- 创建数据库的时候会在配置的HDFs目录下新增一个同名文件夹
- create database if not exists is locatin ‘/shsxt/ls’;
- 一般情况下都会进将库放在一起,除非某些位置已经存在了库的数据
使用和删除数据库
- user shsxt;
- drop database ls;

数据类型()

primitive type
- 整型
  - TINYINT(byte) SMALLINT(short) INT(int) BIGINT(Long)
- 浮点型
  - double float
- 字符型
  - string varchar char
- 布尔
  - boolean
- 日期
  - date timestamp
array_type
- array<data_type>
map_type
- map<primitive_type,data_type>
struct_type
- struct<col_name:value>

表

一般都是现有数据,然后才根据数据的存放格式创建对应的表

--创建表
create table person(
id int,
name string,
age int,
likes array<string>,
addrs map<string,string>
)
row format delimited fields terminated by ','
COLLECTION ITEMS TERMINATED by '-'
map keys terminated by ':'
lines terminated by '\n';

//将数据导入到表中(注意符号错误)  
//从word拷贝出来的代码，先放在txt中初始化一下
LOAD DATA LOCAL INPATH '/root/person.txt' INTO TABLE person;

//覆盖--我是对的
LOAD DATA LOCAL INPATH '/root/person.txt' OVERWRITE INTO TABLE person;

select * from person;
select name,age,likes[0] from person;
select count(id) from person;

实例:person
1,zshang,18,game-girl-book,stu_addr:beijing-work_addr:shanghai
2,lishi,16,shop-boy-book,stu_addr:hunan-work_addr:shanghai
3,wang2mazi,20,fangniu-eat,stu_addr:shanghai-work_addr:tianjing
4,wang4mazi,21,fangniu-eat,s_addr:shanghai-w_addr:tianjing

首先一行就是一条记录
属性与属性之间用(,)分割，一行有5个属性
1  -->int
zshang -->string
18 -->int
game-girl-book -->string -->arary 用（-）分割
stu_addr:beijing-work_addr:shanghai --> map struct 都可以

实例:
{
    "name": "songsong",
    "friends": ["bingbing" , "lili"] ,       //列表Array, 
    "children": {                      //键值Map,
        "xiao song": 18 ,
        "xiaoxiao song": 19
    }
    "address": {                      //结构Struct,
        "street": "hui long guan" ,
        "city": "beijing" 
    }
}
songsong,bingbing-lili,xiaosong:18-xiaoxiaosong:19,huilongguan-beijing

create table t_user_001(
name string,
friends array<string>,
children map<string,int>,
address struct<street:string,city:string>
)
row format delimited 
fields terminated by ',' 
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

load data local inpath '/root/t_user_001' into table t_user_001;

6.表的操作

创建表

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format]
[LOCATION hdfs_path]

CREATE 创建的关键字
[EXTERNAL] 如果添加这个关键字,那么将创建一个外部表
TABLE 关键字
[IF NOT EXISTS] 如果不存在就创建这张表,如果重复创建就会报错
table_name 需要创建的表名

[(col_name data_type [COMMENT col_comment], ...)] 设置表的属性
列名,数据类型,
[COMMENT table_comment] 设置表的注释信息
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 数据分区
列名 数据类型,列名 数据类型
[CLUSTERED BY (col_name, col_name, ...) 数据分桶
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]  数据分桶
[ROW FORMAT row_format]  设置数据切分格式
[STORED AS file_format] 以什么样的格式存储数据
[LOCATION hdfs_path] 指定数据存放的位置（外部表时候使用）

表与文件夹预文件
- 当我们创建好表的时候,会在当前表所属的库中创建一个对应的文件夹
- 当我们设置表路径的时候,可以直接指向一个已有的路径,也可以直接去使用文件夹中的数据
- 当我们load数据的时候,就会将数据文件存放到表对应的文件夹中
- 而且数据一旦被load,就不能被修改
- 我们查看数据也是查询文件夹中的文件,这些数据最终都会存放到HDFS上
- 当我们删除表的时候,表对应的文件夹也会被删除,同时数据也会被删除

7.显示修改表结构

显示表
- show tables;
- show tables like ‘u’;
表的详情
- desc person;
删除表
- drop table if exists t_user;
修改表的名称
- 内部表(同时修改文件目录)
- 外部表（因为目录是共享的，所以不会修改目录名称）
- alter table person rename to t_person;
修改列名
- alter table t_person change column name pname string;
添加列
- alter table t_person add columns (passwd string);

8.内-外部表

如果创建表的时候加上external,加上外部表
当删除外部表的时候,并不会级联的删除对应的文件夹
所以将外部表设置Location,指向一个项目组共享的一个资源
外部表:hive外面的表,也就是说hive并没有完全拥有这份数据
删除这张表并不会去删除这个数据,不过描述这张表的元数据信息将会被删除
外部表可以减少数据的传输和硬盘的占用

create external table t_user_002(
name string,
friends array<string>,
children map<string,int>,
address struct<street:string,city:string>
)
row format delimited 
fields terminated by ',' 
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n'
location '/shsxt/t_user/';

load data local inpath '/root/t_user_001' into table t_user_002;

内部表-外部表箱号相互转换
- alter table t_user set tblproperties(‘EXTERNAL’=‘TRUE’);

9.数据载入

load

加载数据的方式

load data [local] inpath '/opt/module/datas/student.txt' overwrite | into table student [partition (partcol1=val1,…)];

load data加载本地
[local]  本地，不加Local就是从HDFS,如果是HDFS，将会删除掉原来的数据
inpath  数据的本地
'/opt/module/datas/student.txt'  具体的路径，要参考本地还是HDFS
overwrite 覆盖
into table 加入到表中
student 表的名字
[partition (partcol1=val1,…)]; 分区

--Linux本地
load data local inpath '/root/t_user_001' into table t_user;
--HDFS
load data inpath '/shsxt/t_user/t_user_001' into table t_user;
--HDFS 覆盖
load data inpath '/shsxt/t_user_001' overwrite into table t_user;
load data local inpath '/root/t_user_001' overwrite into table t_user;

Insert查询结果

--创建一张表
create table t_user1(
name string
)
row format delimited fields terminated by ','  lines terminated by '\n';

create table t_user2(
friends array<string>
)
row format delimited fields terminated by ',' 
collection items terminated by '-'  
lines terminated by '\n';


--插入别人的查询结果
insert overwrite table t_user1 select name from t_user;
insert overwrite table t_user2 select friends from t_user;

--一次查询但是可以将数据插入到多张表(多插入模式不能插入相同的表)
from t_user 
insert overwrite table t_user1 select name
insert overwrite table t_user2 select friends;

Insert原始数据

--特殊数据类型插入
insert into t_user values('songxiaobao',{'zs','ls'},'xb:13-xxb:11','tl-hlj');

--不推荐使用，每次只能插入一条记录，而且会启用mapreduce
insert into t_user1 values('songxiaobao');

10.数据备份

将表中的数据备份

--将查询结果存放到本地
insert overwrite local directory '/root/user' 
select * from t_user;


--按照指定的方式将数据输出到本地
insert overwrite local directory '/root/user'
ROW FORMAT DELIMITED fields terminated by ',' collection items terminated by '-'
map keys terminated by ':' lines terminated by '\n'
select * from t_user;

--将查询结果输出到HDFS 
insert overwrite directory '/shsxt/user'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
select * from t_user;

--直接使用HDFS命令保存表对应的文件夹
hdfs dfs -get Hive Linux

将表结构和数据同时备份

--将数据导出到HDFS
export table t_user to '/shsxt/t_user_backup';
--删除表结构
drop table t_user;
--恢复表结构和数据
import from '/shsxt/t_user_backup';

11.分区表

原因
- 当我们load数据到表中的时候,会将数据存放到表对应的文件夹中
- 但是随着时间增长,表中的数据也会越来越多
- 每次查询都会做全遍历的话,那么查询的时间也会越来越长
- 我们可以根据常用属性建立子文件夹,将对应的数据存放到里面去
- 这样查询数据直接去子文件夹查询即可
分区表分类
- 分区的层次
  - 单分区
  - 多分区

xiaoming1,g1,bj1
xiaoming2,g1,bj1
xiaoming3,g1,bj2
xiaoming4,g1,bj2
xiaohong1,g2,bj1
xiaohong2,g2,bj1
xiaohong3,g2,bj2
xiaohong4,g2,bj2

--设置参数开启动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

--创建学生表
create table t_student(
name string
)
partitioned by(
grade string,
clazz string
) 
row format delimited 
fields terminated by ',' 
lines terminated by '\n';


--动态分区
create table t_student_tmp(
name string,
grade string,
clazz string
) 
row format delimited 
fields terminated by ',' 
lines terminated by '\n';

--向临时表中载入数据
load data local inpath '/root/student' into table t_student_tmp;
--开始向分区表插入数据
insert overwrite table t_student partition (grade,clazz) 
select name,grade,clazz from t_student_tmp;

数据加载的方式

静态分区

将数据指定到那个分区,对应的分区属性就是指定值
注意:存放数据时不能放错.
实例

--创建角色表
create table t_role(
name string
)
partitioned by(
author string
) 
row format delimited 
fields terminated by ',' 
lines terminated by '\n';

--向分区表中加载数据
load data local inpath '/root/novel_gl' into table t_role partition(author='gl');
load data local inpath '/root/novel_jy' into table t_role partition(author='jy');

--向分区中加载数据（同时有两个不同分区的数据）
load data local inpath '/root/novel_gj' into table t_role partition(author='gl');

动态分区

就是将数据显存放到临时表中
等我们将数据插入到分区的时候,查询临时表
实例

--设置参数开启动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--创建角色表
create table t_role(
name string
)
partitioned by(
author string
) 
row format delimited 
fields terminated by ',' 
lines terminated by '\n';
--创建动态分区的临时表
create table t_role_tmp(
name string,
author string
) 
row format delimited 
fields terminated by ',' 
lines terminated by '\n';
--向临时表中载入数据
load data local inpath '/root/novel' into table t_role_tmp;
--动态插入数据
insert overwrite table t_role partition (author) select name from t_role_tmp;

12.SQL查询

和SQL语句一样去使用
- select from where group by having order by desc/asc
规则也和以前一样
实例

--emp
1122,scott,3100,10
3344,smith,3200,20
5566,james,3300,10
7788,cat,3400,30
9900,tom,3500,10
--dept
10,java
20,bd
30,python
--创建表
create table t_emp(
empno int,
name string,
sal int,
deptno int
) 
row format delimited  fields terminated by ','  lines terminated by '\n';

create table t_dept(
deptno int,
dname string
) 
row format delimited  fields terminated by ','  lines terminated by '\n';
--载入数据
load data local inpath '/root/emp' into table t_emp;
load data local inpath '/root/dept' into table t_dept;

--查询每个部门的人数
select deptno,count(empno) from t_emp group by deptno;
select e.name,d.dname from t_emp e,t_dept d where e.deptno = d.deptno;
select d.dname,count(e.empno) ce from t_emp e,t_dept d where e.deptno = d.deptno group by d.dname order by ce;
--查询谁的薪资大于公司的平均薪资
select e.name from t_emp e,(select avg(sal) avgsal from t_emp) tt where e.sal >tt.avgsal ;
--查询平均薪资大于等于3200的部门编号
select deptno,avg(sal) from t_emp group by deptno having avg(sal)>=3200;

还可以使用非关系型数据库的某些SQL语句

split
- 切分成数组
explode
- 将数组进行拆分,然后一个数据就是一行
Lateral View
- 首先通过UDTF函数拆分成多行,然后将多行的结果组成一个支持别名的虚拟表
实例

--IMDB250
nezha dongman,shenhua,3d
shanghaibaolei 3d,kehuan,aiqing
zhumengyanyiquan lizhi,dushi
fengshen shenhua,3d,lishi,dongman
--创建表
create table t_movie(
name string,
type string
) 
row format delimited  fields terminated by ' '  lines terminated by '\n';
--载入数据
load data local inpath '/root/movie' into table t_movie;

--查询数据
select * from t_movie;
select name,split(type,',') from t_movie;
select explode(split(type,',')) from t_movie;
--错误deSQL，产生一个对多个的问题
select name,explode(split(type,',')) from t_movie;

--通过虚拟表合并name和type
select name,tcol from t_movie LATERAL VIEW explode(split(type,',')) typetable AS tcol;

wordCount

实例

The escape of the Brazilian boa constrictor earned Harry his longest-ever punishment. By the time he was allowed out of his cupboard again, the summer holidays had started and Dudley had already broken his new video camera, crashed his remote control airplane, and, first time out on his racing bike, knocked down old Mrs. Figg as she crossed Privet Drive on her crutches.

Harry was glad school was over, but there was no escaping Dudley's gang, who visited the house every single day. Piers, Dennis, Malcolm, and Gordon were all big and stupid, but as Dudley was the biggest and stupidest of the lot, he was the leader. The rest of them were all quite happy to join in Dudley's favorite sport: Harry Hunting.

This was why Harry spent as much time as possible out of the house, wandering around and thinking about the end of the holidays, where he could see a tiny ray of hope. When September came he would be going off to secondary school and, for the first time in his life, he wouldn't be with Dudley. Dudley had been accepted at Uncle Vernon's old private school, Smeltings. Piers Polkiss was going there too. Harry, on the other hand, was going to Stonewall High, the local public school. Dudley thought this was very funny.

“They stuff people's heads down the toilet the first day at Stonewall,” he told Harry. “Want to come upstairs and practice?”


--创建表
create table t_wordcount(
wordline string
) 
row format delimited  fields terminated by '\n';


--载入数据
load data local inpath '/root/harry' into table t_wordcount;

--单词统计
select * from t_wordcount;

select split(wordline,' ') from t_wordcount;

select explode(split(wordline,' ')) from t_wordcount;

--方案1
select word,count(word) from t_wordcount lateral view explode(split(wordline,' ')) eswtable as word group by word;

--方案2
select word,count(word) from (select explode(split(wordline,' ')) word from t_wordcount) esw group by word;

上一篇： Web APIs 2

下一篇： Hive

Hive

Hive

1.数据库的OLTP(联机事务处理)

2,数据仓库(Data Warehouse)

3.Hive的简介

4,Hive原理和架构

客户端

MetaStore

Hive驱动

5.Hive的基本命令

数据库

数据类型()

表

6.表的操作

创建表

表与文件夹预文件

7.显示修改表结构

显示表

表的详情

删除表

修改表的名称

修改列名

添加列

8.内-外部表

内部表-外部表箱号相互转换

9.数据载入

load

加载数据的方式

Insert查询结果

Insert原始数据

10.数据备份

11.分区表

12.SQL查询

和SQL语句一样去使用

规则也和以前一样

还可以使用非关系型数据库的某些SQL语句

wordCount

hive函数--编码解码

Hive与Oracle之间利用Sqoop进行数据的导入导出时遇到的问题及解决方法

hive 标准hql建表语法格式

python远程访问hive

Hadoop学习(7)-hive的安装和命令行使用和java操作

Hive的内置函数

snappy压缩算法详解（hive存储格式和压缩格式）

Mysql元数据如何生成Hive建表语句注释脚本详解

snappy压缩算法详解（hive存储格式和压缩格式）

Hive-1.2.1_06_累计报表查询