hive第二部分

程序员文章站 2024-03-16 11:06:40

...

通过参数使用Hive

hive -e ‘命令’

hive -e ‘show databases;’

hive  -f  	 文件（文件内编写造作命令）

hive -f test.sql

数据库的相关操作（增删改查）

增：create database [if not exists] 库名；

删： drop database 库名；（若数据库内表，那么不允许直接删除，需要先清空所有表在删除）

改: hive 不支持数据库修改；

查： show databases;

查看数据库详细信息：

desc database myhive2;

desc database extended myhive2;

数据库切换： use 库名；

hive数据存储在HDFS中，hive的库、表、分区都是以文件夹的形式存在。

数据表的相关操作

建表语句

内部表

create table [IF NOT EXISTS] 表名；

外部表

create EXTERNAL table [IF NOT EXISTS] 表名；

内部表与外部表的区别

内部表在删除表时，表的元数据与数据同时被删除。

外部表在删除表时，表的元数据被删除，数据不删除。

指定导入表的数据列与列的分隔符

ROW FORMAT DELIMITED FIELDS TERMINATED BY char （char 分隔符）

STORED AS 数据上传到HDFS以什么格式进行存储（SEQUENCEFILE | TEXTFILE | RCFILE）

Hive支持的数据类型

基本数据类型

BOOLEA TINYINT SMALLINT INT BIGINT FLOAT DOUBLE DEICIMAL STRING VARCHAR
CHAR BINARY TIMESTAMP DATE INTERVAL

符合数据类型

ARRAY MAP STRUCT UNION

hive初体验

create table if not exists stu2(id int ,name string) row format delimited fields terminated by ‘\t’
stored as textfile location ‘/user/stu2’;

外部表

创建外部表

create external table techer (t_id string,t_name string) row format delimited fields terminated by ‘\t’;

create external table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by ‘\t’;

数据加载语句

load data local inpath ‘/export/servers/hivedatas/student.csv’ into table student;

load data inpath ‘/hivedatas/techer.csv’ into table techer;

验证删除表后数据是否还在

drop table techer;

drop table student;

 删除后表被删除，数据依然存在

内部表

创建内部表

create table student1 (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by ‘，’;

数据加载语句

load data local inpath ‘/export/servers/hivedatas/student1.csv’ into table student;

验证删除表后数据是否还在

drop table student1;

 删除后表被删除，数据也被删除

分区表

创建分区表

create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by ‘\t’;

加载数据

load data local inpath ‘/export/servers/hivedatas/score.csv’ into table score2 partition(year=‘2018’,month=‘06’,day=‘01’);

load data local inpath ‘/export/servers/hivedatas/score.csv’ into table score2 partition(year=‘2018’,month=‘06’,day=‘02’);

重点说明：

分区字段绝对不能出现在表已有的字段内。

作用：提高查询效率

分桶表

1、开启分桶功能

set hive.enforce.bucketing=true;

2、设置桶的数量

set mapreduce.job.reduces=3;

3、创建分桶表

create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by ‘\t’;

4、数据加载

4.1 创建普通表

create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by ‘\t’;

4.2 普通表添加数据

load data local inpath ‘/opt/hive/course.csv’ into table course_common;

4.3 在普通表查询数据插入到分桶表

insert overwrite table course select * from course_common cluster by(c_id);

5、验证分桶数据

[aaa@qq.com hive]# hadoop fs -ls /user/hive/warehouse/hivedatabase.db/course/
Found 3 items
-rwxr-xr-x 3 root supergroup 13 2019-11-21 01:56 /user/hive/warehouse/hivedatabase.db/course/000000_0
-rwxr-xr-x 3 root supergroup 13 2019-11-21 01:56 /user/hive/warehouse/hivedatabase.db/course/000001_0
-rwxr-xr-x 3 root supergroup 13 2019-11-21 01:56 /user/hive/warehouse/hivedatabase.db/course/000002_0
[aaa@qq.com hive]#
[aaa@qq.com hive]# hadoop fs -cat /user/hive/warehouse/hivedatabase.db/course/000000_0
03 英语 03
[aaa@qq.com hive]# hadoop fs -cat /user/hive/warehouse/hivedatabase.db/course/000001_0
01 语文 02
[aaa@qq.com hive]# hadoop fs -cat /user/hive/warehouse/hivedatabase.db/course/000002_0
02 数学 01

重点说明：

分桶字段必须出现在表已有的字段内。

分桶逻辑：

对分桶字段取哈希，用这个哈希值与桶的数量取余，余几，这个数据就放在哪个桶内。

作用：提高join效率和用于数据取样。

提高join效率：将join关联的字段作为分桶字段。相同的数据汇入到一个桶内，在join时直接读取桶内的所有数据，不用全表扫描。

数据取样：将数据编号作为分桶字段，与分桶数量取余。这样可以讲数据打散，分到不同的桶内。那么每个桶内的数据包含各个“阶段”的数据。

Hive的自定义函数

查看函数：show functions;

查看函数的用法： desc function upper;

详细用法： desc function extended upper;

自定义函数种类

UDF UDAF UDTF

一进一出多进一出一进多出

UDF

1 创建一个类继承UDF 实现evaluate方法，编写自己的业务逻辑代码

2、打包上传到集群（linux）

3、让Hive能够找到这个jar文件

在hive的shell窗口 add jar 路径+jar包

4、创建临时函数

create temporary function 函数名 as 包名+类名

5、调用新的函数

hive (default)> select tolower(‘CCCCCC’);
OK
_c0
cccccc

HIVE通过reflect调用纯java代码

1、使用java 编写业务代码，打包上传（linux）

2、让hive 能够找到jar

在hive的shell窗口 add jar 路径+jar包

3、调用

select reflect(‘参数一’,‘参数二’,‘参数三’);

参数一：包名+类名

参数二：方法名

参数三：传入的数据

例如：

hive (default)> select reflect(‘demo03’,‘text’,‘qq’);
OK
_c0
qq -----

Hive修改表

alter table old_table_name rename to new_table_name;

添加修改列的信息

添加列

alter table score5 add columns (mycol string, mysco string);

更新列

alter table score5 change column mysco mysconew int;

Hive的基本操作

1、表数据的导入

有5种方式

1 直接向分区表中插入数据

insert into table score3 partition(month =‘201807’) values (‘001’,‘002’,‘100’);

2、通过查询插入数据

insert overwrite table score4 partition(month = ‘201806’) select s_id,c_id,s_score from score;

3、多插入模式

from score

insert overwrite table score_first partition(month=‘201806’) select s_id,c_id

insert overwrite table score_second partition(month = ‘201806’) select c_id,s_score;

4、查询语句中创建表并加载数据

create table score5 as select * from score;

5、创建表时通过location指定加载数据路径

create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by ‘\t’ location ‘/myscore6’;

2、表数据的导出

有7种方式

1 将查询的结果导出到本地

insert overwrite local directory ‘/export/servers/exporthive/a’ select * from score;

	2    将查询的结果格式化导出到本地

insert overwrite local directory ‘/export/servers/exporthive’ row format delimited fields terminated by ‘\t’ collection items terminated by ‘#’ select * from student;

3、将查询的结果导出到HDFS上(没有local)

insert overwrite directory ‘/export/servers/exporthive’ row format delimited fields terminated by ‘\t’ collection items terminated by ‘#’ select * from score;

4、Hadoop命令导出到本地

dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;

5、hive shell 命令导出

bin/hive -e “select * from myhive.score;” > /export/servers/exporthive/score.txt

6、 export导出到HDFS上
export table score to ‘/export/exporthive/score’;

7、 sqoop 导出数据（后面单独学）

3、清空表数据

truncate table score5;

Hive 常用查询语法

SELECT 字段名A，字段名B from 表明

Hive参数优先级

三种方法

1、配置文件（配置文件参数）

2、hive -hiveconf （命令行参数）

3、在hive shell窗口设置（参数声明）

set mapred.reduce.tasks=100;

优先级：参数声明 > 命令行参数 > 配置文件参数（hive））

Hive的压缩：Snappy最快

hive第二部分

Hive支持的数据格式

可支持Text， SequenceFile ，ParquetFile，ORC格式RCFILE等

存储于压缩结合

在实际的项目开发当中，hive表的数据存储格式一般选择：orc或parquet。压缩方式一般选择snappy。

Hive优化

1、fetch抓取优化

简单查询不转化MR, set hive.fetch.task.conversion=more;

简单查询转化MR, set hive.fetch.task.conversion=none;

2、本地模式

本地计算：数据存储后，计算这批数据的程序已经写完，程序在进行分发时，优先将程序分发到程序所用到数据所在的节点。

本地查询：查询数据的程序运行在提交查询语句的节点上运行（不提交到集群上运行）。

好处：提高查询效率。

数据倾斜：是指数据在一个维度上有非常大量维度的差异。

3、数据倾斜局部聚和

当发生数据倾斜时，使用局部聚和可以起到性能调优的效果（在Map端进行聚合）。

当发生倾斜时，查询语句会转化成至少两个MR程序，第一个程序进行局部聚和，第二个MR程序进行最终聚和。

4、Count(distinct)去重求总数

SELECT count(DISTINCT id) FROM bigtable;

优化方案

使用嵌套查询（先对id 分组，再求需的数量）

SELECT count(id) FROM (SELECT id FROM bigtable GROUP BY id) a;

5、笛卡尔积

避免join的时候不加on条件

6、分区剪裁、列剪裁

什么是分区剪裁：用哪个分区，获取哪个分区的数据，多的不要获取。

什么是列剪裁：用哪个列，获取哪个列的数据，多的不要获取。

先join 后过滤优化方案

join时通常是先join 后过滤优化方案是先过滤后join

例如

SELECT a.id FROM bigtable a LEFT JOIN ori b ON a.id = b.id WHERE b.id <= 10;

优化方案

1、SELECT a.id FROM ori LEFT JOIN bigtable b ON (b.id <= 10 AND a.id = b.id);

2、 SELECT a.id FROM bigtable a RIGHT JOIN (

SELECT id FROM ori WHERE id <= 10

) b ON a.id = b.id;

7、动态分区调整

以第一个表的分区规则，来对应第二个表的分区规则，将第一个表的所有分区，全部拷贝到第二个表中来，第二个表在加载数据的时候，不需要指定分区了，直接用第一个表的分区即可

8、数据倾斜

当有数据倾斜时如何解决。

1、设置reduce数量60，使用ID ,对ID进行分区distribute by

2、设置reduce数量60, 使用distribute by 字段为随机数 select * from a distribute by rand();

9、reduce数量

决定reduce数量的因素，

参数1：每个Reduce处理的数据量

参数2：每个任务最大的reduce数

计算reducer数的公式

N=min(参数2，总输入数据量/参数1)

10、并行执行

在转换后的各个阶段。没有依赖的前提下，可以开启并行执行（多任务多阶段同时执行），一起到优化执行效率的作用。

11、严格模式

1、用户不允许扫描所有分区

2、使用了order by语句的查询，要求必须使用limit语句。

3、限制笛卡尔积的查询。

12、JVM重用

没有开启jvm重用，每个task都需要独立的开启、关闭jvm(开启、关闭需要1s),任务的开销会很大

开启jvm重用，启动阶段开启部分jvm,这些jvm不关闭，等待被使用，后面的任务需要使用jvm时直接调用,就任务的开销会很小

上一篇： Python爬虫入门，快速抓取大规模数据(第二部分)

hive第二部分

通过参数使用Hive

数据库的相关操作（增删改查）

数据表的相关操作

Hive支持的数据类型

外部表

验证删除表后数据是否还在

内部表

验证删除表后数据是否还在

分区表

分桶表

Hive的自定义函数

UDF

HIVE通过reflect调用纯java代码

Hive修改表

添加修改列的信息

Hive的基本操作

1、表数据的导入

2、表数据的导出

3、清空表数据

Hive 常用查询语法

Hive参数优先级

Hive的压缩：Snappy最快

Hive支持的数据格式

存储于压缩结合

Hive优化

1、fetch抓取优化

2、本地模式

3、数据倾斜局部聚和

4、Count(distinct)去重求总数

5、笛卡尔积

6、分区剪裁、列剪裁

先join 后过滤优化方案

7、动态分区调整

8、数据倾斜

9、reduce数量

10、并行执行

11、严格模式

12、JVM重用

Python爬虫入门，快速抓取大规模数据(第二部分)

Storyboard全解析-第二部分

hive第二部分

第二部分：管理站点

Hadoop第二部分：MapReudce(三)

Django第二部分

spring专题---第二部分AOP

dns第二部分 集群

轻便爬虫+OCR 第一部分

HTML部分标签的使用方法2 htmlcss

dns第二部分集群