Hive知识点总结(四)
Hive知识点总结(一)(基本概念、运行方式、创建库表、数据读取、文件格式):点击查看
Hive知识点总结(二)(数据类型及修改表定义):点击查看
Hive知识点总结(三)(Hive查询语法):点击查看
6、hive函数
hive的所有函数手册:点击查看
下面我们总结一些常用内置函数及如何自定义函数
6.1、常用内置函数
6.1.1、类型转换函数(cast)
select cast("5" as int) from dual;
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);
6.1.2、数学运算函数
select round(5.4) from dual; ## 5
select round(5.1345,3) from dual; ##5.135
select ceil(5.4) from dual; // select ceiling(5.4) from dual; ## 6
select floor(5.4) from dual; ## 5
select abs(-5.4) from dual; ## 5.4
select greatest(3,5,6) from dual; ## 6
select least(3,5,6) from dual; ## 3
select max(age) from t_person; 聚合函数
select min(age) from t_person; 聚合函数
举例:
select greatest(cast(s1 as double),cast(s2 as double),cast(s3 as double)) from t_fun2;
结果:+---------+--+
| _c0 |
+---------+--+
| 2000.0 |
| 9800.0 |
+---------+--+
6.1.3、字符串函数
substr(string, int start) ## 截取子串
substring(string, int start)
示例:select substr("abcdefg",2) from dual;
substr(string, int start, int len)
substring(string, int start, int len)
示例:select substr("abcdefg",2,3) from dual;
concat(string A, string B...) ## 拼接字符串
concat_ws(string SEP, string A, string B...)
示例:select concat("ab","xy") from dual;
select concat_ws(".","192","168","33","44") from dual;
length(string A)
示例:select length("192.168.33.44") from dual;
split(string str, string pat)
示例:select split("192.168.33.44",".") from dual; 错误的,因为.号是正则语法中的特定字符
select split("192.168.33.44","\\.") from dual;
upper(string str) ##转大写
6.1.4、时间函数
select current_timestamp;
select current_date;
## 取当前时间的毫秒数时间戳
select unix_timestamp();
## unix时间戳转字符串
from_unixtime(bigint unixtime[, string format])
示例:select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");
## 字符串转unix时间戳
unix_timestamp(string date, string pattern)
示例: select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017/08/10 17:50:30","yyyy/MM/dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");
6.1.5、表生成函数
6.1.5.1、行转列函数:explode()
假如有以下数据:
1,zhangsan,化学:物理:数学:语文 2,lisi,化学:数学:生物:生理:卫生 3,wangwu,化学:语文:英语:体育:生物 |
映射成一张表:
create table t_stu_subject(id int,name string,subjects array<string>)
row format delimited fields terminated by ','
collection items terminated by ':';
使用explode()对数组字段“炸裂”
然后,我们利用这个explode的结果,来求去重的课程:
select distinct tmp.sub
from
(select explode(subjects) as sub from t_stu_subject) tmp;
6.1.5.2、表生成函数 :lateral view
select id,name,tmp.sub
from t_stu_subject lateral view explode(subjects) tmp as sub;
理解: lateral view 相当于两个表在join
左表:是原表
右表:是explode(某个集合字段)之后产生的表
而且:这个join只在同一行的数据间进行
那样,可以方便做更多的查询:
比如,查询选修了生物课的同学
select a.id,a.name,a.sub from
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) a
where sub='生物';
6.1.6、集合函数
array_contains(Array<T>, value) 返回boolean值
示例:
select moive_name,array_contains(actors,'吴刚') from t_movie;
select array_contains(array('a','b','c'),'c') from dual;
sort_array(Array<T>) 返回排序后的数组
示例:
select sort_array(array('c','b','a')) from dual;
select 'haha',sort_array(array('c','b','a')) as xx from (select 0) tmp;
size(Array<T>) 返回一个int值
示例:
select moive_name,size(actors) as actor_number from t_movie;
size(Map<K.V>) 返回一个int值map_keys(Map<K.V>) 返回一个数组
map_values(Map<K.V>) 返回一个数组
6.1.7、控制函数
6.1.7.1、case when
语法:
CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
WHEN conditionn THEN resultn
ELSE result
END
示例:
select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;
6.1.7.2、if
select id,if(age>25,'working','worked') from t_user;
select moive_name,if(array_contains(actors,'吴刚'),'好电影','rom t_movie;
6.1.8、json解析函数
get_json_object('{\"key1\":3333,\"key2\":4444}' , '$.key1') --> 3333
json_tuple('{\"key1\":3333,\"key2\":4444}','key1','key2') as(key1,key2) --> 3333, 4444
json_tuple函数
示例:
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_rating_json;
产生结果:
利用json_tuple从原始json数据表中,etl出一个详细信息表:
create table t_rate
as
select
uid,
movie,
rate,
year(from_unixtime(cast(ts as bigint))) as year,
month(from_unixtime(cast(ts as bigint))) as month,
day(from_unixtime(cast(ts as bigint))) as day,
hour(from_unixtime(cast(ts as bigint))) as hour,
minute(from_unixtime(cast(ts as bigint))) as minute,
from_unixtime(cast(ts as bigint)) as ts
from
(select
json_tuple(rateinfo,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid)
from t_json) tmp;
6.1.9、URL解析函数
网页URL数据解析函数:parse_url_tuple
select parse_url_tuple("http://www.edu360.cn/baoming/youhui?cookieid=20937219375",'HOST','PATH','QUERY','QUERY:cookieid')
from dual;
+----------------+------------------+-----------------------+--------------+--+
| c0 | c1 | c2 | c3 |
+----------------+------------------+-----------------------+--------------+--+
| www.edu360.cn | /baoming/youhui | cookieid=20937219375 | 20937219375 |
+----------------+------------------+-----------------------+--------------+--+
6.1.10、分析函数
(1)使用row_number() over()函数解决TOPN问题:
-- 造数据:
1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
create table t_rn(id int,age int,name string,sex string)
row format delimited fields terminated by ',';
load data local inpath '/root/hivetest/rn.dat' into table t_rn;
-- 分组标记序号
select *
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn
from t_rn) tmp
where rn<3
;
(2)使用窗口分析函数 sum() over():来实现窗口中的逐行累加
0: jdbc:hive2://localhost:10000> select * from t_access_amount;
+----------------------+------------------------+-------------------------+--+
| t_access_amount.uid | t_access_amount.month | t_access_amount.amount |
+----------------------+------------------------+-------------------------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| A | 2015-03 | 20 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
| B | 2015-03 | 45 |
| C | 2015-01 | 30 |
| C | 2015-02 | 40 |
| C | 2015-03 | 30 |
+----------------------+------------------------+-------------------------+--+
-- 需求:求出每个人截止到每个月的总额
select uid,month,amount,
sum(amount) over(partition by uid order by month rows between unbounded preceding and current row) as accumulate
from t_access_amount;
6.2、自定义函数
在实际的开发过程中,很多时候需要我们处理复杂的数据逻辑,需要我们能够自定义函数来实现数据操作,下面举例说明:
/*
有如下json数据:rating.json
*/
首先建表映射上述数据
create table t_ratingjson(json string);
导入数据
load data local inpath '/root/hivetest/rating.json' into table t_ratingjson;
我们的需求是想把上面的原始数据变成如下形式:
1193,5,978300760,1
661,3,978302109,1
914,3,978301968,1
3408,4,978300275,1
思路:如果能够自定义一个json解析函数,传入数,读出对应的值,就很方便了
create table t_rate
as
select myjson(json,1) as movie,cast(myjson(json,2) as int) as rate,myjson(json,3) as ts,myjson(json,4) as uid
from t_ratingjson;
解决:
hive中如何定义自己的函数:
1、先写一个java类(extends UDF,重载方法public C evaluate(A a,B b)),实现你所想要的函数的功能(传入一个json字符串和一个脚标,返回一个值)
public class ParseJson extends UDF{
// 重载 :返回值类型 和参数类型及个数,完全由用户自己决定
// 本处需求是:给一个字符串,返回一个数组
public String[] evaluate(String json) {
String[] split = json.split("\"");
String[] res = new String[]{split[3],split[7],split[11],split[15]};
return res;
}
}
2、将java程序打成jar包,上传到hive所在的机器
3、在hive命令行中将jar包添加到classpath :
hive>add jar /root/hivetest/myjson.jar;
4、在hive命令中用命令创建一个临时函数叫做myjson,关联你所写的这个java类
hive> create temporary function myjson as 'cn.rople.hive.udf.MyJsonParser';
5、开发hql语句,利用自定义函数,从原始表中抽取数据插入新表
insert into table t_rate
select
split(jsonp(json),',')[0],
cast(split(jsonp(json),',')[1] as int),
cast(split(jsonp(json),',')[2] as bigint),
cast(split(jsonp(json),',')[3] as int)
from
t_rating_json;
注:临时函数只在一次hive会话中有效,重启会话后就无效
如果需要经常使用该自定义函数,可以考虑创建永久函数:
拷贝jar包到hive的类路径中:
cp wc.jar apps/hive-1.2.1/lib/
创建了:
create function pfuncx as 'com.doit.hive.udf.UserInfoParser';
删除函数:
DROP TEMPORARY FUNCTION [IF EXISTS] function_name
DROP FUNCTION[IF EXISTS] function_name