HIVE常用函数总结

程序员文章站 2022-07-14 14:31:33

...

HIVE常用函数总结

一：explode（列转行）

lateral view explode()a

explode只能炸array和ma，不能炸struct，UDTF explode 只支持一个字段**
hive explode函数可以将一个array或者map展开，其中explode(array)使得结果中将array列表里的每个元素生成一行；explode(map)使得结果中将map里的每一对元素作为一行，key为一列，value为一列，一般情况下，直接使用即可，但是遇到以下情况时需要结合lateral view 使用。

**LATERAL VIEW的使用：**
侧视图的意义是配合explode（或者其他的UDTF），一个语句生成把单行数据拆解成多行后的数据结果集。

二：case when（行转列）


//两种方法结果等效
方法一：
case 
when tb1.os = 'android' then 'android'
when tb1.os = 'ios' then 'iPhone'
else 'PC'
end as os,

方法二：
case tb1.os
when 'android' then 'android'
when 'ios' then 'iPhone'
else 'PC'
end as os,

三：rows beetwen

current row:当前行
unbounded preceding:首行
unbounded following：尾行
n preceding：往前n行数据   					//preceding
n following：往后n行数据

四： count(),sum(),avg(),max(),min()

1. count()返回匹配制定条件的行数。

   count(*)返回在给定的选择中被选的行数。

2.sum()返回组中所有值的和。sum只能用于数字列,空值会被忽略！

3.avg()返回组中值的平均值，空值回避忽略（默认值为16位）！

4.round (avg(列名)，2)返回组中的平均值，可以选择保留一位或者两位。

5.max()返回组中值的最大值。

6.min()返回组中值的最小值。

7.过滤掉最低分小于60的学生，并平均分保留两位数值
select snum,sum(score)  ,round (avg(score)，2),max(score) from Tbl_Grade
Group by snum having min(score)>=60;

8.round  四舍五入

五：排名函数/排序函数

1.排名函数

//排名函数是窗口函数，后边必须开窗over（）

1.1 row_number() over()

//不考虑相等 {1.2.3.4.5}

1.2 dense_rank() over()

//不留空位  {1.2.3.3.4}

1.3 rank() over()

//留空位  {1.2.3.3.5}

2.排序函数

2.1 order by

只有一个reduce  实现全局排序

2.2 sort by

只有一个reduce时功能跟order by一样实现全局排序，reduce有多个时实现每个reduce局部排序

2.3 distribute by

distribute by和sort by结合使用实现分组局部排序

2.4 cluster by

cluster by的功能就是distribute by和sort by相结合

六：get_json_object

get_json_object(param1,"$.param2")

param1：需要解析的json字段

param2：遇到数组就用 [0,1,2...] 0,1,2是数组对应的元素,遇到jsonObject直接用 ".key"取出想要获取的value

比如：
对于jsonArray（json数组），如person表的xjson字段有数据：
[{"name":"王二狗","sex":"男","age":"25"},{"name":"李狗嗨","sex":"男","age":"47"}]
取出第一个json对象，那么hive sql为：
SELECT get_json_object(xjson,"$.[0]") FROM person;
结果是：
{"name":"王二狗","sex":"男","age":"25"}

取出第一个json的age字段的值：
SELECT get_json_object(xjson,"$.[0].age") FROM person;
结果
25

七：split ， substring，substr，space，rtrim，reverse，length

1.split

split(string str, string pat)

1 返回值：string
将字符串str按照指定分隔符转换成Map，第一个参数是需要转换字符串，第二个参数是键值对之间的分隔符，默认为逗号;第三个参数是键值之间的分隔符，默认为"="

2.substring / substr

对于字符串A,从start位置开始截取字符串并返回
substr(string|binary A, int start) substring(string|binary A, int start)
返回值：string
 
对于二进制/字符串A,从start位置开始截取长度为length的字符串并返回
substr(string|binary A, int start, int len) substring(string|binary A, int start, int len)
返回值：string

截取第count分隔符之前的字符串，如count为正则从左边开始截取，如果为负则从右边开始截取
substring_index(string A, string delim, int count)
返回值：string

3.space

返回n个空格

space(int n)

返回值：string

4.rtrim

去掉字符串后面出现的空格

rtrim(string A)

1 返回值：string

左边去空格函数：ltrim

右边去空格函数：rtrim

5.reverse

反转字符串

reverse(string A)

1 返回值：string

6.length

 语法: length(string A)  
 返回值: int  
 说明：返回字符串A的长度  
 举例：hive> select length('abcedfg');
 OK
 7
 Time taken: 0.065 seconds, Fetched: 1 row(s)

八：时间函数

1：获取当前日期: current_date

select current_date;
2019-07-16

2：日期时间转日期函数：to_date(string timestamp)

select to_date('2017-09-15 11:12:00')
2017-09-15

3：计算两个日期之间的天数: datediff

select datediff('2017-09-15','2017-09-01')
14

4：日期增加和减少: date_add/date_sub(string startdate,int days)

select date_add('2017-09-15',1)
2017-09-16

select date_sub('2017-09-15',1)
2017-09-14

5：计算一年中的第几天

select date_format('2011-12-08 10:03:01', 'D');
342

6：计算一年中的第几周

select weekofyear('2011-12-08 10:03:01');
49

7：计算一月中的第几天

select dayofmonth('2011-12-08 10:03:01');
8

8：计算一周中的第几天

select dayofweek('2011-12-08 10:03:01');
5

9：计算当前日期是周几

select date_format('2011-12-08 10:03:01', 'EEEE');
Thursday

SELECT IF(pmod(datediff('2018-05-20', '1920-01-01') - 3, 7)='0', 7, pmod(datediff('2018-05-20', '1920-01-01') - 3, 7)) 

SELECT IF(current_date - 3, 7)='0', 7, pmod(current_date - 3, 7))

10，from_unixtime：转化unix时间戳到当前时区的时间格式

select from_unixtime(1323308943,’yyyyMMdd’);
20111208

11，unix_timestamp：获取当前unix时间戳

select unix_timestamp();
1430816254
select unix_timestamp('2015-04-30 13:51:20');
1430373080

12，year：返回日期中的年

select year('2015-04-02 11:32:12');
输出：2015

13，month：返回日期中的月份

select month('2015-12-02 11:32:12');
输出：12

14，day：返回日期中的天

select day('2015-04-13 11:32:12');
输出：13

15，hour：返回日期中的小时

select hour('2015-04-13 11:32:12');
输出：11

16，minute：返回日期中的分钟

select minute('2015-04-13 11:32:12');
输出：32

17，second：返回日期中的秒

select second('2015-04-13 11:32:56');
输出：56

18，from_unixtime+ unix_timestamp Hive中yyyymmdd和yyyy-mm-dd日期之间的切换

思想：先转换成时间戳，再由时间戳转换为对应格式。
--20171205转成2017-12-05 

select from_unixtime(unix_timestamp('20171205','yyyymmdd'),'yyyy-mm-dd') from dual;

--2017-12-05转成20171205

select from_unixtime(unix_timestamp('2017-12-05','yyyy-mm-dd'),'yyyymmdd') from dual;

19，hive返回上个月第一天和最后一天

select trunc(add_months(CURRENT_TIMESTAMP,-1),'MM')


select concat(substr(add_months(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),-1),1,7),'-01'); 


--上个月最后一天
select date_sub(trunc(CURRENT_TIMESTAMP,'MM'),1);

20，两个日期相差多少小时

select (unix_timestamp('2018-05-25 12:03:55') - unix_timestamp('2018-05-25 11:03:55'))/3600
输出：1

21，两个日期相差多少分钟

select (unix_timestamp('2018-05-25 12:03:55') - unix_timestamp('2018-05-25 11:03:55'))/60
输出：60

22，求一个月的最后一天

select last_day(current_date);
2019-09-30

九：表库操作

1 强制删库

drop database 库名 cascade;

2 建表语句

建表要插入相应类型的字段时
create table test_set(
id INT,
name STRING,
hobby ARRAY<STRING>,    //array中元素为String类型
friend MAP<STRING,STRING>,    //map中键和值均为String类型
mark struct<math:int,english:int>   //Struct中元素为Int类型
)
row format delimited fields terminated by ','   //字段之间用','分隔
collection items terminated by '_'     //集合中的元素用'_'分隔
map keys terminated by ':'      //map中键值对之间用':'分隔
lines terminated by '\n       //行之间用'\n'分隔

十：NVL,Coalesce,NVL2 ,regexp_replace,regexp_extract

1 NVL

NVL函数的格式如下：NVL(expr1,expr2)
含义是：如果oracle第一个参数为空那么显示第二个参数的值，如果第一个参数的值不为空，则显示第一个参数本来的值。

2 NVL2

NVL2函数的格式如下：NVL2(expr1,expr2, expr3)
含义是：如果该函数的第一个参数为空那么显示第二个参数的值，如果第一个参数的值不为空，则显示第三个参数的值。

3 Coalesce

格式如下：
Coalesce(expr1, expr2, expr3….. exprn)
表示可以指定多个表达式的占位符。所有表达式必须是相同类型，或者可以隐性转换为相同的类型。
返回表达式中第一个非空表达式
如果所有自变量均为 NULL，则 COALESCE 返回 NULL 值。

4 regexp_replace()

select regexp_replace(A,'\\|','')，意思就是将字段A下边的所有的 | 替换为空

5 regexp_extract

 1 语法: regexp_extract(string subject, string pattern, int index)  
 2 返回值: string  
 3 说明：将字符串subject按照pattern正则表达式的规则拆分，返回index指定的字符。注意，在有些情况下要使用转义字符  
 4 举例： 
 5  hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 1);
 6 OK
 7 the

十一：LAG()，LEAD()，FIRST_VALUE，LAST_VALUE

//后边都可以跟开窗函数

1，lag()

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名，第二个参数为往上第n行（可选，默认为1），第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）

2，lead()

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL）

3，first_value

取分组内排序后，截止到当前行，第一个值

4，last_value

取分组内排序后，截止到当前行，最后一个值

十二：CONCAT（）和 CONCAT_WS（）

1，CONCAT（）

CONCAT（）函数用于将多个字符串连接成一个字符串。
语法及使用特点：
CONCAT(str1,str2,…)                       
返回结果为连接参数产生的字符串。如有任何一个参数为NULL ，则返回值为 NULL。可以有一个或多个参数。

2，CONCAT_WS（）

如何指定参数之间的分隔符
使用函数CONCAT_WS（）。使用语法为：CONCAT_WS(separator,str1,str2,…)
CONCAT_WS() 代表 CONCAT With Separator ，是CONCAT()的特殊形式。第一个参数是其它参数的分隔符。分隔符的位置放在要连接的两个字符串之间。分隔符可以是一个字符串，也可以是其它参数。如果分隔符为 NULL，则结果为 NULL。函数会忽略任何分隔符参数后的 NULL 值。但是CONCAT_WS()不会忽略任何空字符串。 (然而会忽略所有的 NULL）。

十三：collect_list和collect_set

// collect_list接收返回的数组函数，collect_set接收的函数必须为string类型

它们都是将分组中的某列转为一个数组返回，不同的是collect_list不去重而collect_set去重。
select username, collect_list(video_name) from t_visit_video group by username ;
select username, collect_set(video_name) from t_visit_video group by username;
select username, collect_list(video_name)[0] from t_visit_video group by username;

十四：类型转换

1，cast

Cast(字段名 as 转换的类型 )，其中类型可以为：
timestamp
date
string

十五：大小写转换

1，upper,ucase

//字符串转大写函数：
 1 语法: upper(string A) ucase(string A)  
 2 返回值: string  
 3 说明：返回字符串A的大写格式  
 4 举例：hive> select upper('abSEd');
 5 OK
 6 ABSED
 7 Time taken: 0.059 seconds, Fetched: 1 row(s)
 8 hive> select ucase('abSEd');
 9 OK
10 ABSED
11 Time taken: 0.058 seconds, Fetched: 1 row(s)

2，lower,lcase

 //字符串转小写函数
 1 语法: lower(string A) lcase(string A)  
 2 返回值: string  
 3 说明：返回字符串A的小写格式  
 4 举例： 
 5 hive> select lower('abSEd');
 6 OK
 7 absed
 8 Time taken: 0.068 seconds, Fetched: 1 row(s)
 9 hive> select lcase('abSEd');
10 OK
11 absed
12 Time taken: 0.057 seconds, Fetched: 1 row(s)

十六：parse_url（解析URL字符串）

parse_url(url, partToExtract[, key]) - extracts a part from a URL  
解析URL字符串，partToExtract的选项包含[HOST,PATH,QUERY,REF,PROTOCOL,FILE,AUTHORITY,USERINFO]。 
举例：
hive> select parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') ;
OK
facebook.com
Time taken: 0.286 seconds, Fetched: 1 row(s)
hive> select parse_url('http://facebook.com/path/p1.php?query=1', 'PATH');
OK
/path/p1.php
Time taken: 0.069 seconds, Fetched: 1 row(s)
hive> select parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY');
OK
query=1
可以指定key来返回特定参数，例如 
Time taken: 0.21 seconds, Fetched: 1 row(s)
hive> select parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY','query');
OK
1
Time taken: 0.057 seconds, Fetched: 1 row(s)
hive> select parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'REF');
OK
Ref
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> select parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'PROTOCOL');
OK
http
Time taken: 0.06 seconds, Fetched: 1 row(s)

上一篇： Hadoop运行模式之本地运行模式

下一篇： hive窗口函数总结

HIVE常用函数总结

HIVE常用函数总结

一：explode（列转行）

二：case when（行转列）

三：rows beetwen

四： count(),sum(),avg(),max(),min()

五：排名函数/排序函数

1.排名函数

1.1 row_number() over()

1.2 dense_rank() over()

1.3 rank() over()

2.排序函数

2.1 order by

2.2 sort by

2.3 distribute by

2.4 cluster by

六：get_json_object

七：split ， substring，substr，space，rtrim，reverse，length

1.split

2.substring / substr

3.space

4.rtrim

5.reverse

6.length

八：时间函数

1：获取当前日期: current_date

2：日期时间转日期函数：to_date(string timestamp)

3：计算两个日期之间的天数: datediff

4：日期增加和减少: date_add/date_sub(string startdate,int days)

5：计算一年中的第几天

6：计算一年中的第几周

7：计算一月中的第几天

8：计算一周中的第几天

9：计算当前日期是周几

10，from_unixtime：转化unix时间戳到当前时区的时间格式

11，unix_timestamp：获取当前unix时间戳

12，year：返回日期中的年

13，month：返回日期中的月份

14，day：返回日期中的天

15，hour：返回日期中的小时

16，minute：返回日期中的分钟

17，second：返回日期中的秒

18，from_unixtime+ unix_timestamp Hive中yyyymmdd和yyyy-mm-dd日期之间的切换

19，hive返回上个月第一天和最后一天

20， 两个日期相差多少小时

21， 两个日期相差多少分钟

22，求一个月的最后一天

九：表库操作

1 强制删库

2 建表语句

十：NVL,Coalesce,NVL2 ,regexp_replace,regexp_extract

1 NVL

2 NVL2

3 Coalesce

4 regexp_replace()

5 regexp_extract

十一：LAG()，LEAD()，FIRST_VALUE，LAST_VALUE

1，lag()

2，lead()

3，first_value

4，last_value

十二：CONCAT（）和 CONCAT_WS（）

1，CONCAT（）

2，CONCAT_WS（）

十三：collect_list和collect_set

十四：类型转换

1，cast

十五：大小写转换

1，upper,ucase

2，lower,lcase

十六：parse_url（解析URL字符串）

List常用方法汇总工具类总结(持续更新)

jsp+mysql数据库操作常用方法实例总结

mysql常用函数汇总

SQL中常用字符串函数

PHP实例：常用的数值判断函数_PHP教程

linux总结及常用命令

javascript常用函数归纳整理_javascript技巧

PHP数组和explode函数示例总结_PHP

常用PHP系统常量总结

20，两个日期相差多少小时

21，两个日期相差多少分钟