大数据私房菜--impala实战
程序员文章站
2024-03-22 12:17:22
...
1 业务背景
现有收集到用户的页面点击行为日志数据,数据格式如下:
用户id, 点击时间
user_id click_time
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
B,2020-05-15 02:03:00
B,2020-05-15 02:29:40
B,2020-05-15 04:00:00
业务:
会话概念:用户的一次会话含义是指用户进入系统开始到用户离开算作一次会话,离开或者重新开始一次会话的概念是指用户的两次行为事件差值大于30分钟,
比如以A用户为例:
第一次会话
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
第二次会话
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
判断条件是只要两次时间差值大于30分钟就属于两次会话。
2 业务需求
对用户的日志数据打上会话内序号,如下
A,2020-05-15 01:30:00,1
A,2020-05-15 01:35:00,2
A,2020-05-15 02:00:00,3
A,2020-05-15 03:00:10,1
A,2020-05-15 03:05:00,2
B,2020-05-15 02:03:00,1
B,2020-05-15 02:29:40,2
B,2020-05-15 04:00:00,1
3 业务实现
在Hive中完成数据加载
--创建表
--在hive的homework数据库里面创建表
use use homework;
--如果表存在删除
drop table if exists user_clicklog;
-- 创建目标表
-- create table homework.t1_ori stored as TEXTFILE
-- 创建homework.t1_ori存储为TEXTFILE
create table homework.user_clicklog(
user_id string,
click_time string
)row format delimited fields terminated by ',';
show tables查看是否创建成功
--加载数据
load data local inpath '/root/clicklog.dat' into table user_clicklog;
需求:如果两次浏览之间的间隔超过30分钟,认为是两个不同的会话;再求对会话进行排序
需要根据数据本身进行分组,思路大致如下:
一、获取数据基础表
需要根据相邻两次浏览记录判断分组点,我们通过序列函数对数据进行错位,SQL如下:
select
t.user_id,
t.click_time,
lag(t.click_time) over(partition by t.user_id order by t.click_time) dt2
from homework.user_clicklog t
数据如下:
二、获取切分点
根据基础表,将dt时间与上一个记录的时间dt2相减计算出时间间隔tl,通过case…when…语句判断是否为切分点,将切分点置一个mark为1
SQL如下:
with ta as (
select
t.user_id,
t.click_time,
lag(t.click_time) over(partition by t.user_id order by t.click_time) dt2
from homework.user_clicklog t
)
select t.user_id,t.click_time,t.dt2,
(unix_timestamp(t.click_time, "yyyy-MM-dd HH:mm:ss") - unix_timestamp(t.dt2, "yyyy-MM-dd HH:mm:ss"))/60 tl,
case when (unix_timestamp(t.click_time, "yyyy-MM-dd HH:mm:ss") - unix_timestamp(t.dt2, "yyyy-MM-dd HH:mm:ss"))/60 >= 30 then 1 else 0 end mark
from ta t
三、实现切分点分组
利用over函数并且在分区内排序之后使用窗口进行累加,累加窗口为第一行到当前行,这样就按照切分点进行分区,第一个分区的mark为0,第二分区的mark为1…如此类推
SQL实现如下:
with ta as (
select
t.user_id,
t.click_time,
lag(t.click_time) over(partition by t.user_id order by t.click_time) dt2
from homework.user_clicklog t
),
tb as (select t.user_id,t.click_time,t.dt2,
(unix_timestamp(t.click_time, "yyyy-MM-dd HH:mm:ss") - unix_timestamp(t.dt2, "yyyy-MM-dd HH:mm:ss"))/60 tl,
case when (unix_timestamp(t.click_time, "yyyy-MM-dd HH:mm:ss") - unix_timestamp(t.dt2, "yyyy-MM-dd HH:mm:ss"))/60 >= 30 then 1 else 0 end mark
from ta t)
select
t.user_id,
t.click_time,
t.dt2,
t.tl,
sum(t.mark) over(partition by t.user_id order by t.click_time rows BETWEEN unbounded preceding and current row) as mark
from tb t
四、汇总统计
在获取了每个分区的标识之后,我们对数据进行分组排名获取序号
SQL实现如下:
with ta as (
select
t.user_id,
t.click_time,
lag(t.click_time) over(partition by t.user_id order by t.click_time) dt2
from homework.user_clicklog t
),
tb as (select t.user_id,t.click_time,t.dt2,
(unix_timestamp(t.click_time, "yyyy-MM-dd HH:mm:ss") - unix_timestamp(t.dt2, "yyyy-MM-dd HH:mm:ss"))/60 tl,
case when (unix_timestamp(t.click_time, "yyyy-MM-dd HH:mm:ss") - unix_timestamp(t.dt2, "yyyy-MM-dd HH:mm:ss"))/60 >= 30 then 1 else 0 end mark
from ta t),
tc as (select
t.user_id,
t.click_time,
t.dt2,
t.tl,
sum(t.mark) over(partition by t.user_id order by t.click_time rows BETWEEN unbounded preceding and current row) as mark
from tb t)
select
t.user_id,
t.click_time,
ROW_NUMBER() over(partition by t.mark,t.user_id order by t.click_time ) rank
from tc t
下一篇: 重新思考软件开发中的单元测试