SQL | explode+lateral view和其余sql常用函数

程序员文章站 2024-03-12 17:25:56

...

背景

实习时遇到的一个场景。

用sql从数据库中提取数据，并且将数据转成所需要的格式
其次对转换后数据放到本地的Python环境中进行预处理建模。
解释模型与上线

下面主要总结一下sql提数的过程，后面两个步骤前面的推文涉及到一部分，后面还会更新~敬请期待。

数据长什么样？需求是什么呢？

原始数据

import pandas as pd

df1 = pd.read_excel('./sql数据处理与提取-0326.xlsx', sheet_name='原始')
df1

	deviceid	categoryinterest
0	456344159	军事:4.412,要闻:0.856,社会:0.543,娱乐/综艺:0.472,娱乐/电视:0...
1	537762486	军事:2.266,社会:1.115,体育/羽毛球:1.038,情感:0.321,娱乐:0.2...
2	842923625	动物/宠物:0.389,历史/近现代史:0.299,未知:0.296,时尚/时装:0.266...
3	000000C7KP39VGIV	动漫/日韩动漫:1.0,动漫:0.549,音乐:0.526,游戏/联机游戏:0.478,游戏...
4	000000C7KP3RJC7W	奇闻趣事:0.881,娱乐/明星八卦:0.831,娱乐:0.823,社会/事故:0.0,社会...

最终想要的数据

形成3列：

一列是deviceid
一列是categoryinterest中的 category，也就是兴趣点
第三列是兴趣点对应的得分

其中一个id对应的多个类别的时候，打散开来，就是一个id对应有n个类别，那么就拆分成n行

df3 = pd.read_excel('./sql数据处理与提取-0326.xlsx', sheet_name='完全分开')
df3[['deviceid', 'interests_news', 'interests_score']]

	deviceid	interests_news	interests_score
0	65762973	体育/冰雪运动	0.641
1	774830731	历史/古代史	1.000
2	65762973	人文/人文科普	0.584
3	124901984	体育/NBA	0.566
4	874657455	体育/乒乓球	0.207

处理的思路

先根据逗号将 categoryinterest 一行一行的分出来，可以使用explode函数
其次对分出来的部分（比如：娱乐/综艺:0.472）根据冒号进行切分，形成两列
最后选择所需要的列

tips：

最后都加一下groupby 防止重复的数据也就相当于起到了去重的作用。测试一下Python相同两列的groupby 是否只有一行了？

测试Python中groupby

df_test = pd.DataFrame({'a': [1,1,3,4], 'a': [1,1,3,4], 
                        'b':[1,1,24,2], 'd': [2,2,8,1]})
df_test

	a	b	d
0	1	1	2
1	1	1	2
2	3	24	8
3	4	2	1

dfg = df_test.groupby('a')
for d in dfg:
    print(d[0])
    print(d[1])

1
   a  b  d
0  1  1  2
1  1  1  2
3
   a   b  d
2  3  24  8
4
   a  b  d
3  4  2  1

结论：

SQL的groupby之后即使两列相同也都还会保留
那问题来了？如果有缺失值会保留吗？测试一下

测试含有缺失值的groupby

import numpy as np
df_test = pd.DataFrame({'a': [1,1,3,4,np.nan], 'a': [1,1,3,4,np.nan], 
                        'b':[1,1,24,2,np.nan], 'd': [2,2,8,1,np.nan],
                        })
df_test

	a	b	d
0	1.0	1.0	2.0
1	1.0	1.0	2.0
2	3.0	24.0	8.0
3	4.0	2.0	1.0
4	NaN	NaN	NaN

dfg = df_test.groupby('a')
for d in dfg:
    print(d[0])
    print(d[1])

1.0
     a    b    d
0  1.0  1.0  2.0
1  1.0  1.0  2.0
3.0
     a     b    d
2  3.0  24.0  8.0
4.0
     a    b    d
3  4.0  2.0  1.0

结论：

groupby会自动过滤掉含有缺失值的行！就不在一起去考虑了！

正题-如何上面说的思路？

首先explode一行每个逗号之间的内容

select
deviceid,categoryinterest,interests1
from
tb 
lateral view explode(split(categoryinterest,',')) tb1 as interests1 
# explode就是将hive一行中复杂的array或者map结构拆分成多行
# lateral view与explode等udtf就是天生好搭档，explode将复杂结构一行拆成多行，然后再用lateral view做各种聚合。
where
day=20190101
group by
deviceid,categoryinterest,interests1
limit 5

核心点：

explode函数以及 lateral view 联合使用
最后得加上group by 起到去重的作用

实现效果如下

df2 = pd.read_excel('./sql数据处理与提取-0326.xlsx', sheet_name='explode后')
df2

	deviceid	categoryinterest	interests1
0	65762973	军事:1.0,娱乐:0.917,军事/装备:0.916,娱乐/明星八卦:0.874,历史/古...	体育/冰雪运动:0.641
1	774830731	历史/古代史:1.0,历史:0.5	历史/古代史:1.0
2	65762973	军事:1.0,娱乐:0.917,军事/装备:0.916,娱乐/明星八卦:0.874,历史/古...	人文/人文科普:0.584
3	124901984	情感/两性:1.0,历史/古代史:0.933,娱乐:0.901,情感:0.742,体育/乒乓...	体育/NBA:0.566
4	874657455	体育/国际足球:9.063,体育/中国足球:3.019,娱乐/综艺:1.947,体育/NBA...	体育/乒乓球:0.207

总结：

explode就是将sql或者hive一行中复杂的array或者map结构拆分成多行。
lateral view（侧视图）的意义是配合explode（或者其他的UDTF），一个语句生成把单行数据拆解成多行后的数据结果集。形成的虚拟表tb1 相当于再和原表联结形成了新表interests1
直接使用explode 一次只能生成一个字段要生成多列就是使用lateral view（侧视图）

参考资料1：https://blog.csdn.net/guodong2k/article/details/79459282

参考资料2：https://blog.csdn.net/bitcarmanlee/article/details/51926530

其次对于explode之后的再进行interests1的切割

select
deviceid,categoryinterest,interests1,split(interests1, ':')[0] as interests_news,split(interests1, ':')[1] as interests_score
from
    (select
    deviceid,categoryinterest,interests1
    from
    table
    lateral view explode(split(categoryinterest,',')) tb1 as interests1 
    # explode就是将hive一行中复杂的array或者map结构拆分成多行
    # lateral view与explode等udtf就是天生好搭档，explode将复杂结构一行拆成多行，然后再用lateral view做各种聚合。
    # tb1相当于虚拟表
    where
    day=20190313
    group by
    deviceid,categoryinterest,interests1
    )t1
limit 5

实现效果如下

df3

	deviceid	categoryinterest	interests1	interests_news	interests_score
0	65762973	军事:1.0,娱乐:0.917,军事/装备:0.916,娱乐/明星八卦:0.874,历史/古...	体育/冰雪运动:0.641	体育/冰雪运动	0.641
1	774830731	历史/古代史:1.0,历史:0.5	历史/古代史:1.0	历史/古代史	1.000
2	65762973	军事:1.0,娱乐:0.917,军事/装备:0.916,娱乐/明星八卦:0.874,历史/古...	人文/人文科普:0.584	人文/人文科普	0.584
3	124901984	情感/两性:1.0,历史/古代史:0.933,娱乐:0.901,情感:0.742,体育/乒乓...	体育/NBA:0.566	体育/NBA	0.566
4	874657455	体育/国际足球:9.063,体育/中国足球:3.019,娱乐/综艺:1.947,体育/NBA...	体育/乒乓球:0.207	体育/乒乓球	0.207

总结：

使用的是split函数，根据冒号进行切分
其余情况类似

最后输出结果进行后续分析

封装成一个整体的sql语句

select
	actusers.device_uuid,interests.interests_news,interests.interests_score
from	

# 表1	
	(
	select
	day,app_id,device_uuid
	from
	table1 
	# 下面限定条件是为了取出一部分数据看看
	where
	day=20190101
	and app_id in ('2x1kfBk63z','2S5Wcx') 
	and current_channel_group='XXXX' 
	group by
	day,app_id,device_uuid 
	)actusers	

# 表2	
left join
	
	(select
	deviceid,interests_news,interests_score
	from
	(select
		deviceid,categoryinterest,interests1,split(interests1, ':')[0] as interests_news,split(interests1, ':')[1] as interests_score
	from
		(select
		deviceid,categoryinterest,interests1
		from
		table2
		lateral view explode(split(categoryinterest,',')) tb1 as interests1 
		# explode就是将hive一行中复杂的array或者map结构拆分成多行
		# lateral view与explode等udtf就是天生好搭档，explode将复杂结构一行拆成多行，然后再用lateral view做各种聚合。
		where
		day=20190101
		group by
		deviceid,categoryinterest,interests1
		)t1
	)t2
	where
	interests_score>=0.5 and interests_score<=1
	group by
	deviceid,interests_news,interests_score
	)interests
on	
	actusers.device_uuid=interests.deviceid
group by
	actusers.device_uuid,interests.interests_news,interests.interests_score

整体总结

多重SQL嵌套从最里面的select往外看
explode用于将一个id对应多个内容想把其分开成每一行的时候，但是explode只能生成一列，如果想多列，使用lateral view 侧视图
split函数是一个很好用的函数一次是用在explode的时候，即根据逗号进行explode，另一次是用在select语句中，是直接生成新的变量，需要根据select的结果来进行变量的赋值！
sql中的去重一个是可以用distinct 一个是用groupby

上一篇：一個有意思的rake脚本

下一篇： Vuex常用模块化使用方案

SQL | explode+lateral view和其余sql常用函数

目录

背景

数据长什么样？需求是什么呢？

原始数据

最终想要的数据

处理的思路

测试Python中groupby

测试含有缺失值的groupby

正题-如何上面说的思路？

首先explode一行每个逗号之间的内容

实现效果如下

其次对于explode之后的再进行interests1的切割

实现效果如下

最后输出结果进行后续分析

封装成一个整体的sql语句

整体总结

SQL分组排序语句

docker镜像+容器，常用命令

Java中使用WebUploader插件上传大文件单文件和多文件的方法小结

go语言常用api使用

26.Vuex在Composition API和非Composition API中结合Typescript的使用

Docker----制作一个带有JDK和tomcat的docker镜像

Java代码为例讲解堆的性质和基本操作以及排序方法

Vuex常用模块化使用方案

java编程常用技术(推荐)

String类及其常用方法