SQL | 解决一个实际问题
程序员文章站
2022-07-12 17:08:36
...
1 背景
最近实习工作遇到了一个数据提取中的问题,特此记录一下,其实当时完全可以用Python处理,不过想锻炼一下自己SQL能力,所以直接提数过程中一步到位!
2 需求
原始部分数据长这样:
app_id | device_uuid | first_start_bg | |
---|---|---|---|
0 | 2S5Wcx | 000101E6-463B-4912-8FC0-575C430A514D | app |
1 | 2S5Wcx | 00018582-CA28-43AC-A241-53461789673E | app |
2 | 2S5Wcx | 0001EC8F-FBF3-4E68-B6A1-27BE87984077 | app |
3 | 2S5Wcx | 0001FEF0-B860-4A29-96B8-7CF20F508BBB | push |
4 | 2S5Wcx | 000277A9-2276-4D6C-B03F-1C9A815A07AE | app |
5 | 2S5Wcx | 00028895-C316-4A8A-AE15-FC3F48E410B3 | push |
6 | 2S5Wcx | 0002A870-AB1D-4FE0-BB34-BF158F901465 | app |
7 | 2S5Wcx | 0005542B-EEC7-462A-912E-253B74643687 | app |
8 | 2S5Wcx | D00FD1E1-B885-4EA3-9049-0748841EE17D | app |
9 | 2S5Wcx | D02CDB1C-748D-4973-B269-740D27713B64 | app |
10 | 2S5Wcx | 00008F69-3419-4677-8AEC-B9AAC66C16FE | app |
- 其中first_start_bg这个字段表示:用户首次启动app是从哪一个来源来的,有app,push等等
- mentor想要实现的是:**统计一个用户首次启动不同来源的次数!**作为新的变量!
3 思路
- 试想,如果在已有的表格加上一列count计数,即一个用户首次启动的来源,以及对应一个时间段内对应的次数!这时候问题是不是就能解决了?
- 一旦有了count列了,这个问题就变成了行转列的问题!参考之前博客文章:SQL | 行转列和列转行
那如何统计次数呢?
- 可以考虑根据对分析的目标字段first_start_bg进行groupby! 【不同场景这个变量会有差异】
- 分组之后统计每一个组内day的不同个数,即加个distinct,因为我们统计的单位是以天为单位!【不同场景这个变量会有差异】
- 不同first_start_bg对应的次数就等于第二步统计的day的个数!于是大功告成!
4 SQL实现
4.1 第一步
select
device_uuid,
app_id,
first_start_bg,
count(distinct day) as first_start_day_count
from
(
select
day,
app_id,
device_uuid,
first_start_bg
from
t1
where
day>=20190304 and day<=20190310
and app_id in ('2x1kfBk63z','2S5Wcx')
)a
group by
device_uuid,
app_id,
first_start_bg
结果为:
device_uuid | first_start_bg | first_start_day_count | |
---|---|---|---|
0 | 641004239 | app | 1 |
1 | 1145635593 | push | 2 |
2 | 000001F6-1F32-4E4A-B74C-8CD740572643 | push | 5 |
3 | 0000EFAB-A42B-4D6B-A4F0-8F02D84B3C89 | app | 7 |
4 | 00019AA3-617F-48DC-AC52-33436C6A13EF | app | 3 |
5 | 0002476A-F198-4A15-9A4E-E3D0638DA51C | app | 4 |
6 | 000262ED-2E7C-4D97-BA1E-07EBE7FB1DFC | app | 7 |
7 | 0002F2E3-4D36-4D71-AAD3-438A6860CBF2 | app | 4 |
8 | 00041656-4C0B-449F-8318-B4D99E16A259 | app | 1 |
9 | 0005F1A4-9CA5-4599-843C-AEF4C955C43F | push | 2 |
4.2 第二步
select
device_uuid,
app_id,
sum(case when first_start_bg = 'app' then first_start_day_count else 0 end) as app_first_count,
sum(case when first_start_bg = 'push' then first_start_day_count else 0 end) as push_first_count,
sum(case when first_start_bg = 'ug' then first_start_day_count else 0 end) as ug_first_count,
sum(case when first_start_bg = 'back' then first_start_day_count else 0 end) as back_first_count
from
(
select
device_uuid,
app_id,
first_start_bg,
count(distinct day) as first_start_day_count
from
(
select
day,
app_id,
device_uuid,
first_start_bg
from
edm_user_device_activity_day
where
day>=20190304 and day<=20190310
and app_id in ('2x1kfBk63z','2S5Wcx')
)a
group by
device_uuid,
app_id,
first_start_bg
)b
group by
device_uuid,
app_id
结果为:
device_uuid | app_first_count | push_first_count | ug_first_count | back_first_count | |
---|---|---|---|---|---|
0 | 00.0c.e7.41.0a.01 | 1 | 0 | 0 | 0 |
1 | 6446738 | 0 | 0 | 5 | 0 |
2 | 00067E04-C950-4E27-BFC2-BAF3C0D9A31C | 3 | 0 | 0 | 0 |
3 | 0007602B-F75D-4F85-B3A4-B985FF8B76C4 | 2 | 1 | 0 | 0 |
4 | 0007C915-BF4C-4136-911A-6D328F2EED65 | 7 | 0 | 0 | 0 |
5 | 0007E5CA-E956-433C-997E-78544138318E | 1 | 0 | 0 | 0 |
6 | 00082AA8-C9F6-4D33-A8A1-6E8C9B6B7DCB | 4 | 0 | 0 | 0 |
7 | 00082E68-0D8E-464B-BF43-2DACA0151C75 | 1 | 0 | 0 | 0 |
大功告成!所以就两步解决问题!思路理清了实现就不是问题!
5 参考资料
上一篇: Struts 2 环境搭建
下一篇: 分布式系统的基石——ZooKeeper