20200915电商数据分析 - 偏逻辑思维(一)
程序员文章站
2022-04-06 23:37:24
...
这是一个电商数据分析题目,可以用excel做,可以用sql做,可以用python做
题目如下:
数据源如下:
链接:https://pan.baidu.com/s/1gM2NoS_1SxUCgPbf-9mgYg
提取码:9zrs
Excel篇:其他人用数据透析表做的
SQL篇:我遇到一些问题,重新用pyhton做
Python篇:如下
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('数据源.csv')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3057: DtypeWarning: Columns (1,2,3,4,7,8) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
df = data.copy()
data.dropna(how="all",inplace=True) #删除全部是Null值的行
data.drop_duplicates(inplace=True) #删除重复行
data.head()
id | week | date | order_d | order_e | quantity | amount | phone | account | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | NaN | 2017/6/8 | GU20170608923396-1 | GU20170608923396 | 1.0 | 2750.0 | 1500026566 | aaa@qq.com |
1 | 2.0 | NaN | 2017/6/8 | GU20170608923428-1 | GU20170608923428 | 1.0 | 2250.0 | 16201251750 | aaa@qq.com |
2 | 3.0 | NaN | 2017/6/8 | GU20170608923430-1 | GU20170608923430 | 1.0 | 3900.0 | 15010529091 | aaa@qq.com |
3 | 4.0 | NaN | 2017/6/8 | GU20170608923431-1 | GU20170608923431 | 1.0 | 3300.0 | 12621601625 | aaa@qq.com |
4 | 5.0 | NaN | 2017/6/8 | GU20170608923435-1 | GU20170608923435 | 1.0 | 2250.0 | 12917062526 | aaa@qq.com |
data.date = pd.to_datetime(df.date,format='%Y-%m-%d')
data.phone = df.phone.astype('str')
data.describe()
id | quantity | amount | |
---|---|---|---|
count | 293657.000000 | 293657.000000 | 293657.000000 |
mean | 146829.000000 | 1.008050 | 4533.377958 |
std | 84771.618337 | 0.259776 | 3904.373436 |
min | 1.000000 | 1.000000 | 350.000000 |
25% | 73415.000000 | 1.000000 | 1750.000000 |
50% | 146829.000000 | 1.000000 | 3650.000000 |
75% | 220243.000000 | 1.000000 | 6450.000000 |
max | 293657.000000 | 62.000000 | 71500.000000 |
len(data)
293657
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 293657 entries, 0 to 293656
Data columns (total 9 columns):
id 293657 non-null float64
week 5491 non-null object
date 293657 non-null datetime64[ns]
order_d 293657 non-null object
order_e 293657 non-null object
quantity 293657 non-null float64
amount 293657 non-null float64
phone 293657 non-null object
account 254716 non-null object
dtypes: datetime64[ns](1), float64(3), object(5)
memory usage: 22.4+ MB
data.isnull().sum() #查看列缺失值
id 0
week 288166
date 0
order_d 0
order_e 0
quantity 0
amount 0
phone 0
account 38941
dtype: int64
- 列account有缺失值,Week只有本周有数值,之前没有数值
len(data.account.value_counts()) #唯一account一共有199179
199179
len(data.phone.value_counts()) #唯一phone一共有234490,这里phone表示下单,account表示注册,下单用户大于注册用户正常
234490
df_week = data[data.week == 'Sep W36'] #Sep W36一共有5491条数据
len(df_week.phone.value_counts()) #Sep W36品牌周,一共有5054个用户购买,这里以一个手机号是一个用户来统计用户个数
5054
len(df_week)
5491
len(df_history)
288166
df_history = data[data.week != 'Sep W36']#品牌周之前的数据,一共288166条消费记录
len(df_history.phone.value_counts())
230260
第一部分:
- 本周新客人数(开店至今本周第一次购买):4230人
- 本周一共有5491条消费记录,用户5054人,新客4230,老客824人,新客消费4483条,老客消费1008条
- 本周新客消费记录保存在df_week,本周老客消费记录保存在df_history
df2 = pd.merge(df_week,df_history,how='left',on='phone') #_x为week的数据,_y为history数据
df2.head()
id_x | week_x | date_x | order_d_x | order_e_x | quantity_x | amount_x | phone | account_x | id_y | week_y | date_y | order_d_y | order_e_y | quantity_y | amount_y | account_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 288167.0 | Sep W36 | 2020-08-30 | GU202008291446552-1 | GU202008291446552 | 1.0 | 6900.0 | 15120002542 | 15120002542 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN |
1 | 288168.0 | Sep W36 | 2020-08-30 | GU202008291446404-1 | GU202008291446404 | 1.0 | 10200.0 | 15941716566 | 15714177539 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN |
2 | 288169.0 | Sep W36 | 2020-08-30 | GU202008291447049-1 | GU202008291447049 | 1.0 | 6900.0 | 12721266779 | 12721236779 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN |
3 | 288170.0 | Sep W36 | 2020-08-30 | GU202008291446576-1 | GU202008291446576 | 1.0 | 2900.0 | 16091666427 | 13091333427 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN |
4 | 288171.0 | Sep W36 | 2020-08-30 | GU202008291446393-1 | GU202008291446393 | 1.0 | 14300.0 | 17627600559 | 17627600559 | 95877.0 | NaN | 2019-08-26 | GU201908261110787-1 | GU201908261110787 | 1.0 | 10500.0 | 17627600559 |
len(df2)
10519
df2.id_y.isnull().sum()
4483
- 把于本周第一次消费的数据提取出来,保存在df_sep_first
- 本周合计销售金额:25653800
- 本周新客销售金额:20929940
- 本周新客销售占比 81.59%
df_week.amount.sum()
25653800.0
df_sep_first = df2[df2.id_y.isnull()]
len(df_sep_first)
#df_sep_first = df_sep_first[['id_x','week_x','date_x','order_d_x','order_e_x','amount_x','phone','account_x']]
#df_sep_first.head()
4483
df_sep_first.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4483 entries, 0 to 10518
Data columns (total 17 columns):
id_x 4483 non-null float64
week_x 4483 non-null object
date_x 4483 non-null datetime64[ns]
order_d_x 4483 non-null object
order_e_x 4483 non-null object
quantity_x 4483 non-null float64
amount_x 4483 non-null float64
phone 4483 non-null object
account_x 3961 non-null object
id_y 0 non-null float64
week_y 0 non-null object
date_y 0 non-null datetime64[ns]
order_d_y 0 non-null object
order_e_y 0 non-null object
quantity_y 0 non-null float64
amount_y 0 non-null float64
account_y 0 non-null object
dtypes: datetime64[ns](2), float64(6), object(9)
memory usage: 630.4+ KB
df_sep_first.amount_x.sum()
20929940.0
- 本周新客会员人数,即在df_sep_first中,account有多少个,3961
df_sep_first.account_x.count()
3961
第二部分:
- 近一年累计消费金额≥¥68,000的总人数:202 #创建满足1的数据框df_oneyear2
- 满足1条件的本周产生购买的新产生客人数:
- 满足1条件的本周产生购买的原达标客人数:
- 满足上述2个条件的人数消费总金额:
data.date.max()
Timestamp('2020-09-05 00:00:00')
df_oneyear1 = data[(data.date < '2019-09-05')]
(df_oneyear1.pivot_table(index='phone',values='amount',aggfunc='sum').amount > 68000).sum()
202
- 下面把202条数据提取出来到df_oneyear2
df_oneyear2 = df_oneyear1.pivot_table(index='phone',values='amount',aggfunc='sum')
df_oneyear2 = df_oneyear2[df_oneyear2.amount > 68000]
len(df_oneyear2)
202
思路:
- 难点在于,这里是聚合求和,满足近1年消费大于68000的聚合求和,已经存到df_oneyear2
- 本周数据df_week,其中新客df_sep_first,减去之后即为本周消费的老客,这里我们先看本周之内全部满足1条件的人数
- 先对df_week进行聚合求和,然后再和df_oneyear2相交,查到本周内全部累计消费》68000的用户
- 这里易错点,拿df_week直接和df_oneyear2相交,结果不对,因为没有聚合求和
# df_week.date 结果是2020-08-30至2020-09-05,不需要考虑1年以内这个条件
df_week.pivot_table(index='phone',values='amount',aggfunc='sum').sort_values(by='amount',ascending=False).head()
amount | |
---|---|
phone | |
12721265027 | 67600.0 |
12611996246 | 49000.0 |
15626452464 | 42500.0 |
16771162166 | 37000.0 |
16261501212 | 34500.0 |
- 结果发现本周之内累计消费最高67600,都不满足68000,所以第2和第3都是0人
第三部分
- 近一年有多次消费,且至少有一笔消费金额≥¥18,000
- 近一年仅一次消费,且该笔消费金额≥¥18,000
- 近一年累计消费金额<¥68,000
- 满足上述条件的总人数:
- 满足1条件的本周产生购买的新产生客人数:
- 满足1条件的本周产生购买的原达标客人数:
- 满足上述2个条件的人数消费总金额:
思路:第一题的总人数,先求出3个条件每一个满足条件的人数,再求和
df_oneyear1.head()
id | week | date | order_d | order_e | quantity | amount | phone | account | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | NaN | 2017-06-08 | GU20170608923396-1 | GU20170608923396 | 1.0 | 2750.0 | 1500026566 | aaa@qq.com |
1 | 2.0 | NaN | 2017-06-08 | GU20170608923428-1 | GU20170608923428 | 1.0 | 2250.0 | 16201251750 | aaa@qq.com |
2 | 3.0 | NaN | 2017-06-08 | GU20170608923430-1 | GU20170608923430 | 1.0 | 3900.0 | 15010529091 | aaa@qq.com |
3 | 4.0 | NaN | 2017-06-08 | GU20170608923431-1 | GU20170608923431 | 1.0 | 3300.0 | 12621601625 | aaa@qq.com |
4 | 5.0 | NaN | 2017-06-08 | GU20170608923435-1 | GU20170608923435 | 1.0 | 2250.0 | 12917062526 | aaa@qq.com |
a = df_oneyear1[df_oneyear1.amount > 18000]
a.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1438 entries, 51 to 98105
Data columns (total 9 columns):
id 1438 non-null float64
week 0 non-null object
date 1438 non-null datetime64[ns]
order_d 1438 non-null object
order_e 1438 non-null object
quantity 1438 non-null float64
amount 1438 non-null float64
phone 1438 non-null object
account 1096 non-null object
dtypes: datetime64[ns](1), float64(3), object(5)
memory usage: 112.3+ KB
a.phone.value_counts() > 1
#遇到问题,不知道怎么计算大于1的数量,待查
16262121224 True
17621541994 True
12647119726 True
16129000006 True
16652119121 True
12064295506 True
12520221500 True
16964522200 True
16521512766 True
16612052679 True
16996176222 True
12500226762 True
15222227475 True
16201967491 True
12405291559 True
16570712222 True
15267120909 True
15210924674 True
16606609226 True
16552005664 True
16552401027 True
16752720064 True
15625676110 True
16916976602 True
15210951271 True
12676106654 True
16916055229 True
16264169746 True
12610649665 True
12662560267 True
...
16466210740 False
12600679612 False
12001729520 False
12420052226 False
15907952196 False
15625229997 False
16921950066 False
12696912090 False
16979201079 False
12100221766 False
16201609226 False
16261646692 False
16611225626 False
15927546424 False
15912621004 False
17600446922 False
12260064657 False
16727962992 False
16962049492 False
19921426110 False
16217566946 False
12210562700 False
16296267629 False
15144694469 False
16166279955 False
12970221916 False
16477045216 False
15292277991 False
15207910610 False
15157020292 False
Name: phone, Length: 1329, dtype: bool
下一篇: mysql(1)-逻辑架构