电商数据分析案例
程序员文章站
2024-03-07 17:55:27
...
题目:电商数据分析案例
数据集来源
本数据集来自kaggle,数据集链接如下:
https://www.kaggle.com/carrie1/ecommerce-data
分析目的:了解该电商平台的销售情况(该网站是一个以批发为主的电商平台)
分析思路框架:
1、产品维度
2、顾客维度
3、时间维度
查看数据集信息
import pandas as pd
import os
import matplotlib.pyplot as plt
os.chdir("F:\jupyter\E-commerce-data")
data = pd.read_csv("data.csv",encoding="ISO-8859-1")
data.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL * | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
数据集共有8个变量,分别为订单编号、产品编号、产品描述、购买数量、购买时间、产品单价、顾客ID、顾客所在地区
数据预处理
数据缺失值处理
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null object
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
更改ID的数据类型
data["CustomerID"] = data["CustomerID"].astype("str")
产品描述和顾客ID有缺失变量,由于本次分析不涉及产品描述,故将其删掉。缺失的顾客ID用unknown表示
data = data.drop(["Description"],axis = 1)
data["CustomerID"] = data["CustomerID"].fillna("unknown")
处理重复值
data.duplicated().sum()
5270
data = data.drop_duplicates()
处理异常值
data.describe()
Quantity | UnitPrice | |
---|---|---|
count | 536639.000000 | 536639.000000 |
mean | 9.619500 | 4.632660 |
std | 219.130206 | 97.233299 |
min | -80995.000000 | -11062.060000 |
25% | 1.000000 | 1.250000 |
50% | 3.000000 | 2.080000 |
75% | 10.000000 | 4.130000 |
max | 80995.000000 | 38970.000000 |
产品数量为负表示退货,产品单价为负没有意义,为异常值
data[data["UnitPrice"]<0].count()
InvoiceNo 0
StockCode 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0
dtype: int64
data = data[data["UnitPrice"]>=0]
data[data["UnitPrice"]==0].count()
InvoiceNo 2515
StockCode 2515
Quantity 2515
InvoiceDate 2515
UnitPrice 2515
CustomerID 2515
Country 2515
dtype: int64
单价为0的产品可能是赠品
变量处理
data["Total"]=data["Quantity"]*data["UnitPrice"]
data[["Date","Time"]]=data["InvoiceDate"].str.split(" ",expand=True)
data[["Month","Day","Year"]] = data["Date"].str.split("/",expand=True)
data["Hour"]=data["Time"].str.split(":",expand=True)[0].astype("int")
将数据集分为两部分,一部分是产品购买信息,一部分是退货信息
data_buy = data[data["Quantity"] >= 0]
data_return = data[data["Quantity"] < 0]
数据分析
PART ONE 产品维度
Question 1: 最热卖的产品是哪些?退货最多的产品是哪些?
product_quantity=data["Quantity"].groupby(data["StockCode"]).sum().sort_values(ascending=False)
product_quantity[:20]
StockCode
22197 56427
84077 53751
85099B 47260
85123A 38811
84879 36122
21212 36016
23084 30597
22492 26437
22616 26299
21977 24719
22178 23825
17003 23017
15036 22552
21915 22036
22386 20992
23203 19695
22086 18876
84946 18874
20725 18858
84991 18015
Name: Quantity, dtype: int64
product_quantity[-20:]
StockCode
D -1194
23113 -1232
35610A -1267
23115 -1288
23114 -1294
84598 -1313
23117 -1348
23116 -1364
22351 -1387
23270 -1616
22618 -1632
79323B -1671
79323P -2007
23059 -2376
72732 -2472
79323LP -2618
79323W -4838
72140F -5368
23003 -8516
23005 -14468
Name: Quantity, dtype: int64
Question 2: 商品的价格分布如何?价格与销量有何关系?
data_unique_stock=data.drop_duplicates(["StockCode"])
data_unique_stock["UnitPrice"].describe()
count 4070.000000
mean 6.905278
std 173.775142
min 0.000000
25% 1.250000
50% 2.510000
75% 4.250000
max 11062.060000
Name: UnitPrice, dtype: float64
商品价格平均数是6.9,中位数是2.51,中位数>平均数,是一个右偏分布,也就是说该网站售卖的绝大部分都是低价商品,极少部分商品价格高昂,导致商品价格的标准差较大,总之该电商网站以批发低价商品为主
price_bin=pd.cut(data_unique_stock["UnitPrice"],bins=[0,1,2,3,4,5,6,7,10,15,20,25,30,100,10000,20000,30000,40000]).value_counts().sort_index()
price_per=price_bin/price_bin.sum()
price_cumper=price_bin.cumsum()/price_bin.sum()
pd.concat([price_per,price_cumper],axis=1)
UnitPrice | UnitPrice | |
---|---|---|
(0, 1] | 0.181323 | 0.181323 |
(1, 2] | 0.231907 | 0.413230 |
(2, 3] | 0.190143 | 0.603372 |
(3, 4] | 0.078859 | 0.682231 |
(4, 5] | 0.089235 | 0.771466 |
(5, 6] | 0.050324 | 0.821790 |
(6, 7] | 0.023606 | 0.845396 |
(7, 10] | 0.087419 | 0.932815 |
(10, 15] | 0.037095 | 0.969909 |
(15, 20] | 0.012970 | 0.982879 |
(20, 25] | 0.004150 | 0.987030 |
(25, 30] | 0.004929 | 0.991958 |
(30, 100] | 0.005966 | 0.997925 |
(100, 10000] | 0.001816 | 0.999741 |
(10000, 20000] | 0.000259 | 1.000000 |
(20000, 30000] | 0.000000 | 1.000000 |
(30000, 40000] | 0.000000 | 1.000000 |
具体来看,商品价格在(1,2]的价格区间里是最多的,占23.19%;商品价格在10元以内的占了93.28%,在100元以内的占了99.79%。
cut=pd.cut(data["UnitPrice"],bins=[0,1,2,3,4,5,6,7,10,15,20,25,30,100,10000,20000,30000,40000])
quantity_price=data["Quantity"].groupby(cut).sum()
quantity_price.cumsum()/quantity_price.sum()
UnitPrice
(0, 1] 0.391199
(1, 2] 0.716792
(2, 3] 0.855267
(3, 4] 0.900399
(4, 5] 0.950368
(5, 6] 0.962797
(6, 7] 0.970164
(7, 10] 0.988817
(10, 15] 0.997159
(15, 20] 0.998879
(20, 25] 0.999407
(25, 30] 0.999606
(30, 100] 0.999869
(100, 10000] 1.000001
(10000, 20000] 1.000000
(20000, 30000] 1.000000
(30000, 40000] 1.000000
Name: Quantity, dtype: float64
(quantity_price.cumsum()/quantity_price.sum()).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x26a786bd278>
71.68%的商品销量集中在价格在2元以内的商品;99.7%的商品销量集中的价格在15元以内的商品
Question 3:退货商品的价格分布与退货量情况是怎样的?
data_return_unique_stock=data_return.drop_duplicates("StockCode")
return_cut=pd.cut(data_return_unique_stock["UnitPrice"],bins=[0,1,2,3,4,5,10,15,20,50,100,200,500,1000]).value_counts().sort_index()
pd.concat([return_cut/return_cut.sum(),return_cut.cumsum()/return_cut.sum()],axis=1)
UnitPrice | UnitPrice | |
---|---|---|
(0, 1] | 0.227027 | 0.227027 |
(1, 2] | 0.283243 | 0.510270 |
(2, 3] | 0.172432 | 0.682703 |
(3, 4] | 0.070270 | 0.752973 |
(4, 5] | 0.079459 | 0.832432 |
(5, 10] | 0.122162 | 0.954595 |
(10, 15] | 0.027027 | 0.981622 |
(15, 20] | 0.006486 | 0.988108 |
(20, 50] | 0.007027 | 0.995135 |
(50, 100] | 0.002703 | 0.997838 |
(100, 200] | 0.001081 | 0.998919 |
(200, 500] | 0.000541 | 0.999459 |
(500, 1000] | 0.000541 | 1.000000 |
退货商品最多的价格区间是(1,2]元,退货商品价格在10元以内的占了95.46%。
(return_cut/return_cut.sum()).plot()
(price_bin/price_bin.sum()).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x26a007ae898>
红色是退货商品的价格分布,蓝色是全部商品的价格分布。从退货商品种类的角度,退货商品与总体商品相比,更集中在0-3元与5-10元这两个区间
return_quantity_price=data_return["Quantity"].groupby(pd.cut(data_return["UnitPrice"],bins=[0,1,2,3,4,5,10,15,20,50,100,200,500,1000])).sum()
return_quantity_price/return_quantity_price.sum()
UnitPrice
(0, 1] 0.161663
(1, 2] 0.405996
(2, 3] 0.371273
(3, 4] 0.014244
(4, 5] 0.016029
(5, 10] 0.022745
(10, 15] 0.005848
(15, 20] 0.000672
(20, 50] 0.000715
(50, 100] 0.000232
(100, 200] 0.000203
(200, 500] 0.000243
(500, 1000] 0.000138
Name: Quantity, dtype: float64
(return_quantity_price/return_quantity_price.sum()).plot()
(quantity_price/quantity_price.sum()).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x26a03035cf8>
1-10元的商品退货量会偏高一些
PART TWO 顾客维度
Question 4 :哪些顾客购买金额最多?哪些顾客购买次数最频繁?如何根据这两个变量对顾客分类?
customer_total=data_buy["Total"].groupby(data_buy["CustomerID"]).sum().sort_values(ascending=False)
customer_total[:20]
CustomerID
nan 1.754902e+06
14646.0 2.802060e+05
18102.0 2.596573e+05
17450.0 1.943908e+05
16446.0 1.684725e+05
14911.0 1.437112e+05
12415.0 1.249145e+05
14156.0 1.172101e+05
17511.0 9.106238e+04
16029.0 8.085084e+04
12346.0 7.718360e+04
16684.0 6.665356e+04
14096.0 6.516479e+04
13694.0 6.503962e+04
15311.0 6.063275e+04
13089.0 5.876208e+04
17949.0 5.851048e+04
15769.0 5.625272e+04
15061.0 5.453414e+04
14298.0 5.152730e+04
Name: Total, dtype: float64
customer_buy_fre=data_buy.drop_duplicates(["InvoiceNo"])["InvoiceNo"].groupby(data_buy["CustomerID"]).count().sort_values(ascending=False)
customer_buy_fre[:20]
CustomerID
nan 2190
12748.0 210
14911.0 201
17841.0 124
13089.0 97
14606.0 93
15311.0 91
12971.0 86
14646.0 74
16029.0 63
13408.0 62
18102.0 60
13798.0 57
14527.0 55
14156.0 55
16422.0 51
13694.0 50
15061.0 48
15039.0 47
16013.0 47
Name: InvoiceNo, dtype: int64
customer_total.drop("nan",inplace=True)
customer_buy_fre.drop("nan",inplace=True)
plt.style.use("ggplot")
plt.scatter(customer_buy_fre.sort_index(),customer_total.sort_index())
plt.axhline(y=50000,c="green")
plt.axvline(x=50,c="blue")
<matplotlib.lines.Line2D at 0x26a6c5050b8>
根据顾客购买次数和购买金额,对顾客进行分组,采取不同的营销措施。
Question 5:订单的商品种类分布是怎样的?
order_type=data_buy["StockCode"].groupby(data_buy["InvoiceNo"]).count().sort_values(ascending=False)
pd.cut(order_type,bins=[0,50,100,150,200,250,300,500,1000,1200]).value_counts()
(0, 50] 18552
(50, 100] 1522
(100, 150] 250
(150, 200] 165
(200, 250] 79
(300, 500] 66
(500, 1000] 48
(250, 300] 43
(1000, 1200] 1
Name: StockCode, dtype: int64
未退货的顾客中,大部分的订单商品种类在100以内。
order_type_r=data_return["StockCode"].groupby(data_return["InvoiceNo"]).count().sort_values(ascending=False)
pd.cut(order_type_r,bins=[0,10,20,30,40,50,60,150]).value_counts()
(0, 10] 5068
(10, 20] 73
(20, 30] 19
(30, 40] 9
(60, 150] 1
(50, 60] 1
(40, 50] 1
Name: StockCode, dtype: int64
退货顾客的商品种类大部分都在10以内
Question 6:哪些国家的顾客消费占比最大?哪些国家的顾客数量最多?哪些国家的顾客平均消费最高?
data
InvoiceNo | StockCode | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | Total | Date | Time | Month | Day | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom | 15.30 | 12/1/2010 | 8:26 | 12 | 1 | 2010 |
1 | 536365 | 71053 | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 | 12/1/2010 | 8:26 | 12 | 1 | 2010 |
2 | 536365 | 84406B | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom | 22.00 | 12/1/2010 | 8:26 | 12 | 1 | 2010 |
3 | 536365 | 84029G | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 | 12/1/2010 | 8:26 | 12 | 1 | 2010 |
4 | 536365 | 84029E | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 | 12/1/2010 | 8:26 | 12 | 1 | 2010 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
541904 | 581587 | 22613 | 12 | 12/9/2011 12:50 | 0.85 | 12680.0 | France | 10.20 | 12/9/2011 | 12:50 | 12 | 9 | 2011 |
541905 | 581587 | 22899 | 6 | 12/9/2011 12:50 | 2.10 | 12680.0 | France | 12.60 | 12/9/2011 | 12:50 | 12 | 9 | 2011 |
541906 | 581587 | 23254 | 4 | 12/9/2011 12:50 | 4.15 | 12680.0 | France | 16.60 | 12/9/2011 | 12:50 | 12 | 9 | 2011 |
541907 | 581587 | 23255 | 4 | 12/9/2011 12:50 | 4.15 | 12680.0 | France | 16.60 | 12/9/2011 | 12:50 | 12 | 9 | 2011 |
541908 | 581587 | 22138 | 3 | 12/9/2011 12:50 | 4.95 | 12680.0 | France | 14.85 | 12/9/2011 | 12:50 | 12 | 9 | 2011 |
536637 rows × 13 columns
country_total = data["Total"].groupby(data["Country"]).sum().sort_values(ascending=False)
country_customer=data["CustomerID"].groupby(data["Country"]).count().sort_values(ascending=False)
country_total_percent=(country_total/country_total.sum())
country_total_percent
Country
United Kingdom 0.840075
Netherlands 0.029203
EIRE 0.026980
Germany 0.022725
France 0.020243
Australia 0.014056
Switzerland 0.005782
Spain 0.005617
Belgium 0.004197
Sweden 0.003753
Japan 0.003626
Norway 0.003607
Portugal 0.003006
Finland 0.002290
Channel Islands 0.002060
Denmark 0.001925
Italy 0.001733
Cyprus 0.001319
Austria 0.001042
* 0.001016
Singapore 0.000936
Israel 0.000811
Poland 0.000740
Unspecified 0.000486
Greece 0.000483
Iceland 0.000442
Canada 0.000376
Malta 0.000257
United Arab Emirates 0.000195
USA 0.000178
Lebanon 0.000174
Lithuania 0.000170
European Community 0.000133
Brazil 0.000117
RSA 0.000103
Czech Republic 0.000073
Bahrain 0.000056
Saudi Arabia 0.000013
Name: Total, dtype: float64
country_customer_percent=country_customer/country_customer.sum()
country_customer_percent
Country
United Kingdom 0.913646
Germany 0.017666
France 0.015916
EIRE 0.015251
Spain 0.004711
Netherlands 0.004418
Belgium 0.003855
Switzerland 0.003716
Portugal 0.002814
Australia 0.002344
Norway 0.002024
Italy 0.001496
Channel Islands 0.001411
Finland 0.001295
Cyprus 0.001139
Sweden 0.000859
Unspecified 0.000824
Austria 0.000747
Denmark 0.000725
Japan 0.000667
Poland 0.000635
Israel 0.000548
USA 0.000542
* 0.000529
Singapore 0.000427
Iceland 0.000339
Canada 0.000281
Greece 0.000272
Malta 0.000237
United Arab Emirates 0.000127
European Community 0.000114
RSA 0.000108
Lebanon 0.000084
Lithuania 0.000065
Brazil 0.000060
Czech Republic 0.000056
Bahrain 0.000035
Saudi Arabia 0.000019
Name: CustomerID, dtype: float64
country_total_percent.drop("United Kingdom",inplace=True)
country_customer_percent.drop("United Kingdom",inplace=True)
plt.scatter(country_total_percent.sort_index(),country_customer_percent.sort_index())
<matplotlib.collections.PathCollection at 0x26a7e033c18>
data["Total"].groupby(data["Country"]).mean().sort_values(ascending=False)
Country
Netherlands 120.059696
Australia 108.910787
Japan 98.716816
Sweden 79.360976
Denmark 48.247147
Lithuania 47.458857
Singapore 39.827031
Lebanon 37.641778
Brazil 35.737500
* 34.888169
Norway 32.378877
Greece 32.263836
EIRE 32.135066
Finland 32.124806
Bahrain 28.863158
Switzerland 28.266324
United Arab Emirates 27.974706
Israel 26.877449
Channel Islands 26.520991
Austria 25.322494
Canada 24.280662
Iceland 23.681319
Czech Republic 23.590667
Germany 23.365978
France 23.102343
Spain 21.659822
European Community 21.176230
Poland 21.152903
Cyprus 21.045434
Italy 21.034259
Belgium 19.773301
Malta 19.728110
Portugal 19.405940
RSA 17.281207
United Kingdom 16.701544
Saudi Arabia 13.117000
Unspecified 10.726109
USA 5.948179
Name: Total, dtype: float64
这家电商网站有91.36%的顾客来自本国UK,84%的销售额来自本国。销售额占比超过1%的国家有荷兰、爱尔兰、法国、澳大利亚,顾客数占比超过1%的国家有德国、法国、爱尔兰.平均顾客消费最高的国家是爱尔兰、澳大利亚、日本、瑞典.这些国家可以作为开拓海外市场的重点对象。
PART THREE 时间维度
Question 7: 不同月份的销售金额情况?订单情况?每订单平均金额情况?
data["Month"]=data["Month"].astype("int")
data["Total"].groupby(data["Month"]).sum().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x26a01d48eb8>
data["InvoiceNo"].groupby(data["Month"]).count().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x26a01a31cf8>
data["Total"].groupby(data["Month"]).mean().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x26a01c265f8>
从1月到12月份的销售总额呈波动上升状态,并且在8月-11月一直以较大的幅度增长,到11月份达到顶峰。订单数量也与销售总金额呈现一样的分布。一年中,平均订单金额波动较大,分别在3月、5月、9月达到了极大值,并且9月份是全年平均订单金额最大的月份。
Question 8: 不同时间段的销售金额情况?订单情况?每订单平均金额情况?
data["Total"].groupby(data["Hour"]).sum().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x257298bb9e8>
data["InvoiceNo"].groupby(data["Hour"]).count().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x257386b6208>
data["Total"].groupby(data["Hour"]).mean().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x25738195a58>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MgORUVVE-1589979700246)(output_98_1.png)]