欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

淘宝用户行为分析-1-数据处理(python)

程序员文章站 2022-03-20 18:02:01
...

项目数据来源:https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1
项目目的:尝试分析用户行为数据,了解这方面的指标,了解从行为数据中能得到什么规律和信息。
这篇博客主要记录我是怎么处理数据和做特征工程的。

1、读取数据,初步查看数据信息

#数据处理包
import pandas as pd
import numpy as np
df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"])
# nrows参数设置读取的行数
# df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"], nrows =500)
# sample 随机选取n行数据
# df = df.sample(n=1000000, replace=False, random_state=1, axis=0)
df.head()
User_ID Item_ID Category_ID Behavior_type Timestamp
0 1 2268318 2520377 pv 1511544070
1 1 2333346 2520771 pv 1511561733
2 1 2576651 149192 pv 1511572885
3 1 3830808 4181361 pv 1511593493
4 1 4365585 2520377 pv 1511596146
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835330 entries, 0 to 3835329
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   User_ID        int64 
 1   Item_ID        int64 
 2   Category_ID    int64 
 3   Behavior_type  object
 4   Timestamp      int64 
dtypes: int64(4), object(1)
memory usage: 146.3+ MB
df.describe()
User_ID Item_ID Category_ID Timestamp
count 3.835330e+06 3.835330e+06 3.835330e+06 3.835330e+06
mean 2.411328e+05 2.578345e+06 2.711667e+06 1.511963e+09
std 2.548230e+05 1.487859e+06 1.464903e+06 8.665418e+05
min 1.000000e+00 3.000000e+00 2.171000e+03 4.401600e+04
25% 1.299830e+05 1.294202e+06 1.349561e+06 1.511763e+09
50% 1.786180e+05 2.576338e+06 2.725426e+06 1.511966e+09
75% 2.255740e+05 3.860809e+06 4.145813e+06 1.512181e+09
max 1.018011e+06 5.163067e+06 5.161669e+06 2.122867e+09
# 查看df有没有null值,每列总共有几个null值
df.isnull().sum()
User_ID          0
Item_ID          0
Category_ID      0
Behavior_type    0
Timestamp        0
dtype: int64

初步观察数据我们可以得到:数据一共有3835330条,有5列分别是:User_ID、Item_ID、Category_ID、Behavior_type、Timestamp,其中User_ID、Item_ID、Category_ID、Behavior_type、Timestamp这4列是整数类型,Behavior_type是字符类型,Timestamp需要后面转换成时间类型。数据比较干净,没有null值。

# 读取融合的CSV文件
# merge_dt = pd.read_csv( cwd  + '\\' + "UserBehavior.csv", encoding="utf_8", engine='python' ,iterator = True ) 
# # merge_dt  
# # output: <pandas.io.parsers.TextFileReader at 0x1cc1f0a7470>
# # 调用数据
# df = merge_dt.get_chunk(10000)

2、清洗数据

# 数据去重
print(df.shape)
list_ = df.columns.to_list()
df.drop_duplicates(list_, keep='first', inplace=True)
print(df.shape)
(3835330, 5)
(3835328, 5)
import datetime
import time
# 时间戳转成datetime再转成str
# df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%Y-%m-%d'))
df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).date())
df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%H'))
# df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).time())
del df['Timestamp']
df.head()
User_ID Item_ID Category_ID Behavior_type date time
0 1 2268318 2520377 pv 2017-11-25 01
1 1 2333346 2520771 pv 2017-11-25 06
2 1 2576651 149192 pv 2017-11-25 09
3 1 3830808 4181361 pv 2017-11-25 15
4 1 4365585 2520377 pv 2017-11-25 15
# 偷个懒,把datetime直接分割成2列
# df[["date","time"]] = df["datetime"].str.split(' ', 1, expand=True)
# df.head()
df['date'].value_counts()
2017-12-02    532774
2017-12-03    528928
2017-12-01    417976
2017-11-26    406792
2017-11-30    400366
2017-11-25    395034
2017-11-29    390884
2017-11-27    382304
2017-11-28    378326
2017-11-24      1507
2017-11-23       181
2017-11-22        53
2017-11-19        33
2017-11-21        25
2017-11-20        20
2017-11-18        19
2017-11-17        18
2018-08-28        16
2017-11-03        14
2017-11-16        10
2017-11-14         6
2017-11-13         6
2017-12-04         6
2017-11-11         5
2017-11-15         4
2017-12-06         3
2017-11-10         2
2017-09-16         2
2017-07-03         2
2017-11-12         2
1970-01-01         1
2015-02-06         1
2017-10-31         1
2017-09-11         1
2017-11-02         1
2017-11-05         1
2017-10-10         1
2017-11-06         1
2037-04-09         1
2017-11-04         1
Name: date, dtype: int64
# 2017-11-25'至'2017-12-03'之外,数据量极少,删除不分析
print(df.shape)
# df['date'] = df['date'][df['date'].between('2017-11-25','2017-12-03', inclusive=True)]
df['date'] = df['date'][(df['date']>=datetime.date(2017,11,25)) & (df['date']<=datetime.date(2017,12,3))]
# 删除含有nnan的数据行
df = df.dropna(axis=0, how='any')
print(df.shape)
(3835328, 6)
(3833384, 6)
# 数据量较多时从星期的维度分析可能会发现一些规律,但是该项目只涉及9天即一周零2天,
# 从星期维度分析和按日期分析没什么区别,也可以不做这个特征工程。
df['week'] = df['date'].apply(lambda x: x.weekday()+1)
df.head()
User_ID Item_ID Category_ID Behavior_type date time week
0 1 2268318 2520377 pv 2017-11-25 01 6
1 1 2333346 2520771 pv 2017-11-25 06 6
2 1 2576651 149192 pv 2017-11-25 09 6
3 1 3830808 4181361 pv 2017-11-25 15 6
4 1 4365585 2520377 pv 2017-11-25 15 6

3、RFM

df_rfm = df[df['Behavior_type']=='buy']
df_rfm.shape
(76705, 7)
# 看看有多少用户
# len(df_rfm['User_ID'].unique())
# 看看有多少订单(假设一个buy是一个订单,但是我们知道这只是假设,实际中一个订单会有多个产品)
# len(df_rfm['Behavior_type'])
# 增加各时间段,用来计算订单时间差
df_rfm['datediff'] = (datetime.date(2017,12,4) - df['date']).dt.days
df_rfm.head()
D:\Anconda\envs\pytorch\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
User_ID Item_ID Category_ID Behavior_type date time week datediff
71 100 1603476 2951233 buy 2017-11-25 11 6 9
73 100 2971043 4869428 buy 2017-11-25 21 6 9
100 100 598929 2429887 buy 2017-11-27 13 1 7
119 100 1046201 3002561 buy 2017-11-27 15 1 7
125 100 1606258 4098232 buy 2017-11-27 21 1 7
# 最近一次购买时间间隔
df_r = df_rfm.groupby(by=['User_ID'])['datediff'].agg([('r', 'min')])
# 购买次数
df_f = df_rfm.groupby(by=['User_ID'])['Behavior_type'].agg([('f', 'count')])
# 关联2个数据框
df_rfm_2 = df_r.join(df_f)
df_rfm_2.head()
r f
User_ID
100 6 8
117 6 10
119 5 3
121 9 1
122 2 3
df_rfm_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   r       25400 non-null  int64
 1   f       25400 non-null  int64
dtypes: int64(2)
memory usage: 1.8 MB
df_rfm_2.describe()
r f
count 25400.000000 25400.000000
mean 3.535906 3.019882
std 2.405743 3.039492
min 1.000000 1.000000
25% 1.000000 1.000000
50% 3.000000 2.000000
75% 5.000000 4.000000
max 9.000000 84.000000
# 最近购买时间,一共9天,距今天分别是1-9,一共分成4段,划分点分别是:,距今越近分数越高,距今越远分数越低
# bins_r = df_rfm_2['r'].quantile(q=np.linspace(0, 1, num=5), interpolation='nearest')
bins_r = [1, 3, 5, 7, 9]
labels_r = np.arange(4, 0, -1)
df_rfm_2['r_score'] = pd.cut(df_rfm_2['r'], bins=bins_r, labels=labels_r, include_lowest=True)
df_rfm_2['r_score'].unique()
[2, 3, 1, 4]
Categories (4, int64): [4 < 3 < 2 < 1]
df_rfm_2['f'].value_counts()
1     8688
2     5899
3     3849
4     2321
5     1545
6      955
7      639
8      421
9      275
10     188
11     148
12     102
13      63
14      59
15      58
16      34
17      24
18      22
20      16
19      14
21      12
23      12
22       9
29       6
25       5
31       4
30       4
26       3
36       3
28       3
32       3
27       2
51       1
33       1
65       1
47       1
84       1
69       1
24       1
72       1
57       1
42       1
43       1
60       1
61       1
39       1
Name: f, dtype: int64
# 购买频率达到84次,分成4批,分割成1-21,22-43,44-65,66-84
bins_f = [1, 21, 43, 65, 84]
labels_f = np.arange(1, 5)
df_rfm_2['f_score'] = pd.cut(df_rfm_2['f'], bins=bins_f, labels=labels_f, include_lowest=True)
df_rfm_2['f_score'].unique()
[1, 2, 4, 3]
Categories (4, int64): [1 < 2 < 3 < 4]
df_rfm_2.head()
r f r_score f_score
User_ID
100 6 8 2 1
117 6 10 2 1
119 5 3 3 1
121 9 1 1 1
122 2 3 4 1
# df_rfm_2['total_score'] = df_rfm_2.apply(lambda x: x['r_score'] + x['f_score'], axis=1)
# df_rfm_2['total_score'].value_counts()
df_rfm_2.describe()
r f
count 25400.000000 25400.000000
mean 3.535906 3.019882
std 2.405743 3.039492
min 1.000000 1.000000
25% 1.000000 1.000000
50% 3.000000 2.000000
75% 5.000000 4.000000
max 9.000000 84.000000
# df_rfm_2['r_score'] = df_rfm_2['r_score'].apply(lambda x: int(str(x)))
# df_rfm_2['f_score'] = df_rfm_2['f_score'].apply(lambda x: int(str(x)))
df_rfm_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   r        25400 non-null  int64   
 1   f        25400 non-null  int64   
 2   r_score  25400 non-null  category
 3   f_score  25400 non-null  category
dtypes: category(2), int64(2)
memory usage: 1.9 MB
# category类型转换成int数据类型
df_rfm_2[['r_score', 'f_score']] = df_rfm_2[['r_score', 'f_score']].astype(int)
df_rfm_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   r        25400 non-null  int64
 1   f        25400 non-null  int64
 2   r_score  25400 non-null  int32
 3   f_score  25400 non-null  int32
dtypes: int32(2), int64(2)
memory usage: 2.0 MB
df_rfm_2.describe()
r f r_score f_score
count 25400.000000 25400.000000 25400.000000 25400.000000
mean 3.535906 3.019882 3.257441 1.003150
std 2.405743 3.039492 1.000799 0.065734
min 1.000000 1.000000 1.000000 1.000000
25% 1.000000 1.000000 3.000000 1.000000
50% 3.000000 2.000000 4.000000 1.000000
75% 5.000000 4.000000 4.000000 1.000000
max 9.000000 84.000000 4.000000 4.000000
# 从上面的describe函数得到r和f的均值分别是3.257441和1.003150
r_avg = 3
f_avg = 1
# bins = df_rfm_2['total_score'].quantile(q=np.linspace(0,1,num=5), interpolation='nearest')
# labels = ['一般维持客户','潜力客户','重要深耕客户','重要价值客户']
# df_rfm_2['label_customer'] = pd.cut(df_rfm_2['total_score'], bins=bins, labels=labels, include_lowest=True)
# 根据r和f均值把用户划分成4类
df_rfm_2['label_customer'] = np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']>f_avg), '重要价值客户',
                                     np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']<=f_avg), '重要维护客户',
                                             np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']>f_avg), '重要发展客户',
                                                     np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']<=f_avg), '一般客户', 
                                                              np.nan))))
df_rfm_2.head()
r f r_score f_score label_customer
User_ID
100 6 8 2 1 一般客户
117 6 10 2 1 一般客户
119 5 3 3 1 一般客户
121 9 1 1 1 一般客户
122 2 3 4 1 重要维护客户
# 索引转化成普通列
df_rfm_2 = df_rfm_2.reset_index()
df_rfm_2.head()
User_ID r f r_score f_score label_customer
0 100 6 8 2 1 一般客户
1 117 6 10 2 1 一般客户
2 119 5 3 3 1 一般客户
3 121 9 1 1 1 一般客户
4 122 2 3 4 1 重要维护客户
df.shape
(3833384, 7)
df_rfm_2 = df_rfm_2[['User_ID','label_customer']]
df_rfm_2.shape
(25400, 2)
# 数据关联
df = pd.merge(df, df_rfm_2, how='left', left_on='User_ID', right_on='User_ID')
df.shape
(3833384, 8)
df.head()
User_ID Item_ID Category_ID Behavior_type date time week label_customer
0 1 2268318 2520377 pv 2017-11-25 01 6 NaN
1 1 2333346 2520771 pv 2017-11-25 06 6 NaN
2 1 2576651 149192 pv 2017-11-25 09 6 NaN
3 1 3830808 4181361 pv 2017-11-25 15 6 NaN
4 1 4365585 2520377 pv 2017-11-25 15 6 NaN

4、产品分类

在实际电商中,会根据产品销量、利润和其他一些特征对产品进行分类,通常有爆款、旺款、在售、清仓等状态,这里用产品buy次数代替产品销量,对产品进行分类。

df_pro = df[df['Behavior_type']=='buy']
print(df_pro.shape)
# 查看Item_ID的销量
print(df_pro['Item_ID'].value_counts().describe())
df_4 = df_pro['Item_ID'].value_counts().reset_index()
# Item_ID太多,只查看下次数大于1的,方便后面设置分类的点
df_5 = df_4[df_4['Item_ID']>1]
print(df_5['Item_ID'].value_counts())
print(df_5['Item_ID'].describe())
df_4.head()
df_4.columns=['Item_ID','buy_num']
df_pro = pd.merge(df_pro, df_4, how='left', left_on='Item_ID', right_on='Item_ID')
df_pro['buy_num'].value_counts()
df_pro.head()
(76705, 8)
count    56726.000000
mean         1.352202
std          1.109932
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         58.000000
Name: Item_ID, dtype: float64
2     6990
3     1940
4      819
5      411
6      197
7      138
8       95
9       68
10      38
11      31
13      18
12      17
15      13
14      12
18       6
17       5
16       3
23       3
21       2
22       2
19       2
26       1
58       1
27       1
25       1
20       1
32       1
Name: Item_ID, dtype: int64
count    10816.000000
mean         2.847171
std          1.923511
min          2.000000
25%          2.000000
50%          2.000000
75%          3.000000
max         58.000000
Name: Item_ID, dtype: float64
User_ID Item_ID Category_ID Behavior_type date time week label_customer buy_num
0 100 1603476 2951233 buy 2017-11-25 11 6 一般客户 3
1 100 2971043 4869428 buy 2017-11-25 21 6 一般客户 1
2 100 598929 2429887 buy 2017-11-27 13 1 一般客户 1
3 100 1046201 3002561 buy 2017-11-27 15 1 一般客户 2
4 100 1606258 4098232 buy 2017-11-27 21 1 一般客户 1
df.shape
(3833384, 8)
df_pro = df_pro[['Item_ID','buy_num']]
df_pro.shape
(76705, 2)
# 去重,否则和df关联后会出现一对多的情况,使数据量太大
df_pro.drop_duplicates(['Item_ID'],keep='first',inplace=True)
df_pro.shape
(56726, 2)
df = pd.merge(df, df_pro, how='left', on='Item_ID')
df.shape
(3833384, 9)
# 设置分界条件对产品进行分类
df['pro_type'] = np.where(df['buy_num']>=9,'爆款',np.where(df['buy_num']>=3,'旺款',np.where(df['buy_num']>0,'在售','清仓')))
del df['buy_num']
df.head()
User_ID Item_ID Category_ID Behavior_type date time week label_customer pro_type
0 1 2268318 2520377 pv 2017-11-25 01 6 NaN 在售
1 1 2333346 2520771 pv 2017-11-25 06 6 NaN 在售
2 1 2576651 149192 pv 2017-11-25 09 6 NaN 清仓
3 1 3830808 4181361 pv 2017-11-25 15 6 NaN 清仓
4 1 4365585 2520377 pv 2017-11-25 15 6 NaN 清仓
df.to_csv("bi_tmall_users.csv", sep=",", index=False)