Python数据分析 CDnow网站的用户购买行为分析案例(RFM分析模型)

程序员文章站 2024-02-03 14:49:22

Python数据分析1 案例 CDnow网站的用户购买行为分析1.1 数据介绍数据来源于CDnow网站的用户购买记录，字段包括用户ID，订单日期，订单数量和订单金额。字段名说明user_id用户IDorder_dt订单日期order_product订单数量order_amount订单金额1.2 数据导入import numpy as npimport pandas as pdfrom pandas import DataFrame, Seri...

Python数据分析

1 案例 CDnow网站的用户购买行为分析

1.1 数据介绍

数据来源于CDnow网站的用户购买记录，字段包括用户ID，订单日期，订单数量和订单金额。

字段名	说明
user_id	用户ID
order_dt	订单日期
order_product	订单数量
order_amount	订单金额

1.2 数据导入

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt

df = pd.read_csv('./CDNOW_master.txt', header=None, sep='\s+', names=['user_id', 'order_date', 'order_quantity', 'order_amount'])

df.head()
'''
	user_id 	order_date 	order_quantity 	order_amount
0 	1 			19970101 	1 				11.77
1 	2 			19970112 	1 				12.00
2 	2 			19970112 	5 				77.00
3 	3 			19970102 	2 				20.76
4 	3 			19970330 	2 				20.76
'''

df.shape  # (69659, 4)
df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_id         69659 non-null  int64  
 1   order_date      69659 non-null  int64  
 2   order_quantity  69659 non-null  int64  
 3   order_amount    69659 non-null  float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB
'''

1.3 数据处理

1.3.1 时间格式转换

df['order_date'] = pd.to_datetime(df['order_date'], format="%Y%m%d")

增加一列，显示月份。

df['order_month'] = df['order_date'].astype('datetime64[M]')
df.head()
'''
 	user_id 	order_date 	order_quantity 	order_amount 	order_month
0 	1 			1997-01-01 	1 				11.77 			1997-01-01
1 	2 			1997-01-12 	1 				12.00 			1997-01-01
2 	2 			1997-01-12 	5 				77.00 			1997-01-01
3 	3 			1997-01-02 	2 				20.76 			1997-01-01
4 	3 			1997-03-30 	2 				20.76 			1997-03-01
'''

1.3.2 describe

describe函数用于观察这一系列数据的范围、大小、波动趋势等，便于判断对数据使用哪类模型更合适。

df.describe()
'''
		user_id 		order_quantity 		order_amount
count 	69659.000000 	69659.000000 		69659.000000
mean 	11470.854592 	2.410040 			35.893648
std 	6819.904848 	2.333924 			36.281942
min 	1.000000 		1.000000 			0.000000
25% 	5506.000000 	1.000000 			14.490000
50% 	11410.000000 	2.000000 			25.980000
75% 	17273.000000 	3.000000 			43.700000
max 	23570.000000 	99.000000 			1286.010000
'''

1.4 数据分析

1.4.1 按月数据分析

1.4.1.1 用户每月的消费总金额

df.groupby(by='order_month')['order_amount'].sum()
'''
order_month
1997-01-01    299060.17
1997-02-01    379590.03
1997-03-01    393155.27
1997-04-01    142824.49
1997-05-01    107933.30
1997-06-01    108395.87
1997-07-01    122078.88
1997-08-01     88367.69
1997-09-01     81948.80
1997-10-01     89780.77
1997-11-01    115448.64
1997-12-01     95577.35
1998-01-01     76756.78
1998-02-01     77096.96
1998-03-01    108970.15
1998-04-01     66231.52
1998-05-01     70989.66
1998-06-01     76109.30
'''

df.groupby(by='order_month')['order_amount'].sum().plot()

Python数据分析 CDnow网站的用户购买行为分析案例(RFM分析模型)

1.4.1.2 用户每月的产品购买量

df.groupby(by='order_month')['order_quantity'].sum()
'''
order_month
1997-01-01    19416
1997-02-01    24921
1997-03-01    26159
1997-04-01     9729
1997-05-01     7275
1997-06-01     7301
1997-07-01     8131
1997-08-01     5851
1997-09-01     5729
1997-10-01     6203
1997-11-01     7812
1997-12-01     6418
1998-01-01     5278
1998-02-01     5340
1998-03-01     7431
1998-04-01     4697
1998-05-01     4903
1998-06-01     5287
'''

1.4.1.2 用户每月的消费总次数

df.groupby(by='order_month')['user_id'].count()
'''
order_month
1997-01-01     8928
1997-02-01    11272
1997-03-01    11598
1997-04-01     3781
1997-05-01     2895
1997-06-01     3054
1997-07-01     2942
1997-08-01     2320
1997-09-01     2296
1997-10-01     2562
1997-11-01     2750
1997-12-01     2504
1998-01-01     2032
1998-02-01     2026
1998-03-01     2793
1998-04-01     1878
1998-05-01     1985
1998-06-01     2043
'''

1.4.1.2 每月的消费用户数

df.groupby(by='order_month')['user_id'].nunique()
'''
order_month
1997-01-01    7846
1997-02-01    9633
1997-03-01    9524
1997-04-01    2822
1997-05-01    2214
1997-06-01    2339
1997-07-01    2180
1997-08-01    1772
1997-09-01    1739
1997-10-01    1839
1997-11-01    2028
1997-12-01    1864
1998-01-01    1537
1998-02-01    1551
1998-03-01    2060
1998-04-01    1437
1998-05-01    1488
1998-06-01    1506
'''

1.4.2 用户个体消费数据分析

用户消费总金额和总购买量的统计描述

df['order_quantity'].sum(), df['order_amount'].sum()  # (167881, 2500315.6300000004)

绘制每位用户消费金额和购买数量的散点图

users_amount_series = df.groupby(by='user_id')['order_amount'].sum()
users_quantity_series = df.groupby(by='user_id')['order_quantity'].sum()

plt.scatter(users_amount_series, users_quantity_series)

Python数据分析 CDnow网站的用户购买行为分析案例(RFM分析模型)

绘制每位用户消费总金额的直方分布图
条件：消费金额在1000之内

order_amount_1000_series = df.query('order_amount <= 1000').groupby(by='user_id')['order_amount'].sum()

绘制每位用户购买商品总数的直方分布图
条件：购买商品的数量在100以内。

order_quantity_1000_series = df.query('order_quantity <= 100').groupby(by='user_id')['order_quantity'].sum()

1.4.3 用户消费行为分析

用户首次消费的月份分布和人数统计

df.groupby(by='user_id')['order_month'].min().value_counts()
'''
1997-02-01    8476
1997-01-01    7846
1997-03-01    7248
'''

用户最后一次消费的时间分布和人数统计

df.groupby(by='user_id')['order_month'].max().value_counts()
'''
1997-02-01    4912
1997-03-01    4478
1997-01-01    4192
1998-06-01    1506
1998-05-01    1042
1998-03-01     993
1998-04-01     769
1997-04-01     677
1997-12-01     620
1997-11-01     609
1998-02-01     550
1998-01-01     514
1997-06-01     499
1997-07-01     493
1997-05-01     480
1997-10-01     455
1997-09-01     397
1997-08-01     384
'''

新老客户的占比
消费一次为新用户，消费多次为老用户。

获取每位用户的首次消费和最后一次消费的消费时间。

new_old_df = df.groupby(by='user_id')['order_date'].agg(['min', 'max'])
(new_old_df['min'] == new_old_df['max']).value_counts()
'''
True     12054  # 新用户
False    11516  # 老用户
'''

获取新老客户的消费比例。

new_num = new_old_df.loc[new_old_df['min'] == new_old_df['max']].shape[0]
old_num = new_old_df.loc[new_old_df['min'] != new_old_df['max']].shape[0]
new_num / old_num  # 1.0467176102813478

用户分层
获取由用户总购买量和消费总金额，以及最近一次消费时间组成的RFM表。

rfm = df.pivot_table(
	index='user_id', 
	aggfunc={
		'order_quantity': 'sum', 
		'order_amount': 'sum', 
		'order_date': 'max'
	}
)
rfm.head()
'''
 		order_amount 	order_date 		order_quantity
user_id 			
1 		11.77 			1997-01-01 		1
2 		89.00 			1997-01-12 		6
3 		156.46 			1998-05-28 		16
4 		100.50 			1997-12-12 		7
5 		385.61 			1998-01-03 		29
'''

RFM分析模型
R表示客户最近一次交易到现在时间间隔；
F表示客户购买商品的总数量。F值越大，表示客户交易越频繁，反之则表示客户交易不够活跃；
M表示客户交易金额。M值越大，表示客户价值越高，反之则表示客户价值较低。

# 计算R
rfm['R'] = (df['order_date'].max() - rfm['order_date']) / np.timedelta64(1, 'D')

rfm = rfm[['order_amount', 'order_quantity', 'R']]
rfm.columns = ['M', 'F', 'R']
rfm.head()
'''
		M 		F 		R
user_id 			
1 		11.77 	1 		545.0
2 		89.00 	6 		534.0
3 		156.46 	16 		33.0
4 		100.50 	7 		200.0
5 		385.61 	29 		178.0
'''

根据价值分层，将用户分为：
重要价值客户，重要保持客户，重要挽留客户，重要发展客户，一般价值客户，一般保持客户，一般挽留客户，一般发展客户

# RFM分层算法
def rfm_func(x):
    # 存储的数据是三个字符串形式的0或1
    level = x.map(lambda x: '1' if x >= 0 else '0')
    label = level['R'] + level.F + level.M
    d = {
        '111': '重要价值客户',
        '011': '重要保持客户',
        '101': '重要挽留客户',
        '001': '重要发展客户',
        '110': '一般价值客户',
        '010': '一般保持客户',
        '100': '一般挽留客户',
        '000': '一般发展客户',
    }
    result = d[label]
    return result

# df.apply(func)：用于对df中的行或列进行指定形式(func)的运算
rfm['label'] = rfm.apply(lambda x: x - x.mean(), axis=0).apply(rfm_func, axis = 1)
rfm.head()
'''
		M 		F 	R 		label
user_id 				
1 		11.77 	1 	545.0 	一般挽留客户
2 		89.00 	6 	534.0 	一般挽留客户
3 		156.46 	16 	33.0 	重要保持客户
4 		100.50 	7 	200.0 	一般发展客户
5 		385.61 	29 	178.0 	重要保持客户
'''

1.4.4 用户生命周期分析

将用户划分为活跃用户和其他用户

统计每位用户每月的消费次数

consumption_per_month_df = df.pivot_table(index='user_id', values='order_date', aggfunc='count', columns='order_month', fill_value=0)
consumption_per_month_df.head()
'''
order_month 	1997-01-01 	1997-02-01 	1997-03-01 	1997-04-01 	1997-05-01 	1997-06-01 	1997-07-01 	1997-08-01 	1997-09-01 	1997-10-01 	1997-11-01 	1997-12-01 	1998-01-01 	1998-02-01 	1998-03-01 	1998-04-01 	1998-05-01 	1998-06-01
user_id
1 				1 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
2 				2 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
3 				1 			0 			1 			1 			0 			0 			0 			0 			0 			0 			2 			0 			0 			0 			0 			0 			1 			0
4 				2 			0 			0 			0 			0 			0 			0 			1 			0 			0 			0 			1 			0 			0 			0 			0 			0 			0
5 				2 			1 			0 			1 			1 			1 			1 			0 			1 			0 			0 			2 			1 			0 			0 			0 			0 			0
'''

统计每位用户每月是否消费，有则为1，否则为0。

is_consumption_per_month_df = consumption_per_month_df.applymap(lambda x: 1 if x > 0 else 0)
is_consumption_per_month_df.head()
'''
order_month 	1997-01-01 	1997-02-01 	1997-03-01 	1997-04-01 	1997-05-01 	1997-06-01 	1997-07-01 	1997-08-01 	1997-09-01 	1997-10-01 	1997-11-01 	1997-12-01 	1998-01-01 	1998-02-01 	1998-03-01 	1998-04-01 	1998-05-01 	1998-06-01
user_id
1 				1 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
2 				1 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
3 				1 			0 			1 			1 			0 			0 			0 			0 			0 			0 			1 			0 			0 			0 			0 			0 			1 			0
4 				1 			0 			0 			0 			0 			0 			0 			1 			0 			0 			0 			1 			0 			0 			0 			0 			0 			0
5 				1 			1 			0 			1 			1 			1 			1 			0 			1 			0 			0 			1 			1 			0 			0 			0 			0 			0
'''

将用户按照每月购买情况分类

unreg：观望用户，即前两月没买，第三个月才首次购买，用户在前两个月为观望用户；
unactive：用户首月购买后，后面的月份中没有购买，用户在没有购买的月份中为非活跃用户；
new：在当前月进行首次购买的用户，在当前月中为新用户；
active：连续月份购买的用户，在这些月中为活跃用户；
return：购买后间隔数个月才再次购买的用户，在再次开始购买的首月用户为回头客。

consumption_per_month_df.shape  # (23570, 18)

def user_classify(data):
    status = []  # 用于存储某位用户在每个月的活跃度
    # 18 - consumption_per_month_df.shape[1]
    for i in range(18):
        # 若本月没有消费
        if data[i] == 0:
            if len(status) > 0:
                if status[i-1] == 'unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
            else:
                status.append('unreg')         
        # 若本月有消费
        else:
            if len(status) == 0:
                status.append('new')
            else:
                if status[i-1] == 'unactive':
                    status.append('return')
                elif status[i-1] == 'unreg':
                    status.append('new')
                else:
                    status.append('active')
                    
    return status

user_status = consumption_per_month_df.apply(user_classify, axis = 1) 
user_status.head()
'''
user_id
1    [new, unactive, unactive, unactive, unactive, ...
2    [new, unactive, unactive, unactive, unactive, ...
3    [new, unactive, return, active, unactive, unac...
4    [new, unactive, unactive, unactive, unactive, ...
5    [new, active, unactive, return, active, active...
'''

user_status_per_month_df = DataFrame(data=pivoted_status.tolist(), index=consumption_per_month_df.index, columns=consumption_per_month_df.columns)
user_status_per_month_df.head()
'''
order_month 	1997-01-01 	1997-02-01 	1997-03-01 	1997-04-01 	1997-05-01 	1997-06-01 	1997-07-01 	1997-08-01 	1997-09-01 	1997-10-01 	1997-11-01 	1997-12-01 	1998-01-01 	1998-02-01 	1998-03-01 	1998-04-01 	1998-05-01 	1998-06-01
user_id
1 				new 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive
2 				new 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive
3 				new 		unactive 	return 		active 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	return 		unactive 	unactive 	unactive 	unactive 	unactive 	return 		unactive
4 				new 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	return 		unactive 	unactive 	unactive 	return 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive
5 				new 		active 		unactive 	return 		active 		active 		active 		unactive 	return 		unactive 	unactive 	return 		active 		unactive 	unactive 	unactive 	unactive 	unactive
'''

对每月的不同状态的用户进行计数

user_status_per_month_df.apply(lambda x: pd.value_counts(x), axis=0).fillna(0).T
'''
 			active 	new 	return 	unactive 	unreg
order_month
1997-01-01 	0.0 	7846.0 	0.0 	0.0 		15724.0
1997-02-01 	1157.0 	8476.0 	0.0 	6689.0 		7248.0
1997-03-01 	1681.0 	7248.0 	595.0 	14046.0 	0.0
1997-04-01 	1773.0 	0.0 	1049.0 	20748.0 	0.0
1997-05-01 	852.0 	0.0 	1362.0 	21356.0 	0.0
1997-06-01 	747.0 	0.0 	1592.0 	21231.0 	0.0
1997-07-01 	746.0 	0.0 	1434.0 	21390.0 	0.0
1997-08-01 	604.0 	0.0 	1168.0 	21798.0 	0.0
1997-09-01 	528.0 	0.0 	1211.0 	21831.0 	0.0
1997-10-01 	532.0 	0.0 	1307.0 	21731.0 	0.0
1997-11-01 	624.0 	0.0 	1404.0 	21542.0 	0.0
1997-12-01 	632.0 	0.0 	1232.0 	21706.0 	0.0
1998-01-01 	512.0 	0.0 	1025.0 	22033.0 	0.0
1998-02-01 	472.0 	0.0 	1079.0 	22019.0 	0.0
1998-03-01 	571.0 	0.0 	1489.0 	21510.0 	0.0
1998-04-01 	518.0 	0.0 	919.0 	22133.0 	0.0
1998-05-01 	459.0 	0.0 	1029.0 	22082.0 	0.0
1998-06-01 	446.0 	0.0 	1060.0 	22064.0 	0.0
'''

本文地址：https://blog.csdn.net/qq_36565509/article/details/107435828

上一篇： C#中实现线程同步lock关键字的用法详解

下一篇： ActiveMQ在C#中的应用示例分析

Python数据分析 CDnow网站的用户购买行为分析案例(RFM分析模型)