欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Python数据分析 CDnow网站的用户购买行为分析案例(RFM分析模型)

程序员文章站 2024-02-03 14:49:22
Python数据分析1 案例 CDnow网站的用户购买行为分析1.1 数据介绍数据来源于CDnow网站的用户购买记录,字段包括用户ID,订单日期,订单数量和订单金额。字段名说明user_id用户IDorder_dt订单日期order_product订单数量order_amount订单金额1.2 数据导入import numpy as npimport pandas as pdfrom pandas import DataFrame, Seri...

Python数据分析

1 案例 CDnow网站的用户购买行为分析

1.1 数据介绍

数据来源于CDnow网站的用户购买记录,字段包括用户ID,订单日期,订单数量和订单金额。

字段名 说明
user_id 用户ID
order_dt 订单日期
order_product 订单数量
order_amount 订单金额

1.2 数据导入

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt

df = pd.read_csv('./CDNOW_master.txt', header=None, sep='\s+', names=['user_id', 'order_date', 'order_quantity', 'order_amount'])

df.head()
'''
	user_id 	order_date 	order_quantity 	order_amount
0 	1 			19970101 	1 				11.77
1 	2 			19970112 	1 				12.00
2 	2 			19970112 	5 				77.00
3 	3 			19970102 	2 				20.76
4 	3 			19970330 	2 				20.76
'''

df.shape  # (69659, 4)
df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_id         69659 non-null  int64  
 1   order_date      69659 non-null  int64  
 2   order_quantity  69659 non-null  int64  
 3   order_amount    69659 non-null  float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB
'''

1.3 数据处理

1.3.1 时间格式转换
df['order_date'] = pd.to_datetime(df['order_date'], format="%Y%m%d")

增加一列,显示月份。

df['order_month'] = df['order_date'].astype('datetime64[M]')
df.head()
'''
 	user_id 	order_date 	order_quantity 	order_amount 	order_month
0 	1 			1997-01-01 	1 				11.77 			1997-01-01
1 	2 			1997-01-12 	1 				12.00 			1997-01-01
2 	2 			1997-01-12 	5 				77.00 			1997-01-01
3 	3 			1997-01-02 	2 				20.76 			1997-01-01
4 	3 			1997-03-30 	2 				20.76 			1997-03-01
'''
1.3.2 describe

describe函数用于观察这一系列数据的范围、大小、波动趋势等,便于判断对数据使用哪类模型更合适。

df.describe()
'''
		user_id 		order_quantity 		order_amount
count 	69659.000000 	69659.000000 		69659.000000
mean 	11470.854592 	2.410040 			35.893648
std 	6819.904848 	2.333924 			36.281942
min 	1.000000 		1.000000 			0.000000
25% 	5506.000000 	1.000000 			14.490000
50% 	11410.000000 	2.000000 			25.980000
75% 	17273.000000 	3.000000 			43.700000
max 	23570.000000 	99.000000 			1286.010000
'''

1.4 数据分析

1.4.1 按月数据分析
1.4.1.1 用户每月的消费总金额
df.groupby(by='order_month')['order_amount'].sum()
'''
order_month
1997-01-01    299060.17
1997-02-01    379590.03
1997-03-01    393155.27
1997-04-01    142824.49
1997-05-01    107933.30
1997-06-01    108395.87
1997-07-01    122078.88
1997-08-01     88367.69
1997-09-01     81948.80
1997-10-01     89780.77
1997-11-01    115448.64
1997-12-01     95577.35
1998-01-01     76756.78
1998-02-01     77096.96
1998-03-01    108970.15
1998-04-01     66231.52
1998-05-01     70989.66
1998-06-01     76109.30
'''
df.groupby(by='order_month')['order_amount'].sum().plot()

Python数据分析 CDnow网站的用户购买行为分析案例(RFM分析模型)

1.4.1.2 用户每月的产品购买量
df.groupby(by='order_month')['order_quantity'].sum()
'''
order_month
1997-01-01    19416
1997-02-01    24921
1997-03-01    26159
1997-04-01     9729
1997-05-01     7275
1997-06-01     7301
1997-07-01     8131
1997-08-01     5851
1997-09-01     5729
1997-10-01     6203
1997-11-01     7812
1997-12-01     6418
1998-01-01     5278
1998-02-01     5340
1998-03-01     7431
1998-04-01     4697
1998-05-01     4903
1998-06-01     5287
'''
1.4.1.2 用户每月的消费总次数
df.groupby(by='order_month')['user_id'].count()
'''
order_month
1997-01-01     8928
1997-02-01    11272
1997-03-01    11598
1997-04-01     3781
1997-05-01     2895
1997-06-01     3054
1997-07-01     2942
1997-08-01     2320
1997-09-01     2296
1997-10-01     2562
1997-11-01     2750
1997-12-01     2504
1998-01-01     2032
1998-02-01     2026
1998-03-01     2793
1998-04-01     1878
1998-05-01     1985
1998-06-01     2043
'''
1.4.1.2 每月的消费用户数
df.groupby(by='order_month')['user_id'].nunique()
'''
order_month
1997-01-01    7846
1997-02-01    9633
1997-03-01    9524
1997-04-01    2822
1997-05-01    2214
1997-06-01    2339
1997-07-01    2180
1997-08-01    1772
1997-09-01    1739
1997-10-01    1839
1997-11-01    2028
1997-12-01    1864
1998-01-01    1537
1998-02-01    1551
1998-03-01    2060
1998-04-01    1437
1998-05-01    1488
1998-06-01    1506
'''
1.4.2 用户个体消费数据分析
  1. 用户消费总金额和总购买量的统计描述
df['order_quantity'].sum(), df['order_amount'].sum()  # (167881, 2500315.6300000004)
  1. 绘制每位用户消费金额和购买数量的散点图
users_amount_series = df.groupby(by='user_id')['order_amount'].sum()
users_quantity_series = df.groupby(by='user_id')['order_quantity'].sum()

plt.scatter(users_amount_series, users_quantity_series)

Python数据分析 CDnow网站的用户购买行为分析案例(RFM分析模型)

  1. 绘制每位用户消费总金额的直方分布图
    条件:消费金额在1000之内
order_amount_1000_series = df.query('order_amount <= 1000').groupby(by='user_id')['order_amount'].sum()
  1. 绘制每位用户购买商品总数的直方分布图
    条件:购买商品的数量在100以内。
order_quantity_1000_series = df.query('order_quantity <= 100').groupby(by='user_id')['order_quantity'].sum()
1.4.3 用户消费行为分析
  1. 用户首次消费的月份分布和人数统计
df.groupby(by='user_id')['order_month'].min().value_counts()
'''
1997-02-01    8476
1997-01-01    7846
1997-03-01    7248
'''
  1. 用户最后一次消费的时间分布和人数统计
df.groupby(by='user_id')['order_month'].max().value_counts()
'''
1997-02-01    4912
1997-03-01    4478
1997-01-01    4192
1998-06-01    1506
1998-05-01    1042
1998-03-01     993
1998-04-01     769
1997-04-01     677
1997-12-01     620
1997-11-01     609
1998-02-01     550
1998-01-01     514
1997-06-01     499
1997-07-01     493
1997-05-01     480
1997-10-01     455
1997-09-01     397
1997-08-01     384
'''
  1. 新老客户的占比
    消费一次为新用户,消费多次为老用户。

获取每位用户的首次消费和最后一次消费的消费时间。

new_old_df = df.groupby(by='user_id')['order_date'].agg(['min', 'max'])
(new_old_df['min'] == new_old_df['max']).value_counts()
'''
True     12054  # 新用户
False    11516  # 老用户
'''

获取新老客户的消费比例。

new_num = new_old_df.loc[new_old_df['min'] == new_old_df['max']].shape[0]
old_num = new_old_df.loc[new_old_df['min'] != new_old_df['max']].shape[0]
new_num / old_num  # 1.0467176102813478
  1. 用户分层
    获取由用户总购买量和消费总金额,以及最近一次消费时间组成的RFM表。
rfm = df.pivot_table(
	index='user_id', 
	aggfunc={
		'order_quantity': 'sum', 
		'order_amount': 'sum', 
		'order_date': 'max'
	}
)
rfm.head()
'''
 		order_amount 	order_date 		order_quantity
user_id 			
1 		11.77 			1997-01-01 		1
2 		89.00 			1997-01-12 		6
3 		156.46 			1998-05-28 		16
4 		100.50 			1997-12-12 		7
5 		385.61 			1998-01-03 		29
'''

RFM分析模型
R表示客户最近一次交易到现在时间间隔;
F表示客户购买商品的总数量。F值越大,表示客户交易越频繁,反之则表示客户交易不够活跃;
M表示客户交易金额。M值越大,表示客户价值越高,反之则表示客户价值较低。

# 计算R
rfm['R'] = (df['order_date'].max() - rfm['order_date']) / np.timedelta64(1, 'D')

rfm = rfm[['order_amount', 'order_quantity', 'R']]
rfm.columns = ['M', 'F', 'R']
rfm.head()
'''
		M 		F 		R
user_id 			
1 		11.77 	1 		545.0
2 		89.00 	6 		534.0
3 		156.46 	16 		33.0
4 		100.50 	7 		200.0
5 		385.61 	29 		178.0
'''
  1. 根据价值分层,将用户分为:
    重要价值客户,重要保持客户,重要挽留客户,重要发展客户,一般价值客户,一般保持客户,一般挽留客户,一般发展客户
# RFM分层算法
def rfm_func(x):
    # 存储的数据是三个字符串形式的0或1
    level = x.map(lambda x: '1' if x >= 0 else '0')
    label = level['R'] + level.F + level.M
    d = {
        '111': '重要价值客户',
        '011': '重要保持客户',
        '101': '重要挽留客户',
        '001': '重要发展客户',
        '110': '一般价值客户',
        '010': '一般保持客户',
        '100': '一般挽留客户',
        '000': '一般发展客户',
    }
    result = d[label]
    return result

# df.apply(func):用于对df中的行或列进行指定形式(func)的运算
rfm['label'] = rfm.apply(lambda x: x - x.mean(), axis=0).apply(rfm_func, axis = 1)
rfm.head()
'''
		M 		F 	R 		label
user_id 				
1 		11.77 	1 	545.0 	一般挽留客户
2 		89.00 	6 	534.0 	一般挽留客户
3 		156.46 	16 	33.0 	重要保持客户
4 		100.50 	7 	200.0 	一般发展客户
5 		385.61 	29 	178.0 	重要保持客户
'''
1.4.4 用户生命周期分析
  1. 将用户划分为活跃用户和其他用户

统计每位用户每月的消费次数

consumption_per_month_df = df.pivot_table(index='user_id', values='order_date', aggfunc='count', columns='order_month', fill_value=0)
consumption_per_month_df.head()
'''
order_month 	1997-01-01 	1997-02-01 	1997-03-01 	1997-04-01 	1997-05-01 	1997-06-01 	1997-07-01 	1997-08-01 	1997-09-01 	1997-10-01 	1997-11-01 	1997-12-01 	1998-01-01 	1998-02-01 	1998-03-01 	1998-04-01 	1998-05-01 	1998-06-01
user_id
1 				1 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
2 				2 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
3 				1 			0 			1 			1 			0 			0 			0 			0 			0 			0 			2 			0 			0 			0 			0 			0 			1 			0
4 				2 			0 			0 			0 			0 			0 			0 			1 			0 			0 			0 			1 			0 			0 			0 			0 			0 			0
5 				2 			1 			0 			1 			1 			1 			1 			0 			1 			0 			0 			2 			1 			0 			0 			0 			0 			0
'''

统计每位用户每月是否消费,有则为1,否则为0。

is_consumption_per_month_df = consumption_per_month_df.applymap(lambda x: 1 if x > 0 else 0)
is_consumption_per_month_df.head()
'''
order_month 	1997-01-01 	1997-02-01 	1997-03-01 	1997-04-01 	1997-05-01 	1997-06-01 	1997-07-01 	1997-08-01 	1997-09-01 	1997-10-01 	1997-11-01 	1997-12-01 	1998-01-01 	1998-02-01 	1998-03-01 	1998-04-01 	1998-05-01 	1998-06-01
user_id
1 				1 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
2 				1 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0 			0
3 				1 			0 			1 			1 			0 			0 			0 			0 			0 			0 			1 			0 			0 			0 			0 			0 			1 			0
4 				1 			0 			0 			0 			0 			0 			0 			1 			0 			0 			0 			1 			0 			0 			0 			0 			0 			0
5 				1 			1 			0 			1 			1 			1 			1 			0 			1 			0 			0 			1 			1 			0 			0 			0 			0 			0
'''
  1. 将用户按照每月购买情况分类
unreg:观望用户,即前两月没买,第三个月才首次购买,用户在前两个月为观望用户;
unactive:用户首月购买后,后面的月份中没有购买,用户在没有购买的月份中为非活跃用户;
new:在当前月进行首次购买的用户,在当前月中为新用户;
active:连续月份购买的用户,在这些月中为活跃用户;
return:购买后间隔数个月才再次购买的用户,在再次开始购买的首月用户为回头客。
consumption_per_month_df.shape  # (23570, 18)

def user_classify(data):
    status = []  # 用于存储某位用户在每个月的活跃度
    # 18 - consumption_per_month_df.shape[1]
    for i in range(18):
        # 若本月没有消费
        if data[i] == 0:
            if len(status) > 0:
                if status[i-1] == 'unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
            else:
                status.append('unreg')         
        # 若本月有消费
        else:
            if len(status) == 0:
                status.append('new')
            else:
                if status[i-1] == 'unactive':
                    status.append('return')
                elif status[i-1] == 'unreg':
                    status.append('new')
                else:
                    status.append('active')
                    
    return status

user_status = consumption_per_month_df.apply(user_classify, axis = 1) 
user_status.head()
'''
user_id
1    [new, unactive, unactive, unactive, unactive, ...
2    [new, unactive, unactive, unactive, unactive, ...
3    [new, unactive, return, active, unactive, unac...
4    [new, unactive, unactive, unactive, unactive, ...
5    [new, active, unactive, return, active, active...
'''
user_status_per_month_df = DataFrame(data=pivoted_status.tolist(), index=consumption_per_month_df.index, columns=consumption_per_month_df.columns)
user_status_per_month_df.head()
'''
order_month 	1997-01-01 	1997-02-01 	1997-03-01 	1997-04-01 	1997-05-01 	1997-06-01 	1997-07-01 	1997-08-01 	1997-09-01 	1997-10-01 	1997-11-01 	1997-12-01 	1998-01-01 	1998-02-01 	1998-03-01 	1998-04-01 	1998-05-01 	1998-06-01
user_id
1 				new 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive
2 				new 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	unactive
3 				new 		unactive 	return 		active 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	return 		unactive 	unactive 	unactive 	unactive 	unactive 	return 		unactive
4 				new 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive 	return 		unactive 	unactive 	unactive 	return 		unactive 	unactive 	unactive 	unactive 	unactive 	unactive
5 				new 		active 		unactive 	return 		active 		active 		active 		unactive 	return 		unactive 	unactive 	return 		active 		unactive 	unactive 	unactive 	unactive 	unactive
'''
  1. 对每月的不同状态的用户进行计数
user_status_per_month_df.apply(lambda x: pd.value_counts(x), axis=0).fillna(0).T
'''
 			active 	new 	return 	unactive 	unreg
order_month
1997-01-01 	0.0 	7846.0 	0.0 	0.0 		15724.0
1997-02-01 	1157.0 	8476.0 	0.0 	6689.0 		7248.0
1997-03-01 	1681.0 	7248.0 	595.0 	14046.0 	0.0
1997-04-01 	1773.0 	0.0 	1049.0 	20748.0 	0.0
1997-05-01 	852.0 	0.0 	1362.0 	21356.0 	0.0
1997-06-01 	747.0 	0.0 	1592.0 	21231.0 	0.0
1997-07-01 	746.0 	0.0 	1434.0 	21390.0 	0.0
1997-08-01 	604.0 	0.0 	1168.0 	21798.0 	0.0
1997-09-01 	528.0 	0.0 	1211.0 	21831.0 	0.0
1997-10-01 	532.0 	0.0 	1307.0 	21731.0 	0.0
1997-11-01 	624.0 	0.0 	1404.0 	21542.0 	0.0
1997-12-01 	632.0 	0.0 	1232.0 	21706.0 	0.0
1998-01-01 	512.0 	0.0 	1025.0 	22033.0 	0.0
1998-02-01 	472.0 	0.0 	1079.0 	22019.0 	0.0
1998-03-01 	571.0 	0.0 	1489.0 	21510.0 	0.0
1998-04-01 	518.0 	0.0 	919.0 	22133.0 	0.0
1998-05-01 	459.0 	0.0 	1029.0 	22082.0 	0.0
1998-06-01 	446.0 	0.0 	1060.0 	22064.0 	0.0
'''

本文地址:https://blog.csdn.net/qq_36565509/article/details/107435828