数据分析: kaggle比赛 - 销量预测
目录
Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征
Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布
Part V: 生命周期维度 - 分析商品销售-新品与停售(sku数量)
继之前的Kaggle比赛实战记录,参考其他kaggler的经验,总结一篇数据分析,以作日后参考。
比赛链接(已结束,可下载数据):https://www.kaggle.com/c/m5-forecasting-accuracy
Part 0: 数据读取
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])
0.1 数据内容简述-Data Files:
原始数据文件:
-
calendar.csv
- Contains information about the dates on which the products are sold. -
sales_train_validation.csv
- Contains the historical daily unit sales data per product and store [d_1 - d_1913] -
sample_submission.csv
- The correct format for submissions. Reference the Evaluation tab for more info. -
sell_prices.csv
- Contains information about the price of the products sold per store and date.
预测目标
We are trying for forecast sales for 28 forecast days. The sample submission has the following format:
- The columns represent 28 forecast days. We will fill these forecast days with our predictions.
- The rows each represent a specific item. This id tells us the item type, state, and store. We don't know what these items are exactly.
0.2 数据读入
# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')
We are given historic sales data in the `sales_train_validation` dataset.
- rows exist in this dataset for days d_1 to d_1913. We are given the department, category, state, and store id of the item.
- d_1914 - d_1941 represents the `validation` rows which we will predict in stage 1
- d_1942 - d_1969 represents the `evaluation` rows which we will predict for the final competition standings.
Part I: 抽检单个商品的销量分布
1. 便于形成初步印象
Visualizing the data for a single item
- Lets take a random item that sell a lot and see how it's sales look across the training data.
-
FOODS_3_090_CA_3_validation
sells a lot - Note there are days where it appears the item is unavailable and sales flatline
d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns
# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
.set_index('id')[d_cols] \
.T \
.plot(figsize=(15, 5),
title='FOODS_3_090_CA_3 sales by "d" number',
color=next(color_cycle))
plt.legend('Sales')
plt.show()
Merging the data with real dates
- We are given a calendar with additional information about past and future dates.
- The calendar data can be merged with our days data
- From this we can find weekly and annual trends
cal[['d', 'date', 'event_name_1', 'event_name_2', 'event_type_1', 'event_type_2', 'snap_CA']].head()
# Example 1: FOODS_3_090_CA_3_validation
# Merge calendar on our items' data
example = stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'][d_cols].T # The index and col_names become index
example = example.rename(columns={8412:'FOODS_3_090_CA_3'}) # Name it correctly
example = example.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example = example.merge(cal, how='left', validate='1:1')
example.set_index('date')['FOODS_3_090_CA_3'] \
.plot(figsize=(15, 5),
color=next(color_cycle),
title='FOODS_3_090_CA_3 sales by actual sale dates')
plt.show()
# Example 2: HOBBIES_1_234_CA_3_validation
example2 = stv.loc[stv['id'] == 'HOBBIES_1_234_CA_3_validation'][d_cols].T
example2 = example2.rename(columns={6324:'HOBBIES_1_234_CA_3'}) # Name it correctly
example2 = example2.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example2 = example2.merge(cal, how='left', validate='1:1')
# Example 3: HOUSEHOLD_1_118_CA_3_validation
example3 = stv.loc[stv['id'] == 'HOUSEHOLD_1_118_CA_3_validation'][d_cols].T
example3 = example3.rename(columns={6776:'HOUSEHOLD_1_118_CA_3'}) # Name it correctly
example3 = example3.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example3 = example3.merge(cal, how='left', validate='1:1')
example.head()
Part II:时间纬度 - 查看不同时间颗粒度下的分布
Sales broken down by time variables
- Now that we have our example item lets see how it sells by:
- Day of the week
- Month
- Year
examples = ['FOODS_3_090_CA_3','HOBBIES_1_234_CA_3','HOUSEHOLD_1_118_CA_3']
example_df = [example, example2, example3]
for i in [0, 1, 2]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 3))
example_df[i].groupby('wday').mean()[examples[i]] \
.plot(kind='line',
title='average sale: day of week',
lw=5,
color=color_pal[0],
ax=ax1)
example_df[i].groupby('month').mean()[examples[i]] \
.plot(kind='line',
title='average sale: day of month',
lw=5,
color=color_pal[4],
ax=ax2)
example_df[i].groupby('year').mean()[examples[i]] \
.plot(kind='line',
title='average sale: day of year',
lw=5,
color=color_pal[2],
ax=ax3)
plt.tight_layout()
plt.show()
Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征
抽取多个数据,便于初步了解数据范围,和异常数据,业务特点等
Lets look at a lot of different items!
- Lets put it all together to plot 20 different items and their sales
- Some observations from these plots:
- It is common to see an item unavailable for a period of time.
- Some items only sell 1 or less in a day, making it very hard to predict.
- Other items show spikes in their demand (super bowl sunday?) possibly the "events" provided to us could help with these.
# stv_new = stv.sample(20, random_state=529).set_index('id')[d_cols].T
# print('stv = \n', stv_new.head())
twenty_examples = stv.sample(20, random_state=529) \
.set_index('id')[d_cols] \
.T \
.merge(cal.set_index('d')['date'],
left_index=True,
right_index=True,
validate='1:1') \
.set_index('date')
twenty_examples.head()
fig, axs = plt.subplots(10, 2, figsize=(15, 20))
axs = axs.flatten()
ax_idx = 0
for item in twenty_examples.columns:
twenty_examples[item].plot(title=item,
color=next(color_cycle),
ax=axs[ax_idx])
ax_idx += 1
plt.tight_layout()
plt.show()
Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布
不同商品类别销量对比
Combined Sales over Time by Type
- We have several item types:
- Hobbies
- Household
- Foods
- Lets plot the total demand over time for each type
stv['cat_id'].unique()
stv.groupby('cat_id').count()['id'] \
.sort_values() \
.plot(kind='barh', figsize=(15, 5), title='Count of Items by Category')
plt.show()
past_sales = stv.set_index('id')[d_cols] \
.T \
.merge(cal.set_index('d')['date'],
left_index=True,
right_index=True,
validate='1:1') \
.set_index('date')
for i in stv['cat_id'].unique():
items_col = [c for c in past_sales.columns if i in c]
past_sales[items_col] \
.sum(axis=1) \
.plot(figsize=(15, 5),
alpha=0.8,
title='Total Sales by Item Type')
plt.legend(stv['cat_id'].unique())
plt.show()
Part V: 生命周期维度 - 分析商品销售-新品与停售(sku数量)
Rollout of items being sold.
- We can see the some items come into supply that previously didn't exist. Similarly some items stop being sold completely.
- Lets plot the sales, but only count if item is selling or not selling (0 -> not selling, >0 -> selling)
- This plot shows us that many items are being slowly introduced into inventory, so many of them will not register a sale at the beginning of the provided data.
past_sales_clipped = past_sales.clip(0, 1)
for i in stv['cat_id'].unique():
items_col = [c for c in past_sales.columns if i in c]
(past_sales_clipped[items_col] \
.mean(axis=1) * 100) \
.plot(figsize=(15, 5),
alpha=0.8,
title='Inventory Sale Percentage by Date',
style='.')
plt.ylabel('% of Inventory with at least 1 sale')
plt.legend(stv['cat_id'].unique())
plt.show()
Part VI: 门店维度 - 门店销量by Date
Sales by Store
We are provided data for 10 unique stores. What are the total sales by stores?
- Note that some stores are more steady than others.
- CA_2 seems to have a big change occur in 2015
store_list = sellp['store_id'].unique()
for s in store_list:
store_items = [c for c in past_sales.columns if s in c]
(past_sales[store_items] \
.sum(axis=1)) \
.rolling(90).mean() \
.plot(figsize=(15, 5),
alpha=0.8,
title='Rolling 90 Day Average Total Sales (10 stores)')
plt.legend(store_list)
plt.show()
Looking at the same data a different way, we can plot a rolling 7 day total demand count by store. Note clearly that some stores have abrupt changes in their demand, it could be that the store expanded or a new competitor was built near by. Either way this is imporant to note when creating predictive models about demand pattern.
fig, axes = plt.subplots(5, 2, figsize=(15, 10), sharex=True)
axes = axes.flatten()
ax_idx = 0
for s in store_list:
store_items = [c for c in past_sales.columns if s in c]
past_sales[store_items] \
.sum(axis=1) \
.rolling(7).mean() \
.plot(alpha=1,
ax=axes[ax_idx],
title=s,
lw=3,
color=next(color_cycle))
ax_idx += 1
# plt.legend(store_list)
plt.suptitle('Weekly Sale Trends by Store ID')
plt.tight_layout()
plt.show()
Part VII: 时间组合纬度-星期按日期的展开
Sales Heatmap Calendar
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from datetime import datetime
from dateutil.relativedelta import relativedelta
def calmap(ax, year, data):
ax.tick_params('x', length=0, labelsize="medium", which='major')
ax.tick_params('y', length=0, labelsize="x-small", which='major')
# Month borders
xticks, labels = [], []
start = datetime(year,1,1).weekday()
for month in range(1,13):
first = datetime(year, month, 1)
last = first + relativedelta(months=1, days=-1)
y0 = first.weekday()
y1 = last.weekday()
x0 = (int(first.strftime("%j"))+start-1)//7
x1 = (int(last.strftime("%j"))+start-1)//7
P = [ (x0, y0), (x0, 7), (x1, 7),
(x1, y1+1), (x1+1, y1+1), (x1+1, 0),
(x0+1, 0), (x0+1, y0) ]
xticks.append(x0 +(x1-x0+1)/2)
labels.append(first.strftime("%b"))
poly = Polygon(P, edgecolor="black", facecolor="None",
linewidth=1, zorder=20, clip_on=False)
ax.add_artist(poly)
ax.set_xticks(xticks)
ax.set_xticklabels(labels)
ax.set_yticks(0.5 + np.arange(7))
ax.set_yticklabels(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
ax.set_title("{}".format(year), weight="semibold")
# Clearing first and last day from the data
valid = datetime(year, 1, 1).weekday()
data[:valid,0] = np.nan
valid = datetime(year, 12, 31).weekday()
# data[:,x1+1:] = np.nan
data[valid+1:,x1] = np.nan
# Showing data
ax.imshow(data, extent=[0,53,0,7], zorder=10, vmin=-1, vmax=1,
cmap="RdYlBu_r", origin="lower", alpha=.75)
It appears that walmarts are closed on Chirstmas day. The highest demand day of all the data was on Sunday March 6th, 2016. What happened on this day you may ask... well the Seventh Democratic presidential candidates debate hosted by CNN and held in Flint, Michigan...
print('The lowest sale date was:', past_sales.sum(axis=1).sort_values().index[0],
'with', past_sales.sum(axis=1).sort_values().values[0], 'sales')
print('The lowest sale date was:', past_sales.sum(axis=1).sort_values(ascending=False).index[0],
'with', past_sales.sum(axis=1).sort_values(ascending=False).values[0], 'sales')
from sklearn.preprocessing import StandardScaler
sscale = StandardScaler()
past_sales.index = pd.to_datetime(past_sales.index)
for i in stv['cat_id'].unique():
fig, axes = plt.subplots(3, 1, figsize=(20, 8))
items_col = [c for c in past_sales.columns if i in c]
sales2013 = past_sales.loc[past_sales.index.isin(pd.date_range('31-Dec-2012',
periods=371))][items_col].mean(axis=1)
vals = np.hstack(sscale.fit_transform(sales2013.values.reshape(-1, 1)))
calmap(axes[0], 2013, vals.reshape(53,7).T)
sales2014 = past_sales.loc[past_sales.index.isin(pd.date_range('30-Dec-2013',
periods=371))][items_col].mean(axis=1)
vals = np.hstack(sscale.fit_transform(sales2014.values.reshape(-1, 1)))
calmap(axes[1], 2014, vals.reshape(53,7).T)
sales2015 = past_sales.loc[past_sales.index.isin(pd.date_range('29-Dec-2014',
periods=371))][items_col].mean(axis=1)
vals = np.hstack(sscale.fit_transform(sales2015.values.reshape(-1, 1)))
calmap(axes[2], 2015, vals.reshape(53,7).T)
plt.suptitle(i, fontsize=30, x=0.4, y=1.01)
plt.tight_layout()
plt.show()
Some interesting things to note from these heatmaps:
- Food tends to have lower number of purchases as the month goes on. Could this be because people get their paychecks early in the month?
- Household and Hobby items sell much less in January - after the Holiday season is over.
- Cleary weekends are more popular shopping days regardless of the item category.
Part VIII:价格维度
Sale Prices
We are given historical sale prices of each item. Lets take a look at our example item from before.
- It looks to me like the price of this item is growing.
- Different stores have different selling prices.
fig, ax = plt.subplots(figsize=(15, 5))
stores = []
for store, d in sellp.query('item_id == "FOODS_3_090"').groupby('store_id'):
d.plot(x='wm_yr_wk',
y='sell_price',
style='.',
color=next(color_cycle),
figsize=(15, 5),
title='FOODS_3_090 sale price over time',
ax=ax,
legend=store)
stores.append(store)
plt.legend()
plt.legend(stores)
plt.show()
print('sellp = ', sellp.head())
sellp['Category'] = sellp['item_id'].str.split('_', expand=True)[0]
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
i = 0
for cat, d in sellp.groupby('Category'):
ax = d['sell_price'].apply(np.log1p) \
.plot(kind='hist',
bins=20,
title=f'Distribution of {cat} prices',
ax=axs[i],
color=next(color_cycle))
ax.set_xlabel('Log(price)')
i += 1
plt.tight_layout()
参考文献:
1. https://www.kaggle.com/robikscube/m5-forecasting-starter-data-exploration