数据分析： kaggle比赛 - 销量预测

程序员文章站 2022-07-14 14:17:38

...

Part 0: 数据读取

0.1 数据内容简述-Data Files：

0.2 数据读入

Part I: 抽检单个商品的销量分布

Part II:时间纬度 - 查看不同时间颗粒度下的分布

Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征

Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布

Part V: 生命周期维度 - 分析商品销售-新品与停售（sku数量）

Part VI: 门店维度 - 门店销量by Date

Part VII: 时间组合纬度-星期按日期的展开

Part VIII:价格维度

继之前的Kaggle比赛实战记录，参考其他kaggler的经验，总结一篇数据分析，以作日后参考。

比赛链接（已结束，可下载数据）：https://www.kaggle.com/c/m5-forecasting-accuracy

Part 0: 数据读取

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

0.1 数据内容简述-Data Files：

原始数据文件：

calendar.csv - Contains information about the dates on which the products are sold.
sales_train_validation.csv - Contains the historical daily unit sales data per product and store [d_1 - d_1913]
sample_submission.csv - The correct format for submissions. Reference the Evaluation tab for more info.
sell_prices.csv - Contains information about the price of the products sold per store and date.

预测目标

We are trying for forecast sales for 28 forecast days. The sample submission has the following format:

The columns represent 28 forecast days. We will fill these forecast days with our predictions.
The rows each represent a specific item. This id tells us the item type, state, and store. We don't know what these items are exactly.

0.2 数据读入

# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')

We are given historic sales data in the `sales_train_validation` dataset.
- rows exist in this dataset for days d_1 to d_1913. We are given the department, category, state, and store id of the item.
- d_1914 - d_1941 represents the `validation` rows which we will predict in stage 1
- d_1942 - d_1969 represents the `evaluation` rows which we will predict for the final competition standings.

Part I: 抽检单个商品的销量分布

1. 便于形成初步印象

Visualizing the data for a single item

Lets take a random item that sell a lot and see how it's sales look across the training data.
FOODS_3_090_CA_3_validation sells a lot
Note there are days where it appears the item is unavailable and sales flatline

d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns

# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
    .set_index('id')[d_cols] \
    .T \
    .plot(figsize=(15, 5),
          title='FOODS_3_090_CA_3 sales by "d" number',
          color=next(color_cycle))
plt.legend('Sales')
plt.show()

Merging the data with real dates

We are given a calendar with additional information about past and future dates.
The calendar data can be merged with our days data
From this we can find weekly and annual trends

cal[['d', 'date', 'event_name_1', 'event_name_2', 'event_type_1', 'event_type_2', 'snap_CA']].head()

# Example 1: FOODS_3_090_CA_3_validation
# Merge calendar on our items' data
example = stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'][d_cols].T # The index and col_names become index
example = example.rename(columns={8412:'FOODS_3_090_CA_3'}) # Name it correctly
example = example.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example = example.merge(cal, how='left', validate='1:1')
example.set_index('date')['FOODS_3_090_CA_3'] \
    .plot(figsize=(15, 5),
          color=next(color_cycle),
          title='FOODS_3_090_CA_3 sales by actual sale dates')
plt.show()

# Example 2: HOBBIES_1_234_CA_3_validation
example2 = stv.loc[stv['id'] == 'HOBBIES_1_234_CA_3_validation'][d_cols].T
example2 = example2.rename(columns={6324:'HOBBIES_1_234_CA_3'}) # Name it correctly
example2 = example2.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example2 = example2.merge(cal, how='left', validate='1:1')

# Example 3: HOUSEHOLD_1_118_CA_3_validation
example3 = stv.loc[stv['id'] == 'HOUSEHOLD_1_118_CA_3_validation'][d_cols].T
example3 = example3.rename(columns={6776:'HOUSEHOLD_1_118_CA_3'}) # Name it correctly
example3 = example3.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example3 = example3.merge(cal, how='left', validate='1:1')

example.head()

数据分析： kaggle比赛 - 销量预测

Part II:时间纬度 - 查看不同时间颗粒度下的分布

Sales broken down by time variables

Now that we have our example item lets see how it sells by:
- Day of the week
- Month
- Year

examples = ['FOODS_3_090_CA_3','HOBBIES_1_234_CA_3','HOUSEHOLD_1_118_CA_3']
example_df = [example, example2, example3]

for i in [0, 1, 2]:
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 3))
    example_df[i].groupby('wday').mean()[examples[i]] \
    .plot(kind='line', 
          title='average sale: day of week', 
          lw=5, 
          color=color_pal[0],
          ax=ax1)
    example_df[i].groupby('month').mean()[examples[i]] \
    .plot(kind='line', 
          title='average sale: day of month', 
          lw=5, 
          color=color_pal[4],
          ax=ax2)
    example_df[i].groupby('year').mean()[examples[i]] \
    .plot(kind='line', 
          title='average sale: day of year', 
          lw=5, 
          color=color_pal[2],
          ax=ax3)
    plt.tight_layout()
    plt.show()

数据分析： kaggle比赛 - 销量预测

Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征

抽取多个数据，便于初步了解数据范围，和异常数据，业务特点等

Lets look at a lot of different items!

Lets put it all together to plot 20 different items and their sales
Some observations from these plots:
- It is common to see an item unavailable for a period of time.
- Some items only sell 1 or less in a day, making it very hard to predict.
- Other items show spikes in their demand (super bowl sunday?) possibly the "events" provided to us could help with these.

# stv_new = stv.sample(20, random_state=529).set_index('id')[d_cols].T
# print('stv = \n', stv_new.head())

twenty_examples = stv.sample(20, random_state=529) \
        .set_index('id')[d_cols] \
    .T \
    .merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date')

twenty_examples.head()

fig, axs = plt.subplots(10, 2, figsize=(15, 20))
axs = axs.flatten()
ax_idx = 0
for item in twenty_examples.columns:
    twenty_examples[item].plot(title=item,
                              color=next(color_cycle),
                              ax=axs[ax_idx])
    ax_idx += 1
plt.tight_layout()
plt.show()

数据分析： kaggle比赛 - 销量预测

Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布

不同商品类别销量对比

Combined Sales over Time by Type

We have several item types:
- Hobbies
- Household
- Foods
Lets plot the total demand over time for each type

stv['cat_id'].unique()

stv.groupby('cat_id').count()['id'] \
    .sort_values() \
    .plot(kind='barh', figsize=(15, 5), title='Count of Items by Category')
plt.show()

数据分析： kaggle比赛 - 销量预测

past_sales = stv.set_index('id')[d_cols] \
    .T \
    .merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date')

for i in stv['cat_id'].unique():
    items_col = [c for c in past_sales.columns if i in c]
    past_sales[items_col] \
        .sum(axis=1) \
        .plot(figsize=(15, 5),
              alpha=0.8,
              title='Total Sales by Item Type')
plt.legend(stv['cat_id'].unique())
plt.show()

Part V: 生命周期维度 - 分析商品销售-新品与停售（sku数量）

Rollout of items being sold.

We can see the some items come into supply that previously didn't exist. Similarly some items stop being sold completely.
Lets plot the sales, but only count if item is selling or not selling (0 -> not selling, >0 -> selling)
This plot shows us that many items are being slowly introduced into inventory, so many of them will not register a sale at the beginning of the provided data.

past_sales_clipped = past_sales.clip(0, 1)
for i in stv['cat_id'].unique():
    items_col = [c for c in past_sales.columns if i in c]
    (past_sales_clipped[items_col] \
        .mean(axis=1) * 100) \
        .plot(figsize=(15, 5),
              alpha=0.8,
              title='Inventory Sale Percentage by Date',
              style='.')
plt.ylabel('% of Inventory with at least 1 sale')
plt.legend(stv['cat_id'].unique())
plt.show()

数据分析： kaggle比赛 - 销量预测

Part VI: 门店维度 - 门店销量by Date

Sales by Store

We are provided data for 10 unique stores. What are the total sales by stores?

Note that some stores are more steady than others.
CA_2 seems to have a big change occur in 2015

store_list = sellp['store_id'].unique()
for s in store_list:
    store_items = [c for c in past_sales.columns if s in c]
    (past_sales[store_items] \
        .sum(axis=1)) \
        .rolling(90).mean() \
        .plot(figsize=(15, 5),
              alpha=0.8,
              title='Rolling 90 Day Average Total Sales (10 stores)')
plt.legend(store_list)
plt.show()

数据分析： kaggle比赛 - 销量预测

Looking at the same data a different way, we can plot a rolling 7 day total demand count by store. Note clearly that some stores have abrupt changes in their demand, it could be that the store expanded or a new competitor was built near by. Either way this is imporant to note when creating predictive models about demand pattern.

fig, axes = plt.subplots(5, 2, figsize=(15, 10), sharex=True)
axes = axes.flatten()
ax_idx = 0
for s in store_list:
    store_items = [c for c in past_sales.columns if s in c]
    past_sales[store_items] \
        .sum(axis=1) \
        .rolling(7).mean() \
        .plot(alpha=1,
              ax=axes[ax_idx],
              title=s,
              lw=3,
              color=next(color_cycle))
    ax_idx += 1
# plt.legend(store_list)
plt.suptitle('Weekly Sale Trends by Store ID')
plt.tight_layout()
plt.show()

数据分析： kaggle比赛 - 销量预测

Part VII: 时间组合纬度-星期按日期的展开

Sales Heatmap Calendar

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from datetime import datetime
from dateutil.relativedelta import relativedelta


def calmap(ax, year, data):
    ax.tick_params('x', length=0, labelsize="medium", which='major')
    ax.tick_params('y', length=0, labelsize="x-small", which='major')

    # Month borders
    xticks, labels = [], []
    start = datetime(year,1,1).weekday()
    for month in range(1,13):
        first = datetime(year, month, 1)
        last = first + relativedelta(months=1, days=-1)

        y0 = first.weekday()
        y1 = last.weekday()
        x0 = (int(first.strftime("%j"))+start-1)//7
        x1 = (int(last.strftime("%j"))+start-1)//7

        P = [ (x0,   y0), (x0,    7),  (x1,   7),
              (x1,   y1+1), (x1+1,  y1+1), (x1+1, 0),
              (x0+1,  0), (x0+1,  y0) ]
        xticks.append(x0 +(x1-x0+1)/2)
        labels.append(first.strftime("%b"))
        poly = Polygon(P, edgecolor="black", facecolor="None",
                       linewidth=1, zorder=20, clip_on=False)
        ax.add_artist(poly)
    
    ax.set_xticks(xticks)
    ax.set_xticklabels(labels)
    ax.set_yticks(0.5 + np.arange(7))
    ax.set_yticklabels(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
    ax.set_title("{}".format(year), weight="semibold")
    
    # Clearing first and last day from the data
    valid = datetime(year, 1, 1).weekday()
    data[:valid,0] = np.nan
    valid = datetime(year, 12, 31).weekday()
    # data[:,x1+1:] = np.nan
    data[valid+1:,x1] = np.nan

    # Showing data
    ax.imshow(data, extent=[0,53,0,7], zorder=10, vmin=-1, vmax=1,
              cmap="RdYlBu_r", origin="lower", alpha=.75)

It appears that walmarts are closed on Chirstmas day. The highest demand day of all the data was on Sunday March 6th, 2016. What happened on this day you may ask... well the Seventh Democratic presidential candidates debate hosted by CNN and held in Flint, Michigan...

print('The lowest sale date was:', past_sales.sum(axis=1).sort_values().index[0],
     'with', past_sales.sum(axis=1).sort_values().values[0], 'sales')
print('The lowest sale date was:', past_sales.sum(axis=1).sort_values(ascending=False).index[0],
     'with', past_sales.sum(axis=1).sort_values(ascending=False).values[0], 'sales')

from sklearn.preprocessing import StandardScaler
sscale = StandardScaler()
past_sales.index = pd.to_datetime(past_sales.index)
for i in stv['cat_id'].unique():
    fig, axes = plt.subplots(3, 1, figsize=(20, 8))
    items_col = [c for c in past_sales.columns if i in c]
    sales2013 = past_sales.loc[past_sales.index.isin(pd.date_range('31-Dec-2012',
                                                                   periods=371))][items_col].mean(axis=1)
    vals = np.hstack(sscale.fit_transform(sales2013.values.reshape(-1, 1)))
    calmap(axes[0], 2013, vals.reshape(53,7).T)
    sales2014 = past_sales.loc[past_sales.index.isin(pd.date_range('30-Dec-2013',
                                                                   periods=371))][items_col].mean(axis=1)
    vals = np.hstack(sscale.fit_transform(sales2014.values.reshape(-1, 1)))
    calmap(axes[1], 2014, vals.reshape(53,7).T)
    sales2015 = past_sales.loc[past_sales.index.isin(pd.date_range('29-Dec-2014',
                                                                   periods=371))][items_col].mean(axis=1)
    vals = np.hstack(sscale.fit_transform(sales2015.values.reshape(-1, 1)))
    calmap(axes[2], 2015, vals.reshape(53,7).T)
    plt.suptitle(i, fontsize=30, x=0.4, y=1.01)
    plt.tight_layout()
    plt.show()

数据分析： kaggle比赛 - 销量预测

Some interesting things to note from these heatmaps:

Food tends to have lower number of purchases as the month goes on. Could this be because people get their paychecks early in the month?
Household and Hobby items sell much less in January - after the Holiday season is over.
Cleary weekends are more popular shopping days regardless of the item category.

Part VIII:价格维度

Sale Prices

We are given historical sale prices of each item. Lets take a look at our example item from before.

It looks to me like the price of this item is growing.
Different stores have different selling prices.

fig, ax = plt.subplots(figsize=(15, 5))
stores = []
for store, d in sellp.query('item_id == "FOODS_3_090"').groupby('store_id'):
    d.plot(x='wm_yr_wk',
          y='sell_price',
          style='.',
          color=next(color_cycle),
          figsize=(15, 5),
          title='FOODS_3_090 sale price over time',
         ax=ax,
          legend=store)
    stores.append(store)
    plt.legend()
plt.legend(stores)
plt.show()

数据分析： kaggle比赛 - 销量预测

print('sellp = ', sellp.head())

sellp['Category'] = sellp['item_id'].str.split('_', expand=True)[0]
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
i = 0
for cat, d in sellp.groupby('Category'):
    ax = d['sell_price'].apply(np.log1p) \
        .plot(kind='hist',
                         bins=20,
                         title=f'Distribution of {cat} prices',
                         ax=axs[i],
                                         color=next(color_cycle))
    ax.set_xlabel('Log(price)')
    i += 1
plt.tight_layout()

数据分析： kaggle比赛 - 销量预测

参考文献：

1. https://www.kaggle.com/robikscube/m5-forecasting-starter-data-exploration

数据分析： kaggle比赛 - 销量预测

Part 0: 数据读取

0.1 数据内容简述-Data Files：

0.2 数据读入

Part I: 抽检单个商品的销量分布

Visualizing the data for a single item

Merging the data with real dates

Part II:时间纬度 - 查看不同时间颗粒度下的分布

Sales broken down by time variables

Part III: 商品id纬度 - 随机抽取多个商品查看销量 - 识别潜在特征

Lets look at a lot of different items!

Part IV: 商品类别维度 - 对比分析不同类别商品销量 - 总数/随时间分布

Combined Sales over Time by Type

Part V: 生命周期维度 - 分析商品销售-新品与停售（sku数量）

Rollout of items being sold.

Part VI: 门店维度 - 门店销量by Date

Sales by Store

Part VII: 时间组合纬度-星期按日期的展开

Sales Heatmap Calendar

Part VIII:价格维度

Sale Prices

kaggle销量预测的baseline（M5 Forecasting - Accuracy）

Gartner预测大数据分析人才将供不应求

预测python数据分析师的工资

IDC分析师关于中国大数据市场的十大预测

零基础入门数据挖掘-二手车交易价格预测之数据分析

Pytorch实战Kaggle房价预测比赛

实战KAGGLE比赛：房价预测

机器学习Kaggle入门：房价预测比赛

kaggle比赛：房价预测（排名前4%）

实战Kaggle比赛(二)——房价预测