ASHRAE KAGGLE大能源预测比赛结束回顾大佬解决方案

程序员文章站 2024-03-22 08:32:16

...

文章目录

1 概述
2 处理思想学习

2.1 移除异常值
2.2 缺失值
2.3 目标函数
2.4 特征工程

2.4.1 Savitzky-Golay filter
2.4.2 Bayesian target encoding(python实现)

2.5 models ensemble
2.6 Why does postprocessing work? 2nd place magic

1 概述

先上第一名分析的图
ASHRAE KAGGLE大能源预测比赛结束回顾大佬解决方案

2 处理思想学习

2.1 移除异常值

Long steaks of constant values

恒定值的长条纹
Large positive/negative spikes
极端的大尖峰

我们使用一个数据中所有建筑物验证了潜在的异常-如果同时在多个建筑物中出现异常，我们可以合理地确定这确实是一个真正的异常。

总结：异常值使用多个角度来验证这是真实的一个异常值

2.2 缺失值

温度元数据中缺少很多值。我们发现使用线性插值插补丢失的数据对我们的模型有帮助。

2.3 目标函数

一般人使用的都是log1p(meter_reading)，他们与常人不同的使用了log1p(meter_reading)/square_feet来进行预测。

2.4 特征工程

categorical interactions such as concatenation of building_id and meter
串联building_id和meter产生新的categorical featurebuilding_id_meter
count frequency of feautures
计算特征的数量
Smoothed and 1st, 2nd-order differentiation temperature features using Savitzky-Golay filter.
Cyclic encoding of periodic features; e.g., hour gets mapped to hour_x = cos(2pihour/24) and hour_y = sin(2pihour/24)
这个很骚，就是对于循环特征的编码，用cos和sin进行编码
Bayesian target encoding
这个是作者自己写的一种target编码，下面会详细讲一下

2.4.1 Savitzky-Golay filter

Savitzky-Golay卷积平滑算法是移动平滑算法的改进。
Savitzky-Golay卷积平滑关键在于矩阵算子的求解。

总结：先计算出B，然后计算预测Y，这个需要利用矩阵的运算。应该不难。回头复现的时候上代码
下面回到比赛上来看他们的处理结果
第一个图蓝色线是原数据
第一个图黄色线是用G-S平滑后的数据
第二个图蓝色线是G-S平滑后的数据的一阶导数
第二个图黄色线是G-S平滑后的数据的二阶导数

2.4.2 Bayesian target encoding(python实现)

import gc
import numpy as np
import pandas as pd 
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

PRIOR_PRECISION = 10
class GaussianTargetEncoder():
        
    def __init__(self, group_cols, target_col="target", prior_cols=None):
        self.group_cols = group_cols
        self.target_col = target_col
        self.prior_cols = prior_cols

    def _get_prior(self, df):
        if self.prior_cols is None:
            prior = np.full(len(df), df[self.target_col].mean())
        else:
            prior = df[self.prior_cols].mean(1)
        return prior
                    
    def fit(self, df):
        self.stats = df.assign(mu_prior=self._get_prior(df), y=df[self.target_col])
        self.stats = self.stats.groupby(self.group_cols).agg(
            n        = ("y", "count"),
            mu_mle   = ("y", np.mean),
            sig2_mle = ("y", np.var),
            mu_prior = ("mu_prior", np.mean),
        )        
    
    def transform(self, df, prior_precision=1000, stat_type="mean"):
        
        precision = prior_precision + self.stats.n/self.stats.sig2_mle
        
        if stat_type == "mean":
            numer = prior_precision*self.stats.mu_prior\
                    + self.stats.n/self.stats.sig2_mle*self.stats.mu_mle
            denom = precision
        elif stat_type == "var":
            numer = 1.0
            denom = precision
        elif stat_type == "precision":
            numer = precision
            denom = 1.0
        else: 
            raise ValueError(f"stat_type={stat_type} not recognized.")
        
        mapper = dict(zip(self.stats.index, numer / denom))
        if isinstance(self.group_cols, str):
            keys = df[self.group_cols].values.tolist()
        elif len(self.group_cols) == 1:
            keys = df[self.group_cols[0]].values.tolist()
        else:
            keys = zip(*[df[x] for x in self.group_cols])
        
        values = np.array([mapper.get(k) for k in keys]).astype(float)
        
        prior = self._get_prior(df)
        values[~np.isfinite(values)] = prior[~np.isfinite(values)]
        
        return values
    
    def fit_transform(self, df, *args, **kwargs):
        self.fit(df)
        return self.transform(df, *args, **kwargs)

# load data
train = pd.read_csv("/kaggle/input/ashrae-energy-prediction/train.csv")
test  = pd.read_csv("/kaggle/input/ashrae-energy-prediction/test.csv")

# create target
train["target"] = np.log1p(train.meter_reading)
test["target"] = train.target.mean()

# create time features
def add_time_features(df):
    df.timestamp = pd.to_datetime(df.timestamp)    
    df["hour"]    = df.timestamp.dt.hour
    df["weekday"] = df.timestamp.dt.weekday
    df["month"]   = df.timestamp.dt.month

add_time_features(train)
add_time_features(test)

# define groupings and corresponding priors
groups_and_priors = {
    
    # singe encodings
    ("hour",):        None,
    ("weekday",):     None,
    ("month",):       None,
    ("building_id",): None,
    ("meter",):       None,
    
    # second-order interactions
    ("meter", "hour"):        ["gte_meter", "gte_hour"],
    ("meter", "weekday"):     ["gte_meter", "gte_weekday"],
    ("meter", "month"):       ["gte_meter", "gte_month"],
    ("meter", "building_id"): ["gte_meter", "gte_building_id"],
        
    # higher-order interactions
    ("meter", "building_id", "hour"):    ["gte_meter_building_id", "gte_meter_hour"],
    ("meter", "building_id", "weekday"): ["gte_meter_building_id", "gte_meter_weekday"],
    ("meter", "building_id", "month"):   ["gte_meter_building_id", "gte_meter_month"],
}

features = []
for group_cols, prior_cols in groups_and_priors.items():
    features.append(f"gte_{'_'.join(group_cols)}")
    gte = GaussianTargetEncoder(list(group_cols), "target", prior_cols)    
    train[features[-1]] = gte.fit_transform(train, PRIOR_PRECISION)
    test[features[-1]]  = gte.transform(test,  PRIOR_PRECISION)

train_preds = np.zeros(len(train))
test_preds = np.zeros(len(test))

for m in range(4):
    
    print(f"Meter {m}", end="") 
    
    # instantiate model
    model = RidgeCV(
        alphas=np.logspace(-10, 1, 25), 
        normalize=True,
    )    
    
    # fit model
    model.fit(
        X=train.loc[train.meter==m, features].values, 
        y=train.loc[train.meter==m, "target"].values
    )

    # make predictions 
    train_preds[train.meter==m] = model.predict(train.loc[train.meter==m, features].values)
    test_preds[test.meter==m]   = model.predict(test.loc[test.meter==m, features].values)
    
    # transform predictions
    train_preds[train_preds < 0] = 0
    train_preds[train.meter==m] = np.expm1(train_preds[train.meter==m])
    
    test_preds[test_preds < 0] = 0 
    test_preds[test.meter==m] = np.expm1(test_preds[test.meter==m])
    
    # evaluate model
    meter_rmsle = rmsle(
        train_preds[train.meter==m],
        train.loc[train.meter==m, "meter_reading"].values
    )
    
    print(f", rmsle={meter_rmsle:0.5f}")

print(f"Overall rmsle={rmsle(train_preds, train.meter_reading.values):0.5f}")
del train, train_preds, test
gc.collect()

2.5 models ensemble

2nd的思想：Due to the size of the dataset and difficulty in setting up a robust validation framework, we did not focus much on feature engineering, fearing it might not extrapolate cleanly to the test data. Instead we chose to ensemble as many different models as possible to capture more information and help the predictions to be stable across years.
因为数据集的规模巨大，以及难以建立验证框架的困难，他们担心特征工程可能不发清晰的推断到测试数据上，因此并未过多的关注特征工程。相反，整合尽可能多的不同模型来捕获更多的信息，并帮助预测集的平稳。
根据他们过去的经验，在没有可靠的验证框架的情况下，构建好的特征是非常棘手的
2nd的思想：We bagged a bunch of boosting models XGB, LGBM, CB at various levels of data: Models for every site+meter, models for every building+meter, models for every building-type+meter and models using entire train data. It was very useful to build a separate model for each site so that the model could capture site-specific patterns and each site could be fitted with a different parameter set suitable for it. It also automatically solved for issues like timestamp alignment and feature measurement scale being different across sites so we didn’t have to solve for them separately.
为每一个建立单独的model，作者大概为这次比赛总共建立了超过5000个models进行融合

2.6 Why does postprocessing work? 2nd place magic

Why does postprocessing work? 2nd place magic
参考博客：
1st Place Solution Team Isamu & Matt
2nd Place Solution
Savitzky-Golay 滤波器

ASHRAE KAGGLE大能源预测 比赛结束回顾大佬解决方案

文章目录

1 概述

2 处理思想学习

2.1 移除异常值

2.2 缺失值

2.3 目标函数

2.4 特征工程

2.4.1 Savitzky-Golay filter

2.4.2 Bayesian target encoding(python实现)

2.5 models ensemble

2.6 Why does postprocessing work? 2nd place magic

ASHRAE KAGGLE大能源预测 比赛结束回顾大佬解决方案

ASHRAE KAGGLE大能源预测比赛结束回顾大佬解决方案

ASHRAE KAGGLE大能源预测比赛结束回顾大佬解决方案