欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Kaggle-房价预测

程序员文章站 2024-03-22 08:10:22
...

一. 数据观察

又是一道Kaggle的经典题目。首先观察一下数据:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import ensemble, tree, linear_model
from sklearn.ensemble import RandomForestRegressor
%matplotlib inline

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv') 
train_data.shape,test_data.shape

训练集*有1460个样本以及81个特征,而测试集中有1459个样本80个特征(因为SalePrices是需要预测的)。

首先我们将测试集中的房价单独提取出来,作为一会儿模型训练中的因变量:

train_y = train_data.pop('SalePrice')
y_plot = sns.distplot(train_y)

顺便看一看y的分布曲线:

Kaggle-房价预测

很明显,数据是右斜的,为了使数据的呈现方式接近我们所希望的前提假设,从而更好的进行统计推断,将数据log化,得到:

train_y = np.log(train_y)
y_plot = sns.distplot(train_y)

Kaggle-房价预测

此时因变量基本呈正态分布。


由于特征实在太多,我们选用其中几个对房价进行观察

1.YearBuilt和SalePrice的关系

var = 'YearBuilt'
data = pd.concat([train_y_skewed, train_data[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);plt.xticks(rotation=90);

Kaggle-房价预测


可以看出在最近三十年内建的房子的价格与房屋的年份呈正相关。

2.OverallQuad和SalePrice的关系

var = 'OverallQual'
data = pd.concat([train_y_skewed, train_data[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);plt.xticks(rotation=90);

Kaggle-房价预测

房屋的价格明显与房屋的评分呈正相关。

此外,还有很多数据和房价相关性比较高,在此就不一一赘述了。

二. 数据清洗

首先,我们将训练集和测试集合为一组数据,然后将不难么重要的特征舍弃掉:

features = pd.concat([train_data, test_data], keys=['train', 'test'])

features.drop(['Utilities', 'RoofMatl', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'LowQualFinSF',
               'BsmtFullBath', 'BsmtHalfBath', 'Functional', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'WoodDeckSF',
               'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal'],
              axis=1, inplace=True)

看一眼缺失值

NAs = pd.concat([features.isnull().sum()],keys = ['Features'],axis=1)
NAs

Kaggle-房价预测

然后对缺失的数据进行补充/或舍弃:

features['MSSubClass'] = features['MSSubClass'].astype(str)
features['MSZoning'] = features['MSZoning'].fillna(features['MSZoning'].mode()[0])
features['LotFrontage'] = features['LotFrontage'].fillna(features['LotFrontage'].mean())
features['Alley'] = features['Alley'].fillna('NOACCESS')
features['MasVnrType'] = features['MasVnrType'].fillna(features['MasVnrType'].mode()[0])
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    features[col] = features[col].fillna('NoBSMT')
features['TotalBsmtSF'] = features['TotalBsmtSF'].fillna(0)
features['Electrical'] = features['Electrical'].fillna(features['Electrical'].mode()[0])
features['KitchenAbvGr'] = features['KitchenAbvGr'].astype(str)
features['KitchenQual'] = features['KitchenQual'].fillna(features['KitchenQual'].mode()[0])
features['FireplaceQu'] = features['FireplaceQu'].fillna('NoFP')
for col in ('GarageType', 'GarageFinish', 'GarageQual'):
    features[col] = features[col].fillna('NoGRG')
features['GarageCars'] = features['GarageCars'].fillna(0.0)
features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])
features['YrSold'] = features['YrSold'].astype(str)
features['MoSold'] = features['MoSold'].astype(str)
features['TotalSF'] = features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF']
features.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1, inplace=True)

对MSSubClass, KitchenAbvGr,YrSold和MoSold进行格式转化,转为字符数据。

对于MSZoning,MasvnrType,Electrical,KitchenQual和SaleType这些缺失较少的数据进行众数填补。

对于LotFrontage进行平均数填补。

对于TotalBsmtSF,1stFlrSF和2ndFlrSF合并为TotalSF。

对于缺失比较多或者‘无’本身可以作为特征属性的,通过fillna处理,将缺失或‘0’作为特征量。

三. 特征工程

1.将类别变量改为类别特征编码

all_dummy = pd.get_dummies(features)

2.将数值变量标准化

features['Id'] = features['Id'].astype(str)
numerical_col = features.columns[features.dtypes!='object']
means = all_dummy.loc[:,numerical_col].mean()
std = all_dummy.loc[:,numerical_col].std()
all_dummy.loc[:,numerical_col] = (all_dummy.loc[:,numerical_col] - means)/std

我们可以看一下数值变量之间的相关性:

ax = sns.pairplot(all_dummy.loc[:,numerical_col])

Kaggle-房价预测

此时的表格是这样的:

Kaggle-房价预测

三. 建立模型

Kaggle-房价预测

这个是scikit-learn推荐的算法图,可以看到对于我们的案例来说,Lasso或者ElasticNet是比较合适的。

在此使用ElasticNet进行建模:

alphas = [0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10]
test_scores = []
for alpha in alphas:
    ElasticN = linear_model.ElasticNetCV(alphas=[alpha],
                                    l1_ratio=[.01, .1, .5, .9, .99],
                                    max_iter=5000)
    test_score  = cross_val_score(ElasticN, train_X, train_y, cv=5)
    
    test_scores.append(test_score.mean())
test_scores    

尝试alpha选择[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10],得到交叉验证的平均分数为:

Kaggle-房价预测

可以看到当alpha=0.001时得分最高

sqrt_score1 = np.sqrt(test_scores)
plt.scatter(alphas[0:5],sqrt_score1[0:5])

Kaggle-房价预测

对y进行预测:

ElasticN = linear_model.ElasticNetCV(alphas=[0.001],
                                    l1_ratio=[.01, .1, .5, .9, .99],
                                    max_iter=5000)
ent = ElasticN.fit(train_X, train_y)
test_y = ent.predict(test_X)

四. 模型融合

为了得到更精准的预测,我们再拿随机森林建立一次模型,并将结果与ElasticNet进行融合:

Max_features = [.1,.3,.5,.7,.9,.99]
test_scores2 = []
for feature in Max_features:
    RFR = RandomForestRegressor(n_estimators = 200,max_features = feature)
    test_score  = cross_val_score(RFR, train_X, train_y, cv=5)
    
    test_scores2.append(test_score.mean())
test_scores2    
sqrt_score2 = np.sqrt(test_scores2)
plt.scatter(Max_features,sqrt_score2)

Kaggle-房价预测

可以看到当feature = 0.3时,评分最高,因此建立模型对y进行预测;

RFR = RandomForestRegressor(n_estimators = 200,max_features = 0.3)
clf = RFR.fit(train_X,train_y)
test_y2 = clf.predict(test_X)

至此两个模型的预测值全都得到了,可以进行融合了:

Final_SalePrice  = (np.exp(test_y)+np.exp(test_y2))/2
pd.DataFrame({'Id': test_data.Id, 'SalePrice': Final_SalePrice}).to_csv('House_Price_Prediction2.csv', index =False) 

这个融合就是将两个结果相加取平均。因为之前我们对房价取了对数,所以不要忘记在保存前取指数。

至此,房价预测的初步分析就完成了,看一下结果:

Kaggle-房价预测

1403/5032...大概在28%。马马虎虎吧,看来在特征上还有很多地方没有挖掘到位,模型融合也可以尝试更多种方法。