Kaggle-房价预测
一. 数据观察
又是一道Kaggle的经典题目。首先观察一下数据:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import ensemble, tree, linear_model
from sklearn.ensemble import RandomForestRegressor
%matplotlib inline
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.shape,test_data.shape
训练集*有1460个样本以及81个特征,而测试集中有1459个样本80个特征(因为SalePrices是需要预测的)。
首先我们将测试集中的房价单独提取出来,作为一会儿模型训练中的因变量:
train_y = train_data.pop('SalePrice')
y_plot = sns.distplot(train_y)
顺便看一看y的分布曲线:
很明显,数据是右斜的,为了使数据的呈现方式接近我们所希望的前提假设,从而更好的进行统计推断,将数据log化,得到:
train_y = np.log(train_y)
y_plot = sns.distplot(train_y)
此时因变量基本呈正态分布。
由于特征实在太多,我们选用其中几个对房价进行观察
1.YearBuilt和SalePrice的关系
var = 'YearBuilt'
data = pd.concat([train_y_skewed, train_data[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);plt.xticks(rotation=90);
可以看出在最近三十年内建的房子的价格与房屋的年份呈正相关。
2.OverallQuad和SalePrice的关系
var = 'OverallQual'
data = pd.concat([train_y_skewed, train_data[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);plt.xticks(rotation=90);
房屋的价格明显与房屋的评分呈正相关。
此外,还有很多数据和房价相关性比较高,在此就不一一赘述了。
二. 数据清洗
首先,我们将训练集和测试集合为一组数据,然后将不难么重要的特征舍弃掉:
features = pd.concat([train_data, test_data], keys=['train', 'test'])
features.drop(['Utilities', 'RoofMatl', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'LowQualFinSF',
'BsmtFullBath', 'BsmtHalfBath', 'Functional', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal'],
axis=1, inplace=True)
看一眼缺失值
NAs = pd.concat([features.isnull().sum()],keys = ['Features'],axis=1)
NAs
然后对缺失的数据进行补充/或舍弃:
features['MSSubClass'] = features['MSSubClass'].astype(str)
features['MSZoning'] = features['MSZoning'].fillna(features['MSZoning'].mode()[0])
features['LotFrontage'] = features['LotFrontage'].fillna(features['LotFrontage'].mean())
features['Alley'] = features['Alley'].fillna('NOACCESS')
features['MasVnrType'] = features['MasVnrType'].fillna(features['MasVnrType'].mode()[0])
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
features[col] = features[col].fillna('NoBSMT')
features['TotalBsmtSF'] = features['TotalBsmtSF'].fillna(0)
features['Electrical'] = features['Electrical'].fillna(features['Electrical'].mode()[0])
features['KitchenAbvGr'] = features['KitchenAbvGr'].astype(str)
features['KitchenQual'] = features['KitchenQual'].fillna(features['KitchenQual'].mode()[0])
features['FireplaceQu'] = features['FireplaceQu'].fillna('NoFP')
for col in ('GarageType', 'GarageFinish', 'GarageQual'):
features[col] = features[col].fillna('NoGRG')
features['GarageCars'] = features['GarageCars'].fillna(0.0)
features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])
features['YrSold'] = features['YrSold'].astype(str)
features['MoSold'] = features['MoSold'].astype(str)
features['TotalSF'] = features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF']
features.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1, inplace=True)
对MSSubClass, KitchenAbvGr,YrSold和MoSold进行格式转化,转为字符数据。
对于MSZoning,MasvnrType,Electrical,KitchenQual和SaleType这些缺失较少的数据进行众数填补。
对于LotFrontage进行平均数填补。
对于TotalBsmtSF,1stFlrSF和2ndFlrSF合并为TotalSF。
对于缺失比较多或者‘无’本身可以作为特征属性的,通过fillna处理,将缺失或‘0’作为特征量。
三. 特征工程
1.将类别变量改为类别特征编码
all_dummy = pd.get_dummies(features)
2.将数值变量标准化
features['Id'] = features['Id'].astype(str)
numerical_col = features.columns[features.dtypes!='object']
means = all_dummy.loc[:,numerical_col].mean()
std = all_dummy.loc[:,numerical_col].std()
all_dummy.loc[:,numerical_col] = (all_dummy.loc[:,numerical_col] - means)/std
我们可以看一下数值变量之间的相关性:
ax = sns.pairplot(all_dummy.loc[:,numerical_col])
此时的表格是这样的:
三. 建立模型
这个是scikit-learn推荐的算法图,可以看到对于我们的案例来说,Lasso或者ElasticNet是比较合适的。
在此使用ElasticNet进行建模:
alphas = [0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10]
test_scores = []
for alpha in alphas:
ElasticN = linear_model.ElasticNetCV(alphas=[alpha],
l1_ratio=[.01, .1, .5, .9, .99],
max_iter=5000)
test_score = cross_val_score(ElasticN, train_X, train_y, cv=5)
test_scores.append(test_score.mean())
test_scores
尝试alpha选择[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10],得到交叉验证的平均分数为:
可以看到当alpha=0.001时得分最高
sqrt_score1 = np.sqrt(test_scores)
plt.scatter(alphas[0:5],sqrt_score1[0:5])
对y进行预测:
ElasticN = linear_model.ElasticNetCV(alphas=[0.001],
l1_ratio=[.01, .1, .5, .9, .99],
max_iter=5000)
ent = ElasticN.fit(train_X, train_y)
test_y = ent.predict(test_X)
四. 模型融合
为了得到更精准的预测,我们再拿随机森林建立一次模型,并将结果与ElasticNet进行融合:
Max_features = [.1,.3,.5,.7,.9,.99]
test_scores2 = []
for feature in Max_features:
RFR = RandomForestRegressor(n_estimators = 200,max_features = feature)
test_score = cross_val_score(RFR, train_X, train_y, cv=5)
test_scores2.append(test_score.mean())
test_scores2
sqrt_score2 = np.sqrt(test_scores2)
plt.scatter(Max_features,sqrt_score2)
可以看到当feature = 0.3时,评分最高,因此建立模型对y进行预测;
RFR = RandomForestRegressor(n_estimators = 200,max_features = 0.3)
clf = RFR.fit(train_X,train_y)
test_y2 = clf.predict(test_X)
至此两个模型的预测值全都得到了,可以进行融合了:
Final_SalePrice = (np.exp(test_y)+np.exp(test_y2))/2
pd.DataFrame({'Id': test_data.Id, 'SalePrice': Final_SalePrice}).to_csv('House_Price_Prediction2.csv', index =False)
这个融合就是将两个结果相加取平均。因为之前我们对房价取了对数,所以不要忘记在保存前取指数。
至此,房价预测的初步分析就完成了,看一下结果:
1403/5032...大概在28%。马马虎虎吧,看来在特征上还有很多地方没有挖掘到位,模型融合也可以尝试更多种方法。