Mxnet框架学习笔记(二):Kaggle房价预测实战分析
程序员文章站
2022-06-26 20:01:19
...
Kaggle是一个非常不错的数据分析挖掘比赛的平台,房价数据预测就是其中的一场赛事,之前基于Keras和机器学习的方法已经做过了很多的分析工作,最近正好在使用Mxnet框架,所以就基于Mxnet来实战分析一下房价数据预测。
注册了账户信息之后可以从Kaggle平台上下载所需要的数据集,打开训练数据集,数据特征如下:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
去除第一维Id数据后一共剩下79维特征,均可以当做特征使用。
完整的代码实现如下所示:
#!usr/bin/env python
# encoding:utf-8
"""
__Author__:沂水寒城
功能: MXNET 模块学习实践(二): 房价预测
"""
import numpy as np
import pandas as pd
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
train_data = pd.read_csv('house_data/kaggle_house_pred_train.csv')
test_data = pd.read_csv('house_data/kaggle_house_pred_test.csv')
print('train_shape: ', train_data.shape) #(1460, 81)
print('test_data: ', test_data.shape) #(1459, 80)
print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]) #数据概览
#将所有的训练数据和测试数据的79个特征按样本连结,去除首列的id特征
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
#对连续数值的特征做标准化(standardization)
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
# 标准化后,每个特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)
#将离散数值转成指示特征
# dummy_na=True将缺失值也当作合法的特征值并为其创建指示特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print('feature_shape: ', all_features.shape)
# 将Numpy格式数据转化为NDArray格式数据
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))
#训练模型,我们使用一个基本的线性回归模型和平方损失函数来训练模型。
loss = gloss.L2Loss()
def get_net():
'''
搭建网络
'''
net = nn.Sequential()
net.add(nn.Dense(1))
net.initialize()
return net
def log_rmse(net, features, labels):
'''
计算对数均方根误差
'''
clipped_preds = nd.clip(net(features), 1, float('inf')) #将小于1的值设成1,使得取对数时数值更稳定
rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())
return rmse.asscalar()
def train(net, train_features, train_labels, test_features, test_labels,num_epochs, learning_rate, weight_decay, batch_size):
'''
模型训练
'''
train_ls, test_ls = [], []
train_iter = gdata.DataLoader(gdata.ArrayDataset(
train_features, train_labels), batch_size, shuffle=True)
# 这里使用了Adam优化算法
trainer = gluon.Trainer(net.collect_params(), 'adam', {
'learning_rate': learning_rate, 'wd': weight_decay})
for epoch in range(num_epochs):
for X, y in train_iter:
with autograd.record():
l = loss(net(X), y)
l.backward()
trainer.step(batch_size)
train_ls.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_ls.append(log_rmse(net, test_features, test_labels))
return train_ls, test_ls
def train_and_pred(train_features, test_features, train_labels, test_data,num_epochs, lr, weight_decay, batch_size):
'''
预测结果存储
'''
net = get_net()
train_ls, _ = train(net, train_features, train_labels, None, None,
num_epochs, lr, weight_decay, batch_size)
#d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse')
print('train rmse %f' % train_ls[-1])
preds = net(test_features).asnumpy()
test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
submission.to_csv('submission.csv', index=False)
if __name__ == '__main__':
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_and_pred(train_features, test_features, train_labels, test_data,num_epochs, lr, weight_decay, batch_size)
结果数据存储在submission.csv中,部分结果数据如下所示:
Id | SalePrice |
1461 | 119681.5 |
1462 | 154408.8 |
1463 | 198561 |
1464 | 217160.5 |
1465 | 177359.5 |
1466 | 193065.5 |
1467 | 195170 |
1468 | 187124.9 |
1469 | 195645 |
1470 | 123815.8 |
1471 | 196175 |
1472 | 102948.1 |
1473 | 105485.1 |
1474 | 150235 |
1475 | 92978.13 |
1476 | 306583.7 |
1477 | 247138.8 |
1478 | 282101.5 |
1479 | 270760.2 |
1480 | 402098.3 |
1481 | 298979.9 |
1482 | 211217.1 |
1483 | 192548.7 |
1484 | 176624.5 |
1485 | 201686.7 |
1486 | 211308.1 |
1487 | 291197.3 |
1488 | 244460.3 |
1489 | 197268.9 |
1490 | 236971.6 |
1491 | 209994.9 |
对于Mxnet框架的学习我也是今天才开始,所以完全是基于教材或者是官方实例去跑Demo之后再去详细理解,感觉Mxnet提供的方法还是比较简单的,唯一不太适应的就是跟以往的深度学习框架中方法的调用或者实例化方法不一样,这个后面使用多了应该就会好很多了吧,房价预测本身来说没有太多需要解释的,因为这个相信数据挖掘从业者或多或少之前都是有接触的,这里只是为了熟悉新框架拿来做Demo学习了,如果需要数据集的话或者是有疑问都可以联系我。
今天的实践就到这里了!
上一篇: 专业幽默聚合
下一篇: pytorch:Kaggle房价预测