欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Mxnet框架学习笔记(二):Kaggle房价预测实战分析

程序员文章站 2022-06-26 20:01:19
...

      Kaggle是一个非常不错的数据分析挖掘比赛的平台,房价数据预测就是其中的一场赛事,之前基于Keras和机器学习的方法已经做过了很多的分析工作,最近正好在使用Mxnet框架,所以就基于Mxnet来实战分析一下房价数据预测。

     注册了账户信息之后可以从Kaggle平台上下载所需要的数据集,打开训练数据集,数据特征如下:

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice

      去除第一维Id数据后一共剩下79维特征,均可以当做特征使用。

       完整的代码实现如下所示:

#!usr/bin/env python
# encoding:utf-8


"""
__Author__:沂水寒城
功能: MXNET 模块学习实践(二): 房价预测
"""


import numpy as np
import pandas as pd
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn





train_data = pd.read_csv('house_data/kaggle_house_pred_train.csv')
test_data = pd.read_csv('house_data/kaggle_house_pred_test.csv')
print('train_shape: ', train_data.shape)  #(1460, 81)
print('test_data: ', test_data.shape)  #(1459, 80)
print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])  #数据概览
#将所有的训练数据和测试数据的79个特征按样本连结,去除首列的id特征
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

#对连续数值的特征做标准化(standardization)
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
# 标准化后,每个特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)

#将离散数值转成指示特征
# dummy_na=True将缺失值也当作合法的特征值并为其创建指示特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print('feature_shape: ', all_features.shape) 


# 将Numpy格式数据转化为NDArray格式数据
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))


#训练模型,我们使用一个基本的线性回归模型和平方损失函数来训练模型。
loss = gloss.L2Loss()
def get_net(): 
    '''
    搭建网络
    '''
    net = nn.Sequential()
    net.add(nn.Dense(1))
    net.initialize()
    return net


def log_rmse(net, features, labels):
    '''
    计算对数均方根误差
    '''
    clipped_preds = nd.clip(net(features), 1, float('inf'))  #将小于1的值设成1,使得取对数时数值更稳定
    rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())
    return rmse.asscalar()


def train(net, train_features, train_labels, test_features, test_labels,num_epochs, learning_rate, weight_decay, batch_size):
    '''
    模型训练
    '''
    train_ls, test_ls = [], []
    train_iter = gdata.DataLoader(gdata.ArrayDataset(
        train_features, train_labels), batch_size, shuffle=True)
    # 这里使用了Adam优化算法
    trainer = gluon.Trainer(net.collect_params(), 'adam', {
        'learning_rate': learning_rate, 'wd': weight_decay})
    for epoch in range(num_epochs):
        for X, y in train_iter:
            with autograd.record():
                l = loss(net(X), y)
            l.backward()
            trainer.step(batch_size)
        train_ls.append(log_rmse(net, train_features, train_labels))
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
    return train_ls, test_ls
 

def train_and_pred(train_features, test_features, train_labels, test_data,num_epochs, lr, weight_decay, batch_size):
    '''
    预测结果存储
    '''
    net = get_net()
    train_ls, _ = train(net, train_features, train_labels, None, None,
                        num_epochs, lr, weight_decay, batch_size)
    #d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse')
    print('train rmse %f' % train_ls[-1])
    preds = net(test_features).asnumpy()
    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
    submission.to_csv('submission.csv', index=False)



if __name__ == '__main__':
    k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
    train_and_pred(train_features, test_features, train_labels, test_data,num_epochs, lr, weight_decay, batch_size)

      结果数据存储在submission.csv中,部分结果数据如下所示:
     

Id SalePrice
1461 119681.5
1462 154408.8
1463 198561
1464 217160.5
1465 177359.5
1466 193065.5
1467 195170
1468 187124.9
1469 195645
1470 123815.8
1471 196175
1472 102948.1
1473 105485.1
1474 150235
1475 92978.13
1476 306583.7
1477 247138.8
1478 282101.5
1479 270760.2
1480 402098.3
1481 298979.9
1482 211217.1
1483 192548.7
1484 176624.5
1485 201686.7
1486 211308.1
1487 291197.3
1488 244460.3
1489 197268.9
1490 236971.6
1491 209994.9

         对于Mxnet框架的学习我也是今天才开始,所以完全是基于教材或者是官方实例去跑Demo之后再去详细理解,感觉Mxnet提供的方法还是比较简单的,唯一不太适应的就是跟以往的深度学习框架中方法的调用或者实例化方法不一样,这个后面使用多了应该就会好很多了吧,房价预测本身来说没有太多需要解释的,因为这个相信数据挖掘从业者或多或少之前都是有接触的,这里只是为了熟悉新框架拿来做Demo学习了,如果需要数据集的话或者是有疑问都可以联系我。

      今天的实践就到这里了!