欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

机器学习预测PM2.5 多元线性回归模型numpy实现

程序员文章站 2022-05-02 16:45:25
...

Describe

  • 本次作业的样本是从*行政院环保署空气质量监测网下载的观测数据。
  • 希望大家使用线性回归模型预测出一段时间内的PM2.5值。
  1. 本次作业使用*丰源观测站的观测记录,分成train set与test set。
  2. train set是丰原站每个月前20天所有观测数据。test set是从丰原站剩下的观测数据中采样出来的。
    • train.csv: 每个月前20天的完整数据。
    • test.csv : 从剩下的观测数据中采样,每连续的10小时为一条数据,前9小时的所有观测数据当做特征,第10小时的PM2.5当做答案。一共取出240条不重复的测试数据,请根据特征预测出这240条数据的PM2.5值。
  3. 数据中含有18项观测数据: AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR。

Source Code

All the source code will be placed on my github: https://github.com/sunlanchang/blog/blob/master/LinearRegressionModel.ipynb. Please contact me if you have any questions. I will be happy to discuss any issue about this problem.

Preprocess train data

import pandas as pd
import numpy as np
from tqdm import tqdm

Before you start, please move the train.csv and test.csv to current directory.

df_train = pd.read_csv('train.csv',encoding = 'Big5')
df_train.describe()

机器学习预测PM2.5 多元线性回归模型numpy实现

concat each day data to a DataFrame

df_train_cat = pd.DataFrame()
for time in df_train['日期'].unique():
    tmp = df_train.loc[df_train['日期'] == str(time), '0':'23']
    tmp_col_name = list(tmp.columns)
    tmp.columns = [time+'_'+col_name for col_name in tmp_col_name]
    tmp.reset_index(drop=True, inplace=True)#promise index is same when merging
    df_train_cat = pd.concat([df_train_cat, tmp], axis=1)
df_train_cat.drop([10], inplace=True)#remove NR row
df_train_cat = df_train_cat.astype('float')
print(df_train_cat.shape)

Train data is composed with every 10 hours df_train_cat data

train data shape (576, 153) means 576 examples and 153 features

label = []
columns = list(df_train_cat.columns)
flag = True
for start in range(0, df_train_cat.shape[1], 10):
    train_data_2d = df_train_cat.loc[:, columns[start]:columns[start+8]]
    label.append(train_data_2d.loc[9,train_data_2d.columns[-1]])
    if flag:
        train_data_example_1d = train_data_2d.values.reshape(1,-1)
        flag = False
    else:
        train_data_example_1d = np.vstack((train_data_example_1d, train_data_2d.values.reshape(1,-1)))
label = np.array(label).reshape(-1,1)
train_data_all = train_data_example_1d
print(train_data_all.shape)
print(label.shape)

Split all labeled data to train dataset and validation dataset

val_proportion = 0.2 #the proportion of validation data
mid = int(val_proportion * train_data_all.shape[0])
indices = np.random.permutation(train_data_all.shape[0])
val_idx, train_idx  = indices[:mid], indices[mid:]
X_train, y_train = train_data_all[train_idx,:], label[train_idx,:]
X_val, y_val = train_data_all[val_idx, :], label[val_idx, :]
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

Test data preprocess

df_test = pd.read_csv('test.csv', names=[num for num in range(11)], encoding = 'Big5')
df_test.head()

机器学习预测PM2.5 多元线性回归模型numpy实现

df_test.describe()

机器学习预测PM2.5 多元线性回归模型numpy实现

start_idx = 0
df_test_cat = pd.DataFrame()
for id_ in df_test[0].unique():
    tmp = df_test.loc[df_test[0] == str(id_), 2:10]
    tmp_col_name = list(tmp.columns)
    tmp.columns = [num for num in range(start_idx, start_idx + 9)]
    start_idx += 9
    tmp.reset_index(drop=True, inplace=True)#promise index is same when merging
    df_test_cat = pd.concat([df_test_cat, tmp], axis=1)
df_test_cat.drop([10], inplace=True) #remove NR row
df_test_cat = df_test_cat.astype('float')
print(df_test_cat.shape)
columns = list(df_test_cat.columns)
flag = True
for start in range(0, df_test_cat.shape[1], 9):
    test_data_2d = df_test_cat.loc[:, columns[start]:columns[start+8]]
    if flag:
        test_data_example_1d = test_data_2d.values.reshape(1,-1)
        flag = False
    else:
        test_data_example_1d = np.vstack((test_data_example_1d, test_data_2d.values.reshape(1,-1)))
X_test = test_data_example_1d
X_test.shape

Implement Linear Regression

Standardize train and val data

def standardization(data):
    mu = np.mean(data, axis=0)
    sigma = np.std(data, axis=0)
    return (data - mu) / sigma
X_train = standardization(X_train)
X_val = standardization(X_val)

Train Linear Regression model

Notice: W is a mn matrix, but b is a scalar. When I set b a m1 vector, I got a lot of error which caused by the size of b vector. In order to calculate gridient of b, you need to sum vector of db element-wise, just like the code. That confused me a lot.

(m, n) = X_train.shape #m examples, n features
W = np.random.rand(n, 1)
# b = np.random.rand(m, 1)
b = 0

epoch = 200000
lr = 0.0001
for ep in range(epoch):
    y_hat = X_train.dot(W) + b
    tmp_train = y_hat - y_train
    loss = tmp_train.T.dot(tmp_train) / (2 * m)
    dW = np.dot(X_train.T, np.dot(X_train, W) + b - y_train) / m
    db = np.dot(np.ones(shape=[1, X_train.shape[0]]),  np.dot(X_train, W) + b - y_train) / m # this is  right.
    dp = np.sum(np.dot(X_train, W) + b - y_train) / m # this is also right.
    W += - lr * dW
    b += - lr * db

    m_val = X_val.shape[0]
    y_hat = X_val.dot(W) + b
    tmp_val = y_hat - y_val
    loss_val = tmp_val.T.dot(tmp_val) / (2 * m_val)
    if ep % 10000 == 0:
        print('train loss: {}, val loss: {}'.format(loss, loss_val))
#         pass

机器学习预测PM2.5 多元线性回归模型numpy实现

Standardize test data

X_test = standardization(X_test)

Predict data

y_predict = np.dot(X_test, W) + b
y_predict

机器学习预测PM2.5 多元线性回归模型numpy实现

Generate submission csv file

data_submit = {'id': df_test[0].unique(), 'value': y_predict.reshape(1, -1)[0]}
df_submit = pd.DataFrame(data_submit)
df_submit.head()

机器学习预测PM2.5 多元线性回归模型numpy实现

df_submit.to_csv('submission.csv', index=False)
!head submission.csv

机器学习预测PM2.5 多元线性回归模型numpy实现

Refference

https://en.wikipedia.org/wiki/Matrix_calculus
https://blog.csdn.net/nomadlx53/article/details/50849941
https://www.jb51.net/article/146990.htm