机器学习预测PM2.5 多元线性回归模型numpy实现
程序员文章站
2022-05-02 16:45:25
...
Describe
- 本次作业的样本是从*行政院环保署空气质量监测网下载的观测数据。
- 希望大家使用线性回归模型预测出一段时间内的PM2.5值。
- 本次作业使用*丰源观测站的观测记录,分成train set与test set。
- train set是丰原站每个月前20天所有观测数据。test set是从丰原站剩下的观测数据中采样出来的。
- train.csv: 每个月前20天的完整数据。
- test.csv : 从剩下的观测数据中采样,每连续的10小时为一条数据,前9小时的所有观测数据当做特征,第10小时的PM2.5当做答案。一共取出240条不重复的测试数据,请根据特征预测出这240条数据的PM2.5值。
- 数据中含有18项观测数据: AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR。
Source Code
All the source code will be placed on my github: https://github.com/sunlanchang/blog/blob/master/LinearRegressionModel.ipynb. Please contact me if you have any questions. I will be happy to discuss any issue about this problem.
Preprocess train data
import pandas as pd
import numpy as np
from tqdm import tqdm
Before you start, please move the train.csv and test.csv to current directory.
df_train = pd.read_csv('train.csv',encoding = 'Big5')
df_train.describe()
concat each day data to a DataFrame
df_train_cat = pd.DataFrame()
for time in df_train['日期'].unique():
tmp = df_train.loc[df_train['日期'] == str(time), '0':'23']
tmp_col_name = list(tmp.columns)
tmp.columns = [time+'_'+col_name for col_name in tmp_col_name]
tmp.reset_index(drop=True, inplace=True)#promise index is same when merging
df_train_cat = pd.concat([df_train_cat, tmp], axis=1)
df_train_cat.drop([10], inplace=True)#remove NR row
df_train_cat = df_train_cat.astype('float')
print(df_train_cat.shape)
Train data is composed with every 10 hours df_train_cat data
train data shape (576, 153) means 576 examples and 153 features
label = []
columns = list(df_train_cat.columns)
flag = True
for start in range(0, df_train_cat.shape[1], 10):
train_data_2d = df_train_cat.loc[:, columns[start]:columns[start+8]]
label.append(train_data_2d.loc[9,train_data_2d.columns[-1]])
if flag:
train_data_example_1d = train_data_2d.values.reshape(1,-1)
flag = False
else:
train_data_example_1d = np.vstack((train_data_example_1d, train_data_2d.values.reshape(1,-1)))
label = np.array(label).reshape(-1,1)
train_data_all = train_data_example_1d
print(train_data_all.shape)
print(label.shape)
Split all labeled data to train dataset and validation dataset
val_proportion = 0.2 #the proportion of validation data
mid = int(val_proportion * train_data_all.shape[0])
indices = np.random.permutation(train_data_all.shape[0])
val_idx, train_idx = indices[:mid], indices[mid:]
X_train, y_train = train_data_all[train_idx,:], label[train_idx,:]
X_val, y_val = train_data_all[val_idx, :], label[val_idx, :]
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
Test data preprocess
df_test = pd.read_csv('test.csv', names=[num for num in range(11)], encoding = 'Big5')
df_test.head()
df_test.describe()
start_idx = 0
df_test_cat = pd.DataFrame()
for id_ in df_test[0].unique():
tmp = df_test.loc[df_test[0] == str(id_), 2:10]
tmp_col_name = list(tmp.columns)
tmp.columns = [num for num in range(start_idx, start_idx + 9)]
start_idx += 9
tmp.reset_index(drop=True, inplace=True)#promise index is same when merging
df_test_cat = pd.concat([df_test_cat, tmp], axis=1)
df_test_cat.drop([10], inplace=True) #remove NR row
df_test_cat = df_test_cat.astype('float')
print(df_test_cat.shape)
columns = list(df_test_cat.columns)
flag = True
for start in range(0, df_test_cat.shape[1], 9):
test_data_2d = df_test_cat.loc[:, columns[start]:columns[start+8]]
if flag:
test_data_example_1d = test_data_2d.values.reshape(1,-1)
flag = False
else:
test_data_example_1d = np.vstack((test_data_example_1d, test_data_2d.values.reshape(1,-1)))
X_test = test_data_example_1d
X_test.shape
Implement Linear Regression
Standardize train and val data
def standardization(data):
mu = np.mean(data, axis=0)
sigma = np.std(data, axis=0)
return (data - mu) / sigma
X_train = standardization(X_train)
X_val = standardization(X_val)
Train Linear Regression model
Notice: W is a mn matrix, but b is a scalar. When I set b a m1 vector, I got a lot of error which caused by the size of b vector. In order to calculate gridient of b, you need to sum vector of db element-wise, just like the code. That confused me a lot.
(m, n) = X_train.shape #m examples, n features
W = np.random.rand(n, 1)
# b = np.random.rand(m, 1)
b = 0
epoch = 200000
lr = 0.0001
for ep in range(epoch):
y_hat = X_train.dot(W) + b
tmp_train = y_hat - y_train
loss = tmp_train.T.dot(tmp_train) / (2 * m)
dW = np.dot(X_train.T, np.dot(X_train, W) + b - y_train) / m
db = np.dot(np.ones(shape=[1, X_train.shape[0]]), np.dot(X_train, W) + b - y_train) / m # this is right.
dp = np.sum(np.dot(X_train, W) + b - y_train) / m # this is also right.
W += - lr * dW
b += - lr * db
m_val = X_val.shape[0]
y_hat = X_val.dot(W) + b
tmp_val = y_hat - y_val
loss_val = tmp_val.T.dot(tmp_val) / (2 * m_val)
if ep % 10000 == 0:
print('train loss: {}, val loss: {}'.format(loss, loss_val))
# pass
Standardize test data
X_test = standardization(X_test)
Predict data
y_predict = np.dot(X_test, W) + b
y_predict
Generate submission csv file
data_submit = {'id': df_test[0].unique(), 'value': y_predict.reshape(1, -1)[0]}
df_submit = pd.DataFrame(data_submit)
df_submit.head()
df_submit.to_csv('submission.csv', index=False)
!head submission.csv
Refference
https://en.wikipedia.org/wiki/Matrix_calculus
https://blog.csdn.net/nomadlx53/article/details/50849941
https://www.jb51.net/article/146990.htm
上一篇: 对MySQL语法进行高质量与高性能编写