建模随手记2 --- 最小二乘法实现线性回归
1. 回归分析
1.1. 一元线性回归
一元线性回归可以用来分析一个自变量和因变量之间的关系,通过分散的样本点来得到自变量和因变量之间的线性关系,通过最小二乘法来获得线性回归的系数,计算之后要对获得的回归方程进行检验。
P19 例2.1.1:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
def linear_regression(x, y):
plt.figure()
plt.scatter(x, y, alpha=0.5)
plt.title('weight(y) and temperature(x)')
plt.xlabel('temperature')
plt.ylabel('weight')
lrModel = LinearRegression()
# 求解模型
lrModel.fit(x, y)
# 对x进行预测
y0 = lrModel.predict(x)
plt.plot(x, y0)
plt.show()
alpha = lrModel.coef_
# 获得斜率
beta = lrModel.intercept_
# 获得截距
def test():
x = np.array([3.5, 1.8, 2.4, 3.0, 3.5, 3.9, 4.4, 4.8, 5.0])
y = np.array([8.5, 2.57, 3.0, 5.3, 8.9, 11.69, 13.1, 13.6, 13.3])
linear_regression(x.reshape([len(x), 1]), y.reshape([len(y), 1]))
test()
使用了sklearn库中的linear_model,其中需注意在使用fit方法时,传参数需要二维数组。
结果:
1.2. 检验
在获得模型之后要对模型进行检验,我们的模型拟合程度如何(R2)?我们要检验的使整个模型是否能显著预测因变量的变化(F检验)?每个自变量是否能显著预测因变量的变化(t检验)?
1.2.1 R2 - - - 模型拟合程度
通过计算模型对自变量存在的情况下的因变量的和平均值的差异的解释情况,来判断模型的拟合程度,即计算预测值和平均值差值平方之和(SSA)和实际值和平均值差值的平方之和(SST)的比值。SSA表示对实际值和平均值差异的解释情况,SST表示实际值和平均值之间的差异大小。通过比值可以看出模型对差异的解释情况从而判断模型的拟合情况,也就是R2。
1.2.2 F检验 - - - 能否显著预测y值
计算SSA和SSE的比值,所得结果小于5%即线性关系明显。
1.2.3 t检验 - - - 回归系数是否显著
计算Pr要小于0.05即可证明显著。
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
x = np.array([3.5, 1.8, 2.4, 3.0, 3.5, 3.9, 4.4, 4.8, 5.0])
y = np.array([8.5, 2.57, 3.0, 5.3, 8.9, 11.69, 13.1, 13.6, 13.3])
data = pd.DataFrame({'x': x, 'y': y})
model = ols('y~x', data).fit()
print(model.summary())
使用statsmodels可以更简便的完成回归以及检验,结果如下:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.945
Model: OLS Adj. R-squared: 0.937
Method: Least Squares F-statistic: 120.0
Date: Tue, 21 Jan 2020 Prob (F-statistic): 1.17e-05
Time: 23:19:13 Log-Likelihood: -12.533
No. Observations: 9 AIC: 29.07
Df Residuals: 7 BIC: 29.46
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -5.4375 1.358 -4.003 0.005 -8.649 -2.226
x 3.9906 0.364 10.954 0.000 3.129 4.852
==============================================================================
Omnibus: 1.239 Durbin-Watson: 1.329
Prob(Omnibus): 0.538 Jarque-Bera (JB): 0.658
Skew: 0.026 Prob(JB): 0.720
Kurtosis: 1.677 Cond. No. 14.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Process finished with exit code 0
可以得到,R2=0.945,说明模型拟合效果显著,F值为120,p值远小于0.05因此通过F检验,自变量和因变量之间存在显著的线性关系,两个参数均通过t检验,所得Pr与那小于0.05.
1.3. 一元非线性回归
在实际问题中许多自变量和因变量之间的关系并不成线性关系,无法做线性回归,但是我们可以通过数学处理实现线性回归,下面几组数据进行处理。
》》》P22 例2.1.2
画图观察:
x = [1, 2, 3, 4, 4, 6, 6, 8, 8, 9]
y = [1.85, 1.37, 1.02, 0.75, 0.56, 0.41, 0.31, 0.23, 0.17, 0.10]
plt.figue()
plt.scatter(x, y, alpha=0.5)
plt.show()
由图推测可能为一下关系:
i.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.874
Model: OLS Adj. R-squared: 0.858
Method: Least Squares F-statistic: 55.35
Date: Wed, 22 Jan 2020 Prob (F-statistic): 7.34e-05
Time: 02:03:50 Log-Likelihood: 2.1956
No. Observations: 10 AIC: -0.3912
Df Residuals: 8 BIC: 0.2139
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.6846 0.152 11.093 0.000 1.334 2.035
x -0.1976 0.027 -7.440 0.000 -0.259 -0.136
==============================================================================
Omnibus: 0.057 Durbin-Watson: 0.884
Prob(Omnibus): 0.972 Jarque-Bera (JB): 0.209
Skew: 0.132 Prob(JB): 0.901
Kurtosis: 2.343 Cond. No. 13.0
==============================================================================
ii.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.890
Model: OLS Adj. R-squared: 0.877
Method: Least Squares F-statistic: 65.02
Date: Wed, 22 Jan 2020 Prob (F-statistic): 4.13e-05
Time: 02:03:50 Log-Likelihood: 2.9057
No. Observations: 10 AIC: -1.811
Df Residuals: 8 BIC: -1.206
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0729 0.099 0.740 0.480 -0.154 0.300
x 1.9952 0.247 8.063 0.000 1.425 2.566
==============================================================================
Omnibus: 1.626 Durbin-Watson: 1.010
Prob(Omnibus): 0.443 Jarque-Bera (JB): 1.044
Skew: 0.541 Prob(JB): 0.593
Kurtosis: 1.844 Cond. No. 4.25
==============================================================================
iii.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.875
Model: OLS Adj. R-squared: 0.859
Method: Least Squares F-statistic: 55.94
Date: Wed, 22 Jan 2020 Prob (F-statistic): 7.07e-05
Time: 02:03:50 Log-Likelihood: -2.6834
No. Observations: 10 AIC: 9.367
Df Residuals: 8 BIC: 9.972
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.0790 0.269 4.008 0.004 0.458 1.700
x -1.2627 0.169 -7.479 0.000 -1.652 -0.873
==============================================================================
Omnibus: 1.445 Durbin-Watson: 0.825
Prob(Omnibus): 0.486 Jarque-Bera (JB): 1.007
Skew: -0.680 Prob(JB): 0.605
Kurtosis: 2.249 Cond. No. 5.15
==============================================================================
iv.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.973
Model: OLS Adj. R-squared: 0.969
Method: Least Squares F-statistic: 286.9
Date: Wed, 22 Jan 2020 Prob (F-statistic): 1.50e-07
Time: 02:03:50 Log-Likelihood: 4.9603
No. Observations: 10 AIC: -5.921
Df Residuals: 8 BIC: -5.315
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.9874 0.115 8.573 0.000 0.722 1.253
x -0.3411 0.020 -16.938 0.000 -0.388 -0.295
==============================================================================
Omnibus: 0.110 Durbin-Watson: 2.673
Prob(Omnibus): 0.947 Jarque-Bera (JB): 0.293
Skew: 0.169 Prob(JB): 0.864
Kurtosis: 2.232 Cond. No. 13.0
==============================================================================
代码:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.formula.api import ols
def multiple_regression(x, y):
# prepare data
linear_data = pd.DataFrame({'x': x, 'y': y})
inverse_data = pd.DataFrame({'x': 1/x, 'y': y})
power_data = pd.DataFrame({'x': np.log(x), 'y': np.log(y)})
exponential_data = pd.DataFrame({'x': x, 'y': np.log(y)})
# y = ax + b
Lmodel = ols('y~x', linear_data).fit()
print(Lmodel.summary())
plt.figure()
plt.title('y = ax + b')
Lx = np.arange(0, 10, 0.1)
Ly = Lmodel.params[1]*Lx + Lmodel.params[0]
plt.plot(Lx, Ly)
plt.scatter(x, y)
plt.show()
# y = a + b/x
Imodel = ols('y~x', inverse_data).fit()
print(Imodel.summary())
Ix = np.arange(0, 10, 0.1)
Iy = Imodel.params[1]/Ix + Imodel.params[0]
plt.figure()
plt.title('y = a + b/x')
plt.plot(Ix, Iy)
plt.scatter(x, y)
plt.show()
# y = a*x^b
Pmodel = ols('y~x', power_data).fit()
print(Pmodel.summary())
Px = np.arange(0, 10, 0.1)
Py = np.exp(Pmodel.params[0])*np.power(Px, Pmodel.params[1])
plt.figure()
plt.title('y = a*x^b')
plt.plot(Px, Py)
plt.scatter(x, y)
plt.show()
# y = a*e^(b*x)
Emodel = ols('y~x', exponential_data).fit()
print(Emodel.summary())
Ex = np.arange(0, 10, 0.1)
Ey = np.exp(Emodel.params[0]+Emodel.params[1]*Ex)
plt.figure()
plt.title('y = a*e^(b*x)')
plt.plot(Ex, Ey)
plt.scatter(x, y)
plt.show()
def test():
x = np.array([1, 2, 3, 4, 4, 6, 6, 8, 8, 9])
y = np.array([1.85, 1.37, 1.02, 0.75, 0.56, 0.41, 0.31, 0.23, 0.17, 0.10])
multiple_regression(x, y)
test()
总结:
线性规划编写过程:
- 如果为非线性问题,先转化为线性问题
- 导入需要的工具
import numpy as np
import statsmodels as sm
- 导入数据
x = np.array([[a1, a2,..., an],...,[ b1, b2,...,bn]])
y = np.array([y1, y2,..., yn])
- 添加常数,完成计算
x = sm.add_constant(x)
res = sm.OLS(y, x)
print(res.summary())