阿里云天池-AI训练营机器学习TASK1 - logistic regression

程序员文章站 2024-03-15 23:17:42

...

如何画出logistic regression的图

##  基础函数库
import numpy as np 

## 导入画图库
import matplotlib.pyplot as plt
import seaborn as sns

## 导入逻辑回归模型函数
from sklearn.linear_model import LogisticRegression

##Demo演示LogisticRegression分类

## 构造数据集
x_fearures = np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]])
y_label = np.array([0, 0, 0, 1, 1, 1])

## 调用逻辑回归模型
lr_clf = LogisticRegression()

## 用逻辑回归模型拟合构造的数据集
lr_clf = lr_clf.fit(x_fearures, y_label) #其拟合方程为 y=w0+w1*x1+w2*x2

## 查看其对应模型的w
print('the weight of Logistic Regression:',lr_clf.coef_)

## 查看其对应模型的w0
print('the intercept(w0) of Logistic Regression:',lr_clf.intercept_)

## 可视化构造的数据样本点
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
plt.show()

阿里云天池-AI训练营机器学习TASK1 - logistic regression

# 可视化决策边界
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

nx, ny = 200, 100
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()

# 这个是网格线的画法，相当于把坐标轴上的点全都找出来
x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny))

# np.c_[xxx, xxx]表示的是array的拼接，按照column拼接
# xxx.ravel()表示的是打散，x_grid.shape是一个(200, 100)的矩阵，但我希望的数据格式是：(xxxxx, 2), 这样才能代表点的坐标
| x1| x2 |
|-- |  --|

# z_proba这块shape是(xxxxx, 2)，因为只有[0, 1]两个情况，所以表达的是这两个的概率。
z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()])

# 但是要画等高线图，就要变会原来的矩阵(200, 100)，这样坐标上就可以完全对应上。
z_proba = z_proba[:, 1].reshape(x_grid.shape)
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=1., colors='blue')

plt.show()

阿里云天池-AI训练营机器学习TASK1 - logistic regression
使用contour等高线的思想画出此图。

其他注意的

yyy = xxx.copy() ##进行浅拷贝，防止对于原始数据的修改

拆分数据集

## 为了正确评估模型性能，将数据划分为训练集和测试集，并在训练集上训练模型，在测试集上验证模型性能。
from sklearn.model_selection import train_test_split

## 选择其类别为0和1的样本 （不包括类别为2的样本）
iris_features_part = iris_features.iloc[:100]
iris_target_part = iris_target[:100]

## 测试集大小为20%， 80%/20%分
x_train, x_test, y_train, y_test = train_test_split(iris_features_part, iris_target_part, test_size = 0.2, random_state = 2020)

## 定义 逻辑回归模型 
clf = LogisticRegression(random_state=0, solver='lbfgs')

这里的solver参数选择可以参考上头大神的链接！
阿里云天池-AI训练营机器学习TASK1 - logistic regression

大家可能觉得，既然newton-cg, lbfgs和sag这么多限制，如果不是大样本，我们选择liblinear不就行了嘛！错，因为liblinear也有自己的弱点！我们知道，逻辑回归有二元逻辑回归和多元逻辑回归。对于多元逻辑回归常见的有one-vs-rest(OvR)和many-vs-many(MvM)两种。而MvM一般比OvR分类相对准确一些。郁闷的是liblinear只支持OvR，不支持MvM，这样如果我们需要相对精确的多元逻辑回归时，就不能选择liblinear了。也意味着如果我们需要相对精确的多元逻辑回归不能使用L1正则化了。总结而言，liblinear支持L1和L2，只支持OvR做多分类，“lbfgs”, “sag” “newton-cg”只支持L2，支持OvR和MvM做多分类。

confusion matrics

from sklearn import metrics

## 利用accuracy（准确度）【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## 查看混淆矩阵 (预测值和真实值的各类情况统计矩阵)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# 利用热力图对于结果进行可视化
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()