机器学习笔记9-Logistic回归实战

程序员文章站 2024-01-18 22:23:04

...

一、改进的随机梯度上升算法

对于上一节的梯度上升算法：

def gradAscent(dataMatIn,classLabels):
    # 转换成numpy的mat
    dataMatrix = np.mat(dataMatIn)
    # 转换成numpy的mat并进行转置
    labelMat = np.mat(classLabels).transpose()
    # 返回dataMatrix的大小，m为行数，n为列数
    m,n = np.shape(dataMatrix)
    # 移动步长，也就是学习速率，控制更新的幅度
    alpha = 0.001
    # 最大迭代次数
    maxCycles = 500
    weights = np.ones((n,1))
    # 梯度上升矢量化公式
    for k in range(maxCycles):
        h = sigmoid(dataMatrix * weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose() * error
    # 将矩阵转化为数组，返回权重数组（最优参数）
    return weights.getA()

假设，我们使用的数据集一共有100个样本。那么，dataMatrix就是一个1003的矩阵。每次计算h的时候，都要计算dataMatrixweights这个矩阵乘法运算，要进行1003次乘法运算和1002次加法运算。同理，更新回归系数(最优参数)weights时，也需要用到整个数据集，要进行矩阵乘法运算。总而言之，该方法处理100个左右的数据集时尚可，但如果有数十亿样本和成千上万的特征，那么该方法的计算复杂度就太高了。因此，需要对算法进行改进，我们每次更新回归系数(最优参数)的时候，能不能不用所有样本呢？一次只用一个样本点去更新回归系数(最优参数)？这样就可以有效减少计算量了，这种方法就叫做随机梯度上升算法。

1 随机梯度上升法

该算法第一个改进之处在于，alpha在每次迭代的时候都会调整，并且，虽然alpha会随着迭代次数不断减小，但永远不会减小到0，因为这里还存在一个常数项。必须这样做的原因是为了保证在多次迭代之后新数据仍然具有一定的影响。如果需要处理的问题是动态变化的，那么可以适当加大上述常数项，来确保新的值获得更大的回归系数。另一点值得注意的是，在降低alpha的函数中，alpha每次减少1/(j+i)，其中j是迭代次数，i是样本点的下标。第二个改进的地方在于跟新回归系数(最优参数)时，只使用一个样本点，并且选择的样本点是随机的，每次迭代不使用已经用过的样本点。这样的方法，就有效地减少了计算量，并保证了回归效果。

#！/user/bin/env python
# -*- coding:utf-8 -*-
#@Time  : 2020/3/16 10:06
#@Author: fangyuan
#@File  : Logistic回归绘制决策边界.py
from random import random
from matplotlib.font_manager import FontProperties
import numpy as np
import matplotlib.pyplot as plt

def loadDataSet():
    # 创建数据集
    dataMat = []
    # 创建标签列表
    labelMat = []
    # 打开文件
    fr = open('Logistic')
    # 逐行读取
    for line in fr.readlines():
        # 去回车，放入列表
        lineArr = line.strip().split()
        # 添加数据
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        # 添加标签
        labelMat.append(int(lineArr[2]))
    # 关闭文件
    fr.close()
    # 返回
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))

# def gradAscent(dataMatIn,classLabels):
#     # 转换成numpy的mat
#     dataMatrix = np.mat(dataMatIn)
#     # 转换成numpy的mat并进行转置
#     labelMat = np.mat(classLabels).transpose()
#     # 返回dataMatrix的大小，m为行数，n为列数
#     m,n = np.shape(dataMatrix)
#     # 移动步长，也就是学习速率，控制更新的幅度
#     alpha = 0.001
#     # 最大迭代次数
#     maxCycles = 500
#     weights = np.ones((n,1))
#     # 梯度上升矢量化公式
#     for k in range(maxCycles):
#         h = sigmoid(dataMatrix * weights)
#         error = (labelMat - h)
#         weights = weights + alpha * dataMatrix.transpose() * error
#     # 将矩阵转化为数组，返回权重数组（最优参数）
#     return weights.getA()

# def gradAscent(dataMatIn,classLabels,maxCycles=150):
#     dataMatrix = np.array(dataMatIn)
#     m,n = np.shape(dataMatrix)
#     weights = np.ones(n)
#     for j in range(maxCycles):
#         dataIndex = list(range(m))
#         for i in range(m):
#             alpha = 4/(1.0+j+i) + 0.01
#             randIndex = int(random.uniform(0,len(dataIndex)))
#             h = sigmoid(sum(dataMatrix[randIndex]*weights))
#             error = classLabels[randIndex] - h
#             weights = weights + alpha * error * dataMatrix[randIndex]
#             del(dataIndex)
#     return weights

def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m,n = np.shape(dataMatrix)                                                #返回dataMatrix的大小。m为行数,n为列数。
    # 之前是矩阵乘法，每一个样本均与系数相乘，此处为数组乘法，系数与随机抽取的样本对应乘积
    weights = np.ones(n)                                                       #参数初始化
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            # alpha会随着迭代次数减小但不会为0
            alpha = 4/(1.0+j+i)+0.01                                            #降低alpha的大小，每次减小1/(j+i)。
            randIndex = int(np.random.uniform(0,len(dataIndex)))                #随机选取样本
            h = sigmoid(sum(dataMatrix[randIndex]*weights))                    #选择随机选取的一个样本，计算h
            error = classLabels[randIndex] - h                                 #计算误差
            weights = weights + alpha * error * dataMatrix[randIndex]       #更新回归系数
            del(dataIndex[randIndex])                                         #删除已经使用的样本
    return weights

def plotBestFit(weights):
    # 加载数据集
    dataMat,labelMat = loadDataSet()
    # 转换成numpy的array数组
    dataArr = np.array(dataMat)
    # 数据个数
    n = np.shape(dataMat)[0]
    # 正样本
    xcord1 = []
    ycord1 = []
    # 负样本
    xcord2 = []
    ycord2 = []
    # 根据数据集标签进行分类
    for i in range(n):
        # 1为正样本
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        # 0为负样本
        else:
            xcord2.append(dataArr[i,1])
            ycord2.append(dataArr[i,2])
    fig = plt.figure()
    # 添加subplot
    ax = fig.add_subplot(111)
    # 绘制正样本
    ax.scatter(xcord1,ycord1,s = 20,c = 'red',marker = 's',alpha =.5)
    # 绘制负样本
    ax.scatter(xcord2,ycord2,s = 20,c = 'green',alpha=.5)
    x = np.arange(-3.0,3.0,0.1)
    y = (-weights[0] - weights[1] * x) / weights[2]
    ax.plot(x,y)
    # 绘制title
    plt.title('BestFit')
    # 绘制label
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

if __name__ == '__main__':
    dataMat,labelMat = loadDataSet()
    weights = stocGradAscent1(np.array(dataMat),labelMat)
    print(weights)
    plotBestFit(weights)

2 回归系数与迭代次数关系

可以看到分类效果也是不错的。不过，从这个分类结果中，我们不好看出迭代次数和回归系数的关系，也就不能直观的看到每个回归方法的收敛情况。因此，我们编写程序，绘制出回归系数和迭代次数的关系曲线：

#！/user/bin/env python
# -*- coding:utf-8 -*-
#@Time  : 2020/3/16 16:47
#@Author: fangyuan
#@File  : Logistic回归系数与迭代次数展示.py

from random import random
from matplotlib.font_manager import FontProperties
import numpy as np
import matplotlib.pyplot as plt

def loadDataSet():
    # 创建数据集
    dataMat = []
    # 创建标签列表
    labelMat = []
    # 打开文件
    fr = open('Logistic')
    # 逐行读取
    for line in fr.readlines():
        # 去回车，放入列表
        lineArr = line.strip().split()
        # 添加数据
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        # 添加标签
        labelMat.append(int(lineArr[2]))
    # 关闭文件
    fr.close()
    # 返回
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))

def gradAscent(dataMatIn,classLabels):
    # 转换成numpy的mat
    dataMatrix = np.mat(dataMatIn)
    # 转换成numpy的mat并进行转置
    labelMat = np.mat(classLabels).transpose()
    # 返回dataMatrix的大小，m为行数，n为列数
    m,n = np.shape(dataMatrix)
    # 移动步长，也就是学习速率，控制更新的幅度
    alpha = 0.001
    # 最大迭代次数
    maxCycles = 500
    weights = np.ones((n,1))
    # 每次更新的回归系数
    weights_array = np.array([])
    # 梯度上升矢量化公式
    for k in range(maxCycles):
        h = sigmoid(dataMatrix * weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose() * error
        weights_array = np.append(weights_array,weights)
    weights_array = weights_array.reshape(maxCycles,n)
    # 将矩阵转化为数组，返回权重数组（最优参数）
    return weights.getA(),weights_array

# def gradAscent(dataMatIn,classLabels):
#     # 转换成numpy的mat
#     dataMatrix = np.mat(dataMatIn)
#     # 转换成numpy的mat并进行转置
#     labelMat = np.mat(classLabels).transpose()
#     # 返回dataMatrix的大小，m为行数，n为列数
#     m,n = np.shape(dataMatrix)
#     # 移动步长，也就是学习速率，控制更新的幅度
#     alpha = 0.001
#     # 最大迭代次数
#     maxCycles = 500
#     weights = np.ones((n,1))
#     # 梯度上升矢量化公式
#     for k in range(maxCycles):
#         h = sigmoid(dataMatrix * weights)
#         error = (labelMat - h)
#         weights = weights + alpha * dataMatrix.transpose() * error
#     # 将矩阵转化为数组，返回权重数组（最优参数）
#     return weights.getA()

# def gradAscent(dataMatIn,classLabels,maxCycles=150):
#     dataMatrix = np.array(dataMatIn)
#     m,n = np.shape(dataMatrix)
#     weights = np.ones(n)
#     for j in range(maxCycles):
#         dataIndex = list(range(m))
#         for i in range(m):
#             alpha = 4/(1.0+j+i) + 0.01
#             randIndex = int(random.uniform(0,len(dataIndex)))
#             h = sigmoid(sum(dataMatrix[randIndex]*weights))
#             error = classLabels[randIndex] - h
#             weights = weights + alpha * error * dataMatrix[randIndex]
#             del(dataIndex)
#     return weights

def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m,n = np.shape(dataMatrix)                                                #返回dataMatrix的大小。m为行数,n为列数。
    # 之前是矩阵乘法，每一个样本均与系数相乘，此处为数组乘法，系数与随机抽取的样本对应乘积
    weights = np.ones(n)                                                       #参数初始化
    weights_array = np.array([])
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            # alpha会随着迭代次数减小但不会为0
            alpha = 4/(1.0+j+i)+0.01                                            #降低alpha的大小，每次减小1/(j+i)。
            randIndex = int(np.random.uniform(0,len(dataIndex)))                #随机选取样本
            h = sigmoid(sum(dataMatrix[randIndex]*weights))                    #选择随机选取的一个样本，计算h
            error = classLabels[randIndex] - h                                 #计算误差
            weights = weights + alpha * error * dataMatrix[randIndex]       #更新回归系数
            # 添加回归系数到数组中
            weights_array = np.append(weights_array,weights,axis=0)
            del(dataIndex[randIndex])                                         #删除已经使用的样本
    # 改变维度
    weights_array = weights_array.reshape(numIter*m,n)
    return weights,weights_array

def plotWeights(weights_array1,weights_array2):
    #设置汉字格式
    font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
    #将fig画布分隔成1行1列,不共享x轴和y轴,fig画布的大小为(13,8)
    #当nrow=3,nclos=2时,代表fig画布被分为六个区域,axs[0][0]表示第一行第一列
    fig, axs = plt.subplots(nrows=3, ncols=2,sharex=False, sharey=False, figsize=(20,10))
    x1 = np.arange(0, len(weights_array1), 1)
    #绘制w0与迭代次数的关系
    axs[0][0].plot(x1,weights_array1[:,0])
    axs0_title_text = axs[0][0].set_title(u'梯度上升算法：回归系数与迭代次数关系',FontProperties=font)
    axs0_ylabel_text = axs[0][0].set_ylabel(u'W0',FontProperties=font)
    plt.setp(axs0_title_text, size=20, weight='bold', color='black')
    plt.setp(axs0_ylabel_text, size=20, weight='bold', color='black')
    #绘制w1与迭代次数的关系
    axs[1][0].plot(x1,weights_array1[:,1])
    axs1_ylabel_text = axs[1][0].set_ylabel(u'W1',FontProperties=font)
    plt.setp(axs1_ylabel_text, size=20, weight='bold', color='black')
    #绘制w2与迭代次数的关系
    axs[2][0].plot(x1,weights_array1[:,2])
    axs2_xlabel_text = axs[2][0].set_xlabel(u'迭代次数',FontProperties=font)
    axs2_ylabel_text = axs[2][0].set_ylabel(u'W1',FontProperties=font)
    plt.setp(axs2_xlabel_text, size=20, weight='bold', color='black')
    plt.setp(axs2_ylabel_text, size=20, weight='bold', color='black')


    x2 = np.arange(0, len(weights_array2), 1)
    #绘制w0与迭代次数的关系
    axs[0][1].plot(x2,weights_array2[:,0])
    axs0_title_text = axs[0][1].set_title(u'改进的随机梯度上升算法：回归系数与迭代次数关系',FontProperties=font)
    axs0_ylabel_text = axs[0][1].set_ylabel(u'W0',FontProperties=font)
    plt.setp(axs0_title_text, size=20, weight='bold', color='black')
    plt.setp(axs0_ylabel_text, size=20, weight='bold', color='black')
    #绘制w1与迭代次数的关系
    axs[1][1].plot(x2,weights_array2[:,1])
    axs1_ylabel_text = axs[1][1].set_ylabel(u'W1',FontProperties=font)
    plt.setp(axs1_ylabel_text, size=20, weight='bold', color='black')
    #绘制w2与迭代次数的关系
    axs[2][1].plot(x2,weights_array2[:,2])
    axs2_xlabel_text = axs[2][1].set_xlabel(u'迭代次数',FontProperties=font)
    axs2_ylabel_text = axs[2][1].set_ylabel(u'W1',FontProperties=font)
    plt.setp(axs2_xlabel_text, size=20, weight='bold', color='black')
    plt.setp(axs2_ylabel_text, size=20, weight='bold', color='black')

    plt.show()

if __name__ == '__main__':
    dataMat,labelMat = loadDataSet()
    weights1,weights_array1 = stocGradAscent1(np.array(dataMat),labelMat)
    weights2,weights_array2 = gradAscent(dataMat,labelMat)
    # print(weights2)
    print(weights_array1)
    print(weights_array2)
    plotWeights(weights_array1,weights_array2)

由于改进的随机梯度上升算法，随机选取样本点，所以每次的运行结果是不同的。但是大体趋势是一样的。我们改进的随机梯度上升算法收敛效果更好。为什么这么说呢？让我们分析一下。我们一共有100个样本点，改进的随机梯度上升算法迭代次数为150。而上图显示15000次迭代次数的原因是，使用一次样本就更新一下回归系数。因此，迭代150次，相当于更新回归系数150*100=15000次。简而言之，迭代150次，更新1.5万次回归参数。从上图左侧的改进随机梯度上升算法回归效果中可以看出，其实在更新2000次回归系数的时候，已经收敛了。相当于遍历整个数据集20次的时候，回归系数已收敛。训练已完成。

再让我们看看上图右侧的梯度上升算法回归效果，梯度上升算法每次更新回归系数都要遍历整个数据集。从图中可以看出，当迭代次数为300多次的时候，回归系数才收敛。凑个整，就当它在遍历整个数据集300次的时候已经收敛好了。

没有对比就没有伤害，改进的随机梯度上升算法，在遍历数据集的第20次开始收敛。而梯度上升算法，在遍历数据集的第300次才开始收敛。想像一下，大量数据的情况下，谁更牛逼？

机器学习笔记9-Logistic回归实战

二、从疝气病症状预测病马的死亡率

python基础知识

常见报错：

NameError 尝试访问一个没有申明的变量
ZeroDivisionError 除数为0
SyntaxError 语法错误
IndexError 索引超出序列范围
KeyError 请求一个不存在的字典关键字
IOError 输入输出错误(比如读取的文件不存在)
AttributeError 尝试访问未知的对象属性
ValueError 传给函数的参数类型不正确，比如int()函数传入字符串
IndexError:
FN+NUM LOCK 开关小键盘，切换小键盘进入上下左右模式还是输入模式

引用链接1
引用链接2

机器学习笔记9-Logistic回归实战

一、改进的随机梯度上升算法

1 随机梯度上升法

2 回归系数与迭代次数关系

二、从疝气病症状预测病马的死亡率

python基础知识

【机器学习】手写数字识别学习笔记（对三篇文件进行分析记录）

【含课程pdf & 测验答案】吴恩达-机器学习公开课学习笔记 Week8-2 Dimensionality Reduction