《机器学习实战》5.Logistic回归源码实现

程序员文章站 2024-02-03 21:13:16

...

结合源码分析第五章中实现的Demo

运行环境：Anaconda——Jupyter Notebook

Python版本为：3.6.2（原书代码实现为2.x 所以在一些代码上略有改动）

参考资料:

Apachecn 专注于优秀项目维护的开源组织

Hands-On Machine Learning with Scikit-Learn and TensorFlow

阅读本文你将获得如下知识：

1.LR的基本思想
2.LR的优缺点
3.用于求解无约束条件的最优化方法——GD以及SGD
4.处理数据集中缺失值的基本思路
5.一个具体的实例，了解LR使用的整个流程

前言

Logistic回归其实更常被用作在分类中。它与传统的线性回归模型相比，多了一步将输出值映射到Sigmoid函数上，进而得到一个范围在0~1之间的数值。我们可以定义大于0.5的数据为1（正）类，而小于0.5被归为0（负）类。由于函数本身的特殊性质，它更加稳定，并且它本质上仍然是线性模型。

优点：
* 计算代价不高，容易理解和实现

缺点：
* 容易欠拟合，分类精度可能不高

我们来具体看一下这个函数的样子：

《机器学习实战》5.Logistic回归源码实现

那么LR的训练过程做了什么呢？就是找到一个最佳的分类回归系数(Weights)

我们将Sigmoid函数的输入记作Z，现实中，某事件的发生一般都是与多种因素相关联的，所以：

Z=W0X0 + W1X1 + … +WnXn 我们可以写成向量的形式：《机器学习实战》5.Logistic回归源码实现

所以这里的W就是我们要找的最佳分类回归系数啦。

如何寻找该系数呢？这里我们使用梯度下降（上升）法，这里下降与上升只是一个符号的区别，但是本质上都是沿着该函数的梯度方向探寻，这里讲了方向，并未讲到移动量的大小。我们可以称其为步长a。这里步长的选取是有说法的，如果选取太大会导致结果发散；如果选取得太小呢又会使得时间复杂度很高。

停止条件呢？通常我们可以设置迭代次数，也可以设置一个迭代终止的误差条件。

使用梯度上升找到最佳参数

from numpy import *
import matplotlib.pyplot as plt
#Logistic回归梯度上升优化算法
#读取数据
def loadDataSet():
    dataMat = []
    labelMat = []
    fr = open('TestSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split('\t')
        # 为了方便计算，我们将 X0 的值设为 1.0 ，也就是在每一行的开头添加一个 1.0 作为 X0
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat, labelMat

#定义sigmoid函数
def sigmoid(inX):
    return 1.0/(1+exp(-inX))

#输入数据特征与数据的类别标签
#返回最佳回归系数(weights)
def gradAscent(dataMatIn, classLabels):
    #转换为numpy型
    dataMatrix = mat(dataMatIn) 
    # 转化为矩阵[[0,1,0,1,0,1.....]]，并转制[[0],[1],[0].....] 
    # transpose() 行列转置函数
    # 将行向量转化为列向量   =>  矩阵的转置
    labelMat = mat(classLabels).transpose()
    # m->数据量，样本数 n->特征数
    m,n = shape(dataMatrix)
    alpha = 0.001 #步长
    maxCycles = 500 #迭代次数
    #初始化权值向量，每个维度均为1.0
    weights = ones((n,1))
    for k in range(maxCycles):
        #求当前sigmoid函数的值
        h = sigmoid(dataMatrix * weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose() * error
    return array(weights)

肯定有读者和我一样，一开始不理解倒数第二行weights为什么会那样赋值更新？不是求梯度嘛？最后怎么梯度怎么变成 dataMatrix.transpose() error* 其实呢这里有一个数学推导,可以自己查一下（涉及最大似然估计估计模型参数）

最后我们得到一组Weights：

array([[ 4.12414349],
       [ 0.48007329],
       [-0.6168482 ]])

分析数据：画出决策边界

def plotBestFit(weights):
    import matplotlib.pyplot as plt
    dataMat, labelMat = loadDataSet()
    dataArr = array(dataMat)
    # n->数据量，样本数
    n = shape(dataArr)[0]
    #xcord1,ycord1代表正例特征
    #xcord2,ycord2代表负例特征
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    #循环筛选出正负集
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i, 2])
        else:
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s = 30, c = 'red', marker = 's')
    ax.scatter(xcord2, ycord2, s = 30, c = 'green')
    #设定边界直线x和y的值
    x = arange(-3.0, 3.0, 0.1)
    """
    y的由来？
    首先理论上是这个样子的。
    dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
    w0*x0+w1*x1+w2*x2=f(x)
    x0最开始就设置为1， x2就是我们画图的y值。0是两个类别的分界处
    所以： w0+w1*x+w2*y=0 => y = (-w0-w1*x)/w2   
    """
    y = (-weights[0] - weights[1] * x) / weights[2]
    ax.plot(x, y)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

看一下输出结果：《机器学习实战》5.Logistic回归源码实现

训练算法：随机梯度上升

在GD中每次更新回归系数都要遍历整个数据集，如果数据量庞大的话，该方法的计算复杂度就很高。一种改进的方法是一次仅使用一个样本点来更新回归系数，我们叫做随机梯度上升法，并且它是一种在线学习算法。前面的梯度上升法我们叫做(Batch learning)批处理

我们看一下到底什么是Batch learning？有这样的一段话：

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning

简而言之就是它不行 ==(

那么我们随机梯度上升牛在哪里？记住它是一种在线学习方式哦。

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.Online learning is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a good option if you have limited computing resources。

def stoGradAscent0(dataMatrix, classLabels):
    m, n = shape(dataMatrix)
    alpha = 0.01
    weights = ones(n)
    sumWeights = []
    for i in range(m):
        h = sigmoid(sum(dataMatrix[i] * weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights

两者在代码上的区别：

1.后者的变量h和误差error都是向量
2.前者没有矩阵的转换过程，所有变量的数据类型都是Numpy数组

可视化：

《机器学习实战》5.Logistic回归源码实现

从结果上来看，该方法不如直接使用梯度上升法那样完美。

直接这样比较是不公平的，我们知道在梯度上升法中迭代了500次。

一个判断优化算法优劣的可靠方法是看它是否收敛，也就是说参数是否达到了稳定值，是否还会不断的变化？通过分析，我们使用改进的随机梯度上升算法

改进的随机梯度上升算法

#改进的随机梯度上升法（随机化）
#使用随机的一个样本来更新回归系数
#numIter迭代
def stoGradAscent1(dataMatrix, classLabels, numIter = 150):
    #m为行 n为列
    m,n = shape(dataMatrix)
    weights = ones(n)
    # 随机梯度, 循环150,观察是否收敛
    for j in range (numIter):
        # [0, 1, 2 .. m-1]
        dataIndex = range(m)
        for i in range(m):
            #alpha会随着迭代次数不断减小，但不会是0
            alpha = 4 / (1.0 + j + i) + 0.0001
            # random.uniform(x, y) 方法将随机生成下一个实数，它在[x,y]范围内,x是这个范围内的最小值，y是这个范围内的最大值。
            randIndex = int(random.uniform(0, len(dataIndex)))
            #随机选取更新
            h = sigmoid(sum(dataMatrix[randIndex] * weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del (list(dataIndex)[randIndex])
    return weights

该程序在原有基础上做了三点优化：

1.alpha每次迭代的时候都会调整
2.随机选取样本来更新回归系数，这种做法会减少周期性波动
3.增加了迭代参数的设置。默认为150次，如果给定，将会按照新的参数值进行迭代。

我们来看一下可视化：
《机器学习实战》5.Logistic回归源码实现

在计算量更少的情况下，也得到了一个很好的效果。

示例从疝气病症预测病马的死亡率

由于数据中会有缺失值，因此我们要处理数据中的缺失值
这里只是列了几个方法，其实数据清洗以及特征提取是一项很庞大的工程。

1.使用可用特征的均值来填补缺失值
2.使用特殊值来填补缺失值
3.忽略有缺失值的样本
4.使用相似样本的均值填补缺失值
5.使用另外的机器学习算法来预测缺失值

用Logistic回归进行分类

#输入：特征向量与回归系数
def classifyVector(inX, weights):
    prob = sigmoid(sum(inX * weights))
    #大于0.5 返回 1；否则返回0
    if prob > 0.5:
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('HorseColicTraining.txt')
    frTest = open('HorseColicTest.txt')
    trainingSet = []
    trainingLabels = []
    # trainingSet 中存储训练数据集的特征，trainingLabels 存储训练数据集的样本对应的分类标签
    for line in frTrain.readlines():
        currLine = line.strip().split('\t') #分割
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        #存入训练样本特征
        trainingSet.append(lineArr)
        #存入训练样本标签
        trainingLabels.append(float(currLine[21]))
    #使用改进后的随机梯度下降算法得到回归系数
    trainingWeights = stoGradAscent1(array(trainingSet), trainingLabels, 500)

    #统计测试集预测错误样本数量和样本总数
    errorCount = 0;
    numTestVec = 0.0
    for line in frTest.readlines():
        #循环一次，样本数加1
        numTestVec += 1.0
        currLine = line.strip().split('\t') #分割
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        # 利用分类预测函数对该样本进行预测，并与样本标签进行比较
        if int(classifyVector(array(lineArr), trainingWeights)) != int(currLine[21]):
            #如果预测错误，错误数加1
            errorCount += 1
    #计算错误率
    errorRate = (float(errorCount) / numTestVec)
    print('the error rate of this test is : %f' % errorRate)
    return errorRate


def multiTest():
    numTests = 10
    errorSum = 0.0
    for k in range(numTests):
        errorSum += colicTest()
    print('after %d iterations the average error rete is : %f ' % (numTests,errorSum / float(numTests)))

最后一个函数multiTest()，它的功能是调用colicTest()10次，并求一个平均值。
看一下结果：

he error rate of this test is : 0.328358
the error rate of this test is : 0.373134
the error rate of this test is : 0.328358
the error rate of this test is : 0.298507
the error rate of this test is : 0.388060
the error rate of this test is : 0.313433
the error rate of this test is : 0.268657
the error rate of this test is : 0.253731
the error rate of this test is : 0.313433
the error rate of this test is : 0.283582
after 10 iterations the average error rete is : 0.314925

事实上这个结果并不差，因为原始的数据集中有30%的数据缺失。

此外，我们通过调整步长与迭代次数该结果肯定也会有改变。

相关标签：机器学习实战机器学习

上一篇： dom4j 解析XML文件

下一篇：使用pyinstaller打包py文件成.exe程序详解（附pyinstaller命令参数及打包实例）

《机器学习实战》5.Logistic回归源码实现

前言

使用梯度上升找到最佳参数

分析数据：画出决策边界

训练算法：随机梯度上升

改进的随机梯度上升算法

示例从疝气病症预测病马的死亡率