CNTK:逻辑回归

程序员文章站 2024-03-26 09:06:05

...

许多关于机器学习方面的知识，逻辑回归作为刚入门学习的内容，这里针对于刚学习机器学习和CNTK平台的新人，在教程中使用的是python API。其中BrainScript的例子在：点击打开链接

介绍：

问题描述：癌症医院提供了数据，并希望我们确定患者是否有致命的恶性肿瘤或良性肿瘤。这类问题被称为分类问题。为了帮助对每个病人进行分类，我们给予了他们的年龄和肿瘤的大小。直观地，可以想象，年轻的患者或/和小肿瘤的患者不太可能患有恶性肿瘤。在下面的图中，红色表示恶性和蓝色表示良性。注意：这是一个学习的例子; 在现实生活中，需要来自不同测试/检查来源的许多特征和医生的专业知识将为患者做出诊断、治疗决定。

from IPython.display import Image
Image(url="https://www.cntk.ai/jup/cancer_data_plot.jpg", width=400, height=400)

CNTK:逻辑回归

目标：我们的目标是学习一个分类器，可以根据两个特征（年龄和肿瘤大小）自动将任何患者标记为良性或恶性。在本教程中，我们将创建一个线性分类器。

以下是分类的结果

Image(url ="https://www.cntk.ai/jup/cancer_classify_plot.jpg" , width = 400, height = 400)

CNTK:逻辑回归

在上图中，绿线表示从数据中学习的模型，并将蓝点与红点分开。

任何学习算法通常有五个阶段。这些是数据读取，数据预处理，创建模型，学习模型参数和评估模型（也称为测试/预测）。

1.数据读取：我们生成模拟数据集，每个样本具有两个特征（如下所示），用来表示年龄和肿瘤大小。 2. 数据预处理：通常需要缩放各种feature（如大小或年龄）。通常情况下，可以在0和1之间缩放数据。 3. 模型创建：本教程中介绍一个基本的线性模型。4. 学习模式：这也被称为训练。虽然拟合线性模型可以通过各种方式完成，在CNTK中使用的是随机梯度下降。

逻辑回归

逻辑回归在机器学习中是一种基本技术，它利用特征的线性加权组合，并产生预测不同类别的概率。在文中，分类器的概率范围为【0，1】，然后与设定的阈值（大多去取0.5）比较，进而产生二进制标签，0或1。这里为二分类问题，所述方法也可以扩展到多分类问题。

CNTK:逻辑回归

由上图可知，来自不同输入特征的贡献是线性加权的。所得到的和通过Sigmoid函数映射到【0，1】范围，对于具有两个以上分类的，可以使用softmax函数

检查是否安装了CNTK，以及其版本

from __future__ import print_function
import numpy as np
import sys
import os

import cntk as C

if 'TEST_DEVICE' in os.environ:
    if os.environ['TEST_DEVICE'] == 'cpu':
        C.device.try_set_default_device(C.device.cpu())
    else:
        C.device.try_set_default_device(C.device.gpu(0))
if not C.__version__ == "2.0":
    raise Exception("this notebook was designed to work with 2.0. Current Version: " + C.__version__)

数据生成

用numpy库生成一些模拟癌症的数据。这里定义了两个输入的特征和两个标签。在示例中，训练数据中每组数据都有一个标签，良性或恶性，所以这里为二分类问题。

定义网络

input_dim  =  2 
num_output_classes  =  2

特征和标签

在本教程中使用numpy库生成数据。

from __future__ import print_function
import numpy as np
import sys
import os

import cntk as C
# Plot the data 
import matplotlib.pyplot as plt
# Define the network
input_dim = 2
num_output_classes = 2

# Ensure that we always get the same results
np.random.seed(0)

# Helper function to generate a random data sample
def generate_random_data_sample(sample_size, feature_dim, num_classes):
    # Create synthetic data using NumPy. 
    Y = np.random.randint(size=(sample_size, 1), low=0, high=num_classes)

    # Make sure that the data is separable 
    X = (np.random.randn(sample_size, feature_dim)+3) * (Y+1)
    
    # Specify the data type to match the input variable used later in the tutorial 
    # (default type is double)
    X = X.astype(np.float32)    
    
    # convert class 0 into the vector "1 0 0", 
    # class 1 into the vector "0 1 0", ...
    class_ind = [Y==class_number for class_number in range(num_classes)]
    Y = np.asarray(np.hstack(class_ind), dtype=np.float32)
    return X, Y  
# Create the input variables denoting the features and the label data. Note: the input 
# does not need additional info on the number of observations (Samples) since CNTK creates only 
# the network topology first 
mysamplesize = 32
features, labels = generate_random_data_sample(mysamplesize, input_dim, num_output_classes)

# let 0 represent malignant/red and 1 represent benign/blue 
colors = ['r' if label == 0 else 'b' for label in labels[:,0]]

plt.scatter(features[:,0], features[:,1], c=colors)
plt.xlabel("Age (scaled)")
plt.ylabel("Tumor size (in cm)")
plt.show()

为了确保每次的运行结果一样，在生成随机数的时候使用seed可以保障每次生成的随机数是一样的。然后使用numpy生成随机数，然后可视化数据，使用matplotlib画图。

CNTK:逻辑回归

模型创建

CNTK:逻辑回归

其数学形式为：

z=∑i=1nwi×xi+b=w⋅x+b

W是向量N的权重，b为偏差。使用sigmoid或softmax函数可以将和映射到0到1.

定义输入

feature = C.input_variable(input_dim, np.float32)

在输入中，如果要输入10*5pixel图片，那么该函数要写作为 C.input_variable(10*5, np.float32)

网络设置

linear_layer 函数是上面公式的简单实现，在这里我们要进行两个操作：

1.使用times操作对权重W和特征X进行相乘

2.加上偏差b

feature = C.input_variable(input_dim, np.float32)
# Define a dictionary to store the model parameters
mydict = {}

def linear_layer(input_var, output_dim):
    
    input_dim = input_var.shape[0]
    weight_param = C.parameter(shape=(input_dim, output_dim))
    bias_param = C.parameter(shape=(output_dim))
    
    mydict['w'], mydict['b'] = weight_param, bias_param

    return C.times(input_var, weight_param) + bias_param
output_dim = num_output_classes
z = linear_layer(feature, output_dim)

z用来表示网络的输出

学习模型参数

现在网络已经建立起来，但是我们想要知道参数W和b，为此我们这里使用softmax函数，将Z映射到0-1.其中softmax是一个**函数，进行归一化处理。

训练

通过softmax函数，输出每个类别的概率。为了训练分类器，我们需要定义损失函数，最小化输出和真实标签的误差。

Cross-entropy就是常用的损失函数，它的数学形式为：

H(p)=−∑j=1|y|yjlog(pj)

其中p是经由softmax计算得到的预测概率，y为真实的标签值。

label = C.input_variable(num_output_classes, np.float32)
loss = C.cross_entropy_with_softmax(z, label)

评估

为了评估分类结果，我们可以计算出classification_error，如果模型是正确的，则为0，否则为1.

eval_error = C.classification_error(z, label)

训练

在训练的过程中，努力是loss最小。在这里使用随机梯度下降，SGD。通常，从模型参数的随机初始化开始。然后计算预测和真实标签之间的误差，应用梯度下降生成新的模型参数集合。

# Define a utility function to compute the moving average.
# A more efficient implementation is possible with np.cumsum() function
def moving_average(a, w=10):
    if len(a) < w: 
        return a[:]    
    return [val if idx < w else sum(a[(idx-w):idx])/w for idx, val in enumerate(a)]


# Define a utility that prints the training progress
def print_training_progress(trainer, mb, frequency, verbose=1):
    training_loss, eval_error = "NA", "NA"

    if mb % frequency == 0:
        training_loss = trainer.previous_minibatch_loss_average
        eval_error = trainer.previous_minibatch_evaluation_average
        if verbose: 
            print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}".format(mb, training_loss, eval_error))
        
    return mb, training_loss, eval_error

运行训练模型

经过上述操作，那么现在我们已经设置好了逻辑回归模型。一般我们使用大量的观察数据进行训练，比如总数据的70%，剩下的作为评估模型。

# Initialize the parameters for the trainer
minibatch_size = 25
num_samples_to_train = 20000
num_minibatches_to_train = int(num_samples_to_train  / minibatch_size)

from collections import defaultdict

# Run the trainer and perform model training
training_progress_output_freq = 50
plotdata = defaultdict(list)

for i in range(0, num_minibatches_to_train):
    features, labels = generate_random_data_sample(minibatch_size, input_dim, num_output_classes)
    
    # Assign the minibatch data to the input variables and train the model on the minibatch
    trainer.train_minibatch({feature : features, label : labels})
    batchsize, loss, error = print_training_progress(trainer, i, 
                                                     training_progress_output_freq, verbose=1)
    
    if not (loss == "NA" or error =="NA"):
        plotdata["batchsize"].append(batchsize)
        plotdata["loss"].append(loss)
        plotdata["error"].append(error)

运行结果为：

Minibatch: 0, Loss: 0.6931, Error: 0.32
Minibatch: 50, Loss: 4.4290, Error: 0.36
Minibatch: 100, Loss: 0.4585, Error: 0.16
Minibatch: 150, Loss: 0.7228, Error: 0.32
Minibatch: 200, Loss: 0.1290, Error: 0.08
Minibatch: 250, Loss: 0.1321, Error: 0.08
Minibatch: 300, Loss: 0.1012, Error: 0.04
Minibatch: 350, Loss: 0.1076, Error: 0.04
Minibatch: 400, Loss: 0.3087, Error: 0.08
Minibatch: 450, Loss: 0.3219, Error: 0.12
Minibatch: 500, Loss: 0.4076, Error: 0.20
Minibatch: 550, Loss: 0.6784, Error: 0.24
Minibatch: 600, Loss: 0.2988, Error: 0.12
Minibatch: 650, Loss: 0.1676, Error: 0.12
Minibatch: 700, Loss: 0.2772, Error: 0.12
Minibatch: 750, Loss: 0.2309, Error: 0.04

# Compute the moving average loss to smooth out the noise in SGD
plotdata["avgloss"] = moving_average(plotdata["loss"])
plotdata["avgerror"] = moving_average(plotdata["error"])

# Plot the training loss and the training error
import matplotlib.pyplot as plt

plt.figure(1)
plt.subplot(211)
plt.plot(plotdata["batchsize"], plotdata["avgloss"], 'b--')
plt.xlabel('Minibatch number')
plt.ylabel('Loss')
plt.title('Minibatch run vs. Training loss')

plt.show()

plt.subplot(212)
plt.plot(plotdata["batchsize"], plotdata["avgerror"], 'r--')
plt.xlabel('Minibatch number')
plt.ylabel('Label Prediction Error')
plt.title('Minibatch run vs. Label Prediction Error')
plt.show()

评估模型

为了评估模型，我们将剩下的数据输入到已经训练好的模型中，将真实的结果和预测的结果进行比较。

# Run the trained model on a newly generated dataset
test_minibatch_size = 25
features, labels = generate_random_data_sample(test_minibatch_size, input_dim, num_output_classes)

trainer.test_minibatch({feature : features, label : labels})

此时，这里的minibatch为0.12，这是一个关键的指标，如果错误大大超过的训练误差，则表明训练后的模型在训练过程中出现了过拟合情况。

预测评估

查看预测错误的个数

print("Label    :", [np.argmax(label) for label in labels])
print("Predicted:", [np.argmax(x) for x in result[0]])

可视化

# Model parameters
print(mydict['b'].value)

bias_vector   = mydict['b'].value
weight_matrix = mydict['w'].value

# Plot the data 
import matplotlib.pyplot as plt

# let 0 represent malignant/red, and 1 represent benign/blue
colors = ['r' if label == 0 else 'b' for label in labels[:,0]]
plt.scatter(features[:,0], features[:,1], c=colors)
plt.plot([0, bias_vector[0]/weight_matrix[0][1]], 
         [ bias_vector[1]/weight_matrix[0][0], 0], c = 'g', lw = 3)
plt.xlabel("Patient age (scaled)")
plt.ylabel("Tumor size (in cm)")
plt.show()

CNTK:逻辑回归

机器学习算法(一)线性回归的原理以及代码实现

逻辑回归实战--R/python代码