欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

机器学习框架ML.NET学习笔记【9】自动学习

程序员文章站 2022-06-28 21:10:32
一、概述 本篇我们首先通过回归算法实现一个葡萄酒品质预测的程序,然后通过AutoML的方法再重新实现,通过对比两种实现方式来学习AutoML的应用。 首先数据集来自于竞赛网站kaggle.com的UCI Wine Quality Dataset数据集,访问地址:https://www.kaggle. ......

一、概述

本篇我们首先通过回归算法实现一个葡萄酒品质预测的程序,然后通过automl的方法再重新实现,通过对比两种实现方式来学习automl的应用。

首先数据集来自于竞赛网站kaggle.com的uci wine quality dataset数据集,访问地址:https://www.kaggle.com/c/uci-wine-quality-dataset/data

 该数据集,输入为一些葡萄酒的化学检测数据,比如酒精度等,输出为品酒师的打分,具体字段描述如下:

data fields
input variables (based on physicochemical tests): 
1 - fixed acidity 
2 - volatile acidity 
3 - citric acid 
4 - residual sugar 
5 - chlorides 
6 - free sulfur dioxide 
7 - total sulfur dioxide 
8 - density 
9 - ph 
10 - sulphates 
11 - alcohol

output variable (based on sensory data): 
12 - quality (score between 0 and 10)

other:
13 - id (unique id for each sample, needed for submission)

   

二、代码

namespace regression_winequality
{
    public class winedata
    {
        [loadcolumn(0)]
        public float fixedacidity;

        [loadcolumn(1)]
        public float volatileacidity;

        [loadcolumn(2)]
        public float citricacid;

        [loadcolumn(3)]
        public float residualsugar;

        [loadcolumn(4)]
        public float chlorides;

        [loadcolumn(5)]
        public float freesulfurdioxide;

        [loadcolumn(6)]
        public float totalsulfurdioxide;

        [loadcolumn(7)]
        public float density;

        [loadcolumn(8)]
        public float ph;

        [loadcolumn(9)]
        public float sulphates;

        [loadcolumn(10)]
        public float alcohol;
      
        [loadcolumn(11)]
        [columnname("label")]
        public float quality;
       
        [loadcolumn(12)]
        public float id;
    }

    public class wineprediction
    {
        [columnname("score")]
        public float predictionquality;
    }

    class program
    {
        static readonly string modelfilepath = path.combine(environment.currentdirectory, "mlmodel", "model.zip");

        static void main(string[] args)
        { 
            train();
            prediction();

            console.writeline("hit any key to finish the app");
            console.readkey();
        }

        public static void train()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            // 准备数据
            string traindatapath = path.combine(environment.currentdirectory, "data", "winequality-data-full.csv");
            var fulldata = mlcontext.data.loadfromtextfile<winedata>(path: traindatapath, separatorchar: ',', hasheader: true);

            var traintestdata = mlcontext.data.traintestsplit(fulldata, testfraction: 0.2);
            var traindata = traintestdata.trainset;
            var testdata = traintestdata.testset;

            // 创建学习管道并通过训练数据调整模型  
            var dataprocesspipeline = mlcontext.transforms.dropcolumns("id")
                .append(mlcontext.transforms.normalizemeanvariance(nameof(winedata.freesulfurdioxide)))
                .append(mlcontext.transforms.normalizemeanvariance(nameof(winedata.totalsulfurdioxide)))
                .append(mlcontext.transforms.concatenate("features", new string[] { nameof(winedata.fixedacidity),
                                                                                    nameof(winedata.volatileacidity),
                                                                                    nameof(winedata.citricacid),
                                                                                    nameof(winedata.residualsugar),
                                                                                    nameof(winedata.chlorides),
                                                                                    nameof(winedata.freesulfurdioxide),
                                                                                    nameof(winedata.totalsulfurdioxide),
                                                                                    nameof(winedata.density),
                                                                                    nameof(winedata.ph),
                                                                                    nameof(winedata.sulphates),
                                                                                    nameof(winedata.alcohol)}));

            var trainer = mlcontext.regression.trainers.lbfgspoissonregression(labelcolumnname: "label", featurecolumnname: "features");
            var trainingpipeline = dataprocesspipeline.append(trainer);
            var trainedmodel = trainingpipeline.fit(traindata);

            // 评估
            var predictions = trainedmodel.transform(testdata);
            var metrics = mlcontext.regression.evaluate(predictions, labelcolumnname: "label", scorecolumnname: "score");
            printregressionmetrics(trainer.tostring(), metrics);

            // 保存模型
            console.writeline("====== save model to local file =========");
            mlcontext.model.save(trainedmodel, traindata.schema, modelfilepath);
        }

        static void prediction()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            itransformer loadedmodel = mlcontext.model.load(modelfilepath, out var modelinputschema);
            var predictor = mlcontext.model.createpredictionengine<winedata, wineprediction>(loadedmodel);

            winedata winedata = new winedata
            {
                fixedacidity = 7.6f,
                volatileacidity = 0.33f,
                citricacid = 0.36f,
                residualsugar = 2.1f,
                chlorides = 0.034f,
                freesulfurdioxide = 26f,
                totalsulfurdioxide = 172f,
                density = 0.9944f,
                ph = 3.42f,
                sulphates = 0.48f,
                alcohol = 10.5f
            };

            var winequality = predictor.predict(winedata);
            console.writeline($"wine data  quality is:{winequality.predictionquality} ");           
        }        
    }
}

 关于泊松回归的算法,我们在进行人脸颜值判断的那篇文章已经介绍过了,这个程序没有涉及任何新的知识点,就不重复解释了,主要目的是和下面的automl代码对比用的。 

 

三、自动学习

我们发现机器学习的大致流程基本都差不多,如:准备数据-明确特征-选择算法-训练等,有时我们存在这样一个问题:该选择什么算法?算法的参数该如何配置?等等。而自动学习就解决了这个问题,框架会多次重复数据选择、算法选择、参数调优、评估结果这一过程,通过这个过程找出评估效果最好的模型。

全部代码如下:

namespace regression_winequality
{
    public class winedata
    {
        [loadcolumn(0)]
        public float fixedacidity;

        [loadcolumn(1)]
        public float volatileacidity;

        [loadcolumn(2)]
        public float citricacid;

        [loadcolumn(3)]
        public float residualsugar;

        [loadcolumn(4)]
        public float chlorides;

        [loadcolumn(5)]
        public float freesulfurdioxide;

        [loadcolumn(6)]
        public float totalsulfurdioxide;

        [loadcolumn(7)]
        public float density;

        [loadcolumn(8)]
        public float ph;

        [loadcolumn(9)]
        public float sulphates;

        [loadcolumn(10)]
        public float alcohol;
      
        [loadcolumn(11)]
        [columnname("label")]
        public float quality;

        [loadcolumn(12)]       
        public float id; 
    }

    public class wineprediction
    {
        [columnname("score")]
        public float predictionquality;
    }
 

    class program
    {
        static readonly string modelfilepath = path.combine(environment.currentdirectory, "mlmodel", "model.zip");
        static readonly string traindatapath = path.combine(environment.currentdirectory, "data", "winequality-data-train.csv");
        static readonly string testdatapath = path.combine(environment.currentdirectory, "data", "winequality-data-test.csv");

        static void main(string[] args)
        {           
            trainandsave();
            loadandprediction();

            console.writeline("hit any key to finish the app");
            console.readkey();
        }

        public static void trainandsave()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            // 准备数据 
            var traindata = mlcontext.data.loadfromtextfile<winedata>(path: traindatapath, separatorchar: ',', hasheader: true);
            var testdata = mlcontext.data.loadfromtextfile<winedata>(path: testdatapath, separatorchar: ',', hasheader: true);
         
            var progresshandler = new regressionexperimentprogresshandler();
            uint experimenttime = 200;

            experimentresult<regressionmetrics> experimentresult = mlcontext.auto()
               .createregressionexperiment(experimenttime)
               .execute(traindata, "label", progresshandler: progresshandler);           

            debugger.printtopmodels(experimentresult);

            rundetail<regressionmetrics> best = experimentresult.bestrun;
            itransformer trainedmodel = best.model;

            // 评估 bestrun
            var predictions = trainedmodel.transform(testdata);
            var metrics = mlcontext.regression.evaluate(predictions, labelcolumnname: "label", scorecolumnname: "score");
            debugger.printregressionmetrics(best.trainername, metrics);

            // 保存模型
            console.writeline("====== save model to local file =========");
            mlcontext.model.save(trainedmodel, traindata.schema, modelfilepath);           
        }
       

        static void loadandprediction()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            itransformer loadedmodel = mlcontext.model.load(modelfilepath, out var modelinputschema);
            var predictor = mlcontext.model.createpredictionengine<winedata, wineprediction>(loadedmodel);

            winedata winedata = new winedata
            {
                fixedacidity = 7.6f,
                volatileacidity = 0.33f,
                citricacid = 0.36f,
                residualsugar = 2.1f,
                chlorides = 0.034f,
                freesulfurdioxide = 26f,
                totalsulfurdioxide = 172f,
                density = 0.9944f,
                ph = 3.42f,
                sulphates = 0.48f,
                alcohol = 10.5f
            };

            var winequality = predictor.predict(winedata);
            console.writeline($"wine data  quality is:{winequality.predictionquality} ");           
        }
    }
}

  

四、代码分析

1、自动学习过程

            var progresshandler = new regressionexperimentprogresshandler();
            uint experimenttime = 200;

            experimentresult<regressionmetrics> experimentresult = mlcontext.auto()
               .createregressionexperiment(experimenttime)
               .execute(traindata, "label", progresshandler: progresshandler);           

            debugger.printtopmodels(experimentresult); //打印所有模型数据

  experimenttime 是允许的试验时间,progresshandler是一个报告程序,当每完成一种学习,系统就会调用一次报告事件。

    public class regressionexperimentprogresshandler : iprogress<rundetail<regressionmetrics>>
    {
        private int _iterationindex;

        public void report(rundetail<regressionmetrics> iterationresult)
        {
            _iterationindex++;
            console.writeline($"report index:{_iterationindex},trainername:{iterationresult.trainername},runtimeinseconds:{iterationresult.runtimeinseconds}");            
        }
    }

 调试结果如下:

report index:1,trainername:sdcaregression,runtimeinseconds:12.5244426
report index:2,trainername:lightgbmregression,runtimeinseconds:11.2034988
report index:3,trainername:fasttreeregression,runtimeinseconds:14.810409
report index:4,trainername:fasttreetweedieregression,runtimeinseconds:14.7338553
report index:5,trainername:fastforestregression,runtimeinseconds:15.6224459
report index:6,trainername:lbfgspoissonregression,runtimeinseconds:11.1668197
report index:7,trainername:onlinegradientdescentregression,runtimeinseconds:10.5353
report index:8,trainername:olsregression,runtimeinseconds:10.8905459
report index:9,trainername:lightgbmregression,runtimeinseconds:10.5703296
report index:10,trainername:fasttreeregression,runtimeinseconds:19.4470509
report index:11,trainername:fasttreetweedieregression,runtimeinseconds:63.638882
report index:12,trainername:lightgbmregression,runtimeinseconds:10.7710518

学习结束后我们通过debugger.printtopmodels打印出所有模型数据: 

   public class debugger
    {
        private const int width = 114;
        public  static void printtopmodels(experimentresult<regressionmetrics> experimentresult)
        {            
            var topruns = experimentresult.rundetails
                .where(r => r.validationmetrics != null && !double.isnan(r.validationmetrics.rsquared))
                .orderbydescending(r => r.validationmetrics.rsquared);

            console.writeline("top models ranked by r-squared --");
            printregressionmetricsheader();
            for (var i = 0; i < topruns.count(); i++)
            {
                var run = topruns.elementat(i);
                printiterationmetrics(i + 1, run.trainername, run.validationmetrics, run.runtimeinseconds);
            }
        }       

        public static void printregressionmetricsheader()
        {
            createrow($"{"",-4} {"trainer",-35} {"rsquared",8} {"absolute-loss",13} {"squared-loss",12} {"rms-loss",8} {"duration",9}", width);
        }

        public static void printiterationmetrics(int iteration, string trainername, regressionmetrics metrics, double? runtimeinseconds)
        {
            createrow($"{iteration,-4} {trainername,-35} {metrics?.rsquared ?? double.nan,8:f4} {metrics?.meanabsoluteerror ?? double.nan,13:f2} {metrics?.meansquarederror ?? double.nan,12:f2} {metrics?.rootmeansquarederror ?? double.nan,8:f2} {runtimeinseconds.value,9:f1}", width);
        }

        public static void createrow(string message, int width)
        {
            console.writeline("|" + message.padright(width - 2) + "|");
        }
}

 其中createrow代码功能用于排版。调试结果如下:

top models ranked by r-squared --
|     trainer                             rsquared absolute-loss squared-loss rms-loss  duration                 |
|1    fasttreetweedieregression             0.4731          0.46         0.41     0.64      63.6                 |
|2    fasttreetweedieregression             0.4431          0.49         0.43     0.65      14.7                 |
|3    fasttreeregression                    0.4386          0.54         0.49     0.70      19.4                 |
|4    lightgbmregression                    0.4177          0.52         0.45     0.67      10.8                 |
|5    fasttreeregression                    0.4102          0.51         0.45     0.67      14.8                 |
|6    lightgbmregression                    0.3944          0.52         0.46     0.68      11.2                 |
|7    lightgbmregression                    0.3501          0.60         0.57     0.75      10.6                 |
|8    fastforestregression                  0.3381          0.60         0.58     0.76      15.6                 |
|9    olsregression                         0.2829          0.56         0.53     0.73      10.9                 |
|10   lbfgspoissonregression                0.2760          0.62         0.63     0.80      11.2                 |
|11   sdcaregression                        0.2746          0.58         0.56     0.75      12.5                 |
|12   onlinegradientdescentregression       0.0593          0.69         0.81     0.90      10.5                 |

根据结果可以看到,一些算法被重复试验,但在使用同一个算法时其配置参数并不一样,如阙值、深度等。

 

2、获取最优模型

            rundetail<regressionmetrics> best = experimentresult.bestrun;
            itransformer trainedmodel = best.model;

 获取最佳模型后,其评估和保存的过程和之前代码一致。用测试数据评估结果:

*************************************************
*       metrics for fasttreetweedieregression regression model
*------------------------------------------------
*       lossfn:        0.67
*       r2 score:      0.34
*       absolute loss: .63
*       squared loss:  .67
*       rms loss:      .82
*************************************************

看结果识别率约70%左右,这种结果是没有办法用于生产的,问题应该是我们没有找到决定葡萄酒品质的关键特征。

 

五、小结

到这篇文章为止,《ml.net学习笔记系列》就结束了。学习过程中涉及的原始代码主要来源于:https://github.com/dotnet/machinelearning-samples 。

该工程中还有一些其他算法应用的例子,包括:聚类、矩阵分解、异常检测,其大体流程基本都差不多,有了我们这个系列的学习基础有兴趣的朋友可以自己研究一下。

  

六、资源获取 

源码下载地址:https://github.com/seabluescn/study_ml.net

回归工程名称:regression_winequality

automl工程名称:regression_winequality_automl

点击查看机器学习框架ml.net学习笔记系列文章目录