机器学习框架ML.NET学习笔记【9】自动学习
一、概述
本篇我们首先通过回归算法实现一个葡萄酒品质预测的程序,然后通过automl的方法再重新实现,通过对比两种实现方式来学习automl的应用。
首先数据集来自于竞赛网站kaggle.com的uci wine quality dataset数据集,访问地址:https://www.kaggle.com/c/uci-wine-quality-dataset/data
该数据集,输入为一些葡萄酒的化学检测数据,比如酒精度等,输出为品酒师的打分,具体字段描述如下:
data fields
input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - ph
10 - sulphates
11 - alcohol
output variable (based on sensory data):
12 - quality (score between 0 and 10)
other:
13 - id (unique id for each sample, needed for submission)
二、代码
namespace regression_winequality { public class winedata { [loadcolumn(0)] public float fixedacidity; [loadcolumn(1)] public float volatileacidity; [loadcolumn(2)] public float citricacid; [loadcolumn(3)] public float residualsugar; [loadcolumn(4)] public float chlorides; [loadcolumn(5)] public float freesulfurdioxide; [loadcolumn(6)] public float totalsulfurdioxide; [loadcolumn(7)] public float density; [loadcolumn(8)] public float ph; [loadcolumn(9)] public float sulphates; [loadcolumn(10)] public float alcohol; [loadcolumn(11)] [columnname("label")] public float quality; [loadcolumn(12)] public float id; } public class wineprediction { [columnname("score")] public float predictionquality; } class program { static readonly string modelfilepath = path.combine(environment.currentdirectory, "mlmodel", "model.zip"); static void main(string[] args) { train(); prediction(); console.writeline("hit any key to finish the app"); console.readkey(); } public static void train() { mlcontext mlcontext = new mlcontext(seed: 1); // 准备数据 string traindatapath = path.combine(environment.currentdirectory, "data", "winequality-data-full.csv"); var fulldata = mlcontext.data.loadfromtextfile<winedata>(path: traindatapath, separatorchar: ',', hasheader: true); var traintestdata = mlcontext.data.traintestsplit(fulldata, testfraction: 0.2); var traindata = traintestdata.trainset; var testdata = traintestdata.testset; // 创建学习管道并通过训练数据调整模型 var dataprocesspipeline = mlcontext.transforms.dropcolumns("id") .append(mlcontext.transforms.normalizemeanvariance(nameof(winedata.freesulfurdioxide))) .append(mlcontext.transforms.normalizemeanvariance(nameof(winedata.totalsulfurdioxide))) .append(mlcontext.transforms.concatenate("features", new string[] { nameof(winedata.fixedacidity), nameof(winedata.volatileacidity), nameof(winedata.citricacid), nameof(winedata.residualsugar), nameof(winedata.chlorides), nameof(winedata.freesulfurdioxide), nameof(winedata.totalsulfurdioxide), nameof(winedata.density), nameof(winedata.ph), nameof(winedata.sulphates), nameof(winedata.alcohol)})); var trainer = mlcontext.regression.trainers.lbfgspoissonregression(labelcolumnname: "label", featurecolumnname: "features"); var trainingpipeline = dataprocesspipeline.append(trainer); var trainedmodel = trainingpipeline.fit(traindata); // 评估 var predictions = trainedmodel.transform(testdata); var metrics = mlcontext.regression.evaluate(predictions, labelcolumnname: "label", scorecolumnname: "score"); printregressionmetrics(trainer.tostring(), metrics); // 保存模型 console.writeline("====== save model to local file ========="); mlcontext.model.save(trainedmodel, traindata.schema, modelfilepath); } static void prediction() { mlcontext mlcontext = new mlcontext(seed: 1); itransformer loadedmodel = mlcontext.model.load(modelfilepath, out var modelinputschema); var predictor = mlcontext.model.createpredictionengine<winedata, wineprediction>(loadedmodel); winedata winedata = new winedata { fixedacidity = 7.6f, volatileacidity = 0.33f, citricacid = 0.36f, residualsugar = 2.1f, chlorides = 0.034f, freesulfurdioxide = 26f, totalsulfurdioxide = 172f, density = 0.9944f, ph = 3.42f, sulphates = 0.48f, alcohol = 10.5f }; var winequality = predictor.predict(winedata); console.writeline($"wine data quality is:{winequality.predictionquality} "); } } }
关于泊松回归的算法,我们在进行人脸颜值判断的那篇文章已经介绍过了,这个程序没有涉及任何新的知识点,就不重复解释了,主要目的是和下面的automl代码对比用的。
三、自动学习
我们发现机器学习的大致流程基本都差不多,如:准备数据-明确特征-选择算法-训练等,有时我们存在这样一个问题:该选择什么算法?算法的参数该如何配置?等等。而自动学习就解决了这个问题,框架会多次重复数据选择、算法选择、参数调优、评估结果这一过程,通过这个过程找出评估效果最好的模型。
全部代码如下:
namespace regression_winequality { public class winedata { [loadcolumn(0)] public float fixedacidity; [loadcolumn(1)] public float volatileacidity; [loadcolumn(2)] public float citricacid; [loadcolumn(3)] public float residualsugar; [loadcolumn(4)] public float chlorides; [loadcolumn(5)] public float freesulfurdioxide; [loadcolumn(6)] public float totalsulfurdioxide; [loadcolumn(7)] public float density; [loadcolumn(8)] public float ph; [loadcolumn(9)] public float sulphates; [loadcolumn(10)] public float alcohol; [loadcolumn(11)] [columnname("label")] public float quality; [loadcolumn(12)] public float id; } public class wineprediction { [columnname("score")] public float predictionquality; } class program { static readonly string modelfilepath = path.combine(environment.currentdirectory, "mlmodel", "model.zip"); static readonly string traindatapath = path.combine(environment.currentdirectory, "data", "winequality-data-train.csv"); static readonly string testdatapath = path.combine(environment.currentdirectory, "data", "winequality-data-test.csv"); static void main(string[] args) { trainandsave(); loadandprediction(); console.writeline("hit any key to finish the app"); console.readkey(); } public static void trainandsave() { mlcontext mlcontext = new mlcontext(seed: 1); // 准备数据 var traindata = mlcontext.data.loadfromtextfile<winedata>(path: traindatapath, separatorchar: ',', hasheader: true); var testdata = mlcontext.data.loadfromtextfile<winedata>(path: testdatapath, separatorchar: ',', hasheader: true); var progresshandler = new regressionexperimentprogresshandler(); uint experimenttime = 200; experimentresult<regressionmetrics> experimentresult = mlcontext.auto() .createregressionexperiment(experimenttime) .execute(traindata, "label", progresshandler: progresshandler); debugger.printtopmodels(experimentresult); rundetail<regressionmetrics> best = experimentresult.bestrun; itransformer trainedmodel = best.model; // 评估 bestrun var predictions = trainedmodel.transform(testdata); var metrics = mlcontext.regression.evaluate(predictions, labelcolumnname: "label", scorecolumnname: "score"); debugger.printregressionmetrics(best.trainername, metrics); // 保存模型 console.writeline("====== save model to local file ========="); mlcontext.model.save(trainedmodel, traindata.schema, modelfilepath); } static void loadandprediction() { mlcontext mlcontext = new mlcontext(seed: 1); itransformer loadedmodel = mlcontext.model.load(modelfilepath, out var modelinputschema); var predictor = mlcontext.model.createpredictionengine<winedata, wineprediction>(loadedmodel); winedata winedata = new winedata { fixedacidity = 7.6f, volatileacidity = 0.33f, citricacid = 0.36f, residualsugar = 2.1f, chlorides = 0.034f, freesulfurdioxide = 26f, totalsulfurdioxide = 172f, density = 0.9944f, ph = 3.42f, sulphates = 0.48f, alcohol = 10.5f }; var winequality = predictor.predict(winedata); console.writeline($"wine data quality is:{winequality.predictionquality} "); } } }
四、代码分析
1、自动学习过程
var progresshandler = new regressionexperimentprogresshandler(); uint experimenttime = 200; experimentresult<regressionmetrics> experimentresult = mlcontext.auto() .createregressionexperiment(experimenttime) .execute(traindata, "label", progresshandler: progresshandler); debugger.printtopmodels(experimentresult); //打印所有模型数据
experimenttime 是允许的试验时间,progresshandler是一个报告程序,当每完成一种学习,系统就会调用一次报告事件。
public class regressionexperimentprogresshandler : iprogress<rundetail<regressionmetrics>> { private int _iterationindex; public void report(rundetail<regressionmetrics> iterationresult) { _iterationindex++; console.writeline($"report index:{_iterationindex},trainername:{iterationresult.trainername},runtimeinseconds:{iterationresult.runtimeinseconds}"); } }
调试结果如下:
report index:1,trainername:sdcaregression,runtimeinseconds:12.5244426
report index:2,trainername:lightgbmregression,runtimeinseconds:11.2034988
report index:3,trainername:fasttreeregression,runtimeinseconds:14.810409
report index:4,trainername:fasttreetweedieregression,runtimeinseconds:14.7338553
report index:5,trainername:fastforestregression,runtimeinseconds:15.6224459
report index:6,trainername:lbfgspoissonregression,runtimeinseconds:11.1668197
report index:7,trainername:onlinegradientdescentregression,runtimeinseconds:10.5353
report index:8,trainername:olsregression,runtimeinseconds:10.8905459
report index:9,trainername:lightgbmregression,runtimeinseconds:10.5703296
report index:10,trainername:fasttreeregression,runtimeinseconds:19.4470509
report index:11,trainername:fasttreetweedieregression,runtimeinseconds:63.638882
report index:12,trainername:lightgbmregression,runtimeinseconds:10.7710518
学习结束后我们通过debugger.printtopmodels打印出所有模型数据:
public class debugger { private const int width = 114; public static void printtopmodels(experimentresult<regressionmetrics> experimentresult) { var topruns = experimentresult.rundetails .where(r => r.validationmetrics != null && !double.isnan(r.validationmetrics.rsquared)) .orderbydescending(r => r.validationmetrics.rsquared); console.writeline("top models ranked by r-squared --"); printregressionmetricsheader(); for (var i = 0; i < topruns.count(); i++) { var run = topruns.elementat(i); printiterationmetrics(i + 1, run.trainername, run.validationmetrics, run.runtimeinseconds); } } public static void printregressionmetricsheader() { createrow($"{"",-4} {"trainer",-35} {"rsquared",8} {"absolute-loss",13} {"squared-loss",12} {"rms-loss",8} {"duration",9}", width); } public static void printiterationmetrics(int iteration, string trainername, regressionmetrics metrics, double? runtimeinseconds) { createrow($"{iteration,-4} {trainername,-35} {metrics?.rsquared ?? double.nan,8:f4} {metrics?.meanabsoluteerror ?? double.nan,13:f2} {metrics?.meansquarederror ?? double.nan,12:f2} {metrics?.rootmeansquarederror ?? double.nan,8:f2} {runtimeinseconds.value,9:f1}", width); } public static void createrow(string message, int width) { console.writeline("|" + message.padright(width - 2) + "|"); } }
其中createrow代码功能用于排版。调试结果如下:
top models ranked by r-squared --
| trainer rsquared absolute-loss squared-loss rms-loss duration |
|1 fasttreetweedieregression 0.4731 0.46 0.41 0.64 63.6 |
|2 fasttreetweedieregression 0.4431 0.49 0.43 0.65 14.7 |
|3 fasttreeregression 0.4386 0.54 0.49 0.70 19.4 |
|4 lightgbmregression 0.4177 0.52 0.45 0.67 10.8 |
|5 fasttreeregression 0.4102 0.51 0.45 0.67 14.8 |
|6 lightgbmregression 0.3944 0.52 0.46 0.68 11.2 |
|7 lightgbmregression 0.3501 0.60 0.57 0.75 10.6 |
|8 fastforestregression 0.3381 0.60 0.58 0.76 15.6 |
|9 olsregression 0.2829 0.56 0.53 0.73 10.9 |
|10 lbfgspoissonregression 0.2760 0.62 0.63 0.80 11.2 |
|11 sdcaregression 0.2746 0.58 0.56 0.75 12.5 |
|12 onlinegradientdescentregression 0.0593 0.69 0.81 0.90 10.5 |
根据结果可以看到,一些算法被重复试验,但在使用同一个算法时其配置参数并不一样,如阙值、深度等。
2、获取最优模型
rundetail<regressionmetrics> best = experimentresult.bestrun; itransformer trainedmodel = best.model;
获取最佳模型后,其评估和保存的过程和之前代码一致。用测试数据评估结果:
*************************************************
* metrics for fasttreetweedieregression regression model
*------------------------------------------------
* lossfn: 0.67
* r2 score: 0.34
* absolute loss: .63
* squared loss: .67
* rms loss: .82
*************************************************
看结果识别率约70%左右,这种结果是没有办法用于生产的,问题应该是我们没有找到决定葡萄酒品质的关键特征。
五、小结
到这篇文章为止,《ml.net学习笔记系列》就结束了。学习过程中涉及的原始代码主要来源于:https://github.com/dotnet/machinelearning-samples 。
该工程中还有一些其他算法应用的例子,包括:聚类、矩阵分解、异常检测,其大体流程基本都差不多,有了我们这个系列的学习基础有兴趣的朋友可以自己研究一下。
六、资源获取
源码下载地址:https://github.com/seabluescn/study_ml.net
回归工程名称:regression_winequality
automl工程名称:regression_winequality_automl