pyspark逻辑斯蒂回归

程序员文章站 2022-07-14 22:01:41

...

数据集

数据集共包含20000行和6列
数据集是一家运动商品零售网站的在线用户有关的信息，这些数据集包括用户的国家、使用的平台、年龄、新访客/老访客，还有就是网该网站上浏览的网页数量，以及客户最终是否购买产品的信息

数据集探索研究

产看数据结构

print((df.count(),len(df.columns)))
#列名及数据类型
df.printSchema()
#查看数据内容
df.show(5)
#数据的统计指标
df.describe().show()

pyspark逻辑斯蒂回归
可以看到，访客平均年龄是28岁，他们在访问期间大约浏览了9个网页

数据探索分析

df.groupBy('Country').count().show()

Country	count
Malaysia	1218
India	4018
Indonesia	`12178`
Brazil	2586

最大数量的访客来自印度尼西亚

df.groupBy('Platform').count().show()

Platform	count
Yahoo	`9859`
Bing	4360
Google	5781

Yahoo搜索引擎用户的数量最高

df.groupBy('Country').mean().show()

Country	avg(Age)	avg(Repeat_Visitor)	avg(Web_pages_viewed)	avg(Status)
Malaysia	27.792282430213465	0.5730706075533661	`11.192118226600986`	`0.6568144499178982`
India	27.976854156296664	0.5433051269288203	10.727227476356397	0.6212045793927327
Indonesia	28.43159796354081	0.5207751683363442	9.985711939563148	0.5422893742814913
Brazil	30.274168600154677	0.322892498066512	4.921113689095128	0.038669760247486466

转化率最高的国家是马来西亚，其次是印度；平均网页访问量最高的国家是马来西亚，最低的是巴西

df.groupBy('Platform').mean().show()

Platform	avg(Age)	avg(Repeat_Visitor)	avg(Web_pages_viewed)	avg(Status)
Yahoo	28.569226087838523	0.5094837204584644	9.599655137437875	0.5071508266558474
Bing	28.68394495412844	0.4720183486238532	9.114908256880733	0.4559633027522936
Google	28.380038055699707	0.5149628092025601	9.804878048780488	`0.5210171250648676`

Google的转化率最高

df.groupBy('Status').mean().show()

Status	avg(Age)	avg(Repeat_Visitor)	avg(Web_pages_viewed)	avg(Status)
1	26.5435	`0.7019`	`14.5617`	1.0
0	30.5356	0.3039	4.5449	0.0

可以看到，转换状态和重复访客的浏览的页面数量之间存在很强的相关性

特征工程

上面在对数据进行探索分析是我们呢发现有两个变量【国家、搜索引擎】是两个类别变量，机器学习是无法识别类别变量的，所以我们需要进行转换。

转换数值型

我们先使用StringIndexer把类别变量标记为数值形式，StringIndexer会为每列的每一类分别分配一个独特的值，如下

Platform_indexer = StringIndexer(inputCol="Platform", outputCol="Platform_Num").fit(df)
df = Platform_indexer.transform(df)
df.show(3,False)

Country	Age	Repeat_Visitor	Platform	Web_pages_viewed	Status	Platform_Num
India	41	1	Yahoo	21	1	0.0
Brazil	28	1	Yahoo	5	0	0.0
Brazil	40	0	Google	3	0	1.0

可以看到搜索引擎的三个类别值，被表示成了（0.0，1.0，2.0）
三个数值型。

df.groupBy('Platform_Num').count().orderBy('count',ascending=False).show(5,False)

Platform_Num	count
0.0	9859
1.0	5781
2.0	4360

而且我们还可以发现一个规律，就是出现次数越多的类别会分配给小的数值

转换独热编码

下面我们需要进行独热编码

Platform_encoder = OneHotEncoder(inputCol="Platform_Num", outputCol="Platform_Vector")
df = Platform_encoder.fit(df).transform(df)

df.groupBy('Platform_Vector').count().orderBy('count',ascending=False).show(5,False)

Platform_Vector	count
(2,[0],[1.0])	9859
(2,[1],[1.0])	5781
(2,[],[])	4360

这里我们需要重点说明下
正常的独热编码是下面这样的：

搜索引擎	Google	Yahoo	Bing
Google	1	0	0
Yahoo	0	1	0
Bing	0	0	1

spark的独热编码是一种更优的方法：

搜索引擎	Google	Yahoo
Google	1	0
Yahoo	0	1
Bing	0	0

仅用两列来表示，这样更加节省空间，因为计算耗时更短。向量的长度等于元素总数减1

(2,[0],[1.0])表示：
2是指，总长度
[0]，表示第几个位置的元素等于1

[1.0]向量包含几个元素

对于Country列也进行同样的处理

Country_Vector	count
(3,[0],[1.0])	12178
(3,[1],[1.0])	4018
(3,[2],[1.0])	2586
(3,[],[])	1218

创建输入

VectorAssembler创建一个向量，该向量会合并所有输入特征。VectorAssembler只会创建单个特征，这个特征会捕获该行的输入值，因此，并不会分别使用五个输入列，如实际上是将所有的输入列合并成单个特征向量列
我们输入5个自变量来创建单个特征向量列

from pyspark.ml.feature import VectorAssembler

df_assembler = VectorAssembler(inputCols=['platform_vector','country_vector','Age', 'Repeat_Visitor','Web_pages_viewed'], outputCol="features")
df = df_assembler.transform(df)

df.select(['features','Status']).show(10,False)

features	Status
[1.0,0.0,0.0,1.0,0.0,41.0,1.0,21.0]	1
[1.0,0.0,0.0,0.0,1.0,28.0,1.0,5.0]	0
(8,[1,4,5,7],[1.0,1.0,40.0,3.0])	0
(8,[2,5,6,7],[1.0,31.0,1.0,15.0])	1
(8,[1,5,7],[1.0,32.0,15.0])	1
(8,[1,4,5,7],[1.0,1.0,32.0,3.0])	0
(8,[1,4,5,7],[1.0,1.0,32.0,6.0])	0
(8,[1,2,5,7],[1.0,1.0,27.0,9.0])	0
(8,[0,2,5,7],[1.0,1.0,32.0,2.0])	0
(8,[2,5,6,7],[1.0,31.0,1.0,16.0])	1

这样就把模型需要的格式数据集创建好了
model_df=df.select(['features','Status'])

划分数据集

类似于python里面sklearn里面一样直接划分数据集为训练集和测试集

training_df,test_df=model_df.randomSplit([0.75,0.25])

构建和训练模型

from pyspark.ml.classification import LogisticRegression
log_reg=LogisticRegression(labelCol='Status').fit(training_df)
train_results=log_reg.evaluate(training_df).predictions

features	rawPrediction	probability
(8,[0,2,5,7],[1.0,1.0,17.0,1.0])	[6.0234658236573875,-6.0234658236573875]	[0.997584584991536,0.0024154150084640573]
(8,[0,2,5,7],[1.0,1.0,17.0,1.0])	[6.0234658236573875,-6.0234658236573875]	[0.997584584991536,0.0024154150084640573]
(8,[0,2,5,7],[1.0,1.0,17.0,1.0])	[6.0234658236573875,-6.0234658236573875]	[0.997584584991536,0.0024154150084640573]
(8,[0,2,5,7],[1.0,1.0,17.0,2.0])	[5.2664051076117,-5.2664051076117]	[0.9948643762056791,0.0051356237943209525]
(8,[0,2,5,7],[1.0,1.0,17.0,2.0])	[5.2664051076117,-5.2664051076117]	[0.9948643762056791,0.0051356237943209525]

测试集评估

results=log_reg.evaluate(test_df).predictions

#手动计算混淆矩阵
true_postives = results[(results.Status == 1) & (results.prediction == 1)].count()
true_negatives = results[(results.Status == 0) & (results.prediction == 0)].count()
false_positives = results[(results.Status == 0) & (results.prediction == 1)].count()
false_negatives = results[(results.Status == 1) & (results.prediction == 0)].count()

准确率
准确率是评估所有粉类器的最基础指标，不过，准确率并非模型性能的恰当指示器，因为他依赖于目标类别的平衡性
$\frac{(TP+TN)}{TP+TN+FP+FN}$

accuracy=float((true_postives+true_negatives) /(results.count()))
print("accuracy:",accuracy)

accuracy: 0.935444825499294
2. 召回率
召回率反映了我们能够正确预测出的正类别样本数占正类别观测值总数的比例
$\frac{TP}{TP+FN}$

recall = float(true_postives)/(true_postives + false_negatives)
print("recall:",recall)

recall: 0.9317092651757188
3. 精度
精度是指正确预测出的正确正样本数占所有预测的正面观测值总数的比例
$\frac{TP}{TP+FP}$

precision = float(true_postives) / (true_postives + false_positives)
print("precision:",precision)

accuracy: 0.935444825499294

pyspark逻辑斯蒂回归

数据集

数据集探索研究

特征工程

转换数值型

转换独热编码

创建输入

划分数据集

构建和训练模型

测试集评估

pyspark逻辑斯蒂回归

《封号码罗》数据分析与人工智能之sklearn模型LogisticRegression逻辑斯蒂回归（十三）