Spark ML(2):常规统计(统计汇总、相关性分析、假设检验)
程序员文章站
2024-03-08 09:27:52
...
一、实现功能
常规统计方法,可以在作进一步处理之前,对整体数据集有一个理性的了解。对后续处理,可以提高效率,以及准确性。
二、统计汇总
1.功能
在使用spark机器学习训练前,使用统计汇总函数,可以大致了解数据集总体情况2.参考:官网
http://spark.apache.org/docs/2.1.0/mllib-statistics.html
官方实例:
***
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
val observations = sc.parallelize(
Seq(
Vectors.dense(1.0, 10.0, 100.0),
Vectors.dense(2.0, 20.0, 200.0),
Vectors.dense(3.0, 30.0, 300.0)
)
)
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column
***
3.北京降雨量统计分析降雨量和年份关系
(1)数据集
0.4806,0.4839,0.318,0.4107,0.4835,0.4445,0.3704,0.3389,0.3711,0.2669,0.7317,0.4309,0.7009,0.5725,0.8132,0.5067,0.5415,0.7479,0.6973,0.4422,0.6733,0.6839,0.6653,0.721,0.4888,0.4899,0.5444,0.3932,0.3807,0.7184,0.6648,0.779,0.684,0.3928,0.4747,0.6982,0.3742,0.5112,0.597,0.9132,0.3867,0.5934,0.5279,0.2618,0.8177,0.7756,0.3669,0.5998,0.5271,1.406,0.6919,0.4868,1.1157,0.9332,0.9614,0.6577,0.5573,0.4816,0.9109,0.921
(2)读取数据集
val txt = sc.textFile("file:///opt/datas/beijing.txt")
(3)倒入相应库库
import org.apache.spark.mllib.{stat,linalg}
import org.apache.spark.mllib.linalg.Vectors
(4)执行处理数据集
scala> val data=txt.flatMap(_.split(",")).map(value=>Vectors.dense(value.toDouble))
data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[5] at map at <console>:29
查看结果
scala> data.take(10)
res9: Array[org.apache.spark.mllib.linalg.Vector] = Array([2009.0], [2007.0], [2006.0], [2005.0], [2004.0], [2003.0], [2002.0], [2001.0], [2000.0], [1999.0])
(5)统计汇总:列统计
scala> stat.Statistics.colStats(data)
res4: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = aaa@qq.com0
(6)查看统计汇总结果
scala> res4.
count max mean min normL1 normL2 numNonzeros variance
三、相关系数
1.目的:研究变量之间的线性相关程度,基于皮尔逊相关系数。
2.北京历年降水量(数据集):年份与降水量之间的相关性
2009,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998,1997,1996,1995,1994,1993,1992,1991,1990,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980,1979,1978,1977,1976,1975,1974,1973,1972,1971,1970,1969,1968,1967,1966,1965,1964,1963,1962,1961,1960,1959,1958,1957,1956,1955,1954,1953,1952,1951,1950,1949
0.4806,0.4839,0.318,0.4107,0.4835,0.4445,0.3704,0.3389,0.3711,0.2669,0.7317,0.4309,0.7009,0.5725,0.8132,0.5067,0.5415,0.7479,0.6973,0.4422,0.6733,0.6839,0.6653,0.721,0.4888,0.4899,0.5444,0.3932,0.3807,0.7184,0.6648,0.779,0.684,0.3928,0.4747,0.6982,0.3742,0.5112,0.597,0.9132,0.3867,0.5934,0.5279,0.2618,0.8177,0.7756,0.3669,0.5998,0.5271,1.406,0.6919,0.4868,1.1157,0.9332,0.9614,0.6577,0.5573,0.4816,0.9109,0.921
3.scala实现统计分析
val txt = sc.textFile("file:///opt/datas/beijing.txt")
val data = txt.flatMap(_.split(",")).map(_.toDouble)
val years = data.filter(_>1000)
val values = data.filter(_<=1000)
scala> stat.Statistics.corr(years,values)
结果:
res6: Double = -0.4385405496488065
4.结果分析:
res6为-0.4385405496488065,即随着年份的降低,降雨量是上升的。换而言之,随着年份的上升,降雨量是下降的,通过excel验证
四、假设检验
1.概念:根据一定的假设条件,由样本推断总体的一种统计学方法。基本思路是先提出假设(虚无假设),使用统计学方法进行计算,根据计算结果判断是否拒绝假设。常用假设检验的方法,卡方检验,T检验。
2.spark实现的是皮尔森卡方检验,它可以实现适配度检测和独立性检测
适配度检测:验证观察值的次数分配与理论值是否相等
独立性检测:两个变量抽样到的观察值是否相互独立
3.判断性别与左撇子之间是否存在关系
男,女
右利手 127,147
左利手 19,10
4.统计实现
import org.apache.spark.mllib.{linalg,stat}
val data=linalg.Matrices.dense(2,2,Array(127,19,147,10))
scala> stat.Statistics.chiSqTest(data)
结果:
res9: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 3.8587031204632654
pValue = 0.049488567227318536
Strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
5.结果分析
默认假设是二者无关的。如果pValue >0.05,可以接受;pValue <0.05,则反对假设检验。pValue = 0.049488567227318536<0.05,所以左右手概率和男女性别是有关系的。
上一篇: 一篇文章搞定Struts2的类型转换