30天了解30种技术系列---(23)SparkR

程序员文章站 2022-07-13 16:54:14

...

SparkR源于AMPLab，是将R易用性和Spark扩展性整合的一个探索。在这个前提之下，SparkR开发者预览版最早在2014年1月开源。随后的一年，SparkR在AMPLab得到了飞速发展，而在许多贡献者的努力下，SparkR在性能和可用性上得到了显著提升。最近，SparkR被合并到Spark项目，并在1.4版本中作为alpha组件发布。

SparkR DataFrames

在Spark 1.4中，SparkR 的核心组件是SparkR DataFrames——在Spark上实现的一个分布式data frame。data frame 是R中处理数据的基本数据结构，而当下这个概念已经通过函数库（比如Pandas）扩展到其它所有语言。而像dplyr这样的项目更去除了基于data frames数据操作任务中存在的大量复杂性。在SparkR DataFrames中，一个类似dplyr和原生R data frame的API被发布，同时它还可以依托Spark，对大型数据集进行分布式计算

SparkR 程序示例

sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

# Create the DataFrame
df <- createDataFrame(sqlContext, iris)

# Fit a linear model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")

# Model coefficients are returned in a similar format to R's native glm().
summary(model)
##$coefficients
##                    Estimate
##(Intercept)        2.2513930
##Sepal_Width        0.8035609
##Species_versicolor 1.4587432
##Species_virginica  1.9468169

# Make predictions based on the model.
predictions <- predict(model, newData = df)
head(select(predictions, "Sepal_Length", "prediction"))
##  Sepal_Length prediction
##1          5.1   5.063856
##2          4.9   4.662076
##3          4.7   4.822788
##4          4.6   4.742432
##5          5.0   5.144212
##6          5.4   5.385281

SparkR 项目地址：http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html

总结：拥有了R的可视化，Spark终于在这方面取得了巨大突破，同时借助Spark ，R语言的处理速度大大的增加了。

更多精彩请关注微信 : 图灵搜索

请大家使用中国第一个为程序员打造的搜索引擎：图灵搜索，https://www.tulingss.com