spark 行转列
程序员文章站
2024-03-12 17:34:38
...
StructType
注意这种方案解决的是形如下面myScore这样的扩展
数据是json格式
/*
root
|-- age: long (nullable = true)
|-- myScore: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- score1: long (nullable = true)
| | |-- score2: long (nullable = true)
|-- name: string (nullable = true)
参考样例:{"name":"Michael", "age":25,"myScore":[{"score1":19,"score2":23},{"score1":58,"score2":50}]}
{"name":"Andy", "age":30,"myScore":[{"score1":29,"score2":33},{"score1":38,"score2":52},{"score1":88,"score2":71}]}
{"name":"Justin", "age":19,"myScore":[{"score1":39,"score2":43},{"score1":28,"score2":53}]}
*/
def explodeDataFrameStruct(df: DataFrame, explodeCol: String, nextLevelCols: Array[String]) = {
var dfScore = df.withColumn(explodeCol, explode(df(explodeCol)))
for (str <- nextLevelCols) {
dfScore = dfScore.withColumn(str, dfScore(explodeCol + "." + str))
}
println(dfScore.show(2))
dfScore
}
String
对string类型的数据进行正则表达式匹配,然后利用explode函数进行对应操作
def explodeDataFrameString(df: DataFrame, explodeCol: String,splitRule:Array[String],nextLevelCols: Array[String])={
/*
数据形如:
val dft = Seq((1, "scene_id1,scene_name1;scene_id2,scene_name2","michal"),
(2, "scene_id1,scene_name1;scene_id2,scene_name2;scene_id3,scene_name3","john"),
(3, "scene_id4,scene_name4;scene_id2,scene_name2","mary"),
(4, "scene_id6,scene_name6;scene_id5,scene_name5","lily")
).toDF("id", "int_id","name");
*/
var dt=df.withColumn(explodeCol, explode(split(col(explodeCol), splitRule(0))))
//var ind=0
for(i<-0 until nextLevelCols.length) {
dt = dt.withColumn(nextLevelCols(i), split(col(explodeCol), splitRule(1))(i))
}
dt
}
下一篇: Java精确抽取网页发布时间
推荐阅读
-
spark 行转列
-
php从数据库中读取特定的行(实例)
-
spark - 分区自动探测 博客分类: spark spark 分区探测
-
sql中 行转列 (一) 博客分类: 数据库 SQLSQL Server游戏生物脚本
-
在GridView中LinkButton的属性的应用(如何不用选中就删除这一行)
-
【Spark2运算效率】第五节 影响生产集群运算效率的原因之小文件
-
mysql锁机制 博客分类: MySQL mysql锁行级锁
-
关于oracle的行级锁 博客分类: orcle orcle行级锁
-
mysql锁机制 博客分类: MySQL mysql锁行级锁
-
按行读取文件(readlines)数据去掉\n(换行符)