Spark实战(1) 配置AWS EMR 和Zeppelin Notebook
程序员文章站
2022-06-13 22:05:47
...
SparkContext和SparkSession的区别,如何取用?
-
SparkContext:
- 在Spark 2.0.0之前使用
- 通过资源管理器例如YARN来连接集群
- 需要传入SparkConf来创建SparkContext对象
- 如果要使用SQL,HIVE或者Streaming的API, 需要创建单独的Context
-
val conf = new SparkConf() .setAppName(“RetailDataAnalysis”) .setMaster(“spark://master:7077”) .set(“spark.executor.memory”, “2g”) val sc = new SparkContext(conf)
-
SparkSession:
- 出现在Spark 2.0.0之后, 推荐使用
- 除了能够调用Spark的全部功能之外,允许DataFrame和Dataset APIs
- 对于SQL, HIVE和Streaming,不需要创建单独的Context
- 可以在初始化session之后配置config
# Creating Spark session: val spark = SparkSession .builder .appName("WorldBankIndex") .getOrCreate() # Configuring properties: spark.conf.set("spark.sql.shuffle.partitions", 6) spark.conf.set("spark.executor.memory", "2g")
配置AWS EMR
# 1. Open aws console
# 2. Access the EMR
# 3. Create cluser
# 4. Go to andvanced options
# 5. Release: emr-5.11.1
# 6. Hadoop: 2.7.3
# 7. Zeppelin: 0.7.3
# 8. Spark: 2.2.1
# 9. Choose spot price to save budget
# 10. Create you key pair, download and chmod 400 it
# 11. Add inbound Security Group: 22 for ssh, 8890 for Zeppelin
创建Zeppelin Notebook
# 1. access master node public dns:8890
# 2. Create new note
# 3. Default Interpreter: spark
%pyspark # 4. import the pyspark package
# after importing package, you could run python code in zeppelin
for i in [1,2,3]:
print(i)
# the spark context is already set
sc
# the spark session is already set
spark
# read file fro aws s3
df = spark.read.csv("s3n://MyaccessKey:[email protected]/file.csv")