欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Spark实战(1) 配置AWS EMR 和Zeppelin Notebook

程序员文章站 2022-06-13 22:05:47
...

SparkContext和SparkSession的区别,如何取用?

  • SparkContext:
    • 在Spark 2.0.0之前使用
    • 通过资源管理器例如YARN来连接集群
    • 需要传入SparkConf来创建SparkContext对象
    • 如果要使用SQL,HIVE或者Streaming的API, 需要创建单独的Context
    •   val conf = new SparkConf()
        .setAppName(“RetailDataAnalysis”)
        .setMaster(“spark://master:7077)
        .set(“spark.executor.memory”, “2g”)
        
        val sc = new SparkContext(conf)
      
  • SparkSession:
    • 出现在Spark 2.0.0之后, 推荐使用
    • 除了能够调用Spark的全部功能之外,允许DataFrameDataset APIs
    • 对于SQL, HIVE和Streaming,不需要创建单独的Context
    • 可以在初始化session之后配置config
       # Creating Spark session:
       val spark = SparkSession
       			.builder
       			.appName("WorldBankIndex")
       			.getOrCreate()
      
        # Configuring properties:
        spark.conf.set("spark.sql.shuffle.partitions", 6)
        spark.conf.set("spark.executor.memory", "2g")
      

配置AWS EMR

# 1. Open aws console
# 2. Access the EMR
# 3. Create cluser
# 4. Go to andvanced options
# 5. Release: emr-5.11.1
# 6. Hadoop: 2.7.3
# 7. Zeppelin: 0.7.3
# 8. Spark: 2.2.1
# 9. Choose spot price to save budget
# 10. Create you key pair, download and chmod 400 it
# 11. Add inbound Security Group: 22 for ssh, 8890 for Zeppelin

创建Zeppelin Notebook

# 1. access master node public dns:8890
# 2. Create new note
# 3. Default Interpreter: spark
%pyspark # 4. import the pyspark package
# after importing package, you could run python code in zeppelin
for i in [1,2,3]:
	print(i)
	
# the spark context is already set
sc

# the spark session is already set
spark

# read file fro aws s3
df = spark.read.csv("s3n://MyaccessKey:[email protected]/file.csv")
相关标签: Spark