eclipse开发spark程序配置本地运行 博客分类: Spark eclipsescala
程序员文章站
2024-03-22 15:26:28
...
今天简单讲一下在local模式下用eclipse开发一个简单的spark应用程序,并在本地运行测试。
1.下载最新版的scala for eclipse版本,选择windows 64位,下载网址:http://scala-ide.org/download/sdk.html
下载好后解压到D盘,打开并选择工作空间。
然后创建一个测试项目ScalaDev,右击项目选择Properties,在对话框中选择Scala Compiler,在右面页签中勾选Use Project Settings和Scala Installation点击ok,保存配置。
2.添加spark1.6.0的jar文件依赖spark-assembly-1.6.0-hadoop2.6.0.jar,并添加到项目中。
spark-assembly-1.6.0-hadoop2.6.0.jar在spark-1.6.0-bin-hadoop2.6.tgz包中的lib下面。
右击ScalaDev项目选择Build Path->Configure Build Path
注:如果你选择了Scala Installation为Latest2.11 bundle(dynamic)项目会报如下的错误:ScalaDev工程上出现一个红叉,查看Problems下面的原因是scala编译版本和spark的不一致导致。
More than one scala library found in the build path (D:/eclipse/plugins/org.scala-lang.scala-library_2.11.7.v20150622-112736-1fbce4612c.jar, F:/IMF/Big_Data_Software/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar).At least one has an incompatible version. Please update the project build path so it contains only one compatible scala library.
解决方法:右击Scala Library Container->Properties,在弹出框中选择Latest 2.10 bundle(dynamic),保存即可。
3.在src下创建spark工程包,并创建入口类。
选择项目New -> Package创建com.imf.spark包;
选择com.imf.spark包名,创建Scala Object;
测试程序前,要将spark-1.6.0-bin-hadoop2.6目录中的README.md文件拷贝到D://testspark//目录中,代码如下:
运行结果:
说明:上面程序运行错误,是加载hadoop的配置,因为运行在本地,是找不到的,但不影响测试。
1.下载最新版的scala for eclipse版本,选择windows 64位,下载网址:http://scala-ide.org/download/sdk.html
下载好后解压到D盘,打开并选择工作空间。
然后创建一个测试项目ScalaDev,右击项目选择Properties,在对话框中选择Scala Compiler,在右面页签中勾选Use Project Settings和Scala Installation点击ok,保存配置。
2.添加spark1.6.0的jar文件依赖spark-assembly-1.6.0-hadoop2.6.0.jar,并添加到项目中。
spark-assembly-1.6.0-hadoop2.6.0.jar在spark-1.6.0-bin-hadoop2.6.tgz包中的lib下面。
右击ScalaDev项目选择Build Path->Configure Build Path
注:如果你选择了Scala Installation为Latest2.11 bundle(dynamic)项目会报如下的错误:ScalaDev工程上出现一个红叉,查看Problems下面的原因是scala编译版本和spark的不一致导致。
More than one scala library found in the build path (D:/eclipse/plugins/org.scala-lang.scala-library_2.11.7.v20150622-112736-1fbce4612c.jar, F:/IMF/Big_Data_Software/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar).At least one has an incompatible version. Please update the project build path so it contains only one compatible scala library.
解决方法:右击Scala Library Container->Properties,在弹出框中选择Latest 2.10 bundle(dynamic),保存即可。
3.在src下创建spark工程包,并创建入口类。
选择项目New -> Package创建com.imf.spark包;
选择com.imf.spark包名,创建Scala Object;
测试程序前,要将spark-1.6.0-bin-hadoop2.6目录中的README.md文件拷贝到D://testspark//目录中,代码如下:
package com.imf.spark import org.apache.spark.SparkConf import org.apache.spark.SparkContext /** * 用户scala开发本地测试的spark wordcount程序 */ object WordCount { def main(args: Array[String]): Unit = { /** * 1.创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息, * 例如:通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置为local, * 则代表Spark程序在本地运行,特别适合于机器配置条件非常差的情况。 */ //创建SparkConf对象 val conf = new SparkConf() //设置应用程序名称,在程序运行的监控界面可以看到名称 conf.setAppName("My First Spark App!") //设置local使程序在本地运行,不需要安装Spark集群 conf.setMaster("local") /** * 2.创建SparkContext对象 * SparkContext是spark程序所有功能的唯一入口,无论是采用Scala,java,python,R等都必须有一个SprakContext * SparkContext核心作用:初始化spark应用程序运行所需要的核心组件,包括DAGScheduler,TaskScheduler,SchedulerBackend * 同时还会负责Spark程序往Master注册程序等; * SparkContext是整个应用程序中最为至关重要的一个对象; */ //通过创建SparkContext对象,通过传入SparkConf实例定制Spark运行的具体参数和配置信息 val sc = new SparkContext(conf) /** * 3.根据具体数据的来源(HDFS,HBase,Local,FS,DB,S3等)通过SparkContext来创建RDD; * RDD的创建基本有三种方式:根据外部的数据来源(例如HDFS)、根据Scala集合、由其他的RDD操作; * 数据会被RDD划分成为一系列的Partitions,分配到每个Partition的数据属于一个Task的处理范畴; */ //读取本地文件,并设置一个partition val lines = sc.textFile("D://testspark//README.md",1) /** * 4.对初始的RDD进行Transformation级别的处理,例如map,filter等高阶函数的变成,来进行具体的数据计算 * 4.1.将每一行的字符串拆分成单个单词 */ //对每一行的字符串进行拆分并把所有行的拆分结果通过flat合并成一个大的集合 val words = lines.flatMap { line => line.split(" ") } /** * 4.2.在单词拆分的基础上对每个单词实例计数为1,也就是word => (word,1) */ val pairs = words.map{word =>(word,1)} /** * 4.3.在每个单词实例计数为1基础上统计每个单词在文件中出现的总次数 */ //对相同的key进行value的累积(包括Local和Reducer级别同时Reduce) val wordCounts = pairs.reduceByKey(_+_) //打印输出 wordCounts.foreach(pair => println(pair._1+":"+pair._2)) sc.stop() } }
运行结果:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/01/26 08:23:37 INFO SparkContext: Running Spark version 1.6.0 16/01/26 08:23:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/26 08:23:42 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104) at org.apache.hadoop.security.Groups.<init>(Groups.java:86) at org.apache.hadoop.security.Groups.<init>(Groups.java:66) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2136) at org.apache.spark.SparkContext.<init>(SparkContext.scala:322) at com.dt.spark.WordCount$.main(WordCount.scala:29) at com.dt.spark.WordCount.main(WordCount.scala) 16/01/26 08:23:42 INFO SecurityManager: Changing view acls to: vivi 16/01/26 08:23:42 INFO SecurityManager: Changing modify acls to: vivi 16/01/26 08:23:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vivi); users with modify permissions: Set(vivi) 16/01/26 08:23:43 INFO Utils: Successfully started service 'sparkDriver' on port 54663. 16/01/26 08:23:43 INFO Slf4jLogger: Slf4jLogger started 16/01/26 08:23:43 INFO Remoting: Starting remoting 16/01/26 08:23:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.100.102:54676] 16/01/26 08:23:43 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54676. 16/01/26 08:23:43 INFO SparkEnv: Registering MapOutputTracker 16/01/26 08:23:43 INFO SparkEnv: Registering BlockManagerMaster 16/01/26 08:23:43 INFO DiskBlockManager: Created local directory at C:\Users\vivi\AppData\Local\Temp\blockmgr-5f59f3c2-3b87-49c5-a1ae-e21847aac44b 16/01/26 08:23:43 INFO MemoryStore: MemoryStore started with capacity 1813.7 MB 16/01/26 08:23:43 INFO SparkEnv: Registering OutputCommitCoordinator 16/01/26 08:23:43 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/01/26 08:23:43 INFO SparkUI: Started SparkUI at http://192.168.100.102:4040 16/01/26 08:23:43 INFO Executor: Starting executor ID driver on host localhost 16/01/26 08:23:43 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54683. 16/01/26 08:23:43 INFO NettyBlockTransferService: Server created on 54683 16/01/26 08:23:43 INFO BlockManagerMaster: Trying to register BlockManager 16/01/26 08:23:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54683 with 1813.7 MB RAM, BlockManagerId(driver, localhost, 54683) 16/01/26 08:23:43 INFO BlockManagerMaster: Registered BlockManager 16/01/26 08:23:46 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB) 16/01/26 08:23:46 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.6 KB) 16/01/26 08:23:46 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54683 (size: 13.9 KB, free: 1813.7 MB) 16/01/26 08:23:46 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:37 16/01/26 08:23:47 WARN : Your hostname, vivi-PC resolves to a loopback/non-reachable address: fe80:0:0:0:5937:95c4:86da:2f43%30, but we couldn't find any external IP address! 16/01/26 08:23:48 INFO FileInputFormat: Total input paths to process : 1 16/01/26 08:23:48 INFO SparkContext: Starting job: foreach at WordCount.scala:56 16/01/26 08:23:48 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:48) 16/01/26 08:23:48 INFO DAGScheduler: Got job 0 (foreach at WordCount.scala:56) with 1 output partitions 16/01/26 08:23:48 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCount.scala:56) 16/01/26 08:23:48 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 16/01/26 08:23:48 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 16/01/26 08:23:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:48), which has no missing parents 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.0 KB, free 171.6 KB) 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 173.9 KB) 16/01/26 08:23:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54683 (size: 2.3 KB, free: 1813.7 MB) 16/01/26 08:23:48 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 16/01/26 08:23:48 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:48) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 16/01/26 08:23:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2119 bytes) 16/01/26 08:23:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 16/01/26 08:23:48 INFO HadoopRDD: Input split: file:/D:/testspark/README.md:0+3359 16/01/26 08:23:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 16/01/26 08:23:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 16/01/26 08:23:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/01/26 08:23:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/01/26 08:23:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/01/26 08:23:48 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver 16/01/26 08:23:48 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 177 ms on localhost (1/1) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/01/26 08:23:48 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:48) finished in 0.186 s 16/01/26 08:23:48 INFO DAGScheduler: looking for newly runnable stages 16/01/26 08:23:48 INFO DAGScheduler: running: Set() 16/01/26 08:23:48 INFO DAGScheduler: waiting: Set(ResultStage 1) 16/01/26 08:23:48 INFO DAGScheduler: failed: Set() 16/01/26 08:23:48 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54), which has no missing parents 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 176.4 KB) 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1581.0 B, free 177.9 KB) 16/01/26 08:23:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54683 (size: 1581.0 B, free: 1813.7 MB) 16/01/26 08:23:48 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 16/01/26 08:23:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 16/01/26 08:23:48 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes) 16/01/26 08:23:48 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 16/01/26 08:23:48 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 16/01/26 08:23:48 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms package:1 For:2 Programs:1 processing.:1 Because:1 The:1 cluster.:1 its:1 [run:1 APIs:1 have:1 Try:1 computation:1 through:1 several:1 This:2 graph:1 Hive:2 storage:1 ["Specifying:1 To:2 page](http://spark.apache.org/documentation.html):1 Once:1 "yarn":1 prefer:1 SparkPi:2 engine:1 version:1 file:1 documentation,:1 processing,:1 the:21 are:1 systems.:1 params:1 not:1 different:1 refer:2 Interactive:2 R,:1 given.:1 if:4 build:3 when:1 be:2 Tests:1 Apache:1 ./bin/run-example:2 programs,:1 including:3 Spark.:1 package.:1 1000).count():1 Versions:1 HDFS:1 Data.:1 >>>:1 programming:1 Testing:1 module,:1 Streaming:1 environment:1 run::1 clean:1 1000::2 rich:1 GraphX:1 Please:3 is:6 run:7 URL,:1 threads.:1 same:1 MASTER=spark://host:7077:1 on:5 built:1 against:1 [Apache:1 tests:2 examples:2 at:2 optimized:1 usage:1 using:2 graphs:1 talk:1 Shell:2 class:2 abbreviated:1 directory.:1 README:1 computing:1 overview:1 `examples`:2 example::1 ##:8 N:1 set:2 use:3 Hadoop-supported:1 tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).:1 running:1 find:1 contains:1 project:1 Pi:1 need:1 or:3 Big:1 Java,:1 high-level:1 uses:1 <class>:1 Hadoop,:2 available:1 requires:1 (You:1 see:1 Documentation:1 of:5 tools:1 using::1 cluster:2 must:1 supports:2 built,:1 system:1 build/mvn:1 Hadoop:3 this:1 Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version):1 particular:2 Python:2 Spark:13 general:2 YARN,:1 pre-built:1 [Configuration:1 locally:2 library:1 A:1 locally.:1 sc.parallelize(1:1 only:1 Configuration:1 following:2 basic:1 #:1 changed:1 More:1 which:2 learning,:1 first:1 ./bin/pyspark:1 also:4 should:2 for:11 [params]`.:1 documentation:3 [project:2 mesos://:1 Maven](http://maven.apache.org/).:1 setup:1 <http://spark.apache.org/>:1 latest:1 your:1 MASTER:1 example:3 scala>:1 DataFrames,:1 provides:1 configure:1 distributions.:1 can:6 About:1 instructions.:1 do:2 easiest:1 no:1 how:2 `./bin/run-example:1 Note:1 individual:1 spark://:1 It:2 Scala:2 Alternatively,:1 an:3 variable:1 submit:1 machine:1 thread,:1 them,:1 detailed:2 stream:1 And:1 distribution:1 return:2 Thriftserver:1 ./bin/spark-shell:1 "local":1 start:1 You:3 Spark](#building-spark).:1 one:2 help:1 with:3 print:1 Spark"](http://spark.apache.org/docs/latest/building-spark.html).:1 data:1 wiki](https://cwiki.apache.org/confluence/display/SPARK).:1 in:5 -DskipTests:1 downloaded:1 versions:1 online:1 Guide](http://spark.apache.org/docs/latest/configuration.html):1 comes:1 [building:1 Python,:2 Many:1 building:2 Running:1 from:1 way:1 Online:1 site,:1 other:1 Example:1 analysis.:1 sc.parallelize(range(1000)).count():1 you:4 runs.:1 Building:1 higher-level:1 protocols:1 guidance:2 a:8 guide,:1 name:1 fast:1 SQL:2 will:1 instance::1 to:14 core:1 :67 web:1 "local[N]":1 programs:2 package.):1 that:2 MLlib:1 ["Building:1 shell::2 Scala,:1 and:10 command,:2 ./dev/run-tests:1 sample:1 16/01/26 08:23:48 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver 16/01/26 08:23:48 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 61 ms on localhost (1/1) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/01/26 08:23:48 INFO DAGScheduler: ResultStage 1 (foreach at WordCount.scala:56) finished in 0.061 s 16/01/26 08:23:48 INFO DAGScheduler: Job 0 finished: foreach at WordCount.scala:56, took 0.328012 s 16/01/26 08:23:48 INFO SparkUI: Stopped Spark web UI at http://192.168.100.102:4040 16/01/26 08:23:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/01/26 08:23:48 INFO MemoryStore: MemoryStore cleared 16/01/26 08:23:48 INFO BlockManager: BlockManager stopped 16/01/26 08:23:48 INFO BlockManagerMaster: BlockManagerMaster stopped 16/01/26 08:23:48 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/01/26 08:23:48 INFO SparkContext: Successfully stopped SparkContext 16/01/26 08:23:48 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/01/26 08:23:48 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/01/26 08:23:48 INFO ShutdownHookManager: Shutdown hook called 16/01/26 08:23:48 INFO ShutdownHookManager: Deleting directory C:\Users\vivi\AppData\Local\Temp\spark-56f9ed0a-5671-449a-955a-041c63569ff2
说明:上面程序运行错误,是加载hadoop的配置,因为运行在本地,是找不到的,但不影响测试。