Spark开发中遇到的问题及解决方法
1.Windows下运行spark产生的Failed to locate the winutils binary in the hadoop binary path异常
解决方法:
1.下载winutils的windows版本
GitHub上,有人提供了winutils的windows的版本,项目地址是:https://github.com/srccodes/hadoop-common-2.2.0-bin,直接下载此项目的zip包,下载后是文件名是hadoop-common-2.2.0-bin-master.zip,随便解压到一个目录
2.配置环境变量
增加用户变量HADOOP_HOME,值是下载的zip包解压的目录,然后在系统变量path里增加$HADOOP_HOME\bin 即可。
再次运行程序,正常执行。
原文链接:http://blog.csdn.net/shawnhu007/article/details/51518879
2.在Spark中,要访问hdfs文件系统上的文件,需要将hadoop的core-site.xml和hdfs-site.xml两个文件拷贝到spark的conf目录下
3.Spark引入第三方Jar包的问题
①可以使用maven的assembly插件将第三方Jar包全部打入生成的Jar中,存在的问题是Jar生成慢,生成的Jar包很大
②在spark-submit时添加–jars参数,问题是引入的Jar包比较多时,命令行比较长
spark-submit --jars ~/lib/hanlp-1.5.3.jar --class "www.bdqn.cn.MyTest" --master spark://hadoop000:7077 ~/lib/SparkTechCount-1.0.jar
③配置spark的spark-defaults.conf设置第三方Jar包的目录,不过此种情况下集群上的每台机器都需要配置并上传Jar包
spark.executor.extraClassPath=/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/external_jars/*
spark.driver.extraClassPath=/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/external_jars/*
④在spark on yarn(cluster)模式下,可以将Jar包放到hdfs上,由于没有亲自测试,只是在此记录下。
3.Spark的文件系统在分布式环境下也使用的是HDFS,我的实验机在经过了1个周末后使用Xshell登录服务器后,发现在命令行模式下输入都卡顿。原因在网上查了下是因为hdfs的问题,先记录解决办法:
①此时使用stop-all.sh脚本已经无法停止hdfs了
②使用命令行查找Hadoop相关的进程号:
ps -ef | grep java | grep hadoop
然后使用kill xxxx把对应的进程一个一个的杀掉,我杀进程是从上到下挨个杀的,网上找到的资料有写杀进程的顺序应按照以下顺序,可以参考:
停止顺序: job 、task、namenode、datanode、secondarynode
③杀完进程后,再使用start-all.sh和stop-all.sh就可以了。
④在网上找到的永久解决方案,尝试效果待验证
出现这个问题的最常见原因是hadoop在stop的时候依据的是datanode上的mapred和dfs进程号。而默认的进程号保存在/tmp下,linux默认会每隔一段时间(一般是一个月或者7天左右)去删除这个目录下的文件。因此删掉hadoop-hadoop-jobtracker.pid和hadoop-hadoop-namenode.pid两个文件后,namenode自然就找不到datanode上的这两个进程了。
另外还有两个原因可能引起这个问题:
①:环境变量 $HADOOP_PID_DIR 在你启动hadoop后改变了
②:用另外的用户身份执行stop-all
解决方法:
①:永久解决方法,修改$HADOOP_HOME/conf/hadoop-env.sh里边,去掉export HADOOP_PID_DIR=/var/hadoop/pids的#号,创建/var/hadoop/pids或者你自己指定目录
发现问题后的解决方法:
①:这个时候通过脚本已经无法停止进程了,不过我们可以手工停止,方法是到各mfs master和各datanode执行ps -ef | grep java | grep hadoop找到进程号强制杀掉,然后在master执行start-all脚本重新启动,就能正常启动和关闭了。
我把HADOOP_PID_DIR指定为
export HADOOP_PID_DIR=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/pids
4.Spark开发时遇到这个Exception:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 2) had a not serializable result: java.util.ArrayList$SubList
Serialization stack:
- object not serializable (class: java.util.ArrayList$SubList, value: [d])
- field (class: scala.Tuple3, name: _2, type: class java.lang.Object)
- object (class scala.Tuple3, ([a, b],[d],0.5))
- writeObject data (class: java.util.ArrayList)
- object (class java.util.ArrayList, [([a, b],[d],0.5), ([a, b],[c],0.5)])
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 11)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:339)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:46)
at org.dataalgorithms.MyImplementation.MarketBasketAnalyzeDriver.main(MarketBasketAnalyzeDriver.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
原因是我在Spark代码里调用了subList来获得List子串,解决方案如下
At some point you’re using something like: x
= myArrayList.subList(a,b));
After this x will not be serializable as the sublist object returned from subList() does not implement it. Try doing x
= new ArrayList(myArrayList.subList(a,b))); instead.
推荐阅读