spark开发中遇到的一些问题及解决方法总结
1.Exception in thread “main” java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
解决:scala版本不对,将scala 2.11 换成scala 2.10
2.window本地 运行saprk程序会报如下报错误:
Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
解决:
1.下载包含winutils.exe的hadoop包
https://www.barik.net/archive/2015/01/19/172716/
2.解压
3.设置HADOOP_HOME环境变量 HADOOP_HOME=E:\1-work\hadoop-2.6.0
4.设置Path = %Path%;$HADOOP_HOME\bin,该目录下有winutils.exe
5.重启电脑
3.window本地调试spark log4j日志级别调整
在项目源码根目录下创建log4j.properties文件,并写入一下内容(以下黑色内容可复制):级别ERROR,DEBGU,INFO,WARN…
log4j.rootCategory=ERROR, file
log4j.appender.file=org.apache.log4j.ConsoleAppender
#如果要把日志输出到某个文件中,则使用FileAppender
#log4j.appender.file=org.apache.log4j.FileAppender
#log4j.appender.file.file=spark.log
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n
#Ignore messages below warning level from Jetty, because it’s a bit verbose
log4j.logger.org.eclipse.jetty=ERROR
org.eclipse.jetty.LEVEL=ERROR
4.spark 在yarn提交模式
standalone 模式提交job
spark-submit --master spark://vm-spider-94:7077
–class com.fangdd.daas.hbase.Hdfs2HBaseTable
–total-executor-cores 16
–executor-memory 4g
–conf spark.default.parallelism=16
–driver-memory 2g
/root/hfile/hbase-data-service-2.0-SNAPSHOT-jar-with-dependencies.jar
${debugFlag} ${renameFlag} ${hdfsPath} ${zkQuorum} ${hbaseMaster} ${hbaseZnode} ${partitions} ${writeBufferSize} ${tgtTableName} ${tmpTableName} ${repartitionNum}
yarn-client 模式
spark-submit --master yarn-client
–class com.fangdd.daas.hbase.Hdfs2HBaseTable
–total-executor-cores 16
–executor-memory 4g
–conf spark.default.parallelism=16
–driver-memory 2g
/root/hfile/hbase-data-service-2.0-SNAPSHOT-jar-with-dependencies.jar
${debugFlag} ${renameFlag} ${hdfsPath} ${zkQuorum} ${hbaseMaster} ${hbaseZnode} ${partitions} ${writeBufferSize} ${tgtTableName} ${tmpTableName} ${repartitionNum}
yarn-cluster模式
spark-submit --master yarn-cluster
–class com.fangdd.daas.hbase.Hdfs2HBaseTable
–total-executor-cores 16
–executor-memory 4g
–conf spark.default.parallelism=16
–driver-memory 2g
/root/hfile/hbase-data-service-2.0-SNAPSHOT-jar-with-dependencies.jar
${debugFlag} ${renameFlag} ${hdfsPath} ${zkQuorum} ${hbaseMaster} ${hbaseZnode} ${partitions} ${writeBufferSize} ${tgtTableName} ${tmpTableName} ${repartitionNum}
5.idea spark本地连hive 只能显示default库,连不上其他库解决
1.必须配置好需要连接的hive的地址。。。而这个地址的配置是使用hive-site.xml文件,只需把配置好的这个文件放进resources
2.可能因为读不到resources目录下的hive-site.xml文件
解决:修改pom文件如下配置
<resources>
<resource>
<directory>src/main/resources/${package.environment}</directory>
</resource>
<resource>
<directory>src/main/resources/</directory>
</resource>
</resources>
sparkstreaming kafka操作hive 的sparksession声明,要加上enableHiveSupport(),要不也只能读到hive的default库
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).enableHiveSupport().getOrCreate()
如果再不行,则在idea中用maven clean一下,能解决
6.idea spark本地连hive报如下错误
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app)
用Everything搜索metastore_db,删除之
7.spark开发中map等算子中print打印不出来
是因为spark是延时计算,需要在map后添加.collect等action算子触发计算
8.spark 本地win系统运行jar报 no filesystem for scheme hdfs
解决办法为:将集群的Hadoop/conf/core-site.xml拷贝到你工程的根目录
下,也就是src下。打开此文件,在最后添加以下代码:
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
刷新工程,重新运行程序,问题完美解决。
9.spark 本地win系统运行jar,获取mysql jdbc DF 报 Exception in thread “main” java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html
pom文件添加
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
代码使用:
def getMysqlJdbcDF(spark: SparkSession, url: String, querySql: String, userName: String, passWord: String): DataFrame = {
val readConnProperties4 = new Properties()
readConnProperties4.put(“driver”, “com.mysql.jdbc.Driver”)
readConnProperties4.put(“user”, userName)
readConnProperties4.put(“password”, passWord)
readConnProperties4.put(“fetchsize”, “3”)
spark.read.jdbc(
url,
s"($querySql) t", // 注意括号和表别名,必须得有,这里可以过滤数据
readConnProperties4)
}
上一篇: php实现从上传文件创建缩略图的方法,_PHP教程
下一篇: php中文字母数字验证码实现代码