欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Spark Streaming 2020(1)Investigation

程序员文章站 2022-03-30 19:24:04
...
Spark Streaming 2020(1)Investigation

On my local I have spark cluster with Zeppelin Notebook. Kafka 3 Nodes Cluster on rancher-home, rancher-worker1, rancher-worker2.

Start Kafka Cluster
Start Zookeeper Cluster on 3 machines
> cd /opt/zookeeper
> /opt/zookeeper/bin/zkServer.sh start /opt/zookeeper/conf/zoo.cfg

Check status
>zkServer.sh status conf/zoo.cfg
1 leader, 2 followers

Start Kafka on all 3 machines
> cd /opt/kafka
> nohup /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties &

Check the Producer and Consumer
> bin/kafka-console-producer.sh --broker-list rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1

> bin/kafka-console-consumer.sh --bootstrap-server rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1 --from-beginning

Go to sparkmaster_service, start on rancher-home, go to sparkslave_service, start on rancher-worker1 and rancher-worker2
Check web console
http://rancher-home:8088/

Check who is using 8080 ports on CentOS7
> netstat -nlp | grep 8080
tcp6       0      0 :::8080                :::*                    LISTEN      6206/java

> ps -ef | grep 6206
It seems that zookeeper is using 8080 ports

Package
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/2.4.0

Kafka Version
https://spark.apache.org/docs/latest/streaming-kafka-integration.html

One updated example
https://community.cloudera.com/t5/Support-Questions/Scala-Spark-Streaming-with-Kafka-Integration-in-Zeppelin-not/td-p/233507

It seems that it is hard to make streaming working well on Zeppelin, so I try to make it working in java and Scala in projects.
https://github.com/luohuazju/kiko-spark-java

Issues about HDFS is not a type, error message:
No FileSystem for scheme: hdfs

Solution:
https://www.cnblogs.com/justinzhang/p/4983673.html
https://www.edureka.co/community/3320/java-mapreduce-error-saying-no-filesystem-for-scheme-hdfs
https://brucebcampbell.wordpress.com/2014/12/11/fix-hadoop-hdfs-error-java-io-ioexception-no-filesystem-for-scheme-hdfs-at-org-apache-hadoop-fs-filesystem-getfilesystemclassfilesystem-java2385/
https://www.codelast.com/%E5%8E%9F%E5%88%9B-%E8%A7%A3%E5%86%B3%E8%AF%BB%E5%86%99hdfs%E6%96%87%E4%BB%B6%E7%9A%84%E9%94%99%E8%AF%AF%EF%BC%9Ano-filesystem-for-scheme-hdfs/
https://*.com/questions/17265002/hadoop-no-filesystem-for-scheme-file

My changes in pom.xml and settings are as follow:
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

    <properties>
        <spark.version>2.4.4</spark.version>
        <hadoop.version>3.2.1</hadoop.version>
    </properties>

        SparkConf conf = this.getSparkConf();
        SparkContext sc = new SparkContext(conf);
        Configuration hadoopConf = sc.hadoopConfiguration();
        hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");

Going to solve the HDFS issue and upgrade the project library

But after I solve the HDFS problem, I found that I run the Hadoop in Docker, the Spark Cluster has issue to talk to 9866 port which is the dis.datanode.address
https://kontext.tech/column/hadoop/265/default-ports-used-by-hadoop-services-hdfs-mapreduce-yarn
https://www.stefaanlippens.net/hadoop-3-default-ports.html

I may need to set up a HDFS Cluster outside of the Dockers
https://acadgild.com/blog/hadoop-3-x-installation-guide
https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/


References:
https://github.com/luohuazju/sillycat-spark
https://www.iteye.com/blog/sillycat-2215237
https://www.iteye.com/blog/sillycat-2406572
https://www.iteye.com/blog/sillycat-2370527

Third parties
https://www.cnblogs.com/luweiseu/p/8045863.html

zepplelin stream
https://www.cnblogs.com/luweiseu/p/8045863.html
https://henning.kropponline.de/2016/12/25/simple-spark-streaming-kafka-example-in-a-zeppelin-notebook/
https://juejin.im/post/5c997d9e5188252da22514e6
https://blog.csdn.net/qwemicheal/article/details/71082663