Spark Streaming 2020(1)Investigation

程序员文章站 2022-03-30 19:24:04

...

Spark Streaming 2020(1)Investigation

On my local I have spark cluster with Zeppelin Notebook. Kafka 3 Nodes Cluster on rancher-home, rancher-worker1, rancher-worker2.

Start Kafka Cluster
Start Zookeeper Cluster on 3 machines
> cd /opt/zookeeper
> /opt/zookeeper/bin/zkServer.sh start /opt/zookeeper/conf/zoo.cfg

Check status
>zkServer.sh status conf/zoo.cfg
1 leader, 2 followers

Start Kafka on all 3 machines
> cd /opt/kafka
> nohup /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties &

Check the Producer and Consumer
> bin/kafka-console-producer.sh --broker-list rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1

> bin/kafka-console-consumer.sh --bootstrap-server rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1 --from-beginning

Go to sparkmaster_service, start on rancher-home, go to sparkslave_service, start on rancher-worker1 and rancher-worker2
Check web console
http://rancher-home:8088/

Check who is using 8080 ports on CentOS7
> netstat -nlp | grep 8080
tcp6       0      0 :::8080                :::*                    LISTEN      6206/java

> ps -ef | grep 6206
It seems that zookeeper is using 8080 ports

Package
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/2.4.0

Kafka Version
https://spark.apache.org/docs/latest/streaming-kafka-integration.html

One updated example
https://community.cloudera.com/t5/Support-Questions/Scala-Spark-Streaming-with-Kafka-Integration-in-Zeppelin-not/td-p/233507

It seems that it is hard to make streaming working well on Zeppelin, so I try to make it working in java and Scala in projects.
https://github.com/luohuazju/kiko-spark-java

Issues about HDFS is not a type, error message:
No FileSystem for scheme: hdfs

Solution:
https://www.cnblogs.com/justinzhang/p/4983673.html
https://www.edureka.co/community/3320/java-mapreduce-error-saying-no-filesystem-for-scheme-hdfs
https://brucebcampbell.wordpress.com/2014/12/11/fix-hadoop-hdfs-error-java-io-ioexception-no-filesystem-for-scheme-hdfs-at-org-apache-hadoop-fs-filesystem-getfilesystemclassfilesystem-java2385/
https://www.codelast.com/%E5%8E%9F%E5%88%9B-%E8%A7%A3%E5%86%B3%E8%AF%BB%E5%86%99hdfs%E6%96%87%E4%BB%B6%E7%9A%84%E9%94%99%E8%AF%AF%EF%BC%9Ano-filesystem-for-scheme-hdfs/
https://*.com/questions/17265002/hadoop-no-filesystem-for-scheme-file

My changes in pom.xml and settings are as follow:
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

    <properties>
        <spark.version>2.4.4</spark.version>
        <hadoop.version>3.2.1</hadoop.version>
    </properties>

        SparkConf conf = this.getSparkConf();
        SparkContext sc = new SparkContext(conf);
        Configuration hadoopConf = sc.hadoopConfiguration();
        hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");

Going to solve the HDFS issue and upgrade the project library

But after I solve the HDFS problem, I found that I run the Hadoop in Docker, the Spark Cluster has issue to talk to 9866 port which is the dis.datanode.address
https://kontext.tech/column/hadoop/265/default-ports-used-by-hadoop-services-hdfs-mapreduce-yarn
https://www.stefaanlippens.net/hadoop-3-default-ports.html

I may need to set up a HDFS Cluster outside of the Dockers
https://acadgild.com/blog/hadoop-3-x-installation-guide
https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/

References:
https://github.com/luohuazju/sillycat-spark
https://www.iteye.com/blog/sillycat-2215237
https://www.iteye.com/blog/sillycat-2406572
https://www.iteye.com/blog/sillycat-2370527

Third parties
https://www.cnblogs.com/luweiseu/p/8045863.html

zepplelin stream
https://www.cnblogs.com/luweiseu/p/8045863.html
https://henning.kropponline.de/2016/12/25/simple-spark-streaming-kafka-example-in-a-zeppelin-notebook/
https://juejin.im/post/5c997d9e5188252da22514e6
https://blog.csdn.net/qwemicheal/article/details/71082663

上一篇： MongoDB 2019(1)Install 4.2.1 Single and Cluster

下一篇： 2011年12月31号

Spark Streaming 2020(1)Investigation

Spark Streaming 项目实战 (4) | 得到最近1小时广告点击量实时统计并写入到redis

Spark 从 0 到 1 学习(8) —— Spark Streaming

Spark Streaming 项目实战 (4) | 得到最近1小时广告点击量实时统计并写入到redis

Spark Streaming 2020(1)Investigation

Spark 以及 spark streaming 核心原理及实践 - (1)

Spark 以及 spark streaming 核心原理及实践 - (1)