Data Solution 2019(10)Spark Cluster Solution with Zeppelin
程序员文章站
2022-03-30 19:23:28
...
Data Solution 2019(10)Spark Cluster Solution with Zeppelin
Spark Single Cluster
https://spark.apache.org/docs/latest/spark-standalone.html
Mesos Cluster
https://spark.apache.org/docs/latest/running-on-mesos.html
Hadoop2 YARN
https://spark.apache.org/docs/latest/running-on-yarn.html
K8S
https://spark.apache.org/docs/latest/running-on-kubernetes.html
Zeppelin with Cluster
https://zeppelin.apache.org/docs/latest/interpreter/spark.html
Decide to Set Up Spark Standalone Cluster and Zeppelin
Start the Spark Master Machine
Prepare Spark
> wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
> mv spark-2.4.4-bin-hadoop2.7 ~/tool/spark-2.4.4
> sudo ln -s /home/carl/tool/spark-2.4.4 /opt/spark-2.4.4
> sudo ln -s /opt/spark-2.4.4 /opt/spark
> cd /opt/spark
> cp conf/spark-env.sh.template conf/spark-env.sh
A lot of sample configuration there
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
https://spark.apache.org/docs/latest/spark-standalone.html
Make some changes according to my ENV
> vi conf/spark-env.sh
SPARK_MASTER_HOST=rancher-home
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
Start the master service
> sbin/start-master.sh
Start the Slave on rancher-worker1
> wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
> mv spark-2.4.4-bin-hadoop2.7 ~/tool/spark-2.4.4
> sudo ln -s /home/carl/tool/spark-2.4.4 /opt/spark-2.4.4
> sudo ln -s /opt/spark-2.4.4 /opt/spark
Prepare Configuration
> cp conf/spark-env.sh.template conf/spark-env.sh
SPARK_MASTER_HOST=rancher-home
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
Start the slave and connect to master
> sbin/start-slave.sh spark://rancher-home:7077
Stop the slave
> sbin/stop-slave.sh spark://rancher-home:7077
Make Spark Cluster in Docker
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.
SPARK_NO_DAEMONIZE=true
It fails if I start the services
2019-10-28T00:41:42.502359700Z 19/10/28 00:41:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-10-28T00:41:43.110823900Z 19/10/28 00:41:43 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.
HOST file
https://cloud.tencent.com/developer/article/1175087
Finally, the configuration will be close to these for Master
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_LOCAL_HOSTNAME=rancher-home
SPARK_IDENT_STRING=rancher-home
SPARK_PUBLIC_DNS=rancher-home
SPARK_NO_DAEMONIZE=true
SPARK_DAEMON_MEMORY=1g
Dockerfile as follow:
#Set up spark master in Docker
#Prepre the OS
FROM centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>
RUN yum -y update
RUN yum install -y wget
#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile
RUN mkdir /tool/
WORKDIR /tool/
#add the software spark
RUN wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark
ADD conf/spark-env.sh /tool/spark/conf/
#set up the app
EXPOSE 8088 7077
RUN mkdir -p /app/
ADD start.sh /app/
WORKDIR /app/
CMD [ "./start.sh" ]
Makefile important parts as follow:
run:
docker run -d -p 7077:7077 -p 8088:8088 \
--hostname rancher-home \
--name $(NAME) $(IMAGE):$(TAG)
The Slave Machine Configuration will be as follow:
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
SPARK_PUBLIC_DNS=rancher-worker1
SPARK_LOCAL_HOSTNAME=rancher-worker1
SPARK_IDENT_STRING=rancher-worker1
SPARK_NO_DAEMONIZE=true
Dockerfile is as follow:
#Set up spark slave in Docker
#Prepre the OS
FROM centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>
RUN yum -y update
RUN yum install -y wget
#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile
RUN mkdir /tool/
WORKDIR /tool/
#add the software spark
RUN wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark
ADD conf/spark-env.sh /tool/spark/conf/
#set up the app
EXPOSE 8188 7177
RUN mkdir -p /app/
ADD start.sh /app/
WORKDIR /app/
CMD [ "./start.sh" ]
Add host to point to our master machine
run:
docker run -d -p 7177:7177 -p 8188:8188 \
--name $(NAME) \
--hostname rancher-worker1 \
--add-host=rancher-home:192.168.56.110 $(IMAGE):$(TAG)
Next step is to put a lot of configuration in parameters.
References:
https://spark.apache.org/docs/latest/cluster-overview.html
https://*.com/questions/28664834/which-cluster-type-should-i-choose-for-spark
https://*.com/questions/39671117/docker-container-with-apache-spark-in-standalone-cluster-mode
https://github.com/shuaicj/docker-spark-master
https://*.com/questions/32719007/spark-spark-public-dns-and-spark-local-ip-on-stand-alone-cluster-with-docker-con
Spark Single Cluster
https://spark.apache.org/docs/latest/spark-standalone.html
Mesos Cluster
https://spark.apache.org/docs/latest/running-on-mesos.html
Hadoop2 YARN
https://spark.apache.org/docs/latest/running-on-yarn.html
K8S
https://spark.apache.org/docs/latest/running-on-kubernetes.html
Zeppelin with Cluster
https://zeppelin.apache.org/docs/latest/interpreter/spark.html
Decide to Set Up Spark Standalone Cluster and Zeppelin
Start the Spark Master Machine
Prepare Spark
> wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
> mv spark-2.4.4-bin-hadoop2.7 ~/tool/spark-2.4.4
> sudo ln -s /home/carl/tool/spark-2.4.4 /opt/spark-2.4.4
> sudo ln -s /opt/spark-2.4.4 /opt/spark
> cd /opt/spark
> cp conf/spark-env.sh.template conf/spark-env.sh
A lot of sample configuration there
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
https://spark.apache.org/docs/latest/spark-standalone.html
Make some changes according to my ENV
> vi conf/spark-env.sh
SPARK_MASTER_HOST=rancher-home
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
Start the master service
> sbin/start-master.sh
Start the Slave on rancher-worker1
> wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
> mv spark-2.4.4-bin-hadoop2.7 ~/tool/spark-2.4.4
> sudo ln -s /home/carl/tool/spark-2.4.4 /opt/spark-2.4.4
> sudo ln -s /opt/spark-2.4.4 /opt/spark
Prepare Configuration
> cp conf/spark-env.sh.template conf/spark-env.sh
SPARK_MASTER_HOST=rancher-home
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
Start the slave and connect to master
> sbin/start-slave.sh spark://rancher-home:7077
Stop the slave
> sbin/stop-slave.sh spark://rancher-home:7077
Make Spark Cluster in Docker
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.
SPARK_NO_DAEMONIZE=true
It fails if I start the services
2019-10-28T00:41:42.502359700Z 19/10/28 00:41:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-10-28T00:41:43.110823900Z 19/10/28 00:41:43 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.
HOST file
https://cloud.tencent.com/developer/article/1175087
Finally, the configuration will be close to these for Master
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_LOCAL_HOSTNAME=rancher-home
SPARK_IDENT_STRING=rancher-home
SPARK_PUBLIC_DNS=rancher-home
SPARK_NO_DAEMONIZE=true
SPARK_DAEMON_MEMORY=1g
Dockerfile as follow:
#Set up spark master in Docker
#Prepre the OS
FROM centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>
RUN yum -y update
RUN yum install -y wget
#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile
RUN mkdir /tool/
WORKDIR /tool/
#add the software spark
RUN wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark
ADD conf/spark-env.sh /tool/spark/conf/
#set up the app
EXPOSE 8088 7077
RUN mkdir -p /app/
ADD start.sh /app/
WORKDIR /app/
CMD [ "./start.sh" ]
Makefile important parts as follow:
run:
docker run -d -p 7077:7077 -p 8088:8088 \
--hostname rancher-home \
--name $(NAME) $(IMAGE):$(TAG)
The Slave Machine Configuration will be as follow:
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
SPARK_PUBLIC_DNS=rancher-worker1
SPARK_LOCAL_HOSTNAME=rancher-worker1
SPARK_IDENT_STRING=rancher-worker1
SPARK_NO_DAEMONIZE=true
Dockerfile is as follow:
#Set up spark slave in Docker
#Prepre the OS
FROM centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>
RUN yum -y update
RUN yum install -y wget
#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile
RUN mkdir /tool/
WORKDIR /tool/
#add the software spark
RUN wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark
ADD conf/spark-env.sh /tool/spark/conf/
#set up the app
EXPOSE 8188 7177
RUN mkdir -p /app/
ADD start.sh /app/
WORKDIR /app/
CMD [ "./start.sh" ]
Add host to point to our master machine
run:
docker run -d -p 7177:7177 -p 8188:8188 \
--name $(NAME) \
--hostname rancher-worker1 \
--add-host=rancher-home:192.168.56.110 $(IMAGE):$(TAG)
Next step is to put a lot of configuration in parameters.
References:
https://spark.apache.org/docs/latest/cluster-overview.html
https://*.com/questions/28664834/which-cluster-type-should-i-choose-for-spark
https://*.com/questions/39671117/docker-container-with-apache-spark-in-standalone-cluster-mode
https://github.com/shuaicj/docker-spark-master
https://*.com/questions/32719007/spark-spark-public-dns-and-spark-local-ip-on-stand-alone-cluster-with-docker-con