Data Solution 2019(13)Docker Zeppelin Notebook and Memory Configuration

程序员文章站 2022-03-30 19:23:46

...

Data Solution 2019(13)Docker Zeppelin Notebook and Memory Configuration

On my MAC, I run into this error when I build my docker image

Disk Requirements:
At least 187MB more space needed on the / filesystem.

I check my disk space, I do have disk on MAC. So maybe it caused by I build too many docker images on my MAC, so here is the command to clean up them
Remove all the containers
> docker rm $(docker ps -qa)

Remove all the images
> docker rmi $(docker image ls -qa)

Memory and Cores Settings
Partitions: split the large data
Task: run in one single Executor. All tasks can be parallel.
Executor: JVM in one worker node, one node can run multiple executors
Cores:
Cluster Manager:

Driver: SparkContext connect tot he cluster manager ( Standalone )
Cluster Manager: manage all resources, like executors
Spark get all executors, send our packages/codes to all executor
SparkContext send all tasks to executors

Core: number of parallel per executor, eg 5
Executors: number of executers, CPU cores/ 5 = num
Memory: Memory / Executors

Executor Total Memory = ExecutorMemory + MemoryOverhead
MemoryOverhead = max( 384M, 0.07 x spark.executor.memory)

Finally, I made it working with ZeppelinBook, Spark Master, Spark Slaves. For example
192.168.56.110 rancher-home         Zeppelin Book, Spark Master
192.168.56.111 rancher-worker1     Spark Slave
192.168.56.112 rancher-worker2.    Spark Slave

Spark Master on rancher-home
Dockerfile including R and Python ENV
#Set up spark master in Docker

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#install java
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

#prepare python
RUN yum groupinstall -y "Development tools"
RUN yum -y install git freetype-devel openssl-devel libffi-devel
RUN git clone https://github.com/pyenv/pyenv.git ~/.pyenv
ENV HOME /root
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN pyenv install 3.7.5
RUN pyenv global 3.7.5

#prepare R
RUN yum install -y epel-release
RUN yum install -y R

RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

ADD conf/spark-env.sh /tool/spark/conf/

#python libraries
RUN pip install --upgrade pip
RUN pip install pandas
RUN pip install -U pandasql
RUN pip install matplotlib

#R libraries
RUN R -e "install.packages('data.table',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('knitr',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('googleVis',dependencies=TRUE, repos='http://cran.rstudio.com/')"

#set up the app
EXPOSE 8088 7077
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh” ]

Makefile to support memory parameter and hostname parameter
HOSTNAME=rancher-home
MEMORY=2g

IMAGE=sillycat/public
TAG=sillycat-sparkmaster-1.0
NAME=sillycat-sparkmaster-1.0

docker-context:

build: docker-context
    docker build -t $(IMAGE):$(TAG) .

run:
    docker run -d \
    -e "SPARK_LOCAL_HOSTNAME=$(HOSTNAME)" \
    -e "SPARK_IDENT_STRING=$(HOSTNAME)" \
    -e "SPARK_PUBLIC_DNS=$(HOSTNAME)" \
    -e "SPARK_DAEMON_MEMORY=$(MEMORY)" \
    --network host \
    --name $(NAME) $(IMAGE):$(TAG)

clean:
    docker stop ${NAME}
    docker rm ${NAME}

logs:
    docker logs ${NAME}

publish:
    docker push ${IMAGE}

start.sh to start the Spark master
#!/bin/sh -ex

#prepare ENV

#start the service
cd /tool/spark
sbin/start-master.sh

Settings in conf/spark-env.sh to support the port number
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_NO_DAEMONIZE=true

I use this command to start the container
>make run HOSTNAME=rancher-home MEMROY=1g

Zeppelin on the rancher-home machine
Dockerfile containers all the libraries and softwares
#Set up Zeppelin Notebook

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#java
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

#prepare python
RUN yum groupinstall -y "Development tools"
RUN yum -y install git freetype-devel openssl-devel libffi-devel
RUN git clone https://github.com/pyenv/pyenv.git ~/.pyenv
ENV HOME /root
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN pyenv install 3.7.5
RUN pyenv global 3.7.5

#prepare R
RUN yum install -y epel-release
RUN yum install -y R

RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN wget --no-verbose http://www.gtlib.gatech.edu/pub/apache/zeppelin/zeppelin-0.8.2/zeppelin-0.8.2-bin-all.tgz
RUN tar -xvzf zeppelin-0.8.2-bin-all.tgz
RUN ln -s /tool/zeppelin-0.8.2-bin-all /tool/zeppelin

#add the software spark
RUN wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

#python libraries
RUN pip install --upgrade pip
RUN pip install pandas
RUN pip install -U pandasql
RUN pip install matplotlib

#R libraries
RUN R -e "install.packages('data.table',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('knitr',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('googleVis',dependencies=TRUE, repos='http://cran.rstudio.com/')"

#set up the app
EXPOSE 8080 4040
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh" ]

Makefile to start the container with host network
IMAGE=sillycat/public
TAG=sillycat-zeppelinbook-1.0
NAME=sillycat-zeppelinbook-1.0

docker-context:

build: docker-context
    docker build -t $(IMAGE):$(TAG) .

run:
    docker run -d --privileged=true \
    -v $(shell pwd)/zeppelin/notebook:/tool/zeppelin/notebook \
    -v $(shell pwd)/zeppelin/conf:/tool/zeppelin/conf \
    --network host \
    --name $(NAME) \
    $(IMAGE):$(TAG)

clean:
    docker stop ${NAME}
    docker rm ${NAME}

logs:
    docker logs ${NAME}

publish:
    docker push ${IMAGE}

Script to start.sh
#!/bin/sh -ex

#start the service
cd /tool/zeppelin
bin/zeppelin.sh

Settings in zeppelin/conf/zeppelin-env.sh
export SPARK_HOME=/tool/spark
export MASTER=spark://rancher-home:7077

Very important thing is this - How to add Dependencies
In the interpreter settings
Add Dependencies in
Artifact: mysql:mysql-connector-java:5.1.47

That is only for driver and notebook, but we need add that this as well to make it working on all the slaves.
spark.jars.packages: mysql:mysql-connector-java:5.1.47

The Spark Slave will be similar to Master
Dockerfile
#Set up spark slave in Docker

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

#prepare python
RUN yum groupinstall -y "Development tools"
RUN yum -y install git freetype-devel openssl-devel libffi-devel
RUN git clone https://github.com/pyenv/pyenv.git ~/.pyenv
ENV HOME /root
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN pyenv install 3.7.5
RUN pyenv global 3.7.5

#prepare R
RUN yum install -y epel-release
RUN yum install -y R

RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

ADD conf/spark-env.sh /tool/spark/conf/

#python libraries
RUN pip install --upgrade pip
RUN pip install pandas
RUN pip install -U pandasql
RUN pip install matplotlib

#r libraries
RUN R -e "install.packages('data.table',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('knitr',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('googleVis',dependencies=TRUE, repos='http://cran.rstudio.com/')"

#set up the app
EXPOSE 8188 7177
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh” ]

Makefile need to connect to master machine
HOSTNAME=rancher-worker1
MASTER=rancher-home
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g

IMAGE=sillycat/public
TAG=sillycat-sparkslave-1.0
NAME=sillycat-sparkslave-1.0

docker-context:

build: docker-context
    docker build -t $(IMAGE):$(TAG) .

run:
    docker run -d \
    -e "SPARK_PUBLIC_DNS=$(HOSTNAME)" \
    -e "SPARK_LOCAL_HOSTNAME=$(HOSTNAME)" \
    -e "SPARK_IDENT_STRING=$(HOSTNAME)" \
    -e "SPARK_MASTER=$(MASTER)" \
    -e "SPARK_WORKER_CORES=$(SPARK_WORKER_CORES)" \
    -e "SPARK_WORKER_MEMORY=$(SPARK_WORKER_MEMORY)" \
    --name $(NAME) \
    --network host \
    $(IMAGE):$(TAG)

clean:
    docker stop ${NAME}
    docker rm ${NAME}

logs:
    docker logs ${NAME}

publish:
    docker push ${IMAGE}

Shell script to start.sh
#!/bin/sh -ex

#start the service
cd /tool/spark
sbin/start-slave.sh spark://${SPARK_MASTER}:7077

Settings in conf/spark-env.sh
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
SPARK_IDENT_STRING=rancher-worker1
SPARK_NO_DAEMONIZE=true

Command to start will be similar to this
>make run MASTER=rancher-home HOSTNAME=rancher-worker1 SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=2g

References:
https://*.com/questions/38820979/docker-image-error-downloading-package
Memory
https://www.jianshu.com/p/a8b61f14309f
https://blog.51cto.com/10120275/2364992
https://taoistwar.gitbooks.io/spark-operationand-maintenance-management/content/spark_install/spark_standalone_configuration.html
Zeppelin Login Issue
https://*.com/questions/46685400/login-to-zeppelin-issues-with-docker
Zeppelin Dependencies Issue
http://zeppelin.apache.org/docs/0.8.2/interpreter/spark.html#dependencyloading

上一篇： Docker and VirtualBox(1)Set up Shared Disk for Virtual Box

下一篇： MongoDB 2019(1)Install 4.2.1 Single and Cluster