欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Data Solution 2019(13)Docker Zeppelin Notebook and Memory Configuration

程序员文章站 2022-03-30 19:23:46
...
Data Solution 2019(13)Docker Zeppelin Notebook and Memory Configuration

On my MAC, I run into this error when I build my docker image

Disk Requirements:
  At least 187MB more space needed on the / filesystem.

I check my disk space, I do have disk on MAC. So maybe it caused by I build too many docker images on my MAC, so here is the command to clean up them
Remove all the containers
> docker rm $(docker ps -qa)

Remove all the images
> docker rmi $(docker image ls -qa)


Memory and Cores Settings
Partitions: split the large data
Task: run in one single Executor. All tasks can be parallel.
Executor: JVM in one worker node, one node can run multiple executors
Cores:
Cluster Manager:

Driver: SparkContext connect tot he cluster manager ( Standalone )
Cluster Manager: manage all resources, like executors
Spark get all executors, send our packages/codes to all executor
SparkContext send all tasks to executors

Core: number of parallel per executor, eg 5
Executors: number of executers, CPU cores/ 5 = num
Memory: Memory / Executors

Executor Total Memory = ExecutorMemory + MemoryOverhead
MemoryOverhead = max( 384M, 0.07 x spark.executor.memory)

Finally, I made it working with ZeppelinBook, Spark Master, Spark Slaves. For example
192.168.56.110 rancher-home         Zeppelin Book, Spark Master
192.168.56.111 rancher-worker1     Spark Slave
192.168.56.112 rancher-worker2.    Spark Slave

Spark Master on rancher-home
Dockerfile including R and Python ENV
#Set up spark master in Docker

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#install java
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

#prepare python
RUN yum groupinstall -y "Development tools"
RUN yum -y install git freetype-devel openssl-devel libffi-devel
RUN git clone https://github.com/pyenv/pyenv.git ~/.pyenv
ENV HOME  /root
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN pyenv install 3.7.5
RUN pyenv global 3.7.5

#prepare R
RUN yum install -y epel-release
RUN yum install -y R


RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN  wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN  tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN  ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

ADD  conf/spark-env.sh /tool/spark/conf/

#python libraries
RUN pip install --upgrade pip
RUN pip install pandas
RUN pip install -U pandasql
RUN pip install matplotlib

#R libraries
RUN R -e "install.packages('data.table',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('knitr',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('googleVis',dependencies=TRUE, repos='http://cran.rstudio.com/')"


#set up the app
EXPOSE  8088 7077
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh” ]

Makefile to support memory parameter and hostname parameter
HOSTNAME=rancher-home
MEMORY=2g

IMAGE=sillycat/public
TAG=sillycat-sparkmaster-1.0
NAME=sillycat-sparkmaster-1.0

docker-context:

build: docker-context
    docker build -t $(IMAGE):$(TAG) .

run:
    docker run -d \
    -e "SPARK_LOCAL_HOSTNAME=$(HOSTNAME)" \
    -e "SPARK_IDENT_STRING=$(HOSTNAME)" \
    -e "SPARK_PUBLIC_DNS=$(HOSTNAME)" \
    -e "SPARK_DAEMON_MEMORY=$(MEMORY)" \
    --network host \
    --name $(NAME) $(IMAGE):$(TAG)

clean:
    docker stop ${NAME}
    docker rm ${NAME}

logs:
    docker logs ${NAME}

publish:
    docker push ${IMAGE}

start.sh to start the Spark master
#!/bin/sh -ex

#prepare ENV

#start the service
cd /tool/spark
sbin/start-master.sh

Settings in conf/spark-env.sh to support the port number
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_NO_DAEMONIZE=true

I use this command to start the container
>make run HOSTNAME=rancher-home MEMROY=1g

Zeppelin on the rancher-home machine
Dockerfile containers all the libraries and softwares
#Set up Zeppelin Notebook

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#java
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

#prepare python
RUN yum groupinstall -y "Development tools"
RUN yum -y install git freetype-devel openssl-devel libffi-devel
RUN git clone https://github.com/pyenv/pyenv.git ~/.pyenv
ENV HOME  /root
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN pyenv install 3.7.5
RUN pyenv global 3.7.5

#prepare R
RUN yum install -y epel-release
RUN yum install -y R

RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN  wget --no-verbose http://www.gtlib.gatech.edu/pub/apache/zeppelin/zeppelin-0.8.2/zeppelin-0.8.2-bin-all.tgz
RUN  tar -xvzf zeppelin-0.8.2-bin-all.tgz
RUN  ln -s /tool/zeppelin-0.8.2-bin-all /tool/zeppelin

#add the software spark
RUN  wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN  tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN  ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

#python libraries
RUN pip install --upgrade pip
RUN pip install pandas
RUN pip install -U pandasql
RUN pip install matplotlib

#R libraries
RUN R -e "install.packages('data.table',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('knitr',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('googleVis',dependencies=TRUE, repos='http://cran.rstudio.com/')"

#set up the app
EXPOSE  8080 4040
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh" ]

Makefile to start the container with host network
IMAGE=sillycat/public
TAG=sillycat-zeppelinbook-1.0
NAME=sillycat-zeppelinbook-1.0

docker-context:

build: docker-context
    docker build -t $(IMAGE):$(TAG) .

run:
    docker run -d --privileged=true \
    -v $(shell pwd)/zeppelin/notebook:/tool/zeppelin/notebook \
    -v $(shell pwd)/zeppelin/conf:/tool/zeppelin/conf \
    --network host \
    --name $(NAME) \
    $(IMAGE):$(TAG)

clean:
    docker stop ${NAME}
    docker rm ${NAME}

logs:
    docker logs ${NAME}

publish:
    docker push ${IMAGE}

Script to start.sh
#!/bin/sh -ex

#start the service
cd /tool/zeppelin
bin/zeppelin.sh

Settings in zeppelin/conf/zeppelin-env.sh
export SPARK_HOME=/tool/spark
export MASTER=spark://rancher-home:7077

Very important thing is this - How to add Dependencies
In the interpreter settings
Add Dependencies in
Artifact: mysql:mysql-connector-java:5.1.47

That is only for driver and notebook, but we need add that this as well to make it working on all the slaves.
spark.jars.packages: mysql:mysql-connector-java:5.1.47

The Spark Slave will be similar to Master
Dockerfile
#Set up spark slave in Docker

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

#prepare python
RUN yum groupinstall -y "Development tools"
RUN yum -y install git freetype-devel openssl-devel libffi-devel
RUN git clone https://github.com/pyenv/pyenv.git ~/.pyenv
ENV HOME  /root
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN pyenv install 3.7.5
RUN pyenv global 3.7.5

#prepare R
RUN yum install -y epel-release
RUN yum install -y R


RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN  wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN  tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN  ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

ADD  conf/spark-env.sh /tool/spark/conf/

#python libraries
RUN pip install --upgrade pip
RUN pip install pandas
RUN pip install -U pandasql
RUN pip install matplotlib

#r libraries
RUN R -e "install.packages('data.table',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('knitr',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('googleVis',dependencies=TRUE, repos='http://cran.rstudio.com/')"


#set up the app
EXPOSE  8188 7177
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh” ]

Makefile need to connect to master machine
HOSTNAME=rancher-worker1
MASTER=rancher-home
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g

IMAGE=sillycat/public
TAG=sillycat-sparkslave-1.0
NAME=sillycat-sparkslave-1.0

docker-context:

build: docker-context
    docker build -t $(IMAGE):$(TAG) .

run:
    docker run -d \
    -e "SPARK_PUBLIC_DNS=$(HOSTNAME)" \
    -e "SPARK_LOCAL_HOSTNAME=$(HOSTNAME)" \
    -e "SPARK_IDENT_STRING=$(HOSTNAME)" \
    -e "SPARK_MASTER=$(MASTER)" \
    -e "SPARK_WORKER_CORES=$(SPARK_WORKER_CORES)" \
    -e "SPARK_WORKER_MEMORY=$(SPARK_WORKER_MEMORY)" \
    --name $(NAME) \
    --network host \
    $(IMAGE):$(TAG)

clean:
    docker stop ${NAME}
    docker rm ${NAME}

logs:
    docker logs ${NAME}

publish:
    docker push ${IMAGE}

Shell script to start.sh
#!/bin/sh -ex

#start the service
cd /tool/spark
sbin/start-slave.sh spark://${SPARK_MASTER}:7077

Settings in conf/spark-env.sh
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
SPARK_IDENT_STRING=rancher-worker1
SPARK_NO_DAEMONIZE=true

Command to start will be similar to this
>make run MASTER=rancher-home HOSTNAME=rancher-worker1 SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=2g


References:
https://*.com/questions/38820979/docker-image-error-downloading-package
Memory
https://www.jianshu.com/p/a8b61f14309f
https://blog.51cto.com/10120275/2364992
https://taoistwar.gitbooks.io/spark-operationand-maintenance-management/content/spark_install/spark_standalone_configuration.html
Zeppelin Login Issue
https://*.com/questions/46685400/login-to-zeppelin-issues-with-docker
Zeppelin Dependencies Issue
http://zeppelin.apache.org/docs/0.8.2/interpreter/spark.html#dependencyloading