spark (3)Spark Standalone集群安装介绍 博客分类: spark sparkHAzookeeperstandalone
(1)初学者对于spark的几个疑问 | http://aperise.iteye.com/blog/2302481 |
(2)spark开发环境搭建 | http://aperise.iteye.com/blog/2302535 |
(3)Spark Standalone集群安装介绍 | http://aperise.iteye.com/blog/2305905 |
(4)spark-shell 读写hdfs 读写redis 读写hbase | http://aperise.iteye.com/blog/2324253 |
Spark集群安装介绍
1.Spark集群方式介绍
1.1 Spark支持的集群管理方式
在Hadoop中,提供了一种编程模型mapreduce方便开发人员编写程序,编写好的mapreduce程序需要进行分布式计算,这时Hadoop又提供了一种资源管理和调度框架yarn,方便mapreduce程序在集群中节点之间分发,负责mapreduce相关资源调度管理。
与此类似,spark中类似的提供的编程模型是RDD(Resilient Distributed Datasets,弹性分布式数据集),一系列RDDs构成了spark中的计算程序,而spark本身也提供类似yarn的资源管理和调度框架,这个就是spark本身。但spark不仅仅局限于此,它也支持其他调度框架,比如RDD运行于Apache Mesos、Hadoop YARN、EC2,详见官网介绍http://spark.apache.org/docs/1.6.0/cluster-overview.html
1.2 Spark 集群方式介绍
对于Spark Standalone Mode方式,我的理解是,一系列RDDs组成的计算程序,其管理和调度这是spark本身,不依赖于Apache Mesos、Hadoop YARN、EC2,详见spark官网http://spark.apache.org/docs/1.6.0/spark-standalone.html
在这种方式下,集群的安装方式又分为三种:
- Spark 集群:此方式下,只有一个master管理所有worker节点,如果master宕机或者出问题,整个计算会停止,存在master单点故障,一旦出问题不可恢复。
- Spark 基于本地文件高可用HA集群:此方式下,只有一个master管理所有worker节点,但会配置一个本地目录文件,master和worker在跑任务时会在此目录下写数据来进行注册,一旦master宕机或者出问题,在再次启动master后,之前任务可以从文件目录中恢复。
- Spark 基于zookeeper高可用HA集群:此方式下,会启动多个master,多个master中只有一个处于激活状态并且管理所有worker,在激活状态的master宕机或者出问题时候,通过zookeeper的协调服务,将之前注册于zookeeper其上的备用mater中选举一个新的master,让它接管之前master来恢复任务执行。
2.Spark集群方式安装
2.1 集群安装环境介绍
2.2 Spark集群安装前准备
1)关闭防火墙
systemctl start firewalld.service
#centos7重启firewall
systemctl restart firewalld.service
#centos7停止firewall
systemctl stop firewalld.service
#centos7禁止firewall开机启动
systemctl disable firewalld.service
#centos7查看防火墙状态
firewall-cmd --state
#开放防火墙端口
vi /etc/sysconfig/iptables-config
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 6379 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 6380 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 6381 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 16379 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 16380 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 16381 -j ACCEPT
这里我关闭防火墙,root下执行如下命令:
systemctl disable firewalld.service
2)优化selinux
作用:spark主节点管理子节点是通过SSH实现的, SELinux不关闭的情况下无法实现,会限制ssh免密码登录。
编辑/etc/selinux/config,修改前:
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=enforcing
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
SELINUXTYPE=targeted
修改后:
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
#SELINUX=enforcing
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
#SELINUXTYPE=targeted
执行以下命令使selinux 修改立即生效:
3)机器名配置
作用:spark集群中机器IP可能变化导致集群间服务中断,所以在Hadoop中最好以机器名进行配置。
修改各机器上文件/etc/hostname,配置主机名称如下:
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.185.31 hadoop31
192.168.185.32 hadoop32
192.168.185.33 hadoop33
192.168.185.34 hadoop34
192.168.185.35 hadoop35
而centos7下各个机器的主机名设置文件为/etc/hostname,以hadoop31节点主机配置为例,配置如下:
hadoop31
4)创建hadoop用户和组
作用:后续单独以用户hadoop来管理spark集群,防止其他用户误操作关闭spark集群
groupadd hadoop
useradd -g hadoop hadoop
#修改用户密码
passwd hadoop
5)用户hadoop免秘钥登录
作用:spark中主节点管理从节点是通过SSH协议登录到从节点实现的,而一般的SSH登录,都是需要输入密码验证的,为了spark主节点方便管理成千上百的从节点,这里将主节点公钥拷贝到从节点,实现SSH协议免秘钥登录,我这里做的是所有主从节点之间机器免秘钥登录
ssh hadoop31
su hadoop
#生成非对称公钥和私钥,这个在集群中所有节点机器都必须执行,一直回车就行
ssh-keygen -t rsa
#通过ssh登录远程机器时,本机会默认将当前用户目录下的.ssh/authorized_keys带到远程机器进行验证,这里是/home/hadoop/.ssh/authorized_keys中公钥(来自其他机器上的/home/hadoop/.ssh/id_rsa.pub.pub),以下代码只在主节点执行就可以做到主从节点之间SSH免密码登录
cd /home/hadoop/.ssh/
#首先将Master节点的公钥添加到authorized_keys
cat id_rsa.pub>>authorized_keys
#其次将Slaves节点的公钥添加到authorized_keys,这里我是在Hadoop31机器上操作的
ssh hadoop@192.168.185.32 cat /home/hadoop/.ssh/id_rsa.pub>> authorized_keys
ssh hadoop@192.168.185.33 cat /home/hadoop/.ssh/id_rsa.pub>> authorized_keys
ssh hadoop@192.168.185.34 cat /home/hadoop/.ssh/id_rsa.pub>> authorized_keys
ssh hadoop@192.168.185.35 cat /home/hadoop/.ssh/id_rsa.pub>> authorized_keys
#必须设置修改/home/hadoop/.ssh/authorized_keys权限
chmod 600 /home/hadoop/.ssh/authorized_keys
#这里将Master节点的authorized_keys分发到其他slaves节点
scp -r /home/hadoop/.ssh/authorized_keys hadoop@192.168.185.32:/home/hadoop/.ssh/
scp -r /home/hadoop/.ssh/authorized_keys hadoop@192.168.185.33:/home/hadoop/.ssh/
scp -r /home/hadoop/.ssh/authorized_keys hadoop@192.168.185.34:/home/hadoop/.ssh/
scp -r /home/hadoop/.ssh/authorized_keys hadoop@192.168.185.35:/home/hadoop/.ssh/
6)JDK安装
作用:spark需要java环境支撑,java环境安装如下:
su hadoop
#下载jdk-7u65-linux-x64.gz放置于/home/hadoop/java并解压
cd /home/hadoop/java
tar -zxvf jdk-7u65-linux-x64.gz
#编辑vi /home/hadoop/.bashrc,在文件末尾追加如下内容
export JAVA_HOME=/home/hadoop/java/jdk1.7.0_65
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
#使得/home/hadoop/.bashrc配置生效
source /home/hadoop/.bashrc
很多人是配置linux全局/etc/profile,这里不建议这么做,一旦有人在里面降级了java环境或者删除了java环境,就会出问题,建议的是在管理spark集群的用户下面修改其.bashrc单独配置该用户环境变量
7)zookeeper安装
作用:用于后期spark基于ZK的HA方式使用
su hadoop
cd /home/hadoop
tar -zxvf zookeeper-3.4.6.tar.gz
#2在集群中各个节点中配置/etc/hosts,内容如下:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.185.31 hadoop31
192.168.185.32 hadoop32
192.168.185.33 hadoop33
192.168.185.34 hadoop34
192.168.185.35 hadoop35
#3在集群中各个节点中创建zookeeper数据文件
ssh hadoop31
cd /home/hadoop
#zookeeper数据存放位置
mkdir -p /opt/hadoop/zookeeper
ssh hadoop32
cd /home/hadoop
#zookeeper数据存放位置
mkdir -p /opt/hadoop/zookeeper
ssh hadoop33
cd /home/hadoop
#zookeeper数据存放位置
mkdir -p /opt/hadoop/zookeeper
ssh hadoop34
cd /home/hadoop
#zookeeper数据存放位置
mkdir -p /opt/hadoop/zookeeper
ssh hadoop35
cd /home/hadoop
#zookeeper数据存放位置
mkdir -p /opt/hadoop/zookeeper
#4配置zoo.cfg
ssh hadoop31
cd /home/hadoop/zookeeper-3.4.6/conf
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg
#内容如下
initLimit=10
syncLimit=5
dataDir=/opt/hadoop/zookeeper
clientPort=2181
server.2=hadoop32:2888:3888
server.3=hadoop33:2888:3888
server.4=hadoop34:2888:3888
server.5=hadoop35:2888:3888
#5在hadoop31上远程复制分发安装文件
scp -r /home/hadoop/zookeeper-3.4.6 hadoop@hadoop32:/home/hadoop/
scp -r /home/hadoop/zookeeper-3.4.6 hadoop@hadoop33:/home/hadoop/
scp -r /home/hadoop/zookeeper-3.4.6 hadoop@hadoop34:/home/hadoop/
scp -r /home/hadoop/zookeeper-3.4.6 hadoop@hadoop35:/home/hadoop/
#6在集群中各个节点设置myid必须为数字
ssh hadoop31
echo "1" > /opt/hadoop/zookeeper/myid
ssh hadoop32
echo "2" > /opt/hadoop/zookeeper/myid
ssh hadoop33
echo "3" > /opt/hadoop/zookeeper/myid
#7.各个节点如何启动zookeeper
ssh hadoop31
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh start
#8.各个节点如何关闭zookeeper
ssh hadoop31
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh stop
#9.各个节点如何查看zookeeper状态
ssh hadoop31
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh status
#10.各个节点如何通过客户端访问zookeeper上目录数据
ssh hadoop31
/home/hadoop/zookeeper-3.4.6/bin/zkCli.sh -server hadoop31:2181,hadoop32:2181,hadoop33:2181,hadoop34:2181,hadoop35:2181
8)scala环境安装
wget http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz
cd /home/hadoop/java
tar –zxvf scala-2.11.7.tgz
export SCALA_HOME=/home/hadoop/java/scala-2.11.7
export PATH=$PATH:$SCALA_HOME/bin
source /home/hadoop/.bashrc
#检测Scala是否安装成功
scala -version
9)安装spark-1.6.0-bin-hadoop2.6
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
tar -zxf spark-1.6.0-bin-hadoop2.6.tgz
export SPARK_HOME=/home/hadoop/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
#使得上述配置生效
source /home/hadoop/.bashrc
2.3 Spark普通集群方式安装
1)spark-env.sh
复制/home/hadoop/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh.template为/home/hadoop/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh,在最后增加如下内容:
export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=10 -Dspark.history.fs.logDirectory=hdfs://bigdatacluster-ha/historyserverforspark"
export SCALA_HOME=/home/hadoop/java/scala-2.11.7
export JAVA_HOME=/home/hadoop/java/jdk1.7.0_65
#export SPARK_MASTER_IP=hadoop31
#export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_CORES=4
export SPARK_WORKER_INSTANCES=4
SPARK_WORKER_INSTANCES参数设置每个slave节点上开启4个worker进程,SPARK_WORKER_CORES设置开启的每个worker进程使用的最多CPU内核数为4,SPARK_WORKER_MEMORY设置开启的每个worker进程使用的最大内存为8G,这样每个slave节点在启动服务后你会真实的看到4个worker进程,总计消耗掉了32G内存,总计占用了16个内核,所以你的每个机器首先内核总数必须要大于16cores,总内存必须要大于32G,因为你还得留一部分cores和内存供操作系统和其他程序使用.
上面export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=NONE"可以不配置,因为默认就是NONE;
上面hdfs://bigdatacluster-ha/historyserverforspark是存储spark在HADOOP HDFS上存储执行记录的目录位置,这里我的Hadoop采用的是基于zookeeper的HA安装,如何安装我已经在http://aperise.iteye.com/admin/blogs/2305809进行讲解,这里需要在HDFS上新建目录,操作如下:
2)slaves
hadoop31
hadoop32
hadoop33
hadoop34
hadoop35
3)分发安装到其他机器
ssh hadoop31
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop32:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop33:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop34:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop35:/home/hadoop/java/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop32:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop33:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop34:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop35:/home/hadoop/
4)spark启动
ssh hadoop31
cd /home/hadoop/spark-1.6.0-bin-hadoop2.6
sbin/start-all.sh
5)spark-shell链接
这里我只有一个master分布于hadoop31上,任意机器上用spark-shell链接spark如下:
bin/spark-shell --master spark://hadoop31:7077
2.4 Spark基于本地文件系统高可用HA集群方式安装
1)spark-env.sh
复制/home/hadoop/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh.template为/home/hadoop/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh,在最后增加如下内容:
export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=10 -Dspark.history.fs.logDirectory=hdfs://bigdatacluster-ha/historyserverforspark"
export SCALA_HOME=/home/hadoop/java/scala-2.11.7
export JAVA_HOME=/home/hadoop/java/jdk1.7.0_65
#export SPARK_MASTER_IP=hadoop31
#export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_CORES=4
export SPARK_WORKER_INSTANCES=4
SPARK_WORKER_INSTANCES参数设置每个slave节点上开启4个worker进程,SPARK_WORKER_CORES设置开启的每个worker进程使用的最多CPU内核数为4,SPARK_WORKER_MEMORY设置开启的每个worker进程使用的最大内存为8G,这样每个slave节点在启动服务后你会真实的看到4个worker进程,总计消耗掉了32G内存,总计占用了16个内核,所以你的每个机器首先内核总数必须要大于16cores,总内存必须要大于32G,因为你还得留一部分cores和内存供操作系统和其他程序使用.
上面需要设置spark.deploy.recoveryMode=FILESYSTEM,并且配置数据保存目录spark.deploy.recoveryDirectory=/home/hadoop/sparkexecutedata,需要在每个机器上创建目录,操作如下:
mkdir -p /home/hadoop/sparkexecutedata
ssh hadoop32
mkdir -p /home/hadoop/sparkexecutedata
ssh hadoop33
mkdir -p /home/hadoop/sparkexecutedata
ssh hadoop34
mkdir -p /home/hadoop/sparkexecutedata
ssh hadoop35
mkdir -p /home/hadoop/sparkexecutedata
上面hdfs://bigdatacluster-ha/historyserverforspark是存储spark在HADOOP HDFS上存储执行记录的目录位置,这里我的Hadoop采用的是基于zookeeper的HA安装,如何安装我已经在http://aperise.iteye.com/admin/blogs/2305809进行讲解,这里需要在HDFS上新建目录,操作如下:
2)slaves
hadoop31
hadoop32
hadoop33
hadoop34
hadoop35
3)分发安装到其他机器
ssh hadoop31
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop32:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop33:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop34:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop35:/home/hadoop/java/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop32:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop33:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop34:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop35:/home/hadoop/
4)spark启动
ssh hadoop31
cd /home/hadoop/spark-1.6.0-bin-hadoop2.6
sbin/start-all.sh
5)spark-shell链接
这里我只有一个master分布于hadoop31上,任意机器上用spark-shell链接spark如下:
bin/spark-shell --master spark://hadoop31:7077
2.5 Spark基于zookeeper高可用HA集群方式安装
1)spark-env.sh
复制/home/hadoop/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh.template为/home/hadoop/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh,在最后增加如下内容:
export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=10 -Dspark.history.fs.logDirectory=hdfs://bigdatacluster-ha/historyserverforspark"
export SCALA_HOME=/home/hadoop/java/scala-2.11.7
export JAVA_HOME=/home/hadoop/java/jdk1.7.0_65
#export SPARK_MASTER_IP=hadoop31
#export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_CORES=4
export SPARK_WORKER_INSTANCES=4
SPARK_WORKER_INSTANCES参数设置每个slave节点上开启4个worker进程,SPARK_WORKER_CORES设置开启的每个worker进程使用的最多CPU内核数为4,SPARK_WORKER_MEMORY设置开启的每个worker进程使用的最大内存为8G,这样每个slave节点在启动服务后你会真实的看到4个worker进程,总计消耗掉了32G内存,总计占用了16个内核,所以你的每个机器首先内核总数必须要大于16cores,总内存必须要大于32G,因为你还得留一部分cores和内存供操作系统和其他程序使用.
上面需要设置spark.deploy.recoveryMode=ZOOKEEPER,并且配置数据保存目录spark.deploy.zookeeper.url=hadoop31:2181,hadoop32:2181,hadoop33:2181,hadoop34:2181,hadoop35:2181和spark.deploy.zookeeper.dir=/spark-zk-path,spark-zk-path是zookeeper上的数据存放目录;
上面hdfs://bigdatacluster-ha/historyserverforspark是存储spark在HADOOP HDFS上存储执行记录的目录位置,这里我的Hadoop采用的是基于zookeeper的HA安装,如何安装我已经在http://aperise.iteye.com/admin/blogs/2305809进行讲解,这里需要在HDFS上新建目录,操作如下:
2)slaves
hadoop31
hadoop32
hadoop33
hadoop34
hadoop35
3)分发安装到其他机器
ssh hadoop31
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop32:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop33:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop34:/home/hadoop/java/
scp -r /home/hadoop/java/scala-2.11.7 hadoop@hadoop35:/home/hadoop/java/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop32:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop33:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop34:/home/hadoop/
scp -r /home/hadoop/spark-1.6.0-bin-hadoop2.6 hadoop@hadoop35:/home/hadoop/
4)zookeeper启动
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh start
ssh hadoop32
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh start
ssh hadoop33
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh start
ssh hadoop34
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh start
ssh hadoop35
/home/hadoop/zookeeper-3.4.6/bin/zkServer.sh start
5)spark启动
ssh hadoop31
cd /home/hadoop/spark-1.6.0-bin-hadoop2.6
sbin/start-all.sh
6)spark备用master启动
master可以多个,只需单独启动就行,比如在hadoop32上启动master
ssh hadoop32
cd /home/hadoop/spark-1.6.0-bin-hadoop2.6
sbin/start-master.sh
7)spark-shell链接
这里我有两个master,分别分布于hadoop31和hadoop32之上,任意机器上用spark-shell链接spark如下:
bin/spark-shell --master spark://hadoop31:7077,hadoop32:7077