欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

hive+hbase+zookeeper+spark2.3.0环境搭建

程序员文章站 2024-01-15 23:56:52
...

集群配置说明

安装图

hive+hbase+zookeeper+spark2.3.0环境搭建

配置说明

JDK :Hadoop和Spark 依赖的配置,官方建议JDK版本在1.7以上!!!
Scala:Spark依赖的配置,建议版本不低于spark的版本。
Hadoop: 是一个分布式系统基础架构。
Spark: 分布式存储的大数据进行处理的工具。
zookeeper:分布式应用程序协调服务,HBase集群需要。
HBase: 一个结构化数据的分布式存储系统。
Hive: 基于Hadoop的一个数据仓库工具,目前的默认元数据库是mysql。

1.spark2.3.0环境安装和启动

根据spark Application的Driver Program是否在集群中运行,spark应用的运行方式又可以分为Cluster模式和Client模式。

1.编译spark2.3.0源码

spark里面是包含spark-sql的,spark-sql是使用修改过的hive和spark相结合的组件,因为spark-sql里面的hive和我们用的hive2.3版本冲突,所以这里用源代码重新编译,去掉里面的spark-sql里面的hive,编译过后的spark是不能使用spark-sql功能的。

编译spark需要maven3.3.9,scala2.11.8,jdk1.8

下载链接spark2.3.0源码
下载后解压缩保存到/opt/workspace/目录下,下载maven并配置好环境变量,将下载源配置到国内;

2.对conf目录下的文件做配置

1.配置spark-env.sh

cp   spark-env.sh.template   spark-env.sh

添加如下配置

export SCALA_HOME=/opt/workspace/scala-2.11.8
export JAVA_HOME=/opt/workspace/jdk1.8
export HADOOP_HOME=/opt/workspace/hadoop-2.9.1
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 
export SPARK_HOME=/opt/workspace/spark-2.3.0-bin-hadoop2-without-hive
export SPARK_CONF_DIR=$SPARK_HOME/conf
export SPARK_EXECUTOR_MEMORY=5120M
export SPARK_DIST_CLASSPATH=$(/opt/workspace/hadoop-2.9.1/bin/hadoop classpath)

2.配置slaves

cp   spark-defaults.conf.template   spark-defaults.conf

修改其中内容

slave1
slave2
slave3

3.修改spark-default.conf文件

根据模板复制

cp   spark-defaults.conf.template   spark-defaults.conf

添加如下内容

spark.master                     yarn-cluster
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://master:9000/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              4g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value-Dnumbers="one two three"

4.在hdfs上创建目录

因为上面的配置中让spark将eventLog存到HDFS的directory目录下,所以需要执行hadoop命令,在HDFS上创建directory目录,创建目录命令是:

$HADOOP_HOME/bin/hadoop   fs  -mkdir  -p   /directory
$HADOOP_HOME/bin/hadoop   fs  -chmod  777  /directory

5.启动spark

进入sbin目录

[aaa@qq.com sbin]# start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/workspace/spark-2.3.0-bin-hadoop2-without-hive/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out
slave3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/workspace/spark-2.3.0-bin-hadoop2-without-hive/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave3.out
slave2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/workspace/spark-2.3.0-bin-hadoop2-without-hive/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave2.out
slave1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/workspace/spark-2.3.0-bin-hadoop2-without-hive/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave1.out
[aaa@qq.com sbin]# jps
9296 QuorumPeerMain
9939 HMaster
28005 ResourceManager
12522 Master
27834 SecondaryNameNode
27626 NameNode
12620 Jps

2.MySQL安装配置

本集群搭建环境为CentOS7,yum源中默认没有MySQL,需要先下载yum源

1.下载mysql的repo源

wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm

2.安装rpm包

rpm -ivh mysql-community-release-el7-5.noarch.rpm

安装这个包后,会获得两个mysql的yum repo源:

/etc/yum.repos.d/mysql-community.repo
/etc/yum.repos.d/mysql-community-source.repo

3.安装MySQL-server

$ sudo yum install mysql-server
[root@master1 mysql]# service mysqld start
[root@master1 mysql]# mysql -u root
mysql> set password for 'root'@'localhost' =password('123456');
Query OK, 0 rows affected (0.00 sec)
[aaa@qq.com mysql]# mysql -uroot -p123456
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

mysql的几个重要目录

#(a)数据库目录
/var/lib/mysql/
#(b)配置文件
/usr/share /mysql(mysql.server命令及配置文件)
#(c)相关命令
/usr/bin(mysqladmin mysqldump等命令)
#(d)启动脚本
/etc/rc.d/init.d/(启动脚本文件mysql的目录)

3.Zookeeper的环境配置

1.下载源码

地址http://mirror.bit.edu.cn/apache/zookeeper/

mkdir /opt/workspace/zookeeper

下载稳定版本,解压缩到/opt/workspace/zookeeper目录

2.环境配置

#Zookeeper Config
export ZK_HOME=/opt/workspace/zookeeper/zookeeper-3.4.12

export PATH=.:${JAVA_HOME}/bin:${SCALA_HOME}/bin:${MAVEN_HOME}/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:${HIVE_HOME}/bin:${SPARK_HOME}/bin:${HBASE_HOME}/bin:$SQOOP_HOME/bin:${ZK_HOME}/bin:$PATH
source /etc/profile

3.修改配置文件

1.在集群的服务器上都创建这些目录

切换到/opt/workspace/zookeeper/目录,执行以下命令

mkdir zookeeper/data
mkdir zookeeper/datalog

并在/opt/workspace/zookeeper/data目录下创建myid文件

touch myid

为了方便,将master1、slave1、slave2、slave3的myid文件内容改为1,3,4,5

2.修改zoo.cfg文件

复制zoo_sample.cfg文件并重命名为zoo.cfg

dataDir=/opt/workspace/zookeeper/data
dataLogDir=/opt/workspace/zookeeper/datalog

server.1=master1:2888:3888
server.3=slave1:2888:3888
server.4=slave2:2888:3888
server.5=slave3:2888:3888

说明:client port,顾名思义,就是客户端连接zookeeper服务的端口。这是一个TCP port。dataLogDir里是放到的顺序日志(WAL)。而dataDir里放的是内存数据结构的snapshot,便于快速恢复。为了达到性能最大化,一般建议把dataDir和dataLogDir分到不同的磁盘上,这样就可以充分利用磁盘顺序写的特性。dataDir和dataLogDir需要自己创建,目录可以自己制定,对应即可。
1.tickTime:CS通信心跳数
Zookeeper 服务器之间或客户端与服务器之间维持心跳的时间间隔,也就是每个 tickTime 时间就会发送一个心跳。tickTime以毫秒为单位。
tickTime=2000
2.initLimit:LF初始通信时限
集群中的follower服务器(F)与leader服务器(L)之间初始连接时能容忍的最多心跳数(tickTime的数量)。
initLimit=10
3.syncLimit:LF同步通信时限
集群中的follower服务器与leader服务器之间请求和应答之间能容忍的最多心跳数(tickTime的数量)。
syncLimit=5
依旧将zookeeper传输到其他的机器上

3.启动zookeeper

切换到/opt/workspace/zookeeper/zookeeper-3.4.12/bin目录下,执行

zkServer.sh start

注:成功配置zookeeper之后,需要在每台机器上启动zookeeper

在所有节点启动后,等待一段时间查看即可

[root@master1 bin]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/workspace/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
Mode: follower
[root@slave3 bin]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/workspace/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
Mode: leader

4.HBase安装

1.下载源码

地址http://mirrors.shu.edu.cn/apache/hbase/
下载stable版源码
解压缩到/opt/workspace目录

2.环境配置

编辑/etc/profile文件

# HBase Config
export HBASE_HOME=/opt/workspace/hbase-1.4.6
export PATH=.:${JAVA_HOME}/bin:${SCALA_HOME}/bin:${MAVEN_HOME}/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:${HIVE_HOME}/bin:${SPARK_HOME}/bin:${HBASE_HOME}/bin:$SQOOP_HOME/bin:$PATH

使配置生效

source /etc/profile

查看版本

[root@master1 hbase-1.4.6]# hbase version
HBase 1.4.6
Source code repository git://apurtell-ltm4.internal.salesforce.com/Users/apurtell/src/hbase revision=a55bcbd4fc87ff9cd3caaae25277e0cfdbb344a5
Compiled by apurtell on Tue Jul 24 16:25:52 PDT 2018
From source with checksum 265f6798aa3f8da7100f0d0f243de921

3.修改配置文件

切换到/opt/workspace/hbase-1.4.6/conf

1.修改hbase.env.sh文件

编辑 hbase-env.sh 文件,添加以下配置

export JAVA_HOME=/opt/workspace/jdk1.8
export HADOOP_HOME=/opt/workspace/hadoop-2.9.1
export HBASE_HOME=/opt/workspace/hbase-1.4.6
export HBASE_CLASSPATH=$HADOOP_HOME/etc/hadoop
export HBASE_PID_DIR=/root/hbase/pids
export HBASE_MANAGES_ZK=false
export HBASE_LOG_DIR=${HBASE_HOME}/logs

HBASE_MANAGES_ZK=false 是不启用HBase自带的Zookeeper集群。

2.修改 hbase-site.xml

    <property>
     <name>hbase.rootdir</name>
     <value>hdfs://master1:9000/hbase</value>
     <description>The directory shared byregion servers.</description>
    </property>
     <!-- hbase端口 -->
    <property>
     <name>hbase.zookeeper.property.clientPort</name>
     <value>2181</value>
    </property>
    <!-- 超时时间 -->
    <property>
     <name>zookeeper.session.timeout</name>
     <value>120000</value>
    </property>
    <!--防止服务器时间不同步出错 -->
    <property>
    <name>hbase.master.maxclockskew</name>
    <value>150000</value>
    </property>
    <!-- 集群主机配置 -->
    <property>
     <name>hbase.zookeeper.quorum</name>
     <value>master1,slave1,slave2,slave3</value>
    </property>
    <!--   路径存放 -->
    <property>
     <name>hbase.tmp.dir</name>
     <value>/root/hbase/tmp</value>
    </property>
    <!-- true表示分布式 -->
    <property>
     <name>hbase.cluster.distributed</name>
     <value>true</value>
    </property>
      <!-- 指定master -->
    <property>
        <name>hbase.master</name>
        <value>master1:60000</value>
    </property>

其中hbase.rootdir配置的是hdfs地,用来持久化Hbase,ip:port要和hadoop/core-site.xml中的fs.defaultFS保持一致。hbase.cluster.distributed :Hbase的运行模式。false是单机模式,true是分布式模式

3.修改regionservers

指定hbase的主从,和hadoop的slaves文件配置一样

slave1
slave2
slave3

最后将环境传输到其他主机

5.配置Hive

1.下载hive包

对照官网,下载对应的hive包。
Hive on spark官网指南
hive源码http://mirrors.hust.edu.cn/apache/hive/
下载3.1.0并解压缩,移动到/opt/workspace/目录下

2.在lib目录下加入相应的jar包

Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn’t have an assembly jar. To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib.

  • scala-library
  • spark-core
  • spark-network-common

加入jdbc驱动

cp mysql-connector-java-8.0.12.jar /opt/workspace/hive-3.1.0/lib/

3.配置hive-site.xml文件

首先在master机器上上创建临时目录/opt/workspace/tmp-hive

将hive-site.xml文件中的所有${system:java.io.tmpdir}替换为/opt/workspace/tmp-hive
将hive-site.xml文件中的所有${system:user.name}都替换为root

/opt/workspace/hive-3.1.0/conf执行cp hive-default.xml.template hive-site.xml

<!--jdbc -->
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://10.2.4.60:3306/hive1?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
  <description>Username to use against metastore database</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>123456</value>
  <description>password to use against metastore database</description>
</property>
<!--spark engine -->
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>
<!--在末尾新添配置-->
<property> 
    <name>hive.enable.spark.execution.engine</name> 
    <value>true</value> 
</property>
<property> 
    <name>spark.master</name> 
<!--    <value>spark://master:7077</value> -->
    <value>yarn-cluster</value>
</property>
<!--    <property> 
    <name>spark.submit.deployMode</name> 
    <value>client</value> 
</property>
-->
<property> 
    <name>spark.serializer</name> 
    <value>org.apache.spark.serializer.KryoSerializer</value> 
</property>
<property> 
    <name>spark.eventLog.enabled</name> 
    <value>true</value> 
</property>
<property> 
    <name>spark.eventLog.dir</name> 
    <value>hdfs://master:9000/directory</value> 
</property>
<property>
    <name>spark.executor.instances</name>
    <value>3</value>
</property>
<property>
    <name>spark.executor.cores</name>
    <value>5</value>
</property>
<property> 
    <name>spark.executor.memory</name> 
    <value>5120m</value>
</property>
<property>
    <name>spark.driver.cores</name>
    <value>2</value>
</property>
<property>
    <name>spark.driver.memory</name>
    <value>4096m</value>
</property>

4.配置hive-env.sh文件

cp hive-env.sh.template hive-env.sh
export HADOOP_HEAPSIZE=4096
export HADOOP_HOME=/opt/workspace/hadoop-2.9.1
export HIVE_CONF_DIR=/opt/workspace/hive-3.1.0/conf
export HIVE_AUX_JARS_PATH=/opt/workspace/hive-3.1.0/lib

5.初始化hive

初始化

[aaa@qq.com bin]# schematool -initSchema -dbType mysql                                                   Metastore connection URL:        jdbc:mysql://10.2.4.60:3306/hive1?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8
Metastore Connection Driver :    com.mysql.cj.jdbc.Driver
Metastore connection User:       root
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.mysql.sql
Initialization script completed
schemaTool completed

出现以上信息即初始化成功

6.启动hive

[aaa@qq.com hive-3.1.0]# cd bin
[aaa@qq.com bin]# ./hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/workspace/hbase-1.4.6/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/workspace/hadoop-2.9.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-08-23 20:15:27,108 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Hive Session ID = 5d68f66b-dc5a-4f3b-b080-0a4dd57a5a16

Logging initialized using configuration in jar:file:/opt/workspace/hive-3.1.0/lib/hive-common-3.1.0.jar!/hive-log4j2.properties Async: true
Hive Session ID = 03da6ad2-295f-4264-b2e4-b955fe975f2a
hive>

成功

Errors

1.初始化hive失败

[aaa@qq.com bin]# schematool -initSchema -dbType mysql
Metastore connection URL:        jdbc:mysql://10.2.4.60:3306/hive1?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       root
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
Underlying cause: java.sql.SQLException : Access denied for user 'root'@'master1' (using password: YES)
SQL Error code: 1045
Use --verbose for detailed stacktrace.
*** schemaTool failed ***

错误一:This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver’.
解决:因安装的最新版本的jdbc驱动,更改hive-site.xml文件中jdbc driver的name为com.mysql.cj.jdbc.Driver

    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>

错误二:Access denied for user ‘root’@’master1’ (using password: YES)
解决:授权
设置用户root可以在任意IP下被访问:

mysql> grant all privileges on *.* to aaa@qq.com"%" identified by '123456';
Query OK, 0 rows affected (0.00 sec)

设置用户root可以在本地被访问:


mysql> grant all privileges on *.* to aaa@qq.com"localhost" identified by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to aaa@qq.com"master1" identified by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

再次初始化

[aaa@qq.com bin]# schematool -initSchema -dbType mysql                                                   Metastore connection URL:        jdbc:mysql://10.2.4.60:3306/hive1?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8
Metastore Connection Driver :    com.mysql.cj.jdbc.Driver
Metastore connection User:       root
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.mysql.sql
Initialization script completed
schemaTool completed

出现以上信息即初始化成功
错误三:jar包冲突
解决:删除hive中$HIVE_HOME/lib下面的log4j-slf4j-impl-2.10.0.jar
不要删除hadoop下面的,否则调用shell 脚本start-all.sh远程启动hadoop时会报找不到log4j包的错误。

2.zookeeper启动错误

若某些节点未启动

[aaa@qq.com bin]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/workspace/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

zookeeper.out中错误信息:

2018-08-23 16:53:02,147 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCnxManager@584] - Cannot open channel to 4 at election address slave2/10.2.4.63:3888
java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:534)
    at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:454)
    at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:435)
    at java.lang.Thread.run(Thread.java:748)

启动所有节点后等待一段时间再次查看,若依然有错
1.查看zookeeper的端口2181是否已经被占用了

[root@master1 hive-3.1.0]# netstat -apn | grep 2181
tcp6       0      0 :::2181                 :::*                    LISTEN      9296/java

若2181端口被占用,使用kill -9 端口号杀掉进程
再次尝试启动

2.如果上面的操作还解决不了问题,那么我们接着到/opt/workspace/zookeeper/data目录下,可以看到如下所示的文件

[root@master1 data]# ls
myid  version-2  zookeeper_server.pid

删除version-2 、zookeeper_server.pid两个文件
再次尝试启动

参考:
hive on spark安装(hive2.3 spark2.1)
hive on spark入门安装(hive2.0、spark1.5)
centos7安装mysql并jdbc测试
hive常见问题解决
解决Zookeeper无法启动的问题
大数据学习系列之七 —– Hadoop+Spark+Zookeeper+HBase+Hive集群搭建 图文详解
Linux搭建Hive On Spark环境(spark-1.6.3-without-hive+hadoop2.8.0+hive2.1.1)