Hadoop 集成 Hudi 部署文档
Hadoop 集成 Hudi 部署文档
官网: https://hudi.apache.org/
1. 环境准备
版本 | 链接地址 | |
---|---|---|
hadoop 2.7.3 | https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz | |
spark 2.4.4 | https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz | |
hive 2.3.1 | http://archive.apache.org/dist/hive/hive-2.3.1/apache-hive-2.3.1-bin.tar.gz | |
presto 0.217 | https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.217/presto-server-0.217.tar.gz | |
presto-cli-0.217-executable.jar | https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.217/presto-cli-0.217-executable.jar |
1.服务器配置
IP | 主机名 | 配置 |
---|---|---|
10.0.6.183 | hudi-test-183 | 4c 21g |
10.0.6.184 | hudi-test-184 | 4c 21g |
10.0.6.185 | hudi-test-185 | 4c 21g |
10.0.6.186 | hudi-test-186 | 4c 21g |
10.0.6.187 | hudi-test-187 | 4c 21g |
2.服务器规划
10.0.6.183 | 10.0.6.184 | 10.0.6.185 | 10.0.6.186 | 10.0.6.187 | |
---|---|---|---|---|---|
HDFS | NN DN | DN | 2NN DN | DN | DN |
YARN | NM | NM | NM RM | NM | NM |
jdk1.8+
时间同步
免密
2.Hadoop 配置
1.解压
tar -xvf hadoop-2.7.3.tar.gz
2.配置环境变量
##HADOOP_HOME
export HADOOP_HOME=/opt/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
3.环境变量生效
source /etc/profile
4.验证
hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /opt/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
2.1 HDFS 配置
cd $HADOOP_HOME/etc/hadoop
-rw-r--r-- 1 root root 4436 Aug 18 2016 capacity-scheduler.xml
-rw-r--r-- 1 root root 1335 Aug 18 2016 configuration.xsl
-rw-r--r-- 1 root root 318 Aug 18 2016 container-executor.cfg
-rw-r--r-- 1 root root 774 Aug 18 2016 core-site.xml
-rw-r--r-- 1 root root 3589 Aug 18 2016 hadoop-env.cmd
-rw-r--r-- 1 root root 4224 Aug 18 2016 hadoop-env.sh
-rw-r--r-- 1 root root 2598 Aug 18 2016 hadoop-metrics2.properties
-rw-r--r-- 1 root root 2490 Aug 18 2016 hadoop-metrics.properties
-rw-r--r-- 1 root root 9683 Aug 18 2016 hadoop-policy.xml
-rw-r--r-- 1 root root 775 Aug 18 2016 hdfs-site.xml
-rw-r--r-- 1 root root 1449 Aug 18 2016 httpfs-env.sh
-rw-r--r-- 1 root root 1657 Aug 18 2016 httpfs-log4j.properties
-rw-r--r-- 1 root root 21 Aug 18 2016 httpfs-signature.secret
-rw-r--r-- 1 root root 620 Aug 18 2016 httpfs-site.xml
-rw-r--r-- 1 root root 3518 Aug 18 2016 kms-acls.xml
-rw-r--r-- 1 root root 1527 Aug 18 2016 kms-env.sh
-rw-r--r-- 1 root root 1631 Aug 18 2016 kms-log4j.properties
-rw-r--r-- 1 root root 5511 Aug 18 2016 kms-site.xml
-rw-r--r-- 1 root root 11237 Aug 18 2016 log4j.properties
-rw-r--r-- 1 root root 931 Aug 18 2016 mapred-env.cmd
-rw-r--r-- 1 root root 1383 Aug 18 2016 mapred-env.sh
-rw-r--r-- 1 root root 4113 Aug 18 2016 mapred-queues.xml.template
-rw-r--r-- 1 root root 758 Aug 18 2016 mapred-site.xml.template
-rw-r--r-- 1 root root 10 Aug 18 2016 slaves
-rw-r--r-- 1 root root 2316 Aug 18 2016 ssl-client.xml.example
-rw-r--r-- 1 root root 2268 Aug 18 2016 ssl-server.xml.example
-rw-r--r-- 1 root root 2191 Aug 18 2016 yarn-env.cmd
-rw-r--r-- 1 root root 4567 Aug 18 2016 yarn-env.sh
-rw-r--r-- 1 root root 690 Aug 18 2016 yarn-site.xm
2.1.1 修改 hadoop-env.sh
vim hadoop-env.sh
将JDK路径指定给HDFS,修改
export JAVA_HOME=/opt/jdk/jdk1.8.0_221
2.1.2 修改 core-site.xml
指定NameNode节点以及数据存储目录(修改core-site.xml)
<configuration>
<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hudi-test-183:9000</value>
</property>
<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-2.7.3/data/tmp</value>
</property>
</configuration>
core-site.xml的默认配置:
https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-common/coredefault.
xml
2.1.3 修改 hdfs-site.xml
指定secondarynamenode节点(修改hdfs-site.xml)
<configuration>
<!-- 指定Hadoop辅助名称节点主机配置 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hudi-test-185:50090</value>
</property>
<!--副本数量 -->
<property>
<name>dfs.replication</name>
<value>5</value>
</property>
</configuration>
官方默认配置
https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-hdfs/hdfsdefault.
xml
2.1.4 修改 slaves
指定datanode从节点(vim slaves文件,每个节点配置信息占一行)
hudi-test-183
hudi-test-184
hudi-test-185
hudi-test-186
hudi-test-187
2.2 MapReduce 集群配置
2.2.1 修改 mapred-env.sh
指定MapReduce使用的jdk路径 vim mapred-env.sh
添加到最后一行即可
export JAVA_HOME=/opt/jdk/jdk1.8.0_221
2.2.2 修改 mapred-site.xml
指定MapReduce计算框架运行Yarn资源调度框架
mv mapred-site.xml.template mapred-site.xml
vim mapred-site.xml
<!-- 指定MR运行在Yarn上 -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
mapred-site.xml默认配置
https://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduceclient-
core/mapred-default.xml
2.3 Yarn 集群配置
2.3.1 修改 yarn-env.sh
24处 行添加
export JAVA_HOME=/opt/jdk/jdk1.8.0_221
2.3.2 修改 yarn-site.xml
指定ResourceMnager的master节点信息
<configuration>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hudi-test-185</value>
</property>
<!-- Reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
yarn-site.xml的默认配置
https://hadoop.apache.org/docs/r2.9.2/hadoop-yarn/hadoop-yarn-common/yarndefault.
xml
指定NodeManager节点(slaves文件已修改)
注意:
Hadoop安装目录所属用户和所属用户组信息,默认是501 dialout,而我们操作Hadoop集群的用户使
用的是虚拟机的root用户,所以为了避免出现信息混乱,修改Hadoop安装目录所属用户和用户组
chown -R root:root /opt/hadoop-2.7.3
将 hadoop-2.7.3 目录分发到其他节点上
scp -r hadoop-2.7.3 [email protected]:/opt/
scp -r hadoop-2.7.3 [email protected]:/opt/
scp -r hadoop-2.7.3 [email protected]:/opt/
scp -r hadoop-2.7.3 [email protected]:/opt/
3.启动集群
3.1 启动 HDFS
HDFS 第一次启动,需要格式化
hudi-test-183 上执行
hadoop namenode -format
日志中出现以下信息,格式化成功(倒数第6行)
INFO common.Storage: Storage directory /opt/hadoop-2.7.3/data/tmp/dfs/name has been successfully formatted.
启动HDFS
$HADOOP_HOME/sbin/start-dfs.sh
执行jps
命令查看
20321 NameNode
20689 Jps
20452 DataNode
3.2 启动 YARN
hudi-test-183 上执行
$HADOOP_HOME/sbin/start-yarn.sh
注意:NameNode和ResourceManger不是在同一台机器,不能在NameNode上启动 YARN,应该
在ResouceManager所在的机器上启动YARN。
3.3 集群测试
1.hdfs 简单测试
hdfs dfs -mkdir -p /test/input
#本地hoome目录创建一个文件
cd /root
vim test.txt
hello hdfs
#上传linxu文件到Hdfs
hdfs dfs -put /root/test.txt /test/input
#从Hdfs下载文件到linux本地
hdfs dfs -get /test/input/test.txt
- MapReduce 分布式计算初体验
在HDFS文件系统根目录下面创建一个wcinput文件夹
hdfs dfs -mkdir /wcinput
touch wc.txt
hadoop mapreduce yarn
hdfs hadoop mapreduce
mapreduce yarn boomlee
boomlee
boomlee
上传到 hdfs
hdfs dfs -put wc.txt /wcinput
cd $HADOOP_HOME 目录,执行
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /wcinput
/wcoutput
查看结果
hdfs dfs -cat /wcoutput/part-r-00000
boomlee 3
hadoop 2
hdfs 1
mapreduce 3
yarn 2
3.4 配置历史服务器
在Yarn中运行的任务产生的日志数据不能查看,为了查看程序的历史运行情况,需要配置一下历史日志
服务器。具体配置步骤如下:
vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hudi-test-183:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hudi-test-183:19888</value>
</property>
将mapred-site.xml
分发到其他节点上(4台)
scp -r mapred-site.xml [email protected]:/opt/hadoop-2.7.3/etc/hadoop
启动历史服务器
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
jps
命令查看
进程名为 JobHistoryServer
查看 JobHistory
http://10.0.6.183:19888/jobhistory
4.HIVE 部署
Hive官网:http://hive.apache.org
下载网址:http://archive.apache.org/dist/hive/
文档网址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual
4.1 MySQL 安装
Hive中使用MySQL存储元数据,MySQL的版本 5.7.26。安装步骤:
1、删除MariaDB
# 查询是否安装了mariadb
rpm -aq | grep mariadb
# 删除mariadb。-e 删除指定的套件;--nodeps 不验证套件的相互关联性
rpm -e --nodeps mariadb-libs
2.安装依赖
yum install perl -y
yum install net-tools -y
3.安装 MySQL
1.下载
wget -i -c http://dev.mysql.com/get/mysql57-community-release-el7-10.noarch.rpm
2.yum 安装
yum -y install mysql57-community-release-el7-10.noarch.rpm
yum -y install mysql-community-server
3.MySQL 启动
systemctl start mysqld.service
4.查看MySQL 默认密码
grep "password" /var/log/mysqld.log
5.登录 MySQL
mysql -uroot -p
修改密码规则
set global validate_password_policy=LOW;
set global validate_password_length=6;
修改root
密码
ALTER USER 'root'@'localhost' IDENTIFIED BY '123456';
6.开启 MySQL 远程访问
%代表所有ip,密码为123456
grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
7.创建 hive 用户
CREATE USER 'hive'@'%' IDENTIFIED BY '123456';
GRANT ALL ON *.* TO 'hive'@'%';
FLUSH PRIVILEGES;
exit;
4.2 Hive 解压安装
安装步骤:
1、下载、上传、解压缩
2、修改环境变量
3、修改hive配置
4、拷贝JDBC的驱动程序
5、初始化元数据库
4.2.1 解压改名
tar -xvf apache-hive-2.3.1-bin.tar.gz
mv apache-hive-2.3.1-bin apache-hive-2.3.1
4.2.2 修改环境变量
vim /etc/profile
##HIVE
export HIVE_HOME=/opt/apache-hive-2.3.1
export PATH=$PATH:$HIVE_HOME/bin
环境变量生效
source /etc/profile
4.2.3 修改 hive 配置
cd $HIVE_HOME/conf
vi hive-site.xml
增加以下内容:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- hive元数据的存储位置 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hudi-test-183:3306/hivemetadata?createDatabaseIfNotExist=true&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<!-- 指定驱动程序 -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!-- 连接数据库的用户名 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<!-- 连接数据库的口令 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
4.2.4 拷贝 Mysql 驱动
将 mysql-connector-java-5.1.*.jar 拷贝到 $HIVE_HOME/lib
4.2.5 初始化元数据库
执行命令
schematool -dbType mysql -initSchema
日志如下
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.1/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://hudi-test-183:3306/hivemetadata?createDatabaseIfNotExist=true&useSSL=false
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed
4.2.6 启动 HIVE
启动hive服务之前,请先启动hdfs、yarn的服务
hive
执行
show functions;
显示如下内容则正常
OK
!
!=
$sum0
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
aes_decrypt
4.2.7 HIVE 属性配置
1.数据存储位置
vim $HIVE_HOME/conf/hive-site.xml
<property>
<!-- 数据默认的存储位置(HDFS) -->
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
2.显示当前库
<property>
<!-- 在命令行中,显示当前操作的数据库 -->
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
3.显示表头属性
<property>
<!-- 在命令行中,显示数据的表头 -->
<name>hive.cli.print.header</name>
<value>true</value>
</property>
Hive的log默认存放在 /tmp/root 目录下
vim $HIVE_HOME/conf/hive-log4j2.properties
添加
# 添加以下内容:
property.hive.log.dir = /opt/apache-hive-2.3.1/logs
5.Spark
官网地址:http://spark.apache.org/
文档地址:http://spark.apache.org/docs/latest/
下载地址:http://spark.apache.org/downloads.html
下载Spark安装包
下载地址:https://archive.apache.org/dist/spark/
5.1 解压
tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
更名
mv spark-2.4.4-bin-hadoop2.7 spark-2.4.4
5.2 配置环境变量
vim /etc/profie
##SPARK
export SPARK_HOME=/opt/spark-2.4.4
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
环境变量生效
source /etc/profile
5.3 修改配置
文件位置:$SPARK_HOME/conf
修改文件:
slaves
spark-defaults.conf
spark-env.sh
log4j.properties
1.修改 slaves
vim $SPARK_HOME/conf/slaves
hudi-test-183
hudi-test-184
hudi-test-185
hudi-test-186
hudi-test-187
2.修改 spark-defaults.conf
vim $SPARK_HOME/conf/spark-defaults.conf
spark.master spark://hudi-test-183:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hudi-test-183:9000/spark-eventlog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 512m
备注:
spark.master 定义master节点,缺省端口号 7077
spark.eventLog.enabled 开启eventLog
spark.eventLog.dir eventLog的存放位置
spark.serializer 一个高效的序列化器
spark.driver.memory 定义driver内存的大小(缺省1G)
创建 HDFS 目录:hdfs dfs -mkdir /spark-eventlog
3.修改 spark-env.sh
vim $SPARK_HOME/conf/spark-env.sh
export JAVA_HOME=/opt/jdk/jdk1.8.0_221
export HADOOP_HOME=/opt/hadoop-2.7.3
export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/opt/hadoop-2.7.3/bin/hadoop classpath)
export SPARK_MASTER_HOST=hudi-test-183
export SPARK_MASTER_PORT=7077
备注:
这里使用的是 spark-2.4.4-bin-hadoop2.7.tgz,如果你使用的是 spark-2.4.4-bin-without-hadoop2.7 就必须要将Hadoop 相关 jars 的位置告诉 Spark
4.分发spark 到其他节点、修改环境环境变量
scp -r spark-2.4.4/ hudi-test-184:$PWD
scp -r spark-2.4.4/ hudi-test-185:$PWD
scp -r spark-2.4.4/ hudi-test-186:$PWD
scp -r spark-2.4.4/ hudi-test-187:$PWD
5.启动集群
sh $SPARK_HOME/sbin/start-all.sh
集群测试 sbin 目录下
run-example SparkPi 10
6.Spark onYarn 配置
参考:http://spark.apache.org/docs/latest/running-on-yarn.html
需要启动的服务:hdfs服务、yarn服务
需要关闭 Standalone 对应的服务(即集群中的Master、Worker进程),一山不容二虎!
关闭 Master Worker
sbin/stop-all.sh
1.修改 yarn 配置
修改 yarn-site.xml
vim HADOOP_HOME/etc/hadoop/yarn-site.xml
添加
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
- yarn.nodemanager.pmem-check-enabled。是否启动一个线程检查每个任务正使用的物理内存量,如果任务
超出分配值,则直接将其杀掉,默认是true - yarn.nodemanager.vmem-check-enabled。是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务
超出分配值,则直接将其杀掉,默认是true
分发到其他节点,重启yarn
scp -r yarn-site.xml hudi-test-184:$PWD
scp -r yarn-site.xml hudi-test-185:$PWD
scp -r yarn-site.xml hudi-test-186:$PWD
scp -r yarn-site.xml hudi-test-187:$PWD
重启 yarn
服务 在185 服务器上
./stop-yarn.sh
./start-yarn.sh
2.修改 Spark 配置
# spark-env.sh 中这一项必须要有
export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
# spark-defaults.conf(以下是优化)
# 与 hadoop historyserver集成
spark.yarn.historyServer.address hudi-test-183:18080
spark.history.ui.port 18080
# 添加(以下是优化)
spark.yarn.jars hdfs:///spark-yarn/jars/*.jar
# 将 $SPARK_HOME/jars 下的jar包上传到hdfs
hdfs dfs -mkdir -p /spark-yarn/jars/
cd $SPARK_HOME/jars
hdfs dfs -put * /spark-yarn/jars/
分发到其他节点上
scp -r spark-defaults.conf hudi-test-184:$PWD
重启/启动 spark 历史服务(sbin 目录下)
/tmp/spark-events
stop-history-server.sh
start-history-server.sh
http://10.0.6.185:8088/cluster
测试
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 100
上一篇: H5 布局 -- 让容器充满屏幕高度或自适应剩余高度
下一篇: JQ实时监听input的value值