欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Hadoop 集成 Hudi 部署文档

程序员文章站 2022-03-08 14:06:03
...

Hadoop 集成 Hudi 部署文档

官网: https://hudi.apache.org/

1. 环境准备

版本 链接地址
hadoop 2.7.3 https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
spark 2.4.4 https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
hive 2.3.1 http://archive.apache.org/dist/hive/hive-2.3.1/apache-hive-2.3.1-bin.tar.gz
presto 0.217 https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.217/presto-server-0.217.tar.gz
presto-cli-0.217-executable.jar https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.217/presto-cli-0.217-executable.jar

1.服务器配置

IP 主机名 配置
10.0.6.183 hudi-test-183 4c 21g
10.0.6.184 hudi-test-184 4c 21g
10.0.6.185 hudi-test-185 4c 21g
10.0.6.186 hudi-test-186 4c 21g
10.0.6.187 hudi-test-187 4c 21g

2.服务器规划

10.0.6.183 10.0.6.184 10.0.6.185 10.0.6.186 10.0.6.187
HDFS NN DN DN 2NN DN DN DN
YARN NM NM NM RM NM NM

jdk1.8+

时间同步

免密

2.Hadoop 配置

1.解压

tar -xvf hadoop-2.7.3.tar.gz

2.配置环境变量

##HADOOP_HOME
export HADOOP_HOME=/opt/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

3.环境变量生效

source /etc/profile

4.验证

hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /opt/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar

2.1 HDFS 配置

cd $HADOOP_HOME/etc/hadoop
-rw-r--r-- 1 root root  4436 Aug 18  2016 capacity-scheduler.xml
-rw-r--r-- 1 root root  1335 Aug 18  2016 configuration.xsl
-rw-r--r-- 1 root root   318 Aug 18  2016 container-executor.cfg
-rw-r--r-- 1 root root   774 Aug 18  2016 core-site.xml
-rw-r--r-- 1 root root  3589 Aug 18  2016 hadoop-env.cmd
-rw-r--r-- 1 root root  4224 Aug 18  2016 hadoop-env.sh
-rw-r--r-- 1 root root  2598 Aug 18  2016 hadoop-metrics2.properties
-rw-r--r-- 1 root root  2490 Aug 18  2016 hadoop-metrics.properties
-rw-r--r-- 1 root root  9683 Aug 18  2016 hadoop-policy.xml
-rw-r--r-- 1 root root   775 Aug 18  2016 hdfs-site.xml
-rw-r--r-- 1 root root  1449 Aug 18  2016 httpfs-env.sh
-rw-r--r-- 1 root root  1657 Aug 18  2016 httpfs-log4j.properties
-rw-r--r-- 1 root root    21 Aug 18  2016 httpfs-signature.secret
-rw-r--r-- 1 root root   620 Aug 18  2016 httpfs-site.xml
-rw-r--r-- 1 root root  3518 Aug 18  2016 kms-acls.xml
-rw-r--r-- 1 root root  1527 Aug 18  2016 kms-env.sh
-rw-r--r-- 1 root root  1631 Aug 18  2016 kms-log4j.properties
-rw-r--r-- 1 root root  5511 Aug 18  2016 kms-site.xml
-rw-r--r-- 1 root root 11237 Aug 18  2016 log4j.properties
-rw-r--r-- 1 root root   931 Aug 18  2016 mapred-env.cmd
-rw-r--r-- 1 root root  1383 Aug 18  2016 mapred-env.sh
-rw-r--r-- 1 root root  4113 Aug 18  2016 mapred-queues.xml.template
-rw-r--r-- 1 root root   758 Aug 18  2016 mapred-site.xml.template
-rw-r--r-- 1 root root    10 Aug 18  2016 slaves
-rw-r--r-- 1 root root  2316 Aug 18  2016 ssl-client.xml.example
-rw-r--r-- 1 root root  2268 Aug 18  2016 ssl-server.xml.example
-rw-r--r-- 1 root root  2191 Aug 18  2016 yarn-env.cmd
-rw-r--r-- 1 root root  4567 Aug 18  2016 yarn-env.sh
-rw-r--r-- 1 root root   690 Aug 18  2016 yarn-site.xm

2.1.1 修改 hadoop-env.sh

vim hadoop-env.sh

将JDK路径指定给HDFS,修改

export JAVA_HOME=/opt/jdk/jdk1.8.0_221

2.1.2 修改 core-site.xml

指定NameNode节点以及数据存储目录(修改core-site.xml)

<configuration>
<!-- 指定HDFS中NameNode的地址 -->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hudi-test-183:9000</value>
</property>
<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop-2.7.3/data/tmp</value>
</property>
</configuration>

core-site.xml的默认配置:

https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-common/coredefault.
xml

2.1.3 修改 hdfs-site.xml

指定secondarynamenode节点(修改hdfs-site.xml)

<configuration>
<!-- 指定Hadoop辅助名称节点主机配置 -->
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>hudi-test-185:50090</value>
</property>
<!--副本数量 -->
<property>
    <name>dfs.replication</name>
    <value>5</value>
</property>
</configuration>

官方默认配置

https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-hdfs/hdfsdefault.
xml

2.1.4 修改 slaves

指定datanode从节点(vim slaves文件,每个节点配置信息占一行)

hudi-test-183
hudi-test-184
hudi-test-185
hudi-test-186
hudi-test-187

2.2 MapReduce 集群配置

2.2.1 修改 mapred-env.sh

指定MapReduce使用的jdk路径 vim mapred-env.sh 添加到最后一行即可

export JAVA_HOME=/opt/jdk/jdk1.8.0_221

2.2.2 修改 mapred-site.xml

指定MapReduce计算框架运行Yarn资源调度框架

mv mapred-site.xml.template mapred-site.xml
vim mapred-site.xml

<!-- 指定MR运行在Yarn上 -->
<configuration>
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
</configuration>

mapred-site.xml默认配置

https://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduceclient-
core/mapred-default.xml

2.3 Yarn 集群配置

2.3.1 修改 yarn-env.sh

24处 行添加

export JAVA_HOME=/opt/jdk/jdk1.8.0_221

2.3.2 修改 yarn-site.xml

指定ResourceMnager的master节点信息

<configuration>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hudi-test-185</value>
</property>
<!-- Reducer获取数据的方式 -->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
</configuration>

yarn-site.xml的默认配置

https://hadoop.apache.org/docs/r2.9.2/hadoop-yarn/hadoop-yarn-common/yarndefault.
xml

指定NodeManager节点(slaves文件已修改)

注意:

Hadoop安装目录所属用户和所属用户组信息,默认是501 dialout,而我们操作Hadoop集群的用户使
用的是虚拟机的root用户,所以为了避免出现信息混乱,修改Hadoop安装目录所属用户和用户组

chown -R root:root /opt/hadoop-2.7.3

将 hadoop-2.7.3 目录分发到其他节点上

scp -r hadoop-2.7.3 [email protected]:/opt/
scp -r hadoop-2.7.3 [email protected]:/opt/
scp -r hadoop-2.7.3 [email protected]:/opt/
scp -r hadoop-2.7.3 [email protected]:/opt/

3.启动集群

3.1 启动 HDFS

HDFS 第一次启动,需要格式化

hudi-test-183 上执行

hadoop namenode -format

日志中出现以下信息,格式化成功(倒数第6行)

INFO common.Storage: Storage directory /opt/hadoop-2.7.3/data/tmp/dfs/name has been successfully formatted.

启动HDFS

$HADOOP_HOME/sbin/start-dfs.sh

执行jps命令查看

20321 NameNode
20689 Jps
20452 DataNode

3.2 启动 YARN

hudi-test-183 上执行

$HADOOP_HOME/sbin/start-yarn.sh

注意:NameNode和ResourceManger不是在同一台机器,不能在NameNode上启动 YARN,应该
在ResouceManager所在的机器上启动YARN。

3.3 集群测试

1.hdfs 简单测试

hdfs dfs -mkdir -p /test/input
#本地hoome目录创建一个文件
cd /root
vim test.txt
hello hdfs
#上传linxu文件到Hdfs
hdfs dfs -put /root/test.txt /test/input
#从Hdfs下载文件到linux本地
hdfs dfs -get /test/input/test.txt
  1. MapReduce 分布式计算初体验

在HDFS文件系统根目录下面创建一个wcinput文件夹

hdfs dfs -mkdir /wcinput

touch wc.txt

hadoop mapreduce yarn
hdfs hadoop mapreduce
mapreduce yarn boomlee
boomlee
boomlee

上传到 hdfs

hdfs dfs -put wc.txt /wcinput

cd $HADOOP_HOME 目录,执行

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /wcinput
/wcoutput

查看结果

hdfs dfs -cat /wcoutput/part-r-00000
boomlee	3
hadoop	2
hdfs	1
mapreduce	3
yarn	2

3.4 配置历史服务器

在Yarn中运行的任务产生的日志数据不能查看,为了查看程序的历史运行情况,需要配置一下历史日志
服务器。具体配置步骤如下:

vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<!-- 历史服务器端地址 -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hudi-test-183:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hudi-test-183:19888</value>
</property>

mapred-site.xml分发到其他节点上(4台)

scp -r mapred-site.xml [email protected]:/opt/hadoop-2.7.3/etc/hadoop

启动历史服务器

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

jps 命令查看

进程名为 JobHistoryServer

查看 JobHistory

http://10.0.6.183:19888/jobhistory

4.HIVE 部署

Hive官网:http://hive.apache.org
下载网址:http://archive.apache.org/dist/hive/
文档网址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual

4.1 MySQL 安装

Hive中使用MySQL存储元数据,MySQL的版本 5.7.26。安装步骤:

1、删除MariaDB

# 查询是否安装了mariadb
rpm -aq | grep mariadb
# 删除mariadb。-e 删除指定的套件;--nodeps 不验证套件的相互关联性
rpm -e --nodeps mariadb-libs

2.安装依赖

yum install perl -y
yum install net-tools -y

3.安装 MySQL

1.下载

wget -i -c http://dev.mysql.com/get/mysql57-community-release-el7-10.noarch.rpm

2.yum 安装

 yum -y install mysql57-community-release-el7-10.noarch.rpm
 yum -y install mysql-community-server

3.MySQL 启动

systemctl start  mysqld.service

4.查看MySQL 默认密码

 grep "password" /var/log/mysqld.log

5.登录 MySQL

mysql -uroot -p

修改密码规则

set global validate_password_policy=LOW;
set global validate_password_length=6;

修改root 密码

ALTER USER 'root'@'localhost' IDENTIFIED BY '123456';

6.开启 MySQL 远程访问
%代表所有ip,密码为123456

grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;

7.创建 hive 用户

CREATE USER 'hive'@'%' IDENTIFIED BY '123456';
GRANT ALL ON *.* TO 'hive'@'%';
FLUSH PRIVILEGES;
exit;

4.2 Hive 解压安装

安装步骤:
1、下载、上传、解压缩
2、修改环境变量
3、修改hive配置
4、拷贝JDBC的驱动程序
5、初始化元数据库

4.2.1 解压改名

tar -xvf apache-hive-2.3.1-bin.tar.gz
mv apache-hive-2.3.1-bin apache-hive-2.3.1

4.2.2 修改环境变量

vim /etc/profile

##HIVE
export HIVE_HOME=/opt/apache-hive-2.3.1
export PATH=$PATH:$HIVE_HOME/bin

环境变量生效

source /etc/profile

4.2.3 修改 hive 配置

cd $HIVE_HOME/conf
vi hive-site.xml

增加以下内容:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- hive元数据的存储位置 -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hudi-test-183:3306/hivemetadata?createDatabaseIfNotExist=true&amp;useSSL=false</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <!-- 指定驱动程序 -->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <!-- 连接数据库的用户名 -->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
        <description>username to use against metastore database</description>
    </property>
    <!-- 连接数据库的口令 -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
        <description>password to use against metastore database</description>
    </property>
</configuration>

4.2.4 拷贝 Mysql 驱动

将 mysql-connector-java-5.1.*.jar 拷贝到 $HIVE_HOME/lib

4.2.5 初始化元数据库

执行命令

schematool -dbType mysql -initSchema

日志如下

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.1/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:	 jdbc:mysql://hudi-test-183:3306/hivemetadata?createDatabaseIfNotExist=true&useSSL=false
Metastore Connection Driver :	 com.mysql.jdbc.Driver
Metastore connection User:	 hive
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed

4.2.6 启动 HIVE

启动hive服务之前,请先启动hdfs、yarn的服务

hive

执行

show functions;

显示如下内容则正常

OK
!
!=
$sum0
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
aes_decrypt

4.2.7 HIVE 属性配置

1.数据存储位置

vim $HIVE_HOME/conf/hive-site.xml
<property>
    <!-- 数据默认的存储位置(HDFS) -->
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
</property>

2.显示当前库

<property>
    <!-- 在命令行中,显示当前操作的数据库 -->
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    <description>Whether to include the current database in the Hive prompt.</description>
</property>

3.显示表头属性

<property>
    <!-- 在命令行中,显示数据的表头 -->
    <name>hive.cli.print.header</name>
    <value>true</value>
</property>

Hive的log默认存放在 /tmp/root 目录下

vim $HIVE_HOME/conf/hive-log4j2.properties

添加

# 添加以下内容:
property.hive.log.dir = /opt/apache-hive-2.3.1/logs

5.Spark

官网地址:http://spark.apache.org/
文档地址:http://spark.apache.org/docs/latest/
下载地址:http://spark.apache.org/downloads.html

下载Spark安装包
下载地址:https://archive.apache.org/dist/spark/

5.1 解压

tar -xvf spark-2.4.4-bin-hadoop2.7.tgz

更名

 mv spark-2.4.4-bin-hadoop2.7 spark-2.4.4

5.2 配置环境变量

vim /etc/profie

##SPARK
export SPARK_HOME=/opt/spark-2.4.4
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

环境变量生效

source /etc/profile

5.3 修改配置

文件位置:$SPARK_HOME/conf
修改文件:

slaves

spark-defaults.conf

spark-env.sh

log4j.properties

1.修改 slaves

vim $SPARK_HOME/conf/slaves
hudi-test-183
hudi-test-184
hudi-test-185
hudi-test-186
hudi-test-187

2.修改 spark-defaults.conf

vim $SPARK_HOME/conf/spark-defaults.conf
spark.master spark://hudi-test-183:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hudi-test-183:9000/spark-eventlog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 512m

备注:

spark.master 定义master节点,缺省端口号 7077
spark.eventLog.enabled 开启eventLog
spark.eventLog.dir eventLog的存放位置
spark.serializer 一个高效的序列化器
spark.driver.memory 定义driver内存的大小(缺省1G)

创建 HDFS 目录:hdfs dfs -mkdir /spark-eventlog

3.修改 spark-env.sh

vim $SPARK_HOME/conf/spark-env.sh
export JAVA_HOME=/opt/jdk/jdk1.8.0_221
export HADOOP_HOME=/opt/hadoop-2.7.3
export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/opt/hadoop-2.7.3/bin/hadoop classpath)
export SPARK_MASTER_HOST=hudi-test-183
export SPARK_MASTER_PORT=7077

备注:

这里使用的是 spark-2.4.4-bin-hadoop2.7.tgz,如果你使用的是 spark-2.4.4-bin-without-hadoop2.7 就必须要将Hadoop 相关 jars 的位置告诉 Spark

4.分发spark 到其他节点、修改环境环境变量

scp -r spark-2.4.4/ hudi-test-184:$PWD
scp -r spark-2.4.4/ hudi-test-185:$PWD
scp -r spark-2.4.4/ hudi-test-186:$PWD
scp -r spark-2.4.4/ hudi-test-187:$PWD

5.启动集群

sh $SPARK_HOME/sbin/start-all.sh

集群测试 sbin 目录下

run-example SparkPi 10

6.Spark onYarn 配置

参考:http://spark.apache.org/docs/latest/running-on-yarn.html

需要启动的服务:hdfs服务、yarn服务
需要关闭 Standalone 对应的服务(即集群中的Master、Worker进程),一山不容二虎!

关闭 Master Worker

sbin/stop-all.sh
1.修改 yarn 配置

修改 yarn-site.xml

vim HADOOP_HOME/etc/hadoop/yarn-site.xml

添加

<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>
  • yarn.nodemanager.pmem-check-enabled。是否启动一个线程检查每个任务正使用的物理内存量,如果任务
    超出分配值,则直接将其杀掉,默认是true
  • yarn.nodemanager.vmem-check-enabled。是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务
    超出分配值,则直接将其杀掉,默认是true

分发到其他节点,重启yarn

scp -r yarn-site.xml hudi-test-184:$PWD
scp -r yarn-site.xml hudi-test-185:$PWD
scp -r yarn-site.xml hudi-test-186:$PWD
scp -r yarn-site.xml hudi-test-187:$PWD

重启 yarn 服务 在185 服务器上

./stop-yarn.sh 
./start-yarn.sh
2.修改 Spark 配置
# spark-env.sh 中这一项必须要有
export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop

# spark-defaults.conf(以下是优化)
# 与 hadoop historyserver集成
spark.yarn.historyServer.address hudi-test-183:18080
spark.history.ui.port 18080

# 添加(以下是优化)
spark.yarn.jars hdfs:///spark-yarn/jars/*.jar

# 将 $SPARK_HOME/jars 下的jar包上传到hdfs
hdfs dfs -mkdir -p /spark-yarn/jars/
cd $SPARK_HOME/jars
hdfs dfs -put * /spark-yarn/jars/

分发到其他节点上

scp -r spark-defaults.conf hudi-test-184:$PWD

重启/启动 spark 历史服务(sbin 目录下)

/tmp/spark-events
stop-history-server.sh
start-history-server.sh

http://10.0.6.185:8088/cluster

测试

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 100