Hadoop集群(CDH4)实践之 (1) Hadoop(HDFS)搭建
目录结构 Hadoop集群(CDH4)实践之 (0) 前言 Hadoop集群(CDH4)实践之 (1) Hadoop(HDFS)搭建 Hadoop集群(CDH4)实践之 (2) HBaseZookeeper搭建 Hadoop集群(CDH4)实践之 (3) Hive搭建 Hadoop集群(CHD4)实践之 (4) Oozie搭建 Hadoop集群(CHD4)实践之 (5) Sqoop安
目录结构
Hadoop集群(CDH4)实践之 (0) 前言
Hadoop集群(CDH4)实践之 (1) Hadoop(HDFS)搭建
Hadoop集群(CDH4)实践之 (2) HBase&Zookeeper搭建
Hadoop集群(CDH4)实践之 (3) Hive搭建
Hadoop集群(CHD4)实践之 (4) Oozie搭建
Hadoop集群(CHD4)实践之 (5) Sqoop安装
本文内容
Hadoop集群(CDH4)实践之 (1) Hadoop(HDFS)搭建
参考资料
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/CDH4-Installation-Guide.html
环境准备
OS: CentOS 6.4 x86_64
Servers:
hadoop-master: 172.17.20.230 内存10G
- namenode
hadoop- secondarynamenode: 172.17.20.234 内存10G
- secondarybackupnamenode,jobtracker
hadoop-node-1: 172.17.20.231 内存10G
- datanode,tasktracker
hadoop-node-2: 172.17.20.232 内存10G
- datanode,tasktracker
hadoop-node-3: 172.17.20.233 内存10G
- datanode,tasktracker
对以上角色做一些简单的介绍:
namenode - 整个HDFS的命名空间管理服务
secondarynamenode - 可以看做是namenode的冗余服务
jobtracker - 并行计算的job管理服务
datanode - HDFS的节点服务
tasktracker - 并行计算的job执行服务
本文定义的规范,避免在配置多台服务器上产生理解上的混乱:
所有直接以 $ 开头,没有跟随主机名的命令,都代表需要在所有的服务器上执行,除非后面有单独的//开头或在标题说明。
1. 选择最好的安装包
为了更方便和更规范的部署Hadoop集群,我们采用Cloudera的集成包。
因为Cloudera对Hadoop相关的系统做了很多优化,避免了很多因各个系统间版本不符产生的很多Bug。
这也是很多资深Hadoop管理员所推荐的。
https://ccp.cloudera.com/display/DOC/Documentation/
2. 安装Java环境
由于整个Hadoop项目主要是通过Java开发完成的,因此需要JVM的支持。
登陆www.oracle.com(需要创建一个ID),从以下地址下载一个64位的JDK,如jdk-7u45-linux-x64.rpm
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
$ sudo rpm -ivh jdk-7u45-linux-x64.rpm
$ sudo vim /etc/profile.d/java.sh
export JAVA_HOME=/usr/java/jdk1.7.0_45 export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
$ sudo chmod +x /etc/profile.d/java.sh
$ source /etc/profile
3. 配置Hadoop安装源
$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
$ cd /etc/yum.repos.d/
$ sudo wget http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo
4. 安装Hadoop相关套件,选择MRv1的框架支持
$ sudo yum install hadoop-hdfs-namenode //仅在hadoop-master上安装
$ sudo yum install hadoop-hdfs-secondarynamenode //仅在hadoop-secondary上安装
$ sudo yum install hadoop-0.20-mapreduce-jobtracker //仅在hadoop-secondary上安装
$ sudo yum install hadoop-hdfs-datanode //仅在hadoop-node上安装
$ sudo yum install hadoop-0.20-mapreduce-tasktracker //仅在hadoop-node上安装
$ sudo yum install hadoop-client
5. 创建Hadoop配置文件
$ sudo cp -r /etc/hadoop/conf.dist /etc/hadoop/conf.my_cluster
6. 激活新的配置文件
$ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
$ sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
$ cd /etc/hadoop/conf
7. 添加hosts记录并修改对应的主机名
$ sudo vim /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.17.20.230 hadoop-master 172.17.20.234 hadoop-secondary 172.17.20.231 hadoop-node-1 172.17.20.232 hadoop-node-2 172.17.20.233 hadoop-node-3
8. 安装LZO支持
$ cd /etc/yum.repos.d
$ sudo wget http://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo
$ sudo yum install hadoop-lzo-cdh4
9. 配置hadoop/conf下的文件
$ sudo vim /etc/hadoop/conf/masters
hadoop-master
$ sudo vim /etc/hadoop/conf/slaves
hadoop-node-1 hadoop-node-2 hadoop-node-3
10. 创建hadoop的HDFS目录
$ sudo mkdir -p /data/{1,2,3,4}/mapred/local
$ sudo chown -R mapred:hadoop /data/{1,2,3,4}/mapred/local
$ sudo mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/ns /data/{1,2,3,4}/dfs/dn
$ sudo chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/ns /data/{1,2,3,4}/dfs/dn
$ sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/ns /data/{1,2,3,4}/dfs/dn
$ sudo mkdir /data/tmp
$ sudo chmod 1777 /data/tmp
11. 配置core-site.xml
$ sudo vim /etc/hadoop/conf/core-site.xml
fs.defaultFS hdfs://hadoop-master:8020 hadoop.tmp.dir /data/tmp/hadoop-${user.name} hadoop.proxyuser.oozie.hosts * hadoop.proxyuser.oozie.groups * hadoop.proxyuser.hive.hosts * hadoop.proxyuser.hive.groups *
12. 配置hdfs-site.xml
$ sudo vim /etc/hadoop/conf/hdfs-site.xml
dfs.namenode.name.dir /data/1/dfs/nn,/nfsmount/dfs/nn dfs.namenode.http-address hadoop-master:50070 fs.namenode.checkpoint.period 3600 fs.namenode.checkpoint.dir /data/1/dfs/ns dfs.namenode.secondary.http-address hadoop-secondary:50090 dfs.replication 3 dfs.permissions.superusergroup supergroup dfs.datanode.data.dir /data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/dn dfs.datanode.max.xcievers 4096
13. 配置mapred-site.xml
$ sudo vim /etc/hadoop/conf/mapred-site.xml
mapred.job.tracker hadoop-secondary:8021 mapred.local.dir /data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local
14. 格式化HDFS分布式文件系统
$ sudo -u hdfs hadoop namenode -format //仅在hadoop-master上执行一次
15. 启动Hadoop进程
在hadoop-master上启动namenode
$ sudo /etc/init.d//etc/init.d/hadoop-hdfs-namenode start
在hadoop-secondary上启动secondarynamenode,jobtracker
$ sudo /etc/init.d/hadoop-hdfs-secondarynamenode start
$ sudo /etc/init.d/hadoop-0.20-mapreduce-jobtracker start
在hadoop-node上启动datanode,tasktracker
$ sudo /etc/init.d/hadoop-hdfs-datanode start
$ sudo /etc/init.d/hadoop-0.20-mapreduce-tasktracker start
16. 创建mapred.system.dir以及/tmp HDFS目录
以下HDFS操作仅需在任意一台主机上执行一次
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
$ sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
$ sudo -u hdfs hadoop fs -ls -R /
$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
17. 配置HADOOP_MAPRED_HOME
$ sudo vim /etc/profile.d/hadoop.sh
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
$ source /etc/profile
18. 查看整个集群的状态
通过网页进行查看:http://hadoop-master:50070
19. 至此,Hadoop(HDFS)的搭建就已经完成。