Hadoop伪分布式运行
Hadoop可以在单节点上以所谓的伪分布式模式运行,此时每一个Hadoop守护进程都作为一个独立的Java进程运行。本文通过自动化脚本配置Hadoop伪分布式模式。测试环境为VMware中的Centos 6.3, Hadoop 1.2.1.其他版本未测试。 伪分布式配置脚本 包括配置core-site.
Hadoop可以在单节点上以所谓的伪分布式模式运行,此时每一个Hadoop守护进程都作为一个独立的Java进程运行。本文通过自动化脚本配置Hadoop伪分布式模式。测试环境为VMware中的Centos 6.3, Hadoop 1.2.1.其他版本未测试。
伪分布式配置脚本
包括配置core-site.xml,hdfs-site.xml及mapred-site.xml,配置ssh免密码登陆。[1]
#!/bin/bash # Usage: Hadoop伪分布式配置 # History: # 20140426 annhe 完成基本功能 # Check if user is root if [ $(id -u) != "0" ]; then printf "Error: You must be root to run this script!\n" exit 1 fi #同步时钟 rm -rf /etc/localtime ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime #yum install -y ntp ntpdate -u pool.ntp.org &>/dev/null echo -e "Time: `date` \n" #默认为单网卡结构,多网卡的暂不考虑 IP=`ifconfig eth0 |grep "inet\ addr" |awk '{print $2}' |cut -d ":" -f2` #伪分布式 function PseudoDistributed () { cd /etc/hadoop/ #恢复备份 mv core-site.xml.bak core-site.xml mv hdfs-site.xml.bak hdfs-site.xml mv mapred-site.xml.bak mapred-site.xml #备份 mv core-site.xml core-site.xml.bak mv hdfs-site.xml hdfs-site.xml.bak mv mapred-site.xml mapred-site.xml.bak #使用下面的core-site.xml cat > core-site.xmleof #使用下面的hdfs-site.xml cat > hdfs-site.xml fs.default.name hdfs://$IP:9000 eof #使用下面的mapred-site.xml cat > mapred-site.xml dfs.replication 1 eof } #配置ssh免密码登陆 function PassphraselessSSH () { #不重复生成私钥 [ ! -f ~/.ssh/id_dsa ] && ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/authorized_keys |grep "`cat ~/.ssh/id_dsa.pub`" &>/dev/null && r=0 || r=1 #没有公钥的时候才添加 [ $r -eq 1 ] && cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys chmod 644 ~/.ssh/authorized_keys } #执行 function Execute () { #格式化一个新的分布式文件系统 hadoop namenode -format #启动Hadoop守护进程 start-all.sh echo -e "\n========================================================================" echo "hadoop log dir : $HADOOP_LOG_DIR" echo "NameNode - http://$IP:50070/" echo "JobTracker - http://$IP:50030/" echo -e "\n=========================================================================" } PseudoDistributed 2>&1 | tee -a pseudo.log PassphraselessSSH 2>&1 | tee -a pseudo.log Execute 2>&1 | tee -a pseudo.log mapred.job.tracker $IP:9001
脚本测试结果
[root@hadoop hadoop]# ./pseudo.sh 14/04/26 23:52:30 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoop/216.34.94.184 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.2.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:27:42 PDT 2013 STARTUP_MSG: java = 1.7.0_51 ************************************************************/ Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) y Format aborted in /tmp/hadoop-root/dfs/name 14/04/26 23:52:40 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop/216.34.94.184 ************************************************************/ starting namenode, logging to /var/log/hadoop/root/hadoop-root-namenode-hadoop.out localhost: starting datanode, logging to /var/log/hadoop/root/hadoop-root-datanode-hadoop.out localhost: starting secondarynamenode, logging to /var/log/hadoop/root/hadoop-root-secondarynamenode-hadoop.out starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-hadoop.out localhost: starting tasktracker, logging to /var/log/hadoop/root/hadoop-root-tasktracker-hadoop.out ======================================================================== hadoop log dir : /var/log/hadoop/root NameNode - http://192.168.60.128:50070/ JobTracker - http://192.168.60.128:50030/ =========================================================================
通过宿主机上的浏览器访问NameNode和JobTracker的网络接口
运行测试程序
将输入文件拷贝到分布式文件系统:
$ hadoop fs -put input input
通过网络接口查看hdfs
运行示例程序
[root@hadoop hadoop]# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount input output
通过JobTracker网络接口查看执行状态
执行结果
[root@hadoop hadoop]# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount input out2 14/04/27 03:34:56 INFO input.FileInputFormat: Total input paths to process : 2 14/04/27 03:34:56 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/04/27 03:34:56 WARN snappy.LoadSnappy: Snappy native library not loaded 14/04/27 03:34:57 INFO mapred.JobClient: Running job: job_201404270333_0001 14/04/27 03:34:58 INFO mapred.JobClient: map 0% reduce 0% 14/04/27 03:35:49 INFO mapred.JobClient: map 100% reduce 0% 14/04/27 03:36:16 INFO mapred.JobClient: map 100% reduce 100% 14/04/27 03:36:19 INFO mapred.JobClient: Job complete: job_201404270333_0001 14/04/27 03:36:19 INFO mapred.JobClient: Counters: 29 14/04/27 03:36:19 INFO mapred.JobClient: Job Counters 14/04/27 03:36:19 INFO mapred.JobClient: Launched reduce tasks=1 14/04/27 03:36:19 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=72895 14/04/27 03:36:19 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/04/27 03:36:19 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/04/27 03:36:19 INFO mapred.JobClient: Launched map tasks=2 14/04/27 03:36:19 INFO mapred.JobClient: Data-local map tasks=2 14/04/27 03:36:19 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=24880 14/04/27 03:36:19 INFO mapred.JobClient: File Output Format Counters 14/04/27 03:36:19 INFO mapred.JobClient: Bytes Written=25 14/04/27 03:36:19 INFO mapred.JobClient: FileSystemCounters 14/04/27 03:36:19 INFO mapred.JobClient: FILE_BYTES_READ=55 14/04/27 03:36:19 INFO mapred.JobClient: HDFS_BYTES_READ=260 14/04/27 03:36:19 INFO mapred.JobClient: FILE_BYTES_WRITTEN=164041 14/04/27 03:36:19 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=25 14/04/27 03:36:19 INFO mapred.JobClient: File Input Format Counters 14/04/27 03:36:19 INFO mapred.JobClient: Bytes Read=25 14/04/27 03:36:19 INFO mapred.JobClient: Map-Reduce Framework 14/04/27 03:36:19 INFO mapred.JobClient: Map output materialized bytes=61 14/04/27 03:36:19 INFO mapred.JobClient: Map input records=2 14/04/27 03:36:19 INFO mapred.JobClient: Reduce shuffle bytes=61 14/04/27 03:36:19 INFO mapred.JobClient: Spilled Records=8 14/04/27 03:36:19 INFO mapred.JobClient: Map output bytes=41 14/04/27 03:36:19 INFO mapred.JobClient: Total committed heap usage (bytes)=414441472 14/04/27 03:36:19 INFO mapred.JobClient: CPU time spent (ms)=2910 14/04/27 03:36:19 INFO mapred.JobClient: Combine input records=4 14/04/27 03:36:19 INFO mapred.JobClient: SPLIT_RAW_BYTES=235 14/04/27 03:36:19 INFO mapred.JobClient: Reduce input records=4 14/04/27 03:36:19 INFO mapred.JobClient: Reduce input groups=3 14/04/27 03:36:19 INFO mapred.JobClient: Combine output records=4 14/04/27 03:36:19 INFO mapred.JobClient: Physical memory (bytes) snapshot=353439744 14/04/27 03:36:19 INFO mapred.JobClient: Reduce output records=3 14/04/27 03:36:19 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2195972096 14/04/27 03:36:19 INFO mapred.JobClient: Map output records=4
查看结果
[root@hadoop hadoop]# hadoop fs -cat out2/* hadoop 1 hello 2 world 1
也可以将分布式文件系统上的文件拷贝到本地查看
[root@hadoop hadoop]# hadoop fs -get out2 out4 [root@hadoop hadoop]# cat out4/* cat: out4/_logs: Is a directory hadoop 1 hello 2 world 1
完成全部操作后,停止守护进程:
[root@hadoop hadoop]# stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode
遇到的问题
宿主机不能访问网络接口
因为开启了iptables,所以需要添加相应端口,当然测试环境也可以直接将iptables关闭。
# Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50070 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50030 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50075 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT
Browse the filesystem跳转地址不对
NameNode网络接口点击Browse the filesystem,跳转到localhost:50075。[2][3]
修改core-site.xml,将hdfs://localhost:9000改成虚拟机ip地址。(上面的脚本已经改写为自动配置为IP)。
根据几次改动的情况,这里也是可以填写域名的,只是要在访问的机器上能解析这个域名。因此公网环境中有DNS服务器的应该是可以设置域名的。
执行reduce的时候卡死
在/etc/hosts中添加主机名对应的ip地址 [4][5]。(已更新Hadoop安装脚本,会自动配置此项)
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 127.0.0.1 hadoop #添加这一行
参考文献
[1]. Hadoop官方文档.?http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html
[2]. *.?http://*.com/questions/15254492/wrong-redirect-from-hadoop-hdfs-namenode-to-localhost50075
[3]. Iteye.?http://yymmiinngg.iteye.com/blog/706909
[4].*.?http://*.com/questions/10165549/hadoop-wordcount-example-stuck-at-map-100-reduce-0
[5]. 李俊的博客.?http://www.colorlight.cn/archives/32
本文遵从CC版权协定,转载请以链接形式注明出处。
本文链接地址: http://www.annhe.net/article-2682.html
上一篇: 如何用PHP程序保护自己的数据
推荐阅读
-
Hadoop伪分布式运行
-
hadoop安装记录 博客分类: hadoop,分布式
-
【hadoop】hadoop集群介绍 和 完全分布式部署步骤
-
Hadoop学习笔记(二)Hadoop 分布式文件系统 HDFS:1.HDFS基础
-
hadoop 1.2.1 完全分布式安装注意要点 博客分类: hadoop学习hadoop环境搭建 hadoop 完全分布式安装
-
JAVA线程池管理及分布式HADOOP调度框架搭建 博客分类: java架构hadoopjeetask 分布式线程池java架构hadoop
-
Hadoop 上运行基于中文分词算法的 MapReduce 程序,进行词频分析。
-
学习Hadoop权威指南之Hadoop运行MapReduce日志查看 博客分类: hadoop hadoop大数据
-
ZooKeeper的配置 博客分类: ZooKeeper云计算 ZooKeeper云计算集群伪分布式Hadoop
-
Hadoop2.2.0伪分布式完全安装手册 博客分类: hadoop hadoophadoop2.2.0单机伪分布式安装