hadoop数据节点通信异常

程序员文章站 2022-05-30 10:50:42

...

在前几天，我们的hadoop集群很不稳定。经常会有1个数据节点挂掉。使用jps查看，tasktracker和datanode均正常，没有crash掉。查看日志：
org.apache.hadoop.ipc.Client: Retrying connect to server: xxxxx/192.168.0.xxxx:9001. Already tried 9 time(s).
org.apache.hadoop.ipc.Client: Retrying connect to server: xxxxx/192.168.0.xxxx:9001. Already tried 8 time(s).
org.apache.hadoop.ipc.Client: Retrying connect to server: xxxxx/192.168.0.xxxx:9001. Already tried 7 time(s).
org.apache.hadoop.ipc.Client: Retrying connect to server: xxxxx/192.168.0.xxxx:9001. Already tried 6 time(s).
org.apache.hadoop.ipc.Client: Retrying connect to server: xxxxx/192.168.0.xxxx:9001. Already tried 5 time(s).
就是与namenode无法正常通信。
从集群的角度看，最近没有做过任何修改。
先尝试通过hadoop-daemon.sh stop datanode ,hadoop-daemon.sh stop tasktracker停止数据节点。
然后尝试通过hadoop-daemon.sh start datanode ,hadoop-daemon.sh start tasktracker启动数据节点。
均正常，无错误信息。
但是在运行一段时间，或者跑一两个MR程序后，该数据几点所在服务器负载开始暴增。
然后无法与namenode正常通信。
一个一个排查吧。
节点信息配置，HDFS信息，均无异常。在排查服务器配置的时候，看到了一个很奇怪的东西。
在etc/hosts中被增加了一个配置，如下：
127.0.1.1 xxxxxx
127.0.1.1是debian中的本地回环。这个造成了hadoop解析出现问题。而且此项配置不知道是谁增加上的。
在屏蔽该项后，问题依然存在，只能重启该服务器。重启后一切正常。

由此可看，保持hadoop集群的环境清洁是很重要的。而且这对于我以后针对hadoop集群异常检查增加了不少经验。hadoop集群的配置一般不会有很大的变动，hadoop对服务器环境的依赖较大，从服务器环境是否变化来排查问题是一个不错的解决方式，mark一下。

原文地址：hadoop数据节点通信异常, 感谢原作者分享。

相关标签： hadoop 数据节点通信异常前几天我们 hado

上一篇： JavaScript门面模式详解

下一篇：讨论一下mvc的C跟V的维护性和健壮复用性

hadoop数据节点通信异常

解决DM数据库频繁读写数据库的网络通信异常问题

python socket通信（文件、数据传输、异常处理）

hadoop重用Decommission状态的数据节点

数据库节点二VIP异常故障分析

数据库节点二VIP异常故障分析

hadoop数据节点通信异常

python socket通信（文件、数据传输、异常处理）