[HBase]记一个state异常 hbaserpcstate
程序员文章站
2022-06-17 13:33:49
...
昨天联调的时候发现一个问题,请求发送后,页面上一直不返回数据,好像一直在waiting。用stack看了下,第一个线程在等待hbase RPC请求返回数据
"http-bio-8080-exec-5" daemon prio=10 tid=0x00002aaab3661000 nid=0x2bb6 in Object.wait() [0x0000000045258000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000d2067280> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:907) - locked <0x00000000d2067280> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:157) at $Proxy9.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.SecureRpcEngine.getProxy(SecureRpcEngine.java:198) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:335) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:312) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:364) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1293) - locked <0x00000000d1d791e0> (a java.lang.String) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1
而其他线程则处于block状态
"http-bio-8080-exec-8" daemon prio=10 tid=0x0000000041d07800 nid=0x2f5b waiting for monitor entry [0x0000000046d74000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1283) - waiting to lock <0x00000000d1d791e0> (a java.lang.String) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1249) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1236) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:945) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:942) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:845)
因为之前看过对应的代码,请见 http://iwinit.iteye.com/blog/1815347 和 http://iwinit.iteye.com/blog/1811265
看代码是region location的时候RPC meta表所在regionserver的时候卡住了。因为系统刚起来的时候没有cache,所以请求都需要region location,.meta.只在一台rs上,所以都block住了。再看之前的RPC分析文章,第一个请求确实已经发出去了,处于wait状态,也就是等待IO线程notify。奇怪的是,jstack里并没有IO线程(IPC关键字)。也就是说第一个请求线程永远无法被唤醒了:(。Log也没有任何记录,开始各种debug,因为我们客户端是自己的client,SecureConnection,其receiveResponse方法如下
try { int id = in.readInt(); // try to read an id if (LOG.isDebugEnabled()) LOG.debug(getName() + " got value #" + id); //这里把call删除掉了,后续IO线程会因为空闲而自动关闭退出 Call call = calls.remove(id); //协议中的state字段 int state = in.readInt(); // read call status if (LOG.isDebugEnabled()) { LOG.debug("call #"+id+" state is " + state); } //对state处理,注意这里没有从的else处理,也就是说如果state不是这3种的话,直接就不处理了,但是call请求确已经被删除掉了。。。 if (state == Status.SUCCESS.state) { Writable value = ReflectionUtils.newInstance(valueClass, conf); value.readFields(in); // read value if (LOG.isDebugEnabled()) { LOG.debug("call #"+id+", response is:\n"+value.toString()); } call.setValue(value); } else if (state == Status.ERROR.state) { call.setException(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } else if (state == Status.FATAL.state) { // Close the connection markClosed(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } } catch (IOException e) {
debug的时候发现server端范围的state是5xxxxxx的数字。。。原因是sever段没有启用secure导致协议不一致了。。client没有容错导致。
上一篇: springboot admin 配置
下一篇: 批处理命令教学之more命令