Tomcat启动失败的问题排查与解决

程序员文章站 2022-04-06 22:39:54

前言最近在某应用更新代码后部分机器发布失败，发布失败的机器上tomcat一直没有启动成功，日志卡在deploying web application，重启数次之后仍然是...

前言

最近在某应用更新代码后部分机器发布失败，发布失败的机器上tomcat一直没有启动成功，日志卡在deploying web application，重启数次之后仍然是一样的情况。所以进行排查问题，下面记录了所有的排查过程，需要的朋友们可以参考学习。

排查过程

1. tomcat启动线程卡住

下文中tomcat启动线程代指线程名为localhost-startstop-$id的线程。

使用jstack打印出tomcat的线程堆栈：

jstack `jps |grep bootstrap |awk '{print $1}'` > jstack.log

从jstack.log里面可以看到线程localhost-startstop-1处于waiting状态，堆栈如下：

"localhost-startstop-1" #26 daemon prio=5 os_prio=0 tid=0x00007fe6c8002000 nid=0x3dc1 waiting on condition [0x00007fe719c1e000]
 java.lang.thread.state: waiting (parking)
 at sun.misc.unsafe.park(native method)
 - parking to wait for <0x00000007147be150> (a xxx.heartbeat.network.client.futureresult)
 at java.util.concurrent.locks.locksupport.park(locksupport.java:175)
 at java.util.concurrent.futuretask.awaitdone(futuretask.java:429)
 at java.util.concurrent.futuretask.get(futuretask.java:191)
 at xxx.heartbeatcontainer.invoke(heartbeatcontainer.java:183)
 at xxx.heartbeatcontainer.registry(heartbeatcontainer.java:130)

对应的代码如下：

final responsefuture<xxxmessage<result>> future = responsefuturefactory.newresponsefuture(request);
channel.writeandflush(request);
xxxmessage<result> response = future.get();

线程一直卡在future.get()没有返回。这个步骤是在等待客户端向xxx-server发送的注册请求的返回。

2. xxx注册请求没返回

用tcpdump抓了下包（xxx-server的服务端口是yyy）：

tcpdump -x -s0 -i bond0 port yyy

发现只有建连接的包，没有length != 0的数据包：

ip app-ip.56599 > xxx-server-ip.yyy: flags [s], seq 3536490816, win 14600, options [mss 1460,sackok,ts val 3049061547 ecr 0], length 0
ip xxx-server-ip.yyy > app-ip.56599: flags [s.], seq 2500877640, ack 3536490817, win 14480, options [mss 1460,sackok,ts val 1580197458 ecr 3049061547], length 0
ip app-ip.56599 > xxx-server-ip.yyy: flags [.], ack 1, win 14600, options [nop,nop,ts val 3049061548 ecr 1580197458], length 0

所以，推断注册请求没返回的原因是请求压根儿没有发送出去。

3. xxx注册请求没发送出去

xxx代码里面调用了channel.writeandflush，但是数据却没有发送出去。这块的代码，更友好的做法应该是writeandflush之后对返回的channelfuture注册一个listener，在write操作完成之后的回调里面判断状态。

在netty大神 – @yh的指导下用btrace跟了一下netty的代码。

在tomcat启动逻辑相关脚本bin/catalina.sh里面加上参数让btrace agent和tomcat一起启动：

java_opts="$java_opts -javaagent:${btrace_home}/build/btrace-agent.jar=script=${btrace_home}/scripts/hangdebug.class,stdout=true,debug=true,noserver=true"

hangdebug.class里面包含了一些需要查看的方法，下面是排查没有发送请求原因的步骤：

首先发现没有调用接口io.netty.channel.channel.unsafe的write方法，验证了请求没有发送出去的推论；
然后发现调用接口io.netty.channel.channeloutboundhandler的write方法时报错；
最后定位到调用类io.netty.handler.codec.messagetobyteencoder的write方法时抛出了异常，异常堆栈为：

io.netty.handler.codec.encoderexception: java.lang.nosuchmethoderror: io.netty.util.internal.mathutil.safefindnextpositivepoweroftwo(i)i
 io.netty.handler.codec.messagetobyteencoder.write(messagetobyteencoder.java:125)
 ...
caused by: java.lang.nosuchmethoderror: 
 io.netty.util.internal.mathutil.safefindnextpositivepoweroftwo(i)i
 io.netty.buffer.poolthreadcache$memoryregioncache.<init>(poolthreadcache.java:372)
 ...

这个时候，问题的原因比较明确了：
io.netty.util.internal.mathutil.safefindnextpositivepoweroftwo这个方法没有找到。

最后找到问题的btrace method如下：

@onmethod(
 clazz = "+io.netty.channel.channeloutboundhandler",
 method = "write",
 location = @location(value = kind.error)
)
public static void errorchanneloutboundhandlerwrite(@probeclassname string classname, throwable cause) {
 println("error channeloutboundhandler.write, real class: " + classname);
 threads.jstack(cause);
 println("=====================");
}

这里有一个问题：为什么这个异常日志里面没有打印呢？

这个问题可以从io.netty.channel.abstractchannelhandlercontext代码里找到答案：

private void invokewrite(object msg, channelpromise promise) {
 try {
 ((channeloutboundhandler)this.handler()).write(this, msg, promise);
 } catch (throwable var4) {
 notifyoutboundhandlerexception(var4, promise);
 }
}

notifyoutboundhandlerexception会去通知对应的listener，xxx的这段老代码没有注册listener，所以没有打印出这个异常。

4. nosuchmethoderror原因

再次查看了下$webapp-dir/web-inf/lib下netty的版本：

netty-3.10.6.final.jar
netty-all-4.1.4.final.jar
netty-buffer-4.1.5.final.jar
netty-codec-4.1.5.final.jar
netty-codec-http-4.1.5.final.jar
netty-common-4.1.5.final.jar
netty-handler-4.1.5.final.jar
netty-resolver-4.1.5.final.jar
netty-transport-4.1.5.final.jar
transport-netty3-client-5.0.0.jar
transport-netty4-client-5.0.0.jar

比较扎眼的是netty-all-4.1.4.final.jar的版本和其它jar包版本不太一致。需要进一步确认一下，io.netty.buffer.poolthreadcache$memoryregioncache和io.netty.util.internal.mathutil这两个类分别是从哪个jar包中加载的。

在tomcat启动逻辑相关脚本bin/catalina.sh里面加上启动参数，打印class加载的日志：

java_opts="$java_opts -verbose:class"

可以看到：

...
[loaded io.netty.buffer.poolthreadcache$memoryregioncache from file:$webapp-dir/web-inf/lib/web-inf/lib/netty-buffer-4.1.5.final.jar]
...
[loaded io.netty.util.internal.mathutil from file:$webapp-dir/web-inf/lib/netty-all-4.1.4.final.jar]
...

从netty-all-4.1.4.final.jar中加载的io.netty.util.internal.mathutil，是没有safefindnextpositivepoweroftwo这个方法的（正常情况下，应该从netty-common-4.1.5.final.jar中加载这个类）。

至此为止，弄清楚了启动卡住的原因：

netty包加载问题 => xxx调用channel.writeandflush发送注册请求时异常 => 没有回包，future.get()一直卡住 => tomcat启动线程卡住

还有一个令人费解的现象：为什么有的机器启动正常，有的机器启动不正常呢？

5. 不同机器表现不同

再回头看一下启动有问题的机器上netty相关jar包的顺序，这里我们使用ls -f命令（只关注和问题相关的jar包）：

$ ls -f |grep netty
netty-buffer-4.1.5.final.jar
netty-all-4.1.4.final.jar
...
netty-common-4.1.5.final.jar
...

ls加-f参数的含义可以通过man手册看到：

-f do not sort, enable -au, disable -ls --color

意思是直接使用系统调用getdents的返回，不再做排序。从man手册可以看到，ls默认排序方法是sort entries alphabetically if none。

nosuchmethoderror的原因是：从netty-buffer-4.1.5.final.jar中加载了io.netty.buffer.poolthreadcache$memoryregioncache，这个类是会调用io.netty.util.internal.mathutil.safefindnextpositivepoweroftwo这个方法的；从netty-all-4.1.4.final.jar加载的io.netty.util.internal.mathutil没有这个方法。

对比看下启动正确的机器上的netty相关jar包的顺序：

$ ls -f |grep netty
...
netty-all-4.1.4.final.jar
...
netty-common-4.1.5.final.jar
netty-buffer-4.1.5.final.jar
...

由此可以看出所有netty相关的class均从netty-all-4.1.4.final.jar中加载，不会有不兼容的问题产生。

要么问题来了：为什么在ext4中，拥有相同目录项的目录，ls -f出来的顺序是不一样的呢？

这个问题我暂时也回答不上来，至少我还没有拿到令自己信服的代码级别的解释。

嗯，没有代码的解释不是解释，没有deadline的任务不是任务，没有流程图或分享的源码阅读不是源码阅读，没有报告的性能测试不是性能测试。

这里有一个基于现象的解释，我觉得还比较靠谱：

on modern filesystems where directory data structures are based on a search tree or hash table, the order is practically unpredictable.

我们可以做的

事后诸葛亮时间 :) 开玩笑的，遇事多review下才能少犯错误。

基础组件：多考虑失败的情况，不吞异常；可能阻塞的操作考虑超时时间（自勉）
发布系统：能够添加一些规则，哪些包不能共存，比如上述问题中的netty-all和netty-common这些
容器隔离：隔离中间件使用的三方包和业务使用的三方包

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作能带来一定的帮助，如果有疑问大家可以留言交流，谢谢大家对的支持。

上一篇：小米手表color2和color运动版哪个好?小米手表color2和color运动版对比评测

下一篇：关于tomcat点击startup.bat后闪退问题的解决办法

Tomcat启动失败的问题排查与解决

eclipse部署tomcat服务器无法启动问题的解决方法

Tomcat启动成功访问主页失败的原因解决方案

Eclipse启动Tomcat超时问题的解决方法

解决centos7中tomcat启动与本机访问问题

解决redis服务启动失败的问题

关于使用spingMVC框架启动tomcat服务器失败的问题

Tomcat启动成功访问主页失败的原因解决方案

解决 tomcat启动访问页面报java.lang.NoSuchMethodError javax.el.ExpressionFactory.newInstance()的问题