记一次线上异常的排查和定位
程序员文章站
2022-04-28 17:29:09
...
title: 记一次线上异常的排查和定位
date: 2019-02-09 20:30:02
tags: debug
某段时间,线上服务不时会爆出dwr数据异常。
初步断定是后端接口返回异常所致。
由于当时我们的服务已经接入ELK,所以第一时间登录kibana后台,按照lucene语法搜索
(Tag:“study_online”) AND (message:“exception”),定位到服务器的异常
异常信息(由于当时的异常日志已经被冲掉了,这是我从网上摘录的):
java.lang.IllegalStateException: Timed out waiting to add Cmd: 1 Opaque: 1147840 Key: com.stubhub.user.business.entity.UserSession|63C21A07311389A1EC361D834BF46E72 Cas: 0 Exp: 7200 Flags: 1 Data Length: 1748(max wait=10000ms)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:362)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:1267)
at com.couchbase.client.CouchbaseConnection.addOperation(CouchbaseConnection.java:277)
at net.spy.memcached.MemcachedConnection.enqueueOperation(MemcachedConnection.java:1185)
at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:328)
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:929)
at com.stubhub.common.cache.store.couchbase.CouchbaseStore.put(CouchbaseStore.java:148)
at com.stubhub.common.cache.store.CompositeCacheStore.put(CompositeCacheStore.java:69)
at com.stubhub.common.session.impl.UserSessionCacheManagerImpl.putCrossModuleValue(UserSessionCacheManagerImpl.java:48)
at com.stubhub.user.business.manager.impl.UserSessionCacheMgrImpl.putUserSessionToStore(UserSessionCacheMgrImpl.java:269)
at com.stubhub.user.business.manager.impl.UserSessionCacheMgrImpl.createUserSession(UserSessionCacheMgrImpl.java:806)
at sun.reflect.GeneratedMethodAccessor1305.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
从日志中判断是因为memcached超时抛出的异常。
首先查看工程的memcached设置,代码如下
<bean id="memcachedClient" class="com.xxx.xxx.cache.impl.KeyPrefixSupportedMemcachedClientFactory">
<property name="servers" value="${memcached_address_list}" />
<property name="namespace" value="study_" />
<property name="protocol" value="BINARY" />
<property name="transcoder">
<bean class="net.spy.memcached.transcoders.SerializingTranscoder">
<property name="compressionThreshold" value="16384" />
</bean>
</property>
<property name="maxReconnectDelay" value="60" />
<property name="opTimeout" value="200"/>
<property name="opQueueMaxBlockTime" value="400"/>
<property name="timeoutExceptionThreshold" value="20"/>
<property name="hashAlg">
<value type="net.spy.memcached.DefaultHashAlgorithm">KETAMA_HASH</value>
</property>
<property name="locatorType">
<value type="net.spy.memcached.ConnectionFactoryBuilder.Locator">CONSISTENT</value>
</property>
<property name="failureMode">
<value type="net.spy.memcached.FailureMode">Redistribute</value>
</property>
<property name="useNagleAlgorithm" value="false" />
</bean>
可以看到超时时间opTimeout设置的是200毫秒,而opQueueMaxBlockTime是400毫秒
opTimeout是操作超时时间,opQueueMaxBlockTime是指操作的最大阻塞时间。
那究竟是哪个时间超时导致的呢?
可以通过下载memcacheClient源码,搜索“Timed out waiting to add”定位到如下源码
/*
* (non-Javadoc)
*
* @see net.spy.memcached.MemcachedNode#addOp(net.spy.memcached.ops.Operation)
*/
public final void addOp(Operation op) {
try {
if (!authLatch.await(1, TimeUnit.SECONDS)) {
op.cancel();
getLogger().warn("Operation canceled because authentication "
+ "or reconnection and authentication has "
+ "taken more than one second to complete.");
getLogger().debug("Canceled operation %s", op.toString());
return;
}
if (!inputQueue.offer(op, opQueueMaxBlockTime, TimeUnit.MILLISECONDS)) {
throw new IllegalStateException("Timed out waiting to add " + op
+ "(max wait=" + opQueueMaxBlockTime + "ms)");
}
} catch (InterruptedException e) {
// Restore the interrupted status
Thread.currentThread().interrupt();
throw new IllegalStateException("Interrupted while waiting to add " + op);
}
}
从源码中可以知晓,原因是inputQueue.offer没有正常返回导致的,超过了opQueueMaxBlockTime的时间限制
知道了原因,下面就开始思考怎么解决吧