欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

记一次线上异常的排查和定位

程序员文章站 2022-04-28 17:29:09
...

title: 记一次线上异常的排查和定位
date: 2019-02-09 20:30:02
tags: debug

某段时间,线上服务不时会爆出dwr数据异常。
初步断定是后端接口返回异常所致。

由于当时我们的服务已经接入ELK,所以第一时间登录kibana后台,按照lucene语法搜索
(Tag:“study_online”) AND (message:“exception”),定位到服务器的异常

异常信息(由于当时的异常日志已经被冲掉了,这是我从网上摘录的):

java.lang.IllegalStateException: Timed out waiting to add Cmd: 1 Opaque: 1147840 Key: com.stubhub.user.business.entity.UserSession|63C21A07311389A1EC361D834BF46E72 Cas: 0 Exp: 7200 Flags: 1 Data Length: 1748(max wait=10000ms) 
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:362) 
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:1267) 
at com.couchbase.client.CouchbaseConnection.addOperation(CouchbaseConnection.java:277) 
at net.spy.memcached.MemcachedConnection.enqueueOperation(MemcachedConnection.java:1185) 
at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:328) 
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:929) 
at com.stubhub.common.cache.store.couchbase.CouchbaseStore.put(CouchbaseStore.java:148) 
at com.stubhub.common.cache.store.CompositeCacheStore.put(CompositeCacheStore.java:69) 
at com.stubhub.common.session.impl.UserSessionCacheManagerImpl.putCrossModuleValue(UserSessionCacheManagerImpl.java:48) 
at com.stubhub.user.business.manager.impl.UserSessionCacheMgrImpl.putUserSessionToStore(UserSessionCacheMgrImpl.java:269) 
at com.stubhub.user.business.manager.impl.UserSessionCacheMgrImpl.createUserSession(UserSessionCacheMgrImpl.java:806) 
at sun.reflect.GeneratedMethodAccessor1305.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 

从日志中判断是因为memcached超时抛出的异常。
首先查看工程的memcached设置,代码如下

	<bean id="memcachedClient" class="com.xxx.xxx.cache.impl.KeyPrefixSupportedMemcachedClientFactory">
		<property name="servers" value="${memcached_address_list}" />
		<property name="namespace" value="study_" />
		<property name="protocol" value="BINARY" />
		<property name="transcoder">
			<bean class="net.spy.memcached.transcoders.SerializingTranscoder">
				<property name="compressionThreshold" value="16384" />
			</bean>
		</property>
		<property name="maxReconnectDelay" value="60" />
		<property name="opTimeout" value="200"/>
		<property name="opQueueMaxBlockTime" value="400"/>
		<property name="timeoutExceptionThreshold" value="20"/>
		<property name="hashAlg">
			<value type="net.spy.memcached.DefaultHashAlgorithm">KETAMA_HASH</value>
		</property>
		<property name="locatorType">
            <value type="net.spy.memcached.ConnectionFactoryBuilder.Locator">CONSISTENT</value>
        </property>
		<property name="failureMode">
            <value type="net.spy.memcached.FailureMode">Redistribute</value>
        </property>
		<property name="useNagleAlgorithm" value="false" />
	</bean>

可以看到超时时间opTimeout设置的是200毫秒,而opQueueMaxBlockTime是400毫秒
opTimeout是操作超时时间,opQueueMaxBlockTime是指操作的最大阻塞时间。
那究竟是哪个时间超时导致的呢?
可以通过下载memcacheClient源码,搜索“Timed out waiting to add”定位到如下源码

  /*
   * (non-Javadoc)
   *
   * @see net.spy.memcached.MemcachedNode#addOp(net.spy.memcached.ops.Operation)
   */
  public final void addOp(Operation op) {
    try {
      if (!authLatch.await(1, TimeUnit.SECONDS)) {
        op.cancel();
        getLogger().warn("Operation canceled because authentication "
                + "or reconnection and authentication has "
                + "taken more than one second to complete.");
        getLogger().debug("Canceled operation %s", op.toString());
        return;
      }
      if (!inputQueue.offer(op, opQueueMaxBlockTime, TimeUnit.MILLISECONDS)) {
        throw new IllegalStateException("Timed out waiting to add " + op
            + "(max wait=" + opQueueMaxBlockTime + "ms)");
      }
    } catch (InterruptedException e) {
      // Restore the interrupted status
      Thread.currentThread().interrupt();
      throw new IllegalStateException("Interrupted while waiting to add " + op);
    }
  }

从源码中可以知晓,原因是inputQueue.offer没有正常返回导致的,超过了opQueueMaxBlockTime的时间限制

知道了原因,下面就开始思考怎么解决吧