记一次线上异常的排查和定位

程序员文章站 2022-04-28 17:29:09

...

title: 记一次线上异常的排查和定位
date: 2019-02-09 20:30:02
tags: debug

某段时间，线上服务不时会爆出dwr数据异常。
初步断定是后端接口返回异常所致。

由于当时我们的服务已经接入ELK，所以第一时间登录kibana后台，按照lucene语法搜索
（Tag:“study_online”） AND (message:“exception”)，定位到服务器的异常

异常信息(由于当时的异常日志已经被冲掉了，这是我从网上摘录的)：

java.lang.IllegalStateException: Timed out waiting to add Cmd: 1 Opaque: 1147840 Key: com.stubhub.user.business.entity.UserSession|63C21A07311389A1EC361D834BF46E72 Cas: 0 Exp: 7200 Flags: 1 Data Length: 1748(max wait=10000ms) 
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:362) 
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:1267) 
at com.couchbase.client.CouchbaseConnection.addOperation(CouchbaseConnection.java:277) 
at net.spy.memcached.MemcachedConnection.enqueueOperation(MemcachedConnection.java:1185) 
at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:328) 
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:929) 
at com.stubhub.common.cache.store.couchbase.CouchbaseStore.put(CouchbaseStore.java:148) 
at com.stubhub.common.cache.store.CompositeCacheStore.put(CompositeCacheStore.java:69) 
at com.stubhub.common.session.impl.UserSessionCacheManagerImpl.putCrossModuleValue(UserSessionCacheManagerImpl.java:48) 
at com.stubhub.user.business.manager.impl.UserSessionCacheMgrImpl.putUserSessionToStore(UserSessionCacheMgrImpl.java:269) 
at com.stubhub.user.business.manager.impl.UserSessionCacheMgrImpl.createUserSession(UserSessionCacheMgrImpl.java:806) 
at sun.reflect.GeneratedMethodAccessor1305.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

从日志中判断是因为memcached超时抛出的异常。
首先查看工程的memcached设置，代码如下

	<bean id="memcachedClient" class="com.xxx.xxx.cache.impl.KeyPrefixSupportedMemcachedClientFactory">
		<property name="servers" value="${memcached_address_list}" />
		<property name="namespace" value="study_" />
		<property name="protocol" value="BINARY" />
		<property name="transcoder">
			<bean class="net.spy.memcached.transcoders.SerializingTranscoder">
				<property name="compressionThreshold" value="16384" />
			</bean>
		</property>
		<property name="maxReconnectDelay" value="60" />
		<property name="opTimeout" value="200"/>
		<property name="opQueueMaxBlockTime" value="400"/>
		<property name="timeoutExceptionThreshold" value="20"/>
		<property name="hashAlg">
			<value type="net.spy.memcached.DefaultHashAlgorithm">KETAMA_HASH</value>
		</property>
		<property name="locatorType">
            <value type="net.spy.memcached.ConnectionFactoryBuilder.Locator">CONSISTENT</value>
        </property>
		<property name="failureMode">
            <value type="net.spy.memcached.FailureMode">Redistribute</value>
        </property>
		<property name="useNagleAlgorithm" value="false" />
	</bean>

可以看到超时时间opTimeout设置的是200毫秒，而opQueueMaxBlockTime是400毫秒
opTimeout是操作超时时间，opQueueMaxBlockTime是指操作的最大阻塞时间。
那究竟是哪个时间超时导致的呢？
可以通过下载memcacheClient源码，搜索“Timed out waiting to add”定位到如下源码

  /*
   * (non-Javadoc)
   *
   * @see net.spy.memcached.MemcachedNode#addOp(net.spy.memcached.ops.Operation)
   */
  public final void addOp(Operation op) {
    try {
      if (!authLatch.await(1, TimeUnit.SECONDS)) {
        op.cancel();
        getLogger().warn("Operation canceled because authentication "
                + "or reconnection and authentication has "
                + "taken more than one second to complete.");
        getLogger().debug("Canceled operation %s", op.toString());
        return;
      }
      if (!inputQueue.offer(op, opQueueMaxBlockTime, TimeUnit.MILLISECONDS)) {
        throw new IllegalStateException("Timed out waiting to add " + op
            + "(max wait=" + opQueueMaxBlockTime + "ms)");
      }
    } catch (InterruptedException e) {
      // Restore the interrupted status
      Thread.currentThread().interrupt();
      throw new IllegalStateException("Interrupted while waiting to add " + op);
    }
  }

从源码中可以知晓，原因是inputQueue.offer没有正常返回导致的，超过了opQueueMaxBlockTime的时间限制

知道了原因，下面就开始思考怎么解决吧

上一篇：在python中将字符串转为json对象并取值的方法

下一篇： [NIO.2] 第二十篇使用命令行创建链接

记一次线上异常的排查和定位

title: 记一次线上异常的排查和定位
date: 2019-02-09 20:30:02
tags: debug

一次线上Redis类转换异常排查引发的思考

记一次线上CPU持续飙升的问题排查

一次线上Redis类转换异常排查引发的思考

记一次线上异常的排查和定位

记一次线上CPU持续飙升的问题排查

记一次线上异常的排查和定位

title: 记一次线上异常的排查和定位 date: 2019-02-09 20:30:02 tags: debug

一次线上Redis类转换异常排查引发的思考

记一次线上CPU持续飙升的问题排查

一次线上Redis类转换异常排查引发的思考

记一次线上异常的排查和定位

记一次线上CPU持续飙升的问题排查

title: 记一次线上异常的排查和定位
date: 2019-02-09 20:30:02
tags: debug