记一次线上OOM
使用Hystrix中api获取动态变量导致的线上的生产事故-OOM导致系统重启
现象
生产环境部署8台实例,某天下午突然所有实例全部异常,自动重启,运维反馈系统OOM,业务高峰过后实例恢复正常,当天未引起重视,周一早上业务高峰,部分实例突然OOM异常重启。
问题排查
叫运维把异常实例的dump文件导出,同时在正常实例上top -Hp 查看下cpu和内存,观察到一个vm的线程cpu使用率高达100%有时甚至超过100,猜测是gc线程。gcutil观察gc情况,发现yongGC 非常频繁平均3s一次,fullGC比较稳定。查看容器监控界面,发现在9:50起内存使用量线性增加,直到系统异常。怀疑有内存泄漏。
分析dump文件
内存占用最多的两个类:object、com.netflix.hystrix.strategy.properties.archaius.HystrixDynamicPropertiesArchaius.StringDynamicProperty。
前一个版本有优化Hystrix限流功能,怀疑跟Hystrix有关。
StringDynamicProperty类里面有两个变量:callbacks、validators占用最多,基本上100%被他俩占用。都是CopyOnWriteArraySet类型。
在线程堆栈文件中搜 StringDynamicProperty 发现很多线程跟此类关联,定位到具体代码:
HystrixDynamicProperties dynamicProperties = HystrixPlugins.getInstance().getDynamicProperties();
HystrixDynamicProperty<String> a = dynamicProperties.getString("hystrix.switch", "null");
这个功能是在一个Filter里面判断hystrix开关是否打开,打开则通过HystrixCommand封装请求,对请求做限流的功能。
刚好这个开关是上个版本加的,99%问题就在他了。
Hystrix动态属性获取存在的问题
Hystrix是通过 com.netflix.archaius 管理配置信息的,支持动态更新。上面就用到了获取属性的api。
archaius提供访问属性的api:
DynamicStringProperty property1 = DynamicPropertyFactory.getInstance().getStringProperty("my.name", "null");
跟踪代码到:
com.netflix.config.DynamicPropertyFactory#getStringProperty(java.lang.String, java.lang.String, java.lang.Runnable)
...
DynamicStringProperty property = new DynamicStringProperty(propName, defaultValue);
...
com.netflix.config.PropertyWrapper#PropertyWrapper:
private static final IdentityHashMap<Class<? extends PropertyWrapper>, Object> SUBCLASSES_WITH_NO_CALLBACK
= new IdentityHashMap<Class<? extends PropertyWrapper>, Object>();
//a: 获取DynamicProperty 实例
this.prop = DynamicProperty.getInstance(propName);
//b: 是否要设置回调,引发oom的源头
if (!SUBCLASSES_WITH_NO_CALLBACK.containsKey(c)) {
Runnable callback = new Runnable() {
public void run() {
propertyChanged();
}
};
this.prop.addCallback(callback);
callbacks.add(callback);
this.prop.addValidator(new PropertyChangeValidator() {
@Override
public void validate(String newValue) {
PropertyWrapper.this.validate(newValue);
}
});
try {
if (this.prop.getString() != null) {
this.validate(this.prop.getString());
}
} catch (ValidationException e) {
logger.warn("Error validating property at initialization. Will fallback to default value.", e);
prop.updateValue(defaultValue);
}
}
a: com.netflix.config.DynamicProperty#getInstance
private static final ConcurrentHashMap<String, DynamicProperty> ALL_PROPS
= new ConcurrentHashMap<String, DynamicProperty>();
if (dynamicPropertySupportImpl == null) {
DynamicPropertyFactory.getInstance();
}
//c:缓存,oom罪魁祸首
DynamicProperty prop = ALL_PROPS.get(propName);
if (prop == null) {
prop = new DynamicProperty(propName);
DynamicProperty oldProp = ALL_PROPS.putIfAbsent(propName, prop);
if (oldProp != null) {
prop = oldProp;
}
}
return prop;
b: PropertyWrapper.SUBCLASSES_WITH_NO_CALLBACK
//缓存需要设置回调的PropertyWrapper
private static final IdentityHashMap<Class<? extends PropertyWrapper>, Object> SUBCLASSES_WITH_NO_CALLBACK
= new IdentityHashMap<Class<? extends PropertyWrapper>, Object>();
//下面这些默认不需要设置回调
static {
PropertyWrapper.registerSubClassWithNoCallback(DynamicIntProperty.class);
PropertyWrapper.registerSubClassWithNoCallback(DynamicStringProperty.class);
PropertyWrapper.registerSubClassWithNoCallback(DynamicBooleanProperty.class);
PropertyWrapper.registerSubClassWithNoCallback(DynamicFloatProperty.class);
PropertyWrapper.registerSubClassWithNoCallback(DynamicLongProperty.class);
PropertyWrapper.registerSubClassWithNoCallback(DynamicDoubleProperty.class);
}
...
//通过HystrixPlugins.getInstance().getDynamicProperties().getxxx()访问,这里的c = HystrixDynamicPropertiesArchaius.StringDynamicProperty,并未注册进SUBCLASSES_WITH_NO_CALLBACK,所以这里每次都会进去设置callback/validator
if (!SUBCLASSES_WITH_NO_CALLBACK.containsKey(c)) {
Runnable callback = new Runnable() {
public void run() {
propertyChanged();
}
};
//d:这里给PropertyWrapper设置回调,设DynamicProperty prop属性中的属性 callbacks,
this.prop.addCallback(callback);
//e:同上设置callback
callbacks.add(callback);
this.prop.addValidator(new PropertyChangeValidator() {
@Override
public void validate(String newValue) {
PropertyWrapper.this.validate(newValue);
}
});
try {
if (this.prop.getString() != null) {
this.validate(this.prop.getString());
}
} catch (ValidationException e) {
logger.warn("Error validating property at initialization. Will fallback to default value.", e);
prop.updateValue(defaultValue);
}
}
//c:缓存,oom罪魁祸首
DynamicProperty.ALL_PROPS
private static final ConcurrentHashMap<String, DynamicProperty> ALL_PROPS
= new ConcurrentHashMap<String, DynamicProperty>();
一个并发安全的hashmap,但是DynamicProperty是强引用的,DynamicProperty在首次访问时会设置进来,可以优化为弱引用,OOM前会把DynamicProperty回收:
private static final ConcurrentHashMap<String, WeakReference<DynamicProperty>> ALL_PROPS
= new ConcurrentHashMap<>();
d/e:
由于DynamicProperty都是从缓存中取到的,该类是HystrixDynamicPropertiesArchaius.StringDynamicProperty.class,默认是没有注册到SUBCLASSES_WITH_NO_CALLBACK,每次通过HystrixPlugins.getInstance().getDynamicProperties().getStiring(“hystrix.switch”)时都会给该变量的callbacks/validators新增同样的值。而且CopyOnWriteArraySet时写时复制,内存开销大,高并发下写性能问题很严重。
这样就导致DynamicProperty.ALL_PROPS的那个DynamicProperty对象越来越大,GC也没办法回收,最终导致系统OOM。
关于archaius源码分析可以参考
https://blog.csdn.net/f641385712/category_9911741.html
解决办法
1 注册StringDynamicProperty到SUBCLASSES_WITH_NO_CALLBACK
PropertyWrapper.registerSubClassWithNoCallback(HystrixDynamicPropertiesArchaius.StringDynamicProperty.class);
但是HystrixDynamicPropertiesArchaius.StringDynamicProperty 时私有类,外部无法访问。此法无效
2 使用DynamicPropertyFactory
DynamicStringProperty property1 = DynamicPropertyFactory.getInstance().getStringProperty("htstrix.switch", "off");
DynamicPropertyFactory 会把变量封装成DynamicStringProperty,该类不会注册callbacks和validators。
3 使用其他框架读取该配置文件,如disconfig
关于HystrixCommandProperties HystrixThreadPoolProperties HystrixTimerThreadPoolProperties 的配置文件时如何读取的,晚点debug看看。