欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

[内存泄漏]Objects remaining in slab_test on kmem_cache_close()

程序员文章站 2022-03-02 14:37:37
...

公司产品在重启服务时,log里有如下Call Trace。

Jul 29 09:27:39 cpe kernel: [  997.113469] =============================================================================
Jul 29 09:27:39 cpe kernel: [  997.113474] BUG slab_test (Tainted: G           OE  ------------  ): Objects remaining in slab_test on kmem_cache_close()
Jul 29 09:27:39 cpe kernel: [  997.113477] -----------------------------------------------------------------------------
Jul 29 09:27:39 cpe kernel: [  997.113477] 
Jul 29 09:27:39 cpe kernel: [  997.113480] Disabling lock debugging due to kernel taint
Jul 29 09:27:39 cpe kernel: [  997.113484] INFO: Slab 0xffffea0004d9cbc0 objects=32 used=1 fp=0xffff88013672ff80 flags=0x2fffff00000080
Jul 29 09:27:39 cpe kernel: [  997.113489] CPU: 2 PID: 15480 Comm: rmmod Tainted: G    B      OE  ------------   3.10.0-514.26.2.el7.x86_64 #1
Jul 29 09:27:39 cpe kernel: [  997.113492] Hardware name: AEWIN Net Work SCB-1615/Aptio CRB, BIOS 5.6.5 11/12/2018
Jul 29 09:27:39 cpe kernel: [  997.113495]  ffffea0004d9cbc0 000000000887defe ffff880094d1bd10 ffffffff81687133
Jul 29 09:27:39 cpe kernel: [  997.113502]  ffff880094d1bde8 ffffffff811da714 ffff880100000020 ffff880094d1bdf8
Jul 29 09:27:39 cpe kernel: [  997.113507]  ffff880094d1bda8 656a624f029c5ac0 616d657220737463 6e6920676e696e69
Jul 29 09:27:39 cpe kernel: [  997.113512] Call Trace:
Jul 29 09:27:39 cpe kernel: [  997.113521]  [<ffffffff81687133>] dump_stack+0x19/0x1b
Jul 29 09:27:39 cpe kernel: [  997.113529]  [<ffffffff811da714>] slab_err+0xb4/0xe0
Jul 29 09:27:39 cpe kernel: [  997.113535]  [<ffffffff8118a200>] ? alloc_pages_exact+0x10/0x40
Jul 29 09:27:39 cpe kernel: [  997.113540]  [<ffffffff811db0bc>] ? __free_slab+0xdc/0x1f0
Jul 29 09:27:39 cpe kernel: [  997.113545]  [<ffffffff811dc150>] ? kmem_cache_alloc_bulk+0x140/0x140
Jul 29 09:27:39 cpe kernel: [  997.113550]  [<ffffffff811dda13>] ? __kmalloc+0x1f3/0x240
Jul 29 09:27:39 cpe kernel: [  997.113554]  [<ffffffff811e00eb>] ? kmem_cache_close+0x12b/0x2f0
Jul 29 09:27:39 cpe kernel: [  997.113558]  [<ffffffff811e010c>] kmem_cache_close+0x14c/0x2f0
Jul 29 09:27:39 cpe kernel: [  997.113563]  [<ffffffff811e02c4>] __kmem_cache_shutdown+0x14/0x80
Jul 29 09:27:39 cpe kernel: [  997.113569]  [<ffffffff811a5e14>] kmem_cache_destroy+0x44/0xf0
Jul 29 09:27:39 cpe kernel: [  997.113605]  [<ffffffffa07ddd21>] _TEST_PoolDestroy_NL+0x38/0x3a [lwconn]
Jul 29 09:27:39 cpe kernel: [  997.113636]  [<ffffffffa07b380c>] TEST_PoolDestroy+0x8c/0xa0 [lwconn]
Jul 29 09:27:39 cpe kernel: [  997.113662]  [<ffffffffa0751f30>] TEST_RuleExit+0x60/0x80 [lwconn]
Jul 29 09:27:39 cpe kernel: [  997.113686]  [<ffffffffa074001a>] TEST_Exit+0x5a/0x70 [lwconn]
Jul 29 09:27:39 cpe kernel: [  997.113708]  [<ffffffffa073706f>] cleanup_module+0x2f/0xd0 [lwconn]
Jul 29 09:27:39 cpe kernel: [  997.113714]  [<ffffffff810fe3db>] SyS_delete_module+0x16b/0x2d0
Jul 29 09:27:39 cpe kernel: [  997.113720]  [<ffffffff81697809>] system_call_fastpath+0x16/0x1b
Jul 29 09:27:39 cpe kernel: [  997.113726] INFO: Object 0xffff88013672f100 @offset=256

很明显 kmem_cache slab_test中有object没有释放,然后在销毁slab_test时,会打印该Call Trace。而该 kmem_cache slab_test并不会释放,造成了内存泄漏。
业务重新启动后,会再次申请 kmem_cache slab_test。于是我们在/proc/slabinfo中会看到2个 kmem_cache slab_test。

[[email protected] ~]# cat /proc/slabinfo |grep slab_test
slab_test    320    320    128   32    1 : tunables    0    0    0 : slabdata     10     10      0
slab_test      1     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0

起初怀疑是卸载模块时,相关object因为代码bug没有释放完全。分析相关代码,发现处理逻辑无误,确实遍历链表将所有object都释放了。

而这个现象后续一致没再出现,也没找到复现条件。导致陷入僵局。

有天在思考中突然想,为何不把未释放的object内存dump出来?还原出数据结构来分析问题,说不定能有新的进展。

于是 echo c > /proc/sysrq-trigger 强制系统crash,产生core文件。使用crash工具打开。

crash> kmem -s |grep slab_test
ffff8800517ea300 slab_test         84        267       320     10     4k
ffff8801394bca00 slab_test         84          1        32      1     4k

通过kmem_cache 获取node地址,再得到node

crash> struct kmem_cache.node ffff8801394bca00
  node = {0xffff8801394bd280, ...//省略部分...}
crash> struct kmem_cache_node 0xffff8801394bd280
struct kmem_cache_node {
  list_lock = {
    {
      rlock = {
        raw_lock = {
          {
            head_tail = 33161722, 
            tickets = {
              head = 506, 
              tail = 506
            }
          }
        }
      }
    }
  }, 
  nr_partial = 1, 
  partial = {
    next = 0xffffea0004d9cbe0,         //挂在链表上的page
    prev = 0xffffea0004d9cbe0
  }, 
  nr_slabs = {
    counter = 1
  }, 
  total_objects = {
    counter = 32
  }, 
  full = {
    next = 0xffff8801394bd2b0, 
    prev = 0xffff8801394bd2b0
  }
}

page->lru的偏移是0x20, 所以我们可以得到page的指针为0xffffea0004d9cbc0。

crash> kmem 0xffffea0004d9cbc0
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff8800517ea300 slab_test         84        267       320     10     4k
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffea0004d9cbc0  ffff88013672f000     0     32          1    31
  FREE / [ALLOCATED]
   ffff88013672f000  
   ffff88013672f080  
  [ffff88013672f100]				//这是未释放的object地址
   ffff88013672f180  
   ffff88013672f200  
//...省略部分
   ffff88013672ff00  
   ffff88013672ff80  

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea0004d9cbc0 13672f000                0 ffff88013672ff80  1 2fffff00000080 slab

找到未释放object地址后,我们就可以读出其内容了。

crash> rd ffff88013672f100 12
ffff88013672f100:  000000004a41434b 0000000000000054   KCAJ....T.......
ffff88013672f110:  0000000000000000 0000000000000000   ................
ffff88013672f120:  ffff88013672f120 ffff88013672f120    .r6.... .r6....		//这是一个list_head,从值上看,可以指导是个刚初始化的。
ffff88013672f130:  000000000000000a ffff880036740800   ..........t6....
ffff88013672f140:  0000000000000000 0000000000000000   ................
ffff88013672f150:  0000000058494e47 0000000000000000   GNIX............

对比object的结构定义,发现结构list_head刚初始化,还未挂载到相应链表上。从这里突破,确定问题出在“object申请”和“插入到链表操作”之间。范围大大缩小,查看相关代码,发现在“插入到链表”之前,有个操作失败后,直接返回,没有对应的释放object操作,最终问题原因找到。

后记:
我在回溯定位过程时,发现走了点弯路,我是通过slab–>kmem_cache–>kmem_cache_node–>page–>object指针。实际上log中的Call Trace已经打印了问题object的指针。我们可以直接读取该Object地址。 ????

Jul 29 09:27:39 cpe kernel: [ 997.113726] INFO: Object 0xffff88013672f100 @offset=256