阿里云容器启动失败: failed to unshare namespaces, running exec setns process for init, Un

程序员文章站 2022-04-01 20:47:59

阿里云Swarm集群上一个节点启动容器失败，日志和事件中的报错信息如下： "failed to unshare namespaces: Cannot allocate memory...

阿里云Swarm集群上一个节点启动容器失败，日志和事件中的报错信息如下：

"failed to unshare namespaces: Cannot allocate memory"

启动容器失败：Error response from daemon: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:247: running exec setns process for init caused \"exit status 34\""

手工启动容器，也报同样的错误：

Error response from daemon: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:247: running exec setns process for init caused \"exit status 34\"

第一句报错，很明显是可用内存不够，导致分配失败。

第二句报错，启动容器进程时运行exec setnamespace时失败。启动容器时也要启动进程，分配namespace，这里的失败可能也与内存有关。

(补充一句，runC实际上就是libcontainer配上了一个轻型的客户端。容器是提供一个与宿主机系统共享内核但与系统中的其它进程资源相隔离的执行环境。Docker通过调用libcontainer包对namespaces、cgroups、capabilities以及文件系统的管理和分配来“隔离”出一个上述执行环境）

进入容器，dmesg查看内核日志：

[2751018.215519] docker_gwbridge: port 16(veth4c02c8b) entered disabled state
[2751018.239907] runc:[1:CHILD]: page allocation failure: order:6, mode:0x10c0d0 //分配2^6 page_size 内存失败
[2751018.239917] CPU: 5 PID: 839206 Comm: runc:[1:CHILD] Tainted: G E ------------ T 3.10.0-514.26.2.el7.x86_64 #1
[2751018.239919] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[2751018.239931] Call Trace:
[2751018.239940] [] dump_stack+0x19/0x1b
[2751018.239945] [] warn_alloc_failed+0x110/0x180
[2751018.239990] [] kmem_cache_create_memcg+0x110/0x230
[2751018.239994] [] kmem_cache_create+0x2b/0x30
[2751018.240003] [] nf_conntrack_init_net+0x101/0x250 [nf_conntrack]
[2751018.240009] [] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]
[2751018.240025] [] create_new_namespaces+0xf9/0x180
[2751018.240028] [] unshare_nsproxy_namespaces+0x5a/0xc0
[2751018.240032] [] SyS_unshare+0x193/0x300
[2751018.240036] [] system_call_fastpath+0x16/0x1b
[2751018.240038] Mem-Info:
[2751018.240044] active_anon:1311792 inactive_anon:68376 isolated_anon:0
active_file:59538 inactive_file:93653 isolated_file:0
unevictable:1 dirty:8384 writeback:0 unstable:0
slab_reclaimable:1510499 slab_unreclaimable:802998
mapped:104765 shmem:68700 pagetables:50862 bounce:0
free:66051 free_pcp:32 free_cma:0
[2751018.240049] Node 0 DMA free:15860kB min:64kB low:80kB high:96kB all_unreclaimable? yes
[2751018.240056] lowmem_reserve[]: 0 2815 15868 15868
[2751018.240060] Node 0 DMA32 free:153112kB min:11976kB low:14968kB high:17964kB active_anon:925112kB inactive_anon:41540kB active_file:43200kB inactive_file:180168kB unevictable:0kB isolated(anon):0kB isolated(file):0kB all_unreclaimable? no
[2751018.240125] lowmem_reserve[]: 0 0 13053 13053
[2751018.240129] Node 0 Normal free:95232kB min:55536kB low:69420kB high:83304kB active_anon:4322056kB inactive_anon:231964kB active_file:194952kB inactive_file:194444kB unevictable:4kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13367060kB mlocked:4kB dirty:1936kB writeback:0kB mapped:366104kB shmem:232948kBslab_reclaimable:5108320kB slab_unreclaimable:2751412kB kernel_stack:27808kB pagetables:175532kB unstable:0kB bounce:0kB free_pcp:264kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:27 all_unreclaimable? no
[2751018.240137] lowmem_reserve[]: 0 0 0 0
[2751018.240140] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15860kB
[2751018.240149] Node 0 DMA32: 1515*4kB (UEM) 2456*8kB (UEM) 907*16kB (UEM) 2236*32kB (UEM) 567*64kB (UEM) 52*128kB (UEM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 154972kB
[2751018.240159] Node 0 Normal: 15952*4kB (UEM) 2235*8kB (UEM) 859*16kB (UEM) 17*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 95976kB
[2751018.240167] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[2751018.240169] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[2751018.240170] 221859 total pagecache pages
[2751018.240171] 0 pages in swap cache
[2751018.240172] Swap cache stats: add 0, delete 0, find 0/0
[2751018.240173] Free swap = 0kB
[2751018.240174] Total swap = 0kB
[2751018.240175] 4194174 pages RAM
[2751018.240176] 0 pages HighMem/MovableOnly
[2751018.240177] 127317 pages reserved
[2751018.240179] kmem_cache_create(nf_conntrack_ffff88036ea66180) failed with error -12
[2751018.240181] CPU: 5 PID: 839206 Comm:runc:[1:CHILD] Tainted: G E ------------ T 3.10.0-514.26.2.el7.x86_64 #1
[2751018.240183] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[2751018.240184] ffff8803b2254e60 000000009f4901bf ffff880066f2fd60 ffffffff81687133
[2751018.240186] ffff880066f2fdb0 ffffffff811a6322 0000000000080000 0000000000000000
[2751018.240188] 00000000fffffff4 ffff88036ea66180 ffffffff81ae6580 ffff88036ea66180
[2751018.240190] Call Trace:
[2751018.240193] [] dump_stack+0x19/0x1b
[2751018.240196] [] kmem_cache_create_memcg+0x162/0x230
[2751018.240198] [] kmem_cache_create+0x2b/0x30
[2751018.240203] [] nf_conntrack_init_net+0x101/0x250 [nf_conntrack]
[2751018.240206] [] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]
[2751018.240224] [] copy_net_ns+0x7c/0x130
[2751018.240227] [] create_new_namespaces+0xf9/0x180
[2751018.240229] [] unshare_nsproxy_namespaces+0x5a/0xc0
[2751018.240231] [] SyS_unshare+0x193/0x300
[2751018.240235] Unable to create nf_conn slab cache

从callstack中标红的几处关键字unshare, namespace, kmem_cache_create, page allocation failed,来看，基本和容器启动时报错一致的。

既然是内存分配不足，我们看下可用内存还有多少：

[root@ ~]# free -mh

total used free shared buff/cache available
Mem: 15G 5.5G 270M 278M 9.7G 6.3G
Swap: 0B 0B 0B

虽然free显示270M，算上缓存，可用的还有10G左右，也不少啊，分配个容器绰绰有余。

返回日志中有句： page allocation failure: order:6，我们看下slab内存

[root@beta_004 ~]# cat /proc/buddyinfo

Node 0, zone DMA 1 　　 0 1 　 1 1 1 1 0 1 1 3

Node 0, zone DMA32 8319 4025 634 266 0 0 0 0 0 0 0
Node 0, zone Normal 8860 11464 488 174 27 0 0 0 0 0 0

内核中为快速管理和分配不同大小的内存，使用slab对象，大小分别按　2^order*Page_size进行管理。buddyinfo显示，目前系统中order 为５以上的可用内存块均已为０.(ＤＭＡ相关的不用管，只看Normal的即可)。小块的还有很多，如4K: 8860个, 8K 11464个等

内核日志中其实在order为４的内存块已为０. 4K 15952个，8K 2235个等。取日志和执行命令"cat /proc/buddyinfo"时间不同，不同size的内存块数量会有动态变化，整体状况没变：碎片很多，大块不足。

[2751018.240159] Node 0 Normal: 15952*4kB (UEM)2235*8kB (UEM) 859*16kB (UEM) 17*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 95976kB

而在正常节点的机器上， cat /proc/buddyinfo为，各order 的内存块都还有。

[root@iZ2zehr0ayevgplh6owqe3Z ~]# cat /proc/buddyinfo
Node 0, zone DMA 1 0 0 1 2 1 1 0 1 1 3
Node 0, zone DMA32 571 288 125 38 19 12 5 5 401 0 0
Node 0, zone Normal 2121 754 758 419 735 1411 1202 1160 1075 3 1

哪些方法可以减少内存碎片呢？

１.　https://www.kernel.org/doc/Documentation/sysctl/vm.txt　中extfrag_相关参数和设置

内存碎片在内核管理中被叫做externam fragmentation，简写为extfrag。

extfrag_threshold

This parameter affects whether the kernel will compact memory or direct
reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
debugfs shows what the fragmentation index for each order is in each zone in
the system. Values tending towards 0 imply allocations would fail due to lack
of memory, values towards 1000 imply failures are due to fragmentation and -1
implies that the allocation will succeed as long as watermarks are met.

The kernel will not compact memory in a zone if the
fragmentation index is <= extfrag_threshold. The default value is 500.

[root@beta_004 vm]# cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 0.864 0.932 0.966 0.983 0.992 0.996 0.998
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 0.919 0.960 0.980 0.990 0.995 0.998 0.999

[root@beta_004 vm]# cat extfrag_threshold
500

显然，extfrag_index中为-1的只有前4个，表示分配内在无压力，后面7个都已经接近0，表示lack of memory.

extfrag_threshold默认值为500，我们改为0.5试试，等待十分钟后再看buddyinfo没什么变化，可能还是系统内存碎片太多，已没法进一步合并了。

2. https://www.kernel.org/doc/Documentation/sysctl/vm.txt compact_memory 参数

compact_memory

Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
all zones are compacted such that free memory is available in contiguous
blocks where possible. This can be important for example in the allocation of
huge pages although processes will also directly compact memory as required.

内核中默认是使能这个选项的，

[root@beta_004 vm]# uname -a
Linux beta_004 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

[root@beta_004 vm]# cat /boot/config-3.10.0-514.26.2.el7.x86_64 | grep COMPACTION

CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y

相对应的配置参数为 /proc/sys/vm/compact_memory，是个只写文件， echo 1 > /proc/sys/vm/compact_memory

再查看buddyinfo，好像也没什么改变，可用内存反而更少了。

[root@beta_004 vm]# cat /proc/buddyinfo
Node 0, zone DMA 1 0 1 1 1 1 1 0 1 1 3
Node 0, zone DMA32 7502 3725 2606 144 0 0 0 0 0 0 0
Node 0, zone Normal 26138 10751 1001 22 0 0 0 0 0 0 0

3. https://events.static.linuxfound.org/sites/events/files/slides/%5BELC-2015%5D-System-wide-Memory-Defragmenter.pdf 三星印度的一个工程师2015年在Linux大会上提出的一个碎片化解决方案，目前内核版本中还没有集成，没法试。

简单说，就是增加一个内核可控参数接口，echo 1 > /proc/sys/vm/shrink_memory，　　echo 1 > /proc/sys/vm/compact_memory

通过 cat /proc/vmstat | grep compact 查看compact_相关的数字变化，通常都会有pages/blocks Moved等，碎片化率在一定程度上会降低。

尝试了一些方法，容器还是不能成功启动。后来回到问题本身，既然是分配大块内存时出错的，那么容器在哪要求分配的？仔细排查了配置参数，发现模块中配置了最大可用内存为256M，也只有这个地方了。后来尝试减少为192M也还是失败，最后试着不设置这个参数，再部署居然成功了。

绕了一大圈，除了系统内存不足外，问题居然与容器自身相关。本来这个参数是起限制作用的，防止容器运行中无限制地申请和占用内存，没想到居然在特殊场景下会导致部署不成功。

需要进一步了解、研究的内容：

1. slabtop 中各字段的含义，尤其 Active / Total Slabs (% used) 总是100%, object中也有些占比较高(99%)的对象，是否有影响？

2. slab_reclaimable对应的内存有5G多，好像与buddyinfo中order阶的内存和代表的对象不是一回事，reclaimable这些对象如何回收？什么时候回收？碎片化的几个尝试，好像对这个参数没什么影响似的。

slab_reclaimable:5108320kB

3. 有说这是个kernel bug, 4.9.12~~18版本中可能已解决，待确认。

Further investigation indicates that I'm probably hitting this kernel bug: OOM but no swap used. – Mark Feb 24 at 21:36
For anyone else who's experiencing this issue, the bug appears to have been fixed somewhere between 4.9.12 and 4.9.18. – Mark Apr 11 at 20:23

上一篇： CAP原则(CAP定理)、BASE理论

下一篇：使用mapreduce进行流量汇总的教程