centos7( 3.10.0-123.el7.x86_64) 重启问题

程序员文章站 2022-06-16 15:13:20

...

centos7( 3.10.0-123.el7.x86_64) 重启问题	http://aperise.iteye.com/blog/2326082
centos7( 3.10.0-327.el7.x86_64) 重启问题	http://aperise.iteye.com/blog/2425717

centos7( 3.10.0-123.el7.x86_64) 重启问题

1.问题

新买来服务器(2U 2cpu 6cores/cpu 16G*8 5 * 2TB)安装centos 7操作系统，搭建好hadoop集群和spark集群后，最近跑spark任务，发现任务执行到一定次数后，服务器中会随机的无端的会有一台自动重启

2.最初解决思路

1）既然是操作系统层面都自动重启了，那么首先应该是从操作系统crash日志入手查找问题；

2）不管咋的spark程序还没啥权利去主动重启服务器，如果是资源不够，起码操作系统层面都要将spark程序申请资源进行拒绝；

联系了系统运维工程师，让协助查下服务器是否有crash日志，反馈的结果是没有任何crash日志， centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765! 让我思考是不是spark将服务器资源耗尽，连服务器写crash的日志能力都不具备了。

因为自己专长的领域不在系统运维，所以对于系统运维工程师给的回复，还是比较相信的，起码首先没有去怀疑（虽然这在最后被证明是致命的错误判断），于是花了一大堆时间在盘查hadoop集群和spark集群资源消耗(CPU MEM IO),在这期间主要是通过工具nmon来抓取所有服务器详细参数，疯狂的跑spark任务使问题重现，然后分析nmon日志信息。

3.nmon抓取服务器各个性能参数

nmon是一款分析 AIX 和 Linux 性能的免费工具，这里也顺便介绍下该工具使用，我下载的版本主要有一下两个文件：

nmon_x86_64_centos6.centos6 nmon工具，主要抓取服务器资源日志，日志存为机器hostname_年月日_时分.nmon
nmon analyser v40.xlsm 主要讲上面的“机器hostname_年月日_时分.nmon”转换为可读性的excel图表

3.1 nmon命令参数介绍

敲入如下命令，获取nmon命令使用介绍

cd /home/hadoop/nmon
./nmon_x86_64_centos6 -h

提示如下信息：

Hint: nmon_x86_64_centos6 [-h] [-s <seconds>] [-c <count>] [-f -d <disks> -t -r <name>] [-x]

-h FULL help information
Interactive-Mode:
read startup banner and type: "h" once it is running
For Data-Collect-Mode (-f)
-f spreadsheet output format [note: default -s300 -c288]
optional
-s <seconds> between refreshing the screen [default 2]
-c <number> of refreshes [default millions]
-d <disks> to increase the number of disks [default 256]
-t spreadsheet includes top processes
-x capacity planning (15 min for 1 day = -fdt -s 900 -c 96)

Version - nmon 14i

For Interactive-Mode
-s <seconds> time between refreshing the screen [default 2]
-c <number> of refreshes [default millions]
-g <filename> User Defined Disk Groups [hit g to show them]
- file = on each line: group_name <disks list> space separated
- like: database sdb sdc sdd sde
- upto 64 disk groups, 512 disks per line
- disks can appear more than once and in many groups
-b black and white [default is colour]
example: nmon_x86_64_centos6 -s 1 -c 100

For Data-Collect-Mode = spreadsheet format (comma separated values)
Note: use only one of f,F,z,x or X and make it the first argument
-f spreadsheet output format [note: default -s300 -c288]
output file is <hostname>_YYYYMMDD_HHMM.nmon
-F <filename> same as -f but user supplied filename
-r <runname> used in the spreadsheet file [default hostname]
-t include top processes in the output
-T as -t plus saves command line arguments in UARG section
-s <seconds> between snap shots
-c <number> of snapshots before nmon stops
-d <disks> to increase the number of disks [default 256]
-l <dpl> disks/line default 150 to avoid spreadsheet issues. EMC=64.
-g <filename> User Defined Disk Groups (see above) - see BBBG & DG lines
-N include NFS Network File System
-I <percent> Include process & disks busy threshold (default 0.1)
don't save or show proc/disk using less than this percent
-m <directory> nmon changes to this directory before saving to file
example: collect for 1 hour at 30 second intervals with top procs
nmon_x86_64_centos6 -f -t -r Test1 -s30 -c120

To load into a spreadsheet:
sort -A *nmon >stats.csv
transfer the stats.csv file to your PC
Start spreadsheet & then Open type=comma-separated-value ASCII file
The nmon analyser or consolidator does not need the file sorted.

Capacity planning mode - use cron to run each day
-x sensible spreadsheet output for CP = one day
every 15 mins for 1 day ( i.e. -ft -s 900 -c 96)
-X sensible spreadsheet output for CP = busy hour
every 30 secs for 1 hour ( i.e. -ft -s 30 -c 120)

Interactive Mode Commands
key --- Toggles to control what is displayed ---
h = Online help information
r = Machine type, machine name, cache details and OS version + LPAR
c = CPU by processor stats with bar graphs
l = long term CPU (over 75 snapshots) with bar graphs
m = Memory stats
L = Huge memory page stats
V = Virtual Memory and Swap stats
k = Kernel Internal stats
n = Network stats and errors
N = NFS Network File System
d = Disk I/O Graphs
D = Disk I/O Stats
o = Disk I/O Map (one character per disk showing how busy it is)
o = User Defined Disk Groups
j = File Systems
t = Top Process stats use 1,3,4,5 to select the data & order
u = Top Process full command details
v = Verbose mode - tries to make recommendations
b = black and white mode (or use -b option)
. = minimum mode i.e. only busy disks and processes

key --- Other Controls ---
+ = double the screen refresh time
- = halves the screen refresh time
q = quit (also x, e or control-C)
0 = reset peak counts to zero (peak = ">")
space = refresh screen now

Startup Control
If you find you always type the same toggles every time you start
then place them in the NMON shell variable. For example:
export NMON=cmdrvtan

Others:
a) To you want to stop nmon - kill -USR2 <nmon-pid>
b) Use -p and nmon outputs the background process pid
c) To limit the processes nmon lists (online and to a file)
Either set NMONCMD0 to NMONCMD63 to the program names
or use -C cmd:cmd:cmd etc. example: -C ksh:vi:syncd
d) If you want to pipe nmon output to other commands use a FIFO:
mkfifo /tmp/mypipe
nmon -F /tmp/mypipe &
grep /tmp/mypipe
e) If nmon fails please report it with:
1) nmon version like: 14i
2) the output of cat /proc/cpuinfo
3) some clue of what you were doing
4) I may ask you to run the debug version

Developer Nigel Griffiths
Feedback welcome - on the current release only and state exactly the problem
No warranty given or implied.

3.2 服务器上安装nmon

将nmon_x86_64_centos6.centos6拷贝到服务器一个目录，比如/home/hadoop/nmon下：

centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!

3.3 服务器上抓取日志

执行如下命令抓取服务器参数写道

cd /home/hadoop/nmon
./nmon_x86_64_centos6 -f -t -r name_view_in_excel_sheet -s 15 -c 960
ls

上述命令意思：每个15秒抓取一次数据，供抓取960次，“name_view_in_excel_sheet”是后续显示在excel中的图标名称，一般这里设置为服务器的hostname
centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!

3.4服务器上日志文件转换为excel图表

第一步：打开文件“nmon analyser v40.xlsm”，点击按钮“Analyze nmon data”,选中上面获取的性能日志文件“hadoop31_160921_2357.nmon”，如下：

centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!
这一步会读取“hadoop31_160921_2357.nmon”的内容，将内容通过excel图表方式进行展示，最终生成一个excel文件，如下：

centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!

通过nmon，发现hadoop集群和spark集群消耗的资源还是正常的，唯一不正常的是，每次跑完spark任务，各个服务器上内存消耗在cache上的内存达到了惊人的80G以上，而且问题在于，就算hadoop集群和spark集群所有服务关闭，cache好几天都无法自动释放。在这里也做过实验，如果每次手动释放cache，操作如下：

手动释放cache ：

free -m
sync
echo 1 > /proc/sys/vm/drop_caches
clear
free -m

然后跑spark任务，从来没出现过服务器自动重启的情况。

总之，此次不得不说是练习了一把如何使用nmon工具分析系统性能，从nmon上分析出cache是没有释放，而这将问题产生的根本还是指向了操作系统，按理操作系统是会慢慢释放cache的。

4.回到原点解决问题

1）开始怀疑系统运维工程师的判断，原因是系统无端自动重启，竟然毫无征兆，找不到日志；

2）既然找不到日志，那么就要想办法让系统运维工程师去主动找到日志，一是操作系统层面crash日志，二是抓取java程序的core dump日志；

3）开始再次联系系统运维工程师，务必说服他去找到上面两个日志；

4）成功说服系统运维工程师，开始着手协助拿取上述日志；

5）好消息来了，系统运维工程师拿到了系统crash日志，从日志中发现如下致命错误：

centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!
查看文件vmcore-dmesg.txt，发现如下错误(吓人的 kenel BUG)：

centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!
是一个centos 7的内核级BUG，我的linux内核版本如下：

centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!
centos7官网介绍如下：

centos7( 3.10.0-123.el7.x86_64) 重启问题

博客分类： linux nmonCentOS7reboot重启问题kernel BUG at mm/page_alloc.c:3765!
由于内存中的page table entry 产生争用，触发了kernel crash。