centos7( 3.10.0-123.el7.x86_64) 重启问题
centos7( 3.10.0-123.el7.x86_64) 重启问题 | http://aperise.iteye.com/blog/2326082 |
centos7( 3.10.0-327.el7.x86_64) 重启问题 | http://aperise.iteye.com/blog/2425717 |
centos7( 3.10.0-123.el7.x86_64) 重启问题
1.问题
新买来服务器(2U 2cpu 6cores/cpu 16G*8 5 * 2TB)安装centos 7操作系统,搭建好hadoop集群和spark集群后,最近跑spark任务,发现任务执行到一定次数后,服务器中会随机的无端的会有一台自动重启
2.最初解决思路
1)既然是操作系统层面都自动重启了,那么首先应该是从操作系统crash日志入手查找问题;
2)不管咋的spark程序还没啥权利去主动重启服务器,如果是资源不够,起码操作系统层面都要将spark程序申请资源进行拒绝;
联系了系统运维工程师,让协助查下服务器是否有crash日志,反馈的结果是没有任何crash日志,让我思考是不是spark将服务器资源耗尽,连服务器写crash的日志能力都不具备了。
因为自己专长的领域不在系统运维,所以对于系统运维工程师给的回复,还是比较相信的,起码首先没有去怀疑(虽然这在最后被证明是致命的错误判断),于是花了一大堆时间在盘查hadoop集群和spark集群资源消耗(CPU MEM IO),在这期间主要是通过工具nmon来抓取所有服务器详细参数,疯狂的跑spark任务使问题重现,然后分析nmon日志信息。
3.nmon抓取服务器各个性能参数
nmon是一款分析 AIX 和 Linux 性能的免费工具,这里也顺便介绍下该工具使用,我下载的版本主要有一下两个文件:
- nmon_x86_64_centos6.centos6 nmon工具,主要抓取服务器资源日志,日志存为机器hostname_年月日_时分.nmon
- nmon analyser v40.xlsm 主要讲上面的“机器hostname_年月日_时分.nmon”转换为可读性的excel图表
3.1 nmon命令参数介绍
./nmon_x86_64_centos6 -h
-h FULL help information
Interactive-Mode:
read startup banner and type: "h" once it is running
For Data-Collect-Mode (-f)
-f spreadsheet output format [note: default -s300 -c288]
optional
-s <seconds> between refreshing the screen [default 2]
-c <number> of refreshes [default millions]
-d <disks> to increase the number of disks [default 256]
-t spreadsheet includes top processes
-x capacity planning (15 min for 1 day = -fdt -s 900 -c 96)
Version - nmon 14i
For Interactive-Mode
-s <seconds> time between refreshing the screen [default 2]
-c <number> of refreshes [default millions]
-g <filename> User Defined Disk Groups [hit g to show them]
- file = on each line: group_name <disks list> space separated
- like: database sdb sdc sdd sde
- upto 64 disk groups, 512 disks per line
- disks can appear more than once and in many groups
-b black and white [default is colour]
example: nmon_x86_64_centos6 -s 1 -c 100
For Data-Collect-Mode = spreadsheet format (comma separated values)
Note: use only one of f,F,z,x or X and make it the first argument
-f spreadsheet output format [note: default -s300 -c288]
output file is <hostname>_YYYYMMDD_HHMM.nmon
-F <filename> same as -f but user supplied filename
-r <runname> used in the spreadsheet file [default hostname]
-t include top processes in the output
-T as -t plus saves command line arguments in UARG section
-s <seconds> between snap shots
-c <number> of snapshots before nmon stops
-d <disks> to increase the number of disks [default 256]
-l <dpl> disks/line default 150 to avoid spreadsheet issues. EMC=64.
-g <filename> User Defined Disk Groups (see above) - see BBBG & DG lines
-N include NFS Network File System
-I <percent> Include process & disks busy threshold (default 0.1)
don't save or show proc/disk using less than this percent
-m <directory> nmon changes to this directory before saving to file
example: collect for 1 hour at 30 second intervals with top procs
nmon_x86_64_centos6 -f -t -r Test1 -s30 -c120
To load into a spreadsheet:
sort -A *nmon >stats.csv
transfer the stats.csv file to your PC
Start spreadsheet & then Open type=comma-separated-value ASCII file
The nmon analyser or consolidator does not need the file sorted.
Capacity planning mode - use cron to run each day
-x sensible spreadsheet output for CP = one day
every 15 mins for 1 day ( i.e. -ft -s 900 -c 96)
-X sensible spreadsheet output for CP = busy hour
every 30 secs for 1 hour ( i.e. -ft -s 30 -c 120)
Interactive Mode Commands
key --- Toggles to control what is displayed ---
h = Online help information
r = Machine type, machine name, cache details and OS version + LPAR
c = CPU by processor stats with bar graphs
l = long term CPU (over 75 snapshots) with bar graphs
m = Memory stats
L = Huge memory page stats
V = Virtual Memory and Swap stats
k = Kernel Internal stats
n = Network stats and errors
N = NFS Network File System
d = Disk I/O Graphs
D = Disk I/O Stats
o = Disk I/O Map (one character per disk showing how busy it is)
o = User Defined Disk Groups
j = File Systems
t = Top Process stats use 1,3,4,5 to select the data & order
u = Top Process full command details
v = Verbose mode - tries to make recommendations
b = black and white mode (or use -b option)
. = minimum mode i.e. only busy disks and processes
key --- Other Controls ---
+ = double the screen refresh time
- = halves the screen refresh time
q = quit (also x, e or control-C)
0 = reset peak counts to zero (peak = ">")
space = refresh screen now
Startup Control
If you find you always type the same toggles every time you start
then place them in the NMON shell variable. For example:
export NMON=cmdrvtan
Others:
a) To you want to stop nmon - kill -USR2 <nmon-pid>
b) Use -p and nmon outputs the background process pid
c) To limit the processes nmon lists (online and to a file)
Either set NMONCMD0 to NMONCMD63 to the program names
or use -C cmd:cmd:cmd etc. example: -C ksh:vi:syncd
d) If you want to pipe nmon output to other commands use a FIFO:
mkfifo /tmp/mypipe
nmon -F /tmp/mypipe &
grep /tmp/mypipe
e) If nmon fails please report it with:
1) nmon version like: 14i
2) the output of cat /proc/cpuinfo
3) some clue of what you were doing
4) I may ask you to run the debug version
Developer Nigel Griffiths
Feedback welcome - on the current release only and state exactly the problem
No warranty given or implied.
3.2 服务器上安装nmon
将nmon_x86_64_centos6.centos6拷贝到服务器一个目录,比如/home/hadoop/nmon下:
3.3 服务器上抓取日志
./nmon_x86_64_centos6 -f -t -r name_view_in_excel_sheet -s 15 -c 960
ls
上述命令意思:每个15秒抓取一次数据,供抓取960次,“name_view_in_excel_sheet”是后续显示在excel中的图标名称,一般这里设置为服务器的hostname
3.4服务器上日志文件转换为excel图表
第一步:打开文件“nmon analyser v40.xlsm”,点击按钮“Analyze nmon data”,选中上面获取的性能日志文件“hadoop31_160921_2357.nmon”,如下:
这一步会读取“hadoop31_160921_2357.nmon”的内容,将内容通过excel图表方式进行展示,最终生成一个excel文件,如下:
通过nmon,发现hadoop集群和spark集群消耗的资源还是正常的,唯一不正常的是,每次跑完spark任务,各个服务器上内存消耗在cache上的内存达到了惊人的80G以上,而且问题在于,就算hadoop集群和spark集群所有服务关闭,cache好几天都无法自动释放。在这里也做过实验,如果每次手动释放cache,操作如下:
sync
echo 1 > /proc/sys/vm/drop_caches
clear
free -m
然后跑spark任务,从来没出现过服务器自动重启的情况。
总之,此次不得不说是练习了一把如何使用nmon工具分析系统性能,从nmon上分析出cache是没有释放,而这将问题产生的根本还是指向了操作系统,按理操作系统是会慢慢释放cache的。
4.回到原点解决问题
1)开始怀疑系统运维工程师的判断,原因是系统无端自动重启,竟然毫无征兆,找不到日志;
2)既然找不到日志,那么就要想办法让系统运维工程师去主动找到日志,一是操作系统层面crash日志,二是抓取java程序的core dump日志;
3)开始再次联系系统运维工程师,务必说服他去找到上面两个日志;
4)成功说服系统运维工程师,开始着手协助拿取上述日志;
5)好消息来了,系统运维工程师拿到了系统crash日志,从日志中发现如下致命错误:
查看文件vmcore-dmesg.txt,发现如下错误(吓人的 kenel BUG):
是一个centos 7的内核级BUG,我的linux内核版本如下:
centos7官网介绍如下:
由于内存中的page table entry 产生争用,触发了kernel crash。
6)在网上找了一下,也有人碰到使用centos7(Linux version 3.10.0-123.el7.x86_64 )出现类似的问题,RH5885 V3 CentOS7.0(Redhat7.0)内核问题导致系统自动重启
至此,困扰多时的问题终于解决了,毫无疑问的升级centos7 内核版本至
上一篇: 谷歌的绩效管理
下一篇: hive实战(1)hive安装准备