记录一次gdb debug经历

程序员文章站 2022-04-15 09:06:12

[TOC] 问题描述今天在写代码时，运行时奔溃了。segment fault，而且是在程序退出main()函数后，才报的。唯一的信息是：简直是一头雾水。查看core文件系统默认是不会生成core文件的，把core文件设为无限大。使用gdb查看core文件提示如下：可以确定崩溃发生在 ......

问题描述

今天在写代码时，运行时奔溃了。segment fault，而且是在程序退出main()函数后，才报的。
唯一的信息是：segmentation fault (core dumped)
简直是一头雾水。

查看core文件

系统默认是不会生成core文件的，ulimit -c unlimited把core文件设为无限大。

使用gdb查看core文件

gdb ./example/sudoku_batch_test core
提示如下：

program terminated with signal sigsegv, segmentation fault.
#0  __gi___libc_free (mem=0x313030303030300a) at malloc.c:2951
2951    malloc.c: no such file or directory.
(gdb)

可以确定崩溃发生在malloc.c中。但是提示没有malloc.c的源码。

首先安装glibc的符号表，命令如下：
sudo apt-get install libc6-dbg

再来是安装glibc的源文件，命令如下：
sudo apt-get source libc6-dev
安装完毕后在当前目录下会多出一个glibc-2.23文件夹，该文件夹包含了glibc的源码。

源码准备就绪后，接着上面，在gdb命令提示符下输入：
directory glibc-2.23/malloc/将glibc-2.23/malloc/设为gdb源码搜索目录。结果如下：

warning: core file may not match specified executable file.
[new lwp 24491]
[thread debugging using libthread_db enabled]
using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
core was generated by `./example/sudoku_batch_test ../example/test1000 127.0.0.1 1'.
program terminated with signal sigsegv, segmentation fault.
#0  __gi___libc_free (mem=0x313030303030300a) at malloc.c:2951
2951    malloc.c: no such file or directory.
(gdb) directory glibc-2.23/malloc/
source directories searched: /root/work/melon/build/glibc-2.23/malloc:$cdir:$cwd
(gdb)

现在我们就可以在gdb中查看崩溃处的源码了，执行list：

(gdb) l
warning: source file is more recent than executable.
2946      if (mem == 0)                              /* free(0) has no effect */
2947        return;
2948    
2949      p = mem2chunk (mem);
2950    
2951      if (chunk_is_mmapped (p))                       /* release mmapped memory. */
2952        {
2953          /* see if the dynamic brk/mmap threshold needs adjusting */
2954          if (!mp_.no_dyn_threshold
2955              && p->size > mp_.mmap_threshold
(gdb)

虽然知道了崩溃发生在2951行，但是貌似没有更多有效的信息。这时我想到了是不是可以看下函数的调用栈，或许会有信息。
接着执行backtrace(或者bt)：

(gdb) bt
#0  __gi___libc_free (mem=0x313030303030300a) at malloc.c:2951
#1  0x000000000048bc9d in melon::coroutine::~coroutine (this=0x1fc9120, __in_chrg=<optimized out>)
    at /root/work/melon/src/coroutine.cpp:56
#2  0x000000000048d099 in std::_sp_counted_ptr<melon::coroutine*, (__gnu_cxx::_lock_policy)2>::_m_dispose (
    this=0x1fc8190) at /usr/include/c++/5/bits/shared_ptr_base.h:374
#3  0x00000000004630f1 in std::_sp_counted_base<(__gnu_cxx::_lock_policy)2>::_m_release (this=0x1fc8190)
    at /usr/include/c++/5/bits/shared_ptr_base.h:150
#4  0x0000000000461f32 in std::__shared_count<(__gnu_cxx::_lock_policy)2>::~__shared_count (this=0x7f07f4ff1770, 
    __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#5  0x00000000004749ed in std::__shared_ptr<melon::coroutine, (__gnu_cxx::_lock_policy)2>::~__shared_ptr (
    this=0x7f07f4ff1768, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:925
#6  0x0000000000474a39 in std::shared_ptr<melon::coroutine>::~shared_ptr (this=0x7f07f4ff1768, 
    __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr.h:93
#7  0x00007f07f40915ff in __gi___call_tls_dtors () at cxa_thread_atexit_impl.c:155
#8  0x00007f07f4090f27 in __run_exit_handlers (status=0, listp=0x7f07f441b5f8 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true) at exit.c:40
#9  0x00007f07f4091045 in __gi_exit (status=<optimized out>) at exit.c:104
#10 0x00007f07f4077837 in __libc_start_main (main=0x45f1c4 <main(int, char**)>, argc=4, argv=0x7ffcfb2ab218, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffcfb2ab208)
    at ../csu/libc-start.c:325
#11 0x000000000045ec89 in _start ()

这下问题找到了，首先在线程结束或者程序运行结束会调用__gi___call_tls_dtors函数来析构线程本地存储。我确实用了thread_local关键字修饰coroutine::ptr变量。
从#1 0x000000000048bc9d in melon::coroutine::~coroutine可知在melon::coroutine类的析构函数中调用了free()导致奔溃。
这下问题基本明确了，我在coroutine析构函数中会释放stack_这个指针，

 53 coroutine::~coroutine() {
 54     log_debug << "destroy coroutine:" << name_;
 55     if (stack_) {
 56         free(stack_);
 57     }
 58 }

有两个构造函数，其中一个如下：

 39 coroutine::coroutine()
 40     :c_id_(++t_coroutine_id),
 41     name_("main-" + std::to_string(c_id_)),
 42     cb_(nullptr),
 43     state_(coroutinestate::init) {
 44 
 45     if (getcontext(&context_)) {
 46         log_error << "getcontext: errno=" << errno
 47                 << " error string:" << strerror(errno);
 58     }
 59 }

因为大意犯了个非常低级的错误，这个构造函数没有正确初始化statck_指针，将statck_初始化为nullptr后，问题就解决了。

总结

遇到这类问题，一般用gdb查看core文件都能定位到崩溃的位置，如果不是直接引发的，可以查看函数调用栈，一般都能找到问题原因。

上一篇： idea在同一窗口创建多个项目（详细步骤）

下一篇： RabbitMQ实战应用技巧

记录一次gdb debug经历

问题描述

查看core文件

使用gdb查看core文件

总结