谈谈 ANR 之 Service 超时

程序员文章站 2022-05-03 22:59:14

不管是系统开发还是应用开发，ANR 一直是挥之不去的存在！本文从 ANR 的触发机制以及一个简单的范例来探讨 ANR 的原理和处理方法！ ......

1. 核心源码

关键类	路径（/frameworks/base/）
activeservices.java	services/core/java/com/android/server/am/activeservices.java
activitymanagerservice.java	services/core/java/com/android/server/am/activitymanagerservice.java
apperrors.java	services/core/java/com/android/server/am/apperrors.java

2. anr 基础认知

2.1 anr 是什么？

anr(application not responding)，应用程序无响应，简单一个定义，却涵盖了很多 android 系统的设计思想。

首先，anr 属于应用程序的范畴，这不同于 snr(system not respoding)，snr 反映的问题是系统进程(system_server)失去了响应能力，而 anr 明确将问题圈定在应用程序。snr 由 watchdog 机制保证，anr 由消息处理机制保证，android 在系统层实现了一套精密的机制来发现 anr，核心原理是 消息调度和超时处理 。

其次，anr 机制 主体实现在系统层。所有与 anr 相关的消息，都会经过系统进程(system_server)调度，然后派发到应用进程完成对消息的实际处理，同时，系统进程设计了不同的超时限制来跟踪消息的处理。一旦应用程序处理消息不当，超时限制就起作用了，它收集一些系统状态，例如：cpu/io使用情况、进程函数调用栈，并且报告用户有进程 无响应了（anr 对话框）。

然后，anr 问题 本质是一个性能问题。anr 机制实际上对应用程序主线程的限制，要求主线程在限定的时间内处理完一些最常见的操作(启动服务、处理广播、处理输入)，如果处理超时，则认为主线程已经失去了响应其他操作的能力。主线程中的 耗时操作，例如：密集cpu运算、大量io、复杂界面布局等，都会降低应用程序的响应能力。

最后，部分 anr 问题是很难分析的，有时候由于系统底层的一些影响，导致消息调度失败，出现问题的场景又难以复现。这类 anr 问题往往需要花费大量的时间去了解系统的一些行为，超出了 anr 机制本身的范畴。

2.2 anr 机制

分析一些初级的 anr 问题，只需要简单理解最终输出的日志即可，但对于一些由系统问题(例如：cpu 负载过高、进程卡死)引发的 anr，就需要对整个 anr 机制有所了解，才能定位出问题的原因。

anr 机制可以分为两部分：

✎ anr的监测：android 对于不同的 anr 类型(broadcast，service，inputevent)都有一套监测机制。

✎ anr的报告：在监测到 anr 以后，需要显示 anr 对话框、输出日志(发生 anr 时的进程函数调用栈、cpu 使用情况等)。

2.3 anr 的触发原因

前面我们说过，出现 anr 之后一个直观现象就是系统会展示出一个 anr 对话框。

谷歌文档中对 anr 产生的原因是这么描述的：

android 系统中的应用被 activitymanagerservice 及 windowmanagerservice 两个系统服务监控着，系统会在如下两种情况展示出 anr 的对话框！

      ✎ keydispatchtimeout ( 5 seconds ) ：按键或触摸事件在特定时间内无响应。
      ✎ broadcasttimeout ( 10 seconds )：broadcastreceiver 在特定时间内无法处理完成。
      ✎ servicetimeout ( 20 seconds ) ：service 在特定的时间内无法处理完成。

3. service 超时监测机制

service 运行在应用程序的主线程，如果 service 的执行时间超过 20 秒，则会引发 anr。

当发生 service anr 时，一般可以先排查一下在 service 的生命周期函数中有没有做 耗时的操作，例如复杂的运算、io 操作等。如果应用程序的代码逻辑查不出问题，就需要深入检查当前系统的状态：cpu 的使用情况、系统服务的状态等，判断当时发生 anr 进程是否受到 系统运行异常 的影响。

那么，系统是如何检测 service 超时的呢？android 是通过设置定时消息实现的。定时消息是由 ams 的消息队列处理的，ams 有 service 运行的上下文信息，所以在 ams 中设置一套超时检测机制也是合情合理的。

service anr 机制相对最为简单，主体实现在activeservices中。

在 service 的启动流程中，service 进程 attach 到 system_server 进程后会调用 realstartservicelocked() 方法。

3.1 realstartservicelocked

// frameworks/base/services/core/java/com/android/server/am/activeservices.java

public final class activeservices {

    private final void realstartservicelocked(servicerecord r,
            processrecord app, boolean execinfg) throws remoteexception {

        // 发送 delay 消息（service_timeout_msg）
        bumpserviceexecutinglocked(r, execinfg, "create");

        boolean created = false;
        try {
            
            // 最终执行服务的 oncreate() 方法
            app.thread.schedulecreateservice(r, r.serviceinfo, mam.
                compatibilityinfoforpackagelocked(r.serviceinfo.applicationinfo),
                app.repprocstate);

            ... ...
        }
    }
    
}

3.2 bumpserviceexecutinglocked

private final void bumpserviceexecutinglocked(...) {

    scheduleservicetimeoutlocked(r.app);
    
}

3.3 scheduleservicetimeoutlocked

void scheduleservicetimeoutlocked(processrecord proc) {
    if (proc.executingservices.size() == 0 || proc.thread == null) {
        return;
    }
    message msg = mam.mhandler.obtainmessage(
            activitymanagerservice.service_timeout_msg);
    msg.obj = proc;
    // 当超时后仍没有 remove 该 service_timeout_msg 消息，
    // 通过 ams.mainhandler 抛出一个定时消息。
    mam.mhandler.sendmessagedelayed(msg,
            proc.execservicesfg ? service_timeout : service_background_timeout);
}

上述方法通过 ams.mainhandler 抛出一个定时消息 service_timeout_msg。

3.4 servicedoneexecutinglocked

前台进程中执行 service，超时时间是 service_timeout（20 秒）

    // how long we wait for a service to finish executing.
    static final int service_timeout = 20*1000;

后台进程中执行 service，超时时间是 service_background_timeout（200 秒）

    // how long we wait for a service to finish executing.
    static final int service_background_timeout = service_timeout * 10;

当 service 的生命周期结束时（不会 anr），会调用 servicedoneexecutinglocked() 方法，之前抛出的 service_timeout_msg 消息在这个方法中会被清除。

void servicedoneexecutinglocked(servicerecord r, int type, int startid, int res) {
    boolean indestroying = mdestroyingservices.contains(r);
    if (r != null) {
        ... ...     
        servicedoneexecutinglocked(r, indestroying, indestroying);
    }
}

private void servicedoneexecutinglocked(servicerecord r, boolean indestroying,
          boolean finishing) {  
    ... ...
    if (r.executenesting <= 0) {
        if (r.app != null) {    
            ... ...
            // 当前服务所在进程中没有正在执行的service，清除 service_timeout_msg 消息
            if (r.app.executingservices.size() == 0) {
                mam.mhandler.removemessages(
                             activitymanagerservice.service_timeout_msg, r.app);            
                ... ...
            }
    ... ...
}

3.5 handlemessage

如果没有 remove 掉 service_timeout_msg 呢？接下来我们看看对于 anr 的处理逻辑。

在 system_server 进程中有一个 handler 线程，名叫 activitymanager。

如果在超时时间内，service_timeout_msg 没有被清除，便会向该 handler 线程发送一条信息 service_timeout_msg。

final class mainhandler extends handler {
    ... ...

    @override
    public void handlemessage(message msg) {
        switch (msg.what) {
            ... ...
            
            case service_timeout_msg: {
                mservices.servicetimeout((processrecord)msg.obj);
            } break;
        ... ...
    }
}

3.6 servicetimeout

void servicetimeout(processrecord proc) {
    string anrmessage = null;

    synchronized(mam) {
        ... ...
        
        long nexttime = 0;

        // 寻找运行超时的 service
        for (int i = proc.executingservices.size() - 1; i >= 0; i--) {
            servicerecord sr = proc.executingservices.valueat(i);
            if (sr.executingstart < maxtime) {
                timeout = sr;
                break;
            }
            if (sr.executingstart > nexttime) {
                nexttime = sr.executingstart;
            }
        }

        // 判断执行 service 超时的进程是否在最近运行进程列表，如果不在，则忽略这个 anr
        if (timeout != null && mam.mlruprocesses.contains(proc)) {
            slog.w(tag, "timeout executing service: " + timeout);
            stringwriter sw = new stringwriter();
            printwriter pw = new fastprintwriter(sw, false, 1024);
            pw.println(timeout);
            timeout.dump(pw, "    ");
            pw.close();
            mlastanrdump = sw.tostring();
            mam.mhandler.removecallbacks(mlastanrdumpclearer);
            mam.mhandler.postdelayed(mlastanrdumpclearer, 
                                           last_anr_lifetime_duration_msecs);
            anrmessage = "executing service " + timeout.shortname;
        ... ...
    }

    if (anrmessage != null) {
        // 当存在 timeout 的 service，则执行 appnotresponding
        mam.mapperrors.appnotresponding(proc, null, null, false, anrmessage);
    }
}

上述方法会找到当前进程已经超时的 service，经过一些判定后，决定要报告 anr，最终调用 ams.appnotresponding() 方法。

走到这一步，anr 机制已经完成了监测报告任务，剩下的任务就是 anr 结果的输出，我们称之为 anr 的报告机制。anr 的报告机制是通过 ams.appnotresponding() 完成的，broadcast 和 inputevent 类型的 anr 最终也都会调用这个方法。

4. anr 信息收集过程

接下来我们看看 android anr 的信息收集过程！

4.1 appnotresponding

// frameworks/base/services/core/java/com/android/server/am/apperrors.java

class apperrors {

    final void appnotresponding(processrecord app, activityrecord activity,
            activityrecord parent, boolean abovesystem, final string annotation) {
        ... ...

        long anrtime = systemclock.uptimemillis();
        if (activitymanagerservice.monitor_cpu_usage) {
            mservice.updatecpustatsnow();   // 更新 cpu 统计信息
        }

        boolean showbackground = settings.secure.
                getint(mcontext.getcontentresolver(),
                           settings.secure.anr_show_background, 0) != 0;

        boolean issilentanr;

        synchronized (mservice) {
            if (mservice.mshuttingdown) {
                return;
            } else if (app.notresponding) {
                return;
            } else if (app.crashing) {
                return;
            } else if (app.killedbyam) {
                return;
            } else if (app.killed) {
                return;
            }

            // in case we come through here for the same app before completing
            // this one, mark as anring now so we will bail out.
            app.notresponding = true;

            // 记录 anr 到 eventlog
            eventlog.writeevent(eventlogtags.am_anr, app.userid, app.pid,
                    app.processname, app.info.flags, annotation);

            // 将当前进程添加到 firstpids
            firstpids.add(app.pid);

            // don't dump other pids if it's a background anr
            issilentanr = !showbackground 
                                  && !isinterestingforbackgroundtraces(app);
            if (!issilentanr) {
                int parentpid = app.pid;
                if (parent != null && parent.app != null && parent.app.pid > 0) {
                    parentpid = parent.app.pid;
                }
                if (parentpid != app.pid) firstpids.add(parentpid);

                // 将 system_server 进程添加到 firstpids
                if (my_pid != app.pid 
                                && my_pid != parentpid) firstpids.add(my_pid);

                for (int i = mservice.mlruprocesses.size() - 1; i >= 0; i--) {
                    processrecord r = mservice.mlruprocesses.get(i);
                    if (r != null && r.thread != null) {
                        int pid = r.pid;
                        if (pid > 0 && pid != app.pid 
                                       && pid != parentpid && pid != my_pid) {
                            if (r.persistent) {
                                // 将 persistent 进程添加到 firstpids
                                firstpids.add(pid);
                            } else if (r.treatlikeactivity) {
                                firstpids.add(pid);
                            } else {
                                // 其他进程添加到 lastpids
                                lastpids.put(pid, boolean.true);
                            }
                        }
                    }
                }
            }
        }

        // 记录 anr 输出到 main log
        stringbuilder info = new stringbuilder();
        info.setlength(0);
        info.append("anr in ").append(app.processname);
        if (activity != null && activity.shortcomponentname != null) {
            info.append(" (").append(activity.shortcomponentname).append(")");
        }
        info.append("\n");
        info.append("pid: ").append(app.pid).append("\n");
        if (annotation != null) {
            info.append("reason: ").append(annotation).append("\n");
        }
        if (parent != null && parent != activity) {
            info.append("parent: ").append(parent.shortcomponentname).append("\n");
        }

        // 创建 cpu tracker 对象 
        processcputracker processcputracker = new processcputracker(true);

        ... ...

        // 输出 traces 信息
        file tracesfile = activitymanagerservice.dumpstacktraces(
                true, firstpids,
                (issilentanr) ? null : processcputracker,
                (issilentanr) ? null : lastpids,
                nativepids);

        string cpuinfo = null;
        if (activitymanagerservice.monitor_cpu_usage) {
            mservice.updatecpustatsnow();
            synchronized (mservice.mprocesscputracker) {
                cpuinfo = mservice.mprocesscputracker.printcurrentstate(anrtime);
            }
            // 记录当前 cpu 负载情况
            info.append(processcputracker.printcurrentload());
            info.append(cpuinfo);
        }

        // 记录从 anr 时间开始的 cpu 使用情况
        info.append(processcputracker.printcurrentstate(anrtime));

        // 输出当前 anr 的 reason，以及 cpu 使用率、负载信息
        slog.e(tag, info.tostring());
        if (tracesfile == null) {
            process.sendsignal(app.pid, process.signal_quit);
        }
        ... ...
                        
        // 将 traces 文件和 cpu 使用率信息保存到 dropbox，即 data/system/dropbox 目录
        mservice.adderrortodropbox("anr", app, app.processname,
                          activity, parent, annotation, cpuinfo, tracesfile, null);
        ... ...

        synchronized (mservice) {
            mservice.mbatterystatsservice.noteprocessanr(app.processname, app.uid);

            // 后台 anr 的情况, 直接杀掉
            if (issilentanr) {
                app.kill("bg anr", true);
                return;
            }

            // 设置 app 的 anr 状态，病查询错误报告 receiver
            makeappnotrespondinglocked(app,
                    activity != null ? activity.shortcomponentname : null,
                    annotation != null ? "anr " + annotation : "anr",
                    info.tostring());

            // 弹出 anr 对话框
            message msg = message.obtain();
            msg.what = activitymanagerservice.show_not_responding_ui_msg;
            msg.obj = new appnotrespondingdialog.data(app, activity, abovesystem);

            // 向 ui 线程发送，内容为 show_not_responding_msg 的消息
            mservice.muihandler.sendmessage(msg);
        }
    }
    
}

当发生 anr 时, 会按顺序依次执行：

       ✒ 1、输出 anr reason 信息到 eventlog，也就是说 anr 触发的时间点最接近的就是 eventlog 中输出的 am_anr 信息;
       ✒ 2、收集并输出重要进程列表中的各个线程的 traces 信息，该方法较耗时；
       ✒ 3、输出当前各个进程的 cpu 使用情况以及 cpu 负载情况；
       ✒ 4、将 traces 文件和 cpu 使用情况信息保存到 dropbox，即 data/system/dropbox 目录；
       ✒ 5、根据进程类型，来决定直接后台杀掉，还是弹框告知用户。

anr输出重要进程的traces信息，这些进程包含：

       ✒ 1、firstpids 队列：第一个是 anr 进程，第二个是 system_server，剩余是所有 persistent 进程；
       ✒ 2、native 队列：是指 /system/bin/ 目录的 mediaserver、sdcard 以及 surfaceflinger 进程；
       ✒ 3、lastpids 队列: 是指 mlruprocesses 中的不属于 firstpids 的所有进程。

4.2 dumpstacktraces

继续看看 dump 出 trace 信息的流程：

// activitymanagerservice.java

    public static file dumpstacktraces(boolean cleartraces, ... ,nativepids) {
        ... ...

        if (tracesdirprop.isempty()) {
            // 默认为 data/anr/traces.txt
            string globaltracespath = 
                          systemproperties.get("dalvik.vm.stack-trace-file", null);

            tracesfile = new file(globaltracespath);
            try {
                if (cleartraces && tracesfile.exists()) {
                    tracesfile.delete();      // 删除已存在的 traces 文件
                }

                // 这里会保证 data/anr/traces.txt 文件内容是全新的方式，而非追加
                tracesfile.createnewfile();   // 创建 traces 文件
                fileutils.setpermissions(globaltracespath, 0666, -1, -1);
            } catch (ioexception e) {
                slog.w(tag, "unable to prepare anr traces file: " + tracesfile, e);
                return null;
            }
        } else {
        }

        // 输出 trace 内容
        dumpstacktraces(tracesfile.getabsolutepath(), firstpids, nativepids,
                                         extrapids, usetombstonedforjavatraces);
        return tracesfile;
    }

4.3 dumpstacktraces

// activitymanagerservice.java

    private static void dumpstacktraces(string tracesfile, ...) {

        final dumpstackfileobserver observer;
        if (usetombstonedforjavatraces) {
            observer = null;
        } else {
            observer = new dumpstackfileobserver(tracesfile);
        }

        // we must complete all stack dumps within 20 seconds.
        long remainingtime = 20 * 1000;
        try {
            if (observer != null) {
                observer.startwatching();
            }

            // 首先，获取 firstpids 进程的 stacks
            if (firstpids != null) {
                int num = firstpids.size();
                for (int i = 0; i < num; i++) {
                    final long timetaken;
                    if (usetombstonedforjavatraces) {
                        timetaken = dumpjavatracestombstoned(firstpids.get(i),
                                                   tracesfile, remainingtime);
                    } else {
                        timetaken = observer.dumpwithtimeout(firstpids.get(i),
                                                               remainingtime);
                    }
                    ... ... 
                }
            }

            // 下一步，获取 native 进程的 stacks
            if (nativepids != null) {
                for (int pid : nativepids) {
                    ... ...
                    
                    // 输出 native 进程的 trace
                    debug.dumpnativebacktracetofiletimeout(
                            pid, tracesfile, (int) (nativedumptimeoutms / 1000));
                            
                    final long timetaken = systemclock.elapsedrealtime() - start;
                    ... ...
                }
            }

            // lastly, dump stacks for all extra pids from the cpu tracker.
            if (extrapids != null) {
                ... ...
                }
            }
        } finally {
            if (observer != null) {
                observer.stopwatching();
            }
        }
    }

4.4 小结

触发 anr 时系统会输出关键信息：

       ✒ 1、将 am_anr 信息，输出到 eventlog；
       ✒ 2、获取重要进程 trace 信息，保存到 /data/anr/traces.txt；
       ✒ 3、anr reason 以及 cpu 使用情况信息，输出到 main log;
       ✒ 4、再将 cpu使用情况 和进程 trace 文件信息，再保存到 /data/system/dropbox。

5. 总结

当 service 出现 anr 时，最终调用到 ams.appnotresponding()方法。

✒ 1、对于前台服务，则超时为 service_timeout = 20s；

✒ 2、对于后台服务，则超时为 service_background_timeout = 200s；

✒ 3、service 超时检测机制：超过一定时间没有执行完相应操作来触发延时消息，则会触发 anr;

上一篇： vue项目部署上线

下一篇： Bootstrap源码解读导航条（7）