转载

调试系列6:软件WatchDog

一、概述

Android系统中,有硬件WatchDog用于定时检测关键硬件是否正常工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。关于 软件WatchDog功能:

  • 监视reboot广播
  • 监视mMonitors关键系统服务是否死锁

二、启动流程

2.1 startOtherServices

[-> SystemServer.java]

private void startOtherServices() {     ...     //创建watchdog【见小节2.2】     final Watchdog watchdog = Watchdog.getInstance();     //注册reboot广播【见小节2.3】     watchdog.init(context, mActivityManagerService);     ...     mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY);     ...     mActivityManagerService.systemReady(new Runnable() {        @Override        public void run() {            mSystemServiceManager.startBootPhase(                    SystemService.PHASE_ACTIVITY_MANAGER_READY);            ...            // watchdog启动【见小节3.1】            Watchdog.getInstance().start();            mSystemServiceManager.startBootPhase(                    SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);         }     } } 

2.2 getInstance

[-> Watchdog.java]

public static Watchdog getInstance() {     if (sWatchdog == null) {         //单例模式,创建实例对象【见小节2.2.1 】         sWatchdog = new Watchdog();     }     return sWatchdog; } 

2.2.1 创建Watchdog

[-> Watchdog.java]

public class Watchdog extends Thread {     ...      private Watchdog() {         super("watchdog");         //【见小节2.2.2 】         mMonitorChecker = new HandlerChecker(FgThread.getHandler(),                 "foreground thread", DEFAULT_TIMEOUT);         mHandlerCheckers.add(mMonitorChecker);         mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),                 "main thread", DEFAULT_TIMEOUT));         mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),                 "ui thread", DEFAULT_TIMEOUT));         mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),                 "i/o thread", DEFAULT_TIMEOUT));         mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),                 "display thread", DEFAULT_TIMEOUT));         //【见小节2.2.3】         addMonitor(new BinderThreadMonitor());     }  } 

Watchdog继承于Thread,创建的线程名为”watchdog”。 mHandlerCheckers 是记录着所有的HandlerChecker对象的列表。

Watchdog监控的线程有:

线程名 对应handler 含义
foreground thread FgThread.getHandler 前台线程
main thread new Handler(MainLooper) 主线程
ui thread UiThread.getHandler UI线程
i/o thread IoThread.getHandler i/o线程
display thread DisplayThread.getHandler 显示线程

DEFAULT_TIMEOUT默认为60s,调试时为10s,方便找出潜在的ANR问题。

2.2.2 HandlerChecker

[-> Watchdog.java]

public final class HandlerChecker implements Runnable { … HandlerChecker(Handler handler, String name, long waitMaxMillis) { mHandler = handler; mName = name; mWaitMax = waitMaxMillis; mCompleted = true; } }

mMonitors 记录所有Watchdog目前正在监控的服务。

2.2.3 监控Binder线程

通过addMonitor(new BinderThreadMonitor())来监控Binder线程

2.2.3.1 addMonitor

public class Watchdog extends Thread {     public void addMonitor(Monitor monitor) {         synchronized (this) {             if (isAlive()) {                 throw new RuntimeException("Monitors can't be added once the Watchdog is running");             }             mMonitorChecker.addMonitor(monitor);         }     }      public final class HandlerChecker implements Runnable {         private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();          public void addMonitor(Monitor monitor) {             mMonitors.add(monitor);         }         ...     } } 

将monitor添加到HandlerChecker的成员变量 mMonitors 列表中。

2.2.3.2 BinderThreadMonitor

private static final class BinderThreadMonitor implements Watchdog.Monitor {     public void monitor() {         Binder.blockUntilThreadAvailable();     } } 

blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程

void IPCThreadState::blockUntilThreadAvailable() {     pthread_mutex_lock(&mProcess->mThreadCountLock);     while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {         //等待正在执行的binder线程小于进程最大binder线程上限(16个)         pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);     }     pthread_mutex_unlock(&mProcess->mThreadCountLock); } 

可见addMonitor(new BinderThreadMonitor())是将Binder线程添加到前台线程的handler(mMonitorChecker)来检查是否工作正常。

2.2.4 Monitor

public class Watchdog extends Thread {     public interface Monitor {         void monitor();     } } 

能够被Watchdog监控的系统服务都实现了Watchdog.Monitor接口。 实现该接口类:

  • ActivityManagerService
  • PowerManagerService
  • WindowManagerService
  • InputManagerService
  • NetworkManagementService
  • MountService
  • NativeDaemonConnector
  • BinderThreadMonitor
  • MediaProjectionManagerService
  • MediaRouterService
  • MediaSessionService

2.3 init

[-> Watchdog.java]

public void init(Context context, ActivityManagerService activity) {     mResolver = context.getContentResolver();     mActivity = activity;     //注册reboot广播接收者【见小节2.3.1】     context.registerReceiver(new RebootRequestReceiver(),             new IntentFilter(Intent.ACTION_REBOOT),             android.Manifest.permission.REBOOT, null); } 

2.3.1 RebootRequestReceiver

[-> Watchdog.java]

final class RebootRequestReceiver extends BroadcastReceiver {     @Override     public void onReceive(Context c, Intent intent) {         if (intent.getIntExtra("nowait", 0) != 0) {             //【见小节2.3.2】             rebootSystem("Received ACTION_REBOOT broadcast");             return;         }         Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);     } } 

2.3.2 rebootSystem

[-> Watchdog.java]

void rebootSystem(String reason) {     Slog.i(TAG, "Rebooting system because: " + reason);     IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);     try {         //通过PowerManager执行reboot操作         pms.reboot(false, reason, false);     } catch (RemoteException ex) {     } } 

最终是通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述。

2.4 小节

获取watchdog实例对象,并注册reboot广播

  • mHandlerCheckers 记录所有的HandlerChecker对象的列表,包括foreground, main, ui, i/o, display线程的handler;
  • mMonitors 记录所有Watchdog目前正在监控Monitor,此处为BinderThreadMonitor;
  • 注册reboot广播,最终是通过PowerManagerService来完成。

三、Watchdog

run(), 当系统hang时间超过1min,

3.1 run

public void run() {     boolean waitedHalf = false;     while (true) {         final ArrayList<HandlerChecker> blockedCheckers;         final String subject;         final boolean allowRestart;         int debuggerWasConnected = 0;         synchronized (this) {             //timeout=30s             long timeout = CHECK_INTERVAL;             for (int i=0; i<mHandlerCheckers.size(); i++) {                 HandlerChecker hc = mHandlerCheckers.get(i);                 //【见小节3.2】                 hc.scheduleCheckLocked();             }              if (debuggerWasConnected > 0) {                 debuggerWasConnected--;             }              long start = SystemClock.uptimeMillis();             //等待30s             while (timeout > 0) {                 if (Debug.isDebuggerConnected()) {                     debuggerWasConnected = 2;                 }                 try {                     wait(timeout);                 } catch (InterruptedException e) {                     Log.wtf(TAG, e);                 }                 if (Debug.isDebuggerConnected()) {                     debuggerWasConnected = 2;                 }                 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);             }             //获取等待状态【见小节3.3】             final int waitState = evaluateCheckerCompletionLocked();             if (waitState == COMPLETED) {                 waitedHalf = false;                 continue;             } else if (waitState == WAITING) {                 continue;             } else if (waitState == WAITED_HALF) {                 if (!waitedHalf) {                     //第一次进入等待时间过半的状态                     ArrayList<Integer> pids = new ArrayList<Integer>();                     pids.add(Process.myPid());                     //则输出栈信息【见小节3.4】                     ActivityManagerService.dumpStackTraces(true, pids, null, null,                             NATIVE_STACKS_OF_INTEREST);                     waitedHalf = true;                 }                 continue;             }             //获取被阻塞的checkers             blockedCheckers = getBlockedCheckersLocked();             subject = describeCheckersLocked(blockedCheckers);             allowRestart = mAllowRestart;         }          EventLog.writeEvent(EventLogTags.WATCHDOG, subject);          ArrayList<Integer> pids = new ArrayList<Integer>();         pids.add(Process.myPid());         if (mPhonePid > 0) pids.add(mPhonePid);         //waitedHalf=true,则追加输出栈信息【见小节3.4】         final File stack = ActivityManagerService.dumpStackTraces(                 !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);         //系统已被阻塞1分钟,也不在乎多等待2s来确保stack trace信息输出         SystemClock.sleep(2000);          if (RECORD_KERNEL_THREADS) {             //输出kernel栈信息【见小节3.5】             dumpKernelStackTraces();         }          //触发kernel来dump所有阻塞线程【见小节3.6】         doSysRq('l');         //输出dropbox信息【见小节3.7】         Thread dropboxThread = new Thread("watchdogWriteToDropbox") {                 public void run() {                     mActivity.addErrorToDropBox(                             "watchdog", null, "system_server", null, null,                             subject, null, stack, null);                 }             };         dropboxThread.start();         try {             //等待dropbox线程工作2s             dropboxThread.join(2000);         } catch (InterruptedException ignored) {}          IActivityController controller;         synchronized (this) {             controller = mController;         }         if (controller != null) {             //将阻塞状态报告给activity controller,             try {                 Binder.setDumpDisabled("Service dumps disabled due to hung system process.");                 //返回值为1表示继续等待,-1表示杀死系统                 int res = controller.systemNotResponding(subject);                 if (res >= 0) {                     waitedHalf = false; //继续等待                     continue;                 }             } catch (RemoteException e) {             }         }          //当debugger没有attach时,才杀死进程         if (Debug.isDebuggerConnected()) {             debuggerWasConnected = 2;         }         if (debuggerWasConnected >= 2) {             Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");         } else if (debuggerWasConnected > 0) {             Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");         } else if (!allowRestart) {             Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");         } else {             Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);             //遍历输出阻塞线程的栈信息             for (int i=0; i<blockedCheckers.size(); i++) {                 Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");                 StackTraceElement[] stackTrace                         = blockedCheckers.get(i).getThread().getStackTrace();                 for (StackTraceElement element: stackTrace) {                     Slog.w(TAG, "    at " + element);                 }             }             Slog.w(TAG, "*** GOODBYE!");             //杀死进程system_server【见小节3.8】             Process.killProcess(Process.myPid());             System.exit(10);         }          waitedHalf = false;     } } 

3.2 scheduleCheckLocked

public final class HandlerChecker implements Runnable {     ...     public void scheduleCheckLocked() {         if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {             mCompleted = true;             return;         }          if (!mCompleted) {             return; //有一个check正在处理中,则无需重复发送         }          mCompleted = false;         mCurrentMonitor = null;         mStartTime = SystemClock.uptimeMillis();         //发送消息,插入消息队列最开头【见3.2.1】         mHandler.postAtFrontOfQueue(this);     } } 

postAtFrontOfQueue(this),该方法输入参数为Runnable对象,根据消息机制,回调HandlerChecker中的run方法。

3.2.1 HandlerChecker.run

public final class HandlerChecker implements Runnable {     public void run() {         final int size = mMonitors.size();         for (int i = 0 ; i < size ; i++) {             synchronized (Watchdog.this) {                 mCurrentMonitor = mMonitors.get(i);             }             //回调具体服务的monitor方法             mCurrentMonitor.monitor();         }          synchronized (Watchdog.this) {             mCompleted = true;             mCurrentMonitor = null;         }     } } 

回调的方法,例如BinderThreadMonitor.monitor

3.3 evaluateCheckerCompletionLocked

private int evaluateCheckerCompletionLocked() {     int state = COMPLETED;     for (int i=0; i<mHandlerCheckers.size(); i++) {         HandlerChecker hc = mHandlerCheckers.get(i);         //【见小节3.3.1】         state = Math.max(state, hc.getCompletionStateLocked());     }     return state; } 

获取mHandlerCheckers列表中等待状态值最大的state.

3.3.1 getCompletionStateLocked

public int getCompletionStateLocked() {     if (mCompleted) {         return COMPLETED;     } else {         long latency = SystemClock.uptimeMillis() - mStartTime;         if (latency < mWaitMax/2) {             return WAITING;         } else if (latency < mWaitMax) {             return WAITED_HALF;         }     }     return OVERDUE; } 
  • COMPLETED = 0:等待完成;
  • WAITING = 1:等待时间小于DEFAULT_TIMEOUT的一半,即30s;
  • WAITED_HALF = 2:等待时间处于30s~60s之间;
  • OVERDUE = 3:等待时间大于或等于60s/

3.4 AMS.dumpStackTraces

public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,         ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {     //默认为 data/anr/traces.txt     String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);     if (tracesPath == null || tracesPath.length() == 0) {         return null;     }      File tracesFile = new File(tracesPath);     try {         //当clearTraces,则删除已存在的traces文件         if (clearTraces && tracesFile.exists()) tracesFile.delete();         //创建traces文件         tracesFile.createNewFile();         // -rw-rw-rw-         FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);     } catch (IOException e) {         return null;     }     //输出trace内容     dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);     return tracesFile; } 

关于trace内容,这里就不细说,直接说说结论:

  1. 调用Process.sendSignal()向目标进程发送信号SIGNAL_QUIT;
  2. 分别调用backtrace.dump_backtrace(),输出 /system/bin/mediaserver , /system/bin/sdcard , /system/bin/surfaceflinger 这3个进程的backtrace;
  3. 统计CPU使用率;
  4. 调用Process.sendSignal()向其他进程发送信号SIGNAL_QUIT。

3.5 dumpKernelStackTraces

private File dumpKernelStackTraces() {     // 路径为data/anr/traces.txt     String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);     if (tracesPath == null || tracesPath.length() == 0) {         return null;     }      native_dumpKernelStacks(tracesPath);     return new File(tracesPath); } 

native_dumpKernelStacks调用到android_server_Watchdog.dumpKernelStacks

3.6 doSysRq

private void doSysRq(char c) {     try {         FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");         sysrq_trigger.write(c);         sysrq_trigger.close();     } catch (IOException e) {         Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);     } } 

通过向节点 /proc/sysrq-trigger 写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log。

3.7 dropBox

关于dropbox已在dropBox源码篇详细讲解过,输出文件到/data/system/dropbox,比如system_app_crash。

3.8 killProcess

Process.killProcess已经在文章理解杀进程的实现原理已详细讲解,通过发送信号9给目标进程来完成杀进程的过程。

当杀死system_server进程,从而导致zygote进程自杀,进而触发init执行重启Zygote进程,这便出现了手机framework重启的现象。

四、小结

watchdog在check过程中出现阻塞1分钟的情况,则会输出:

  1. AMS.dumpStackTraces
    • kill -3
    • backtrace.dump_backtrace()
  2. dumpKernelStackTraces android_server_Watchdog.dumpKernelStacks
  3. dropBox
欢迎关注我的微博: Gityuan 。如果觉得我的文章对您所有帮助,请 ¥打赏支持

,或者点击下方分享给更多的朋友。您的支持将激励我创作更多技术干货!

原文  http://gityuan.com/2016/06/21/watchdog/
正文到此结束
Loading...