Android系统中,有硬件WatchDog用于定时检测关键硬件是否正常工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。关于 软件WatchDog功能:
[-> SystemServer.java]
private void startOtherServices() { ... //创建watchdog【见小节2.2】 final Watchdog watchdog = Watchdog.getInstance(); //注册reboot广播【见小节2.3】 watchdog.init(context, mActivityManagerService); ... mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY); ... mActivityManagerService.systemReady(new Runnable() { @Override public void run() { mSystemServiceManager.startBootPhase( SystemService.PHASE_ACTIVITY_MANAGER_READY); ... // watchdog启动【见小节3.1】 Watchdog.getInstance().start(); mSystemServiceManager.startBootPhase( SystemService.PHASE_THIRD_PARTY_APPS_CAN_START); } } }
[-> Watchdog.java]
public static Watchdog getInstance() { if (sWatchdog == null) { //单例模式,创建实例对象【见小节2.2.1 】 sWatchdog = new Watchdog(); } return sWatchdog; }
[-> Watchdog.java]
public class Watchdog extends Thread { ... private Watchdog() { super("watchdog"); //【见小节2.2.2 】 mMonitorChecker = new HandlerChecker(FgThread.getHandler(), "foreground thread", DEFAULT_TIMEOUT); mHandlerCheckers.add(mMonitorChecker); mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread", DEFAULT_TIMEOUT)); mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(), "ui thread", DEFAULT_TIMEOUT)); mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(), "i/o thread", DEFAULT_TIMEOUT)); mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(), "display thread", DEFAULT_TIMEOUT)); //【见小节2.2.3】 addMonitor(new BinderThreadMonitor()); } }
Watchdog继承于Thread,创建的线程名为”watchdog”。 mHandlerCheckers
是记录着所有的HandlerChecker对象的列表。
Watchdog监控的线程有:
线程名 | 对应handler | 含义 |
---|---|---|
foreground thread | FgThread.getHandler | 前台线程 |
main thread | new Handler(MainLooper) | 主线程 |
ui thread | UiThread.getHandler | UI线程 |
i/o thread | IoThread.getHandler | i/o线程 |
display thread | DisplayThread.getHandler | 显示线程 |
DEFAULT_TIMEOUT默认为60s,调试时为10s,方便找出潜在的ANR问题。
[-> Watchdog.java]
public final class HandlerChecker implements Runnable { … HandlerChecker(Handler handler, String name, long waitMaxMillis) { mHandler = handler; mName = name; mWaitMax = waitMaxMillis; mCompleted = true; } }
mMonitors
记录所有Watchdog目前正在监控的服务。
通过addMonitor(new BinderThreadMonitor())来监控Binder线程
public class Watchdog extends Thread { public void addMonitor(Monitor monitor) { synchronized (this) { if (isAlive()) { throw new RuntimeException("Monitors can't be added once the Watchdog is running"); } mMonitorChecker.addMonitor(monitor); } } public final class HandlerChecker implements Runnable { private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>(); public void addMonitor(Monitor monitor) { mMonitors.add(monitor); } ... } }
将monitor添加到HandlerChecker的成员变量 mMonitors
列表中。
private static final class BinderThreadMonitor implements Watchdog.Monitor { public void monitor() { Binder.blockUntilThreadAvailable(); } }
blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程
void IPCThreadState::blockUntilThreadAvailable() { pthread_mutex_lock(&mProcess->mThreadCountLock); while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) { //等待正在执行的binder线程小于进程最大binder线程上限(16个) pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock); } pthread_mutex_unlock(&mProcess->mThreadCountLock); }
可见addMonitor(new BinderThreadMonitor())是将Binder线程添加到前台线程的handler(mMonitorChecker)来检查是否工作正常。
public class Watchdog extends Thread { public interface Monitor { void monitor(); } }
能够被Watchdog监控的系统服务都实现了Watchdog.Monitor接口。 实现该接口类:
[-> Watchdog.java]
public void init(Context context, ActivityManagerService activity) { mResolver = context.getContentResolver(); mActivity = activity; //注册reboot广播接收者【见小节2.3.1】 context.registerReceiver(new RebootRequestReceiver(), new IntentFilter(Intent.ACTION_REBOOT), android.Manifest.permission.REBOOT, null); }
[-> Watchdog.java]
final class RebootRequestReceiver extends BroadcastReceiver { @Override public void onReceive(Context c, Intent intent) { if (intent.getIntExtra("nowait", 0) != 0) { //【见小节2.3.2】 rebootSystem("Received ACTION_REBOOT broadcast"); return; } Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent); } }
[-> Watchdog.java]
void rebootSystem(String reason) { Slog.i(TAG, "Rebooting system because: " + reason); IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE); try { //通过PowerManager执行reboot操作 pms.reboot(false, reason, false); } catch (RemoteException ex) { } }
最终是通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述。
获取watchdog实例对象,并注册reboot广播
mHandlerCheckers
记录所有的HandlerChecker对象的列表,包括foreground, main, ui, i/o, display线程的handler; mMonitors
记录所有Watchdog目前正在监控Monitor,此处为BinderThreadMonitor; run(), 当系统hang时间超过1min,
public void run() { boolean waitedHalf = false; while (true) { final ArrayList<HandlerChecker> blockedCheckers; final String subject; final boolean allowRestart; int debuggerWasConnected = 0; synchronized (this) { //timeout=30s long timeout = CHECK_INTERVAL; for (int i=0; i<mHandlerCheckers.size(); i++) { HandlerChecker hc = mHandlerCheckers.get(i); //【见小节3.2】 hc.scheduleCheckLocked(); } if (debuggerWasConnected > 0) { debuggerWasConnected--; } long start = SystemClock.uptimeMillis(); //等待30s while (timeout > 0) { if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } try { wait(timeout); } catch (InterruptedException e) { Log.wtf(TAG, e); } if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); } //获取等待状态【见小节3.3】 final int waitState = evaluateCheckerCompletionLocked(); if (waitState == COMPLETED) { waitedHalf = false; continue; } else if (waitState == WAITING) { continue; } else if (waitState == WAITED_HALF) { if (!waitedHalf) { //第一次进入等待时间过半的状态 ArrayList<Integer> pids = new ArrayList<Integer>(); pids.add(Process.myPid()); //则输出栈信息【见小节3.4】 ActivityManagerService.dumpStackTraces(true, pids, null, null, NATIVE_STACKS_OF_INTEREST); waitedHalf = true; } continue; } //获取被阻塞的checkers blockedCheckers = getBlockedCheckersLocked(); subject = describeCheckersLocked(blockedCheckers); allowRestart = mAllowRestart; } EventLog.writeEvent(EventLogTags.WATCHDOG, subject); ArrayList<Integer> pids = new ArrayList<Integer>(); pids.add(Process.myPid()); if (mPhonePid > 0) pids.add(mPhonePid); //waitedHalf=true,则追加输出栈信息【见小节3.4】 final File stack = ActivityManagerService.dumpStackTraces( !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST); //系统已被阻塞1分钟,也不在乎多等待2s来确保stack trace信息输出 SystemClock.sleep(2000); if (RECORD_KERNEL_THREADS) { //输出kernel栈信息【见小节3.5】 dumpKernelStackTraces(); } //触发kernel来dump所有阻塞线程【见小节3.6】 doSysRq('l'); //输出dropbox信息【见小节3.7】 Thread dropboxThread = new Thread("watchdogWriteToDropbox") { public void run() { mActivity.addErrorToDropBox( "watchdog", null, "system_server", null, null, subject, null, stack, null); } }; dropboxThread.start(); try { //等待dropbox线程工作2s dropboxThread.join(2000); } catch (InterruptedException ignored) {} IActivityController controller; synchronized (this) { controller = mController; } if (controller != null) { //将阻塞状态报告给activity controller, try { Binder.setDumpDisabled("Service dumps disabled due to hung system process."); //返回值为1表示继续等待,-1表示杀死系统 int res = controller.systemNotResponding(subject); if (res >= 0) { waitedHalf = false; //继续等待 continue; } } catch (RemoteException e) { } } //当debugger没有attach时,才杀死进程 if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } if (debuggerWasConnected >= 2) { Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); } else if (debuggerWasConnected > 0) { Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); } else if (!allowRestart) { Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); } else { Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); //遍历输出阻塞线程的栈信息 for (int i=0; i<blockedCheckers.size(); i++) { Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:"); StackTraceElement[] stackTrace = blockedCheckers.get(i).getThread().getStackTrace(); for (StackTraceElement element: stackTrace) { Slog.w(TAG, " at " + element); } } Slog.w(TAG, "*** GOODBYE!"); //杀死进程system_server【见小节3.8】 Process.killProcess(Process.myPid()); System.exit(10); } waitedHalf = false; } }
public final class HandlerChecker implements Runnable { ... public void scheduleCheckLocked() { if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) { mCompleted = true; return; } if (!mCompleted) { return; //有一个check正在处理中,则无需重复发送 } mCompleted = false; mCurrentMonitor = null; mStartTime = SystemClock.uptimeMillis(); //发送消息,插入消息队列最开头【见3.2.1】 mHandler.postAtFrontOfQueue(this); } }
postAtFrontOfQueue(this),该方法输入参数为Runnable对象,根据消息机制,回调HandlerChecker中的run方法。
public final class HandlerChecker implements Runnable { public void run() { final int size = mMonitors.size(); for (int i = 0 ; i < size ; i++) { synchronized (Watchdog.this) { mCurrentMonitor = mMonitors.get(i); } //回调具体服务的monitor方法 mCurrentMonitor.monitor(); } synchronized (Watchdog.this) { mCompleted = true; mCurrentMonitor = null; } } }
回调的方法,例如BinderThreadMonitor.monitor
private int evaluateCheckerCompletionLocked() { int state = COMPLETED; for (int i=0; i<mHandlerCheckers.size(); i++) { HandlerChecker hc = mHandlerCheckers.get(i); //【见小节3.3.1】 state = Math.max(state, hc.getCompletionStateLocked()); } return state; }
获取mHandlerCheckers列表中等待状态值最大的state.
public int getCompletionStateLocked() { if (mCompleted) { return COMPLETED; } else { long latency = SystemClock.uptimeMillis() - mStartTime; if (latency < mWaitMax/2) { return WAITING; } else if (latency < mWaitMax) { return WAITED_HALF; } } return OVERDUE; }
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids, ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) { //默认为 data/anr/traces.txt String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null); if (tracesPath == null || tracesPath.length() == 0) { return null; } File tracesFile = new File(tracesPath); try { //当clearTraces,则删除已存在的traces文件 if (clearTraces && tracesFile.exists()) tracesFile.delete(); //创建traces文件 tracesFile.createNewFile(); // -rw-rw-rw- FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1); } catch (IOException e) { return null; } //输出trace内容 dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs); return tracesFile; }
关于trace内容,这里就不细说,直接说说结论:
/system/bin/mediaserver
, /system/bin/sdcard
, /system/bin/surfaceflinger
这3个进程的backtrace; private File dumpKernelStackTraces() { // 路径为data/anr/traces.txt String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null); if (tracesPath == null || tracesPath.length() == 0) { return null; } native_dumpKernelStacks(tracesPath); return new File(tracesPath); }
native_dumpKernelStacks调用到android_server_Watchdog.dumpKernelStacks
private void doSysRq(char c) { try { FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger"); sysrq_trigger.write(c); sysrq_trigger.close(); } catch (IOException e) { Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e); } }
通过向节点 /proc/sysrq-trigger
写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log。
关于dropbox已在dropBox源码篇详细讲解过,输出文件到/data/system/dropbox,比如system_app_crash。
Process.killProcess已经在文章理解杀进程的实现原理已详细讲解,通过发送信号9给目标进程来完成杀进程的过程。
当杀死system_server进程,从而导致zygote进程自杀,进而触发init执行重启Zygote进程,这便出现了手机framework重启的现象。
watchdog在check过程中出现阻塞1分钟的情况,则会输出:
,或者点击下方分享给更多的朋友。您的支持将激励我创作更多技术干货!