System Health Manager Concepts

This section explains the policies involved in a process monitoring.

Relaunch policy

Once a monitored process has failed, SHMA attempts to re-launch the process. If all re-launch attempts fail, then SHMA either ignores the failure of the process, or restarts the device (for a critical component).

If a monitored component is successfully relaunched, and then subsequently fails, the next re-launch attempt will only occur after at least KWaitTime seconds has elapsed (since the last re-launch attempt). This mechanism is referred to a ‘re-launch throttling’

SHMA provides a method to control process restart which involves specifying the ‘rate of failure’ (i.e., restart) of a monitored process. If the process's failure rate exceeds the value that is set for the component, no new restart attempts is performed on the component. The individual restart count decrements (if restart count is greater than zero) at regular intervals (defined by KWaitTime patchable constant). At a restart attempt, if the count exceeds the specified limit, the process is not restarted and the Retry Failure policy is enacted.

Retry failure policy

The purpose of the SHMA is to restart processes which fail unexpectedly. Retry failure policy specifies the action to be taken by the System Health Manager if a process fails to restart within the specified limit (set in the patchable constant KWaitTime).

Following are the possible courses of action:

  • ignore the failure

  • restart the OS

  • restart the OS in a different start-up mode (for example, normal mode, textual mode)

Retry failure policy for each monitoring request must be specified by the client. The client requires certain capabilities to request each course of action.

NOTE: System Health Manager may be requested to restart the OS when a critical process fails without attempting to restart the failed process.

Capabilities required

When requesting monitoring, a client requires the following capabilities:

Retry Failure Policy Description Capability Required

ESsmIgnoreOnFailure

Failure to restart is ignored, no action is taken.

None (for self monitoring)

ProtServ (for monitoring another process)

ESsmRestartOS

System is restarted in normal mode.

ProtServ

ESsmRestartOSWithMode

System is restarted in a specific start-up mode based on a value set in the ROM.

ProtServ and PowerMgmt

ESsmCriticalNoRetries

System is restarted (with no attempts to restart the component being attempted).

ProtServ