Operations Manager Management Pack Authoring - Unit Monitors

Operations Manager Management Pack Authoring - Unit Monitors

This document is part of the Operations Manager Management Pack Authoring Guide.  The Microsoft System Center team has validated this procedure as of the original version.  We will continue to review any changes and periodically provide validations on later revisions as they are made.  Please feel free to make any corrections or additions to this procedure that you think would assist other users.


Event Monitors

Event monitors use one of the event data sources to identify a particular event that indicates an issue. As soon as the specific data source that holds the required information is identified, the logic used to determine different health states must be determined. In addition to the logic that indicates whether an error condition has occurred, additional logic must be defined to determine when the state should be changed back to a healthy condition.

Detection Logic

The different kinds of logic that can be used to detect an error condition by using events are listed in the following table. As noted in the table, some logic can only be used with Windows events.

Logic Data Sources Description
Simple Event All Detects an error state from the occurrence of a single event.
Repeated Events All Detects an error state from one or more occurrences of a particular event in a specified time window.
Correlated Events Windows Events Detects an error state from the occurrence of two events in a specified time window.
Correlated Missing Events Windows Events Detects an error state from an expected event not being detected in a particular time window after the occurrence of another event.
Missing Event Windows Events Detects an error state from an expected event not being detected in a particular time window.

Simple Event

Simple detection refers to a state change being triggered immediately after a single occurrence of the specified event. This is the most basic kind of detection and will apply to most scenarios.

Repeated Events

Repeated event detection uses one or more occurrences of a particular event in a time window to indicate an error condition. This typically applies to conditions in an application where a single event on its own can be ignored, but multiple occurrences of that event in a particular time window indicate a potential error. There are different algorithms that can be used for this detection, depending on the logic that best identifies the specific application issue. The following are details of the different algorithms:

Trigger on Timer

Trigger on timer consolidation of events uses a specified time window and is not dependent on the number of events received. A single event can trigger an error in the health state as in simple detection. Unlike simple detection which sets the health state immediately upon detection of the specified event, however trigger on timer consolidation waits until a specified time window to set the health state of the monitor. The time window can be a rotating time duration of specified length or a specific window based on day of the week.

Trigger on timer consolidation is useful for errors that should only be detected in a certain time window. Used with a time window based on a specific time of day, this disables the monitor outside that time period. It can also have the effect of delaying the change of state for a particular time during which an event that indicates a healthy state could be received. In this case, the health state would never be changed.

Trigger on Count

Trigger on count consolidation of events lets a monitor require multiple occurrences of the same event in a specified time window before it changes the health state to an error. The time window can be rotating time duration of specified length or a specific window based on day of the week.

Trigger on count consolidation resembles trigger on timer consolidation except that multiple occurrences of the event are required instead of just one. When the time window is reached, the event count is returned to zero, and the specific number of events must detected before the time window expires again for the health state to be changed.

Trigger on Count, Sliding

Trigger on count, sliding consolidation of events is similar to trigger on count consolidation except that the time window is reset every time that the specified event is received. The time window only expires if the time is reached after the occurrence of the last event.

Trigger on count, sliding consolidation is useful for error conditions that are detected by a certain number of events in a particular length of time. By using trigger on count consolidation, some events could be received in one time window and then other events received in the next time window with the result that the health state is never changed. Using trigger on count, sliding consolidation, the time window depends on when the event occurs preventing this condition.

Repeated Events Example

To help with understanding the different algorithms used for repeated event detection, the following table shows the effect on health state for monitors based on the different kinds of consolidation. This is based on a repeated event monitor that uses the following details:

  • Consolidation interval: 2 minutes
  • Compare count: 3 (ignored by Trigger on Timer)
  • Health state on repeated event: Critical
  • Reset Logic: Event reset using Event 3
Time Event Trigger on Timer Trigger on Count Trigger on Count. Sliding
00:00:00 - Healthy Healthy Healthy
00:01:00 Event 1 Healthy Healthy Healthy
00:02:00 - Healthy Healthy Healthy
00:02:30 - Healthy Healthy Healthy
00:03:00 - Critical Healthy Healthy
00:03:30 Event 3 Healthy Healthy Healthy
00:04:00 Event 1 Healthy Healthy Healthy
00:04:30 - Healthy Healthy Healthy
00:05:00 Event 1 Critical Healthy Healthy
00:05:30 - Critical Healthy Healthy
00:06:00 - Critical Healthy Healthy
00:06:30 Event 1 Critical Healthy Healthy
00:07:00 Event 1 Critical Healthy Critical
00:07:30 - Critical Healthy Critical
00:08:00 Event 1 Critical Healthy Critical
00:08:30 - Critical Critical Critical
00:09:00 Event 3 Critical Healthy Healthy

  • Using trigger on timer, a critical state is set at 00:03:00 event though the event is received at 00:01:00 because the time window starts when the monitor is loaded. The start is reset to healthy at 00:03:30, but the critical state is again triggered at 00:05:00 from the time window started at 00:03:00.
  • Using trigger on count, the event at 00:05:00 does not trigger a critical state because the time window started by the event at 00:01:00 would have expired at 00:03:00. This event is instead part of the time window started by the event at 00:04:00 which expires at 00:06:00. The monitor triggers a critical state at 00:08:30 because of the 3 events detected in the time window started with the event at 00:06:30.
  • Using trigger on count, sliding, each occurrence of Event 1 starts its own window. The critical state is triggered at 00:07:00 from the 3 events detected in the time window started with the event at 00:05:00.

Correlated Events

A correlated event monitor uses two separate events in a particular time period to detect a single issue. This kind of monitor supports conditions where an issue cannot be identified by a single event alone.

When the first event is detected, a timer is triggered. If the second event is received within that period, the state change is triggered. If the second event is not received in the period, the timer is reset until the first event is received again. The monitor may be configured to better tune the specific conditions that must be met in order to perform correlation. These options include the following:

  • Whether the events must be in chronological order. One of the events may always be expected before the other one, or they may be expected in either order.
  • Whether the first or last occurrence of the first event should be used. If the first occurrence is specified, then each occurrence of the first event will have its own time window and search for corresponding occurrences of the second event. With the last occurrence specified, if the first event reoccurs with the time window, then the time window is extended based on the last event. The monitor can also be configured to reset the time window every time that the first event occurs. When the time window is reset, all previous occurrences of both events are ignored.
  • The number of occurrences of the second event that must be received to trigger the state change. Instead of changing the health state after receiving a single instance of the two events, multiple instances of the second event may be required.
  • Properties between the first and second event that must match for correlation to be performed. Instead of detecting two occurrences of each event, additional comparison may be required to determine whether the events are related. The monitor can, for example, confirm that a particular parameter matches between the two events to make sure that they match.

Correlated Events Example

The following table provides an example of a correlated event monitor by using the first and the last occurrence of the first event. The monitor uses the following details:

  • Event Log A: Event 1
  • Event Log B: Event 2
  • Correlation interval: 2 minutes
  • Number of occurrences of Event 2: 3
  • Health state on correlation: Critical
  • Reset Logic: Event reset using Event 3
Time Event First Occurence Last Occurence
00:00:00 - Healthy Healthy
00:01:00 Event 1 Healthy Healthy
00:01:30 Event 2 Healthy Healthy
00:02:00 Event 2 Healthy Healthy
00:02:30 - Healthy Healthy
00:03:00 Event 1 Healthy Healthy
00:03:30 Event 2 Healthy Healthy
00:04:00 Event 2 Healthy Healthy
00:04:30 Event 1 Healthy Healthy
00:05:00 Event 2 Critical Healthy
00:05:30 Event 3 Healthy Healthy
00:06:00 Event 1 Healthy Healthy
00:06:30 Event 2 Healthy Healthy
00:07:00 Event 1 Healthy Healthy
00:07:30 Event 2 Healthy Healthy
00:08:00 Event 2 Critical Healthy
00:08:30 Event 2 Critical Critical
00:09:00 Event 3 Healthy Healthy

  • The First Occurrence does not trigger a critical state when Event 2 is detected at 00:03:00 because the timer was reset at 00:03:00 which is 2 minutes after the first occurrence of Event 1 at 00:01:00.
  • The First Occurrence triggers a critical state at 00:05:00 because Event 2 is detected 3 times within the 2 minutes since the first occurrence of Event 1 at 00:03:00. Event 1 starts a new time window at 00:03:00 because the time window from Event 1 at 00:01:00 would have expired.
  • The First Occurrence triggers a critical state at 00:08:00 because Event 2 is detected 3 times within 2 minutes from Event 1 at 00:06:00.
  • The First Occurrence resets its state to healthy at 00:05:30 and 00:09:00 because Event 3 is detected.

Correlated Missing Events

A correlated missing event monitor determines an error by the absence of a particular event after the occurrence of another. This resembles the missing event monitor except that instead of searching for the missing event in a particular time window, the monitor searches for the event in a particular time after another event is first detected.

For example, consider an application that performs a backup each evening and creates an event when it starts and a second event when it has completed successfully. A correlated missing event monitor could be created that searches for the event in a particular time window each evening. If both events are detected, then the monitor remains in a healthy state. If the first is found, then the timer starts. If the time is reached before the second event is detected, then the state change is triggered to indicate that the last backup did not occur successfully.

Correlated Missing Events Example

The following table provides an example of a correlated missing event monitor by using the first and the last occurrence of the first event. The monitor uses the following details:

  • Missing Event Log A: Event 1
  • Missing Event Log B: Event 2
  • Correlation interval: 2 minutes
  • Number of occurrences of Event 2: 3
  • Health state on correlation: Critical
  • Reset Logic: Event reset using Event 3
Time Event First Occurence Last Occurence
00:00:00 - Healthy Healthy
00:01:00 Event 1 Healthy Healthy
00:01:30 Event 2 Healthy Healthy
00:02:00 Event 2 Healthy Healthy
00:02:30 Event 1 Healthy Healthy
00:03:00 - Critical Healthy
00:03:30 Event 2 Critical Healthy
00:04:00 Event 2 Critical Healthy
00:04:30 - Critical Critical
00:05:00 Event 3 Healthy Healthy

  • The First Occurrence triggers a critical state at 00:03:00 because Event 2 has not been detected 3 times in the 2 minute interval since the first occurrence of Event 1 at 00:01:00.
  • The Last Occurrence does not trigger a critical state at 00:03:00 because Event 1 occurs at 00:02:30 resetting the timer. The critical state is not triggered until 00:04:30 when Event 2 has not been detected in the 2 minutes interval since the last occurrence of Event 1 at 00:02:30.
  • The single occurrence of Event 3 at 00:05:00 resets both monitors to healthy.

Missing Event

Instead of detecting a particular event to identify an error condition, a missing event monitor uses the absence of a particular event in a particular time window to determine an error. This supports applications that are expected to generate an informational event that indicates a successful operation or the success of a particular action.

For example, consider an application that performs a scheduled data transfer each evening and creates an event when it has completed successfully. A missing event monitor could be created that searches for the event in a particular time window each evening. If the event is detected, then the monitor remains in a healthy state. If it is not found, then it enters error state that indicates that the last transfer did not occur successfully.

Missing Event Example

The following table provides an example of a missing event monitor by using the following details:

  • Event: Event 1
  • Fixed Schedule: Su-Sa 2:00 AM – 3:00 AM
  • Health state on missing event: Critical
  • Reset Logic: Event reset using Event 3
Time Event Health State
00:00:00 - Healthy
00:01:00 Event 1 Healthy
00:02:00 - Healthy
00:03:00 - Healthy
00:04:00 - Critical
00:05:00 Event 3 Healthy

  • The critical state is triggered at 00:03:00 when Event 1 is not detected within the specified window.

Health Reset Logic

The previous detection criteria describe the conditions under which a monitor changes to a warning or critical state. In addition to detecting an error state, each monitor must have logic defined to determine when the state should be returned to healthy. The different methods for resetting state are shown in the following table:

Reset Logic Description
Event Reset A single specific event indicates that monitor should be reset.
Manaual Reset The monitor is never automatically rest. The user must manually reset the monitor.
Timer Reset The monitor is automatically reset after a specified time.

Each of these methods is discussed at length in the following sections:

Event Reset

With event reset, the monitor is reset when a single occurrence of a specific event is detected. The event must be the same type as the event used for detecting the error condition. For example, a Windows event monitor might specify an event with a particular event source and number to indicate an error condition. Another Windows event with the same event source but a different number might indicate that the error in the application was corrected.

Event reset can only be used if the application provides an event indicating the particular error was corrected. Many applications create an event when an error occurs but may not create a corresponding event that indicates that the error was corrected. Event reset cannot be used in this case.

Manual reset

With manual reset, the monitor never returns to a healthy state automatically. The user must determine whether the problem was corrected and then select the monitor in the Health Explorer and select Reset Health.

The advantage to this strategy is that a monitor can be used for issues that do not create an event that indicates a healthy state. The monitor can affect the health state of the managed object instead of creating a simple alert from a rule. The downtime will be recorded for the object in the State Change Events in the Operations Console and in any availability reports.

There are multiple implications of this strategy that should be considered. The first is the additional work required from the user because the monitor will never automatically reset. It can also result in too much downtime being recorded if the user waits a long time before performing the reset. The problem may have been corrected fairly quickly, but the healthy state will not be recorded until the user performs the reset.

Use of manual reset should be especially cautioned for monitors where there is a potential for a single problem to affect multiple instances of the target class. Because users cannot reset the monitor for multiple instances in the Operations Console, the user would be required to manually open the Health Explorer for each instance to perform this action. Depending on the number of instances, this could result in significant effort for the user.

Timer Reset

A timer reset acts the same as a manual reset except that if the user does not manually reset the monitor after a specified time, it will reset automatically. One use of this kind of reset is for issues that continuously log error events until the problem is corrected. Instead of using another event to indicate that the problem was corrected, the previously detected error event for a specified period can be used as the success criteria.

The timer reset can be used in the place of a manual reset providing the advantage of automatically resetting after a while if the user does not perform a manual reset.

Performance Monitors

Monitors in a System Center Operations Manager 2007 management pack based on performance counters collect numeric data at set intervals and compare it to one or more threshold values. This may be a simple comparison that compares each sample to a single threshold or more complex logic, depending on the requirements of the application.

Threshold Types

Multiple kinds of calculations may be performed to determine the threshold for a performance monitor. These threshold types are listed in the following table:

Threshold Type Number of States Description
Average Threshold 2 Compare the average of multiple collected values to a threshold.
Consecutive Samples 2 Compare several consecutive values to a threshold. All collected values must match the threshold criteria.
Delta Threshold 2 Compare the change between two consecutive values to a threshold.
Double Threshold 3 Compare a single collected value to two thresholds with one that indicates a Warning state and the other that indicates a Critical state.
Simple Threshold 2 Compare a single collected value to a threshold.

Each kind of logic is described in detail in the following sections:

Simple Threshold

The simple threshold type is the most basic kind of performance threshold. A single numeric value is provided for the threshold. This threshold is compared to the measured value of the performance data.

Simple threshold supports a two state monitor. One state is set by a performance value equal to or less than the threshold. The other state is set by a performance value greater than the threshold.

Double Threshold

The double threshold type is similar to the simple threshold type but allows for two thresholds to be specified. Each threshold is compared to the measured value of the performance data.

Double threshold supports a three state monitor. One state is set by a performance value less than the low threshold. Another state is set by a performance value that is greater than or equal to the low threshold or one that is less than or equal to the high threshold. Another state is set by a value that is greater than the high threshold.

The following table provides an example of a double monitor by using the following details:

  • Sample rate: 5 minutes
  • Low threshold value: 10
  • High threshold value: 15
  • Over Upper Threshold State: Critical
  • Between Thresholds State: Warning
  • Under Lower Threshold State: Healthy

Time Value State
00:00 5 Healthy
05:00 10 Warning
10:00 12 Warning
15:00 9 Healthy
20:00 12 Warning
25:00 16 Critical
30:00 15 Critical
35:00 8 Healthy
  • The warning threshold is first exceeded at 00:05:00, but the value does not exceed the critical threshold.
  • The critical threshold is first exceeded at 00:25:00 when the state is changed from warning to critical.
  • The state is returned to a healthy state at 00:15:00 and 00:35:00 when the sampled value is less than the warning threshold.

Average Threshold

The average threshold type calculates the average of a specified number of consecutive samples and compares it to the specified threshold.

Average threshold supports a two state monitor. One state is set by an average performance value equal to or less than the threshold. The other state is set by an average performance value greater than the threshold.

The following table provides an example of an average threshold monitor by using the following details:

  • Sample rate: 5 minutes
  • Threshold value: 10
  • Number of samples: 3
  • Over Threshold State: Critical
  • Under Threshold State: Healthy

Time Value Average State
00:00 5 - Healthy
05:00 10 - Healthy
10:00 12 9.0 Healthy
15:00 9 10.3 Critical
20:00 12 11.0 Critical
25:00 14 11.7 Critical
30:00 11 12.3 Critical
35:00 4 9.7 Healthy
  • Because the specified number of samples for the average calculation is 3, no value is evaluated until the third sample.
  • The value of 12 sampled at 00:10:00 exceeds the threshold value, but the calculated average from the last 3 samples is 9.0, which is under the threshold. The state is not changed.
  • The value of 9 sampled at 00:15:00 does not exceed the threshold. But the calculated average from the last 3 samples is 10.3 which does exceed the threshold. The state is changed.
  • The monitor does not return to a healthy state until 00:35:00 when the average from the last 3 samples drops the under the threshold value.

Consecutive Samples

The consecutive threshold type compares the threshold value to the performance counter for several consecutive samples. This supports monitors that should not be triggered by only a single value exceeding a threshold. The threshold must be exceeded multiple consecutive times to trigger a change in state.

Consecutive threshold supports a two state monitor. One state is set by the value being either greater than or less than the threshold value for each consecutive sample. The other state is set by a single sample not matching the other criteria.

The following table provides an example of a consecutive sample monitor by using the following details:

  • Sample rate: 5 minutes
  • Threshold value: greater than or equal to 10
  • Number of samples: 3
  • Over Threshold State: Critical
  • Under Threshold State: Healthy

Time Value State
00:00 5 Healthy
05:00 10 Healthy
10:00 12 Healthy
15:00 9 Healthy
20:00 12 Healthy
25:00 14 Healthy
30:00 11 Critical
35:00 8 Healthy

  • The threshold is exceeded by the values sampled at 00:05:00 and 00:10:00, but the value at 00:15:00 is under threshold and resets the count.
  • The value at 0:30:00 is the first time that 3 consecutive values have been sampled that exceed the threshold, so the state is changed.
  • The single value at 00:35:00 is under the threshold and resets the monitor to a healthy state.

Delta Threshold

The delta threshold type compares the threshold value to the difference between two performance values. This might be two consecutive values or two values separated by a specified number of samples.

Delta threshold supports a two state monitor. One state is set by the difference of two values being greater than the threshold value. The other state is set by the difference of two samples being equal to or less than the threshold value.

The following table provides an example of a delta threshold monitor by using the following details:

  • Sample rate: 5 minutes
  • Threshold value: 10
  • Number of samples: 3
  • Over Threshold State: Critical
  • Under Threshold State: Healthy

Time Value Delta State
00:00 7 - Healthy
05:00 8 - Healthy
10:00 13 - Healthy
15:00 16 9 Healthy
20:00 21 13 Critical
25:00 24 11 Critical
30:00 25 9 Healthy
  • Because the specified number of samples that the delta should be calculated from the current sampled value to the value 3 samples behind, no value is evaluated until the fourth sample.
  • The delta calculation exceeds the threshold value at 00:20:00, and the state is changed.
  • The monitor is reset at 00:30:00 when the delta calculation falls under the threshold.

Self-tuning Threshold Monitors

A self-tuning threshold monitor uses a learning process to determine the typical values for a specified performance counter object and automatically sets the threshold levels based on the learned values. Avoid self-tuning threshold monitors because they may not work well in most customer environments.

Script Monitors

Script monitors run a monitoring script regularly and evaluate the results to determine the state of the monitor. The script could perform such actions as running a synthetic transaction against an application, gathering performance data to be evaluated against a threshold, or retrieving a status of some aspect of the application. Script monitors incur more overhead than the other types of monitors and should be used only when one of those monitors does not provide the required functionality.

Script monitors can use either two states or three states. Criteria must be defined for each state using values from the property bag created by the script. The kinds of values in the property bag will vary depending on the particular script. A numeric value might be compared to a threshold value as in a performance monitor. In that case, the healthy state might be defined by the value being under the threshold value while the critical state is defined by the value being over the same threshold. A synthetic transaction might return a text result indicating whether the test was successful or not. In that case, the criteria for each state would be the string indicating that particular health.

Service Monitors

Service monitors measure the running state of a Windows service. There is no configuration required other than the name of the service. This is a two state monitor with the monitor sets the monitor to a healthy state if the service is running and a critical state if the service is not running. The monitor can be configured to check the startup type of the service. This ensures that the service is only monitored if its startup type is set to automatic.


Leave a Comment
  • Please add 4 and 1 and type the answer here:
  • Post
Wiki - Revision Comment List(Revision Comment)
Sort by: Published Date | Most Recent | Most Useful
Comments
Page 1 of 1 (1 items)
Wikis - Comment List
Sort by: Published Date | Most Recent | Most Useful
Posting comments is temporarily disabled until 10:00am PST on Saturday, December 14th. Thank you for your patience.
Comments
Page 1 of 1 (1 items)