TA9 - Technical Guide : Setting up an On-Call Schedule using Shared Notiﬁcation Proﬁles
Summary In today’s Enterprise IT environments, 24x7 uptime is becoming an increasingly common requirement. Supporting global markets and a constant web presence has meant the need for business continuity has expanded.
To keep up with ‘Always On’ architectures, IT maintenance staff are required to be available outside of business hours should problems arise. While these schedules are very commonly in place, they are often difficult to implement. Opsview Monitor’s Shared Notiﬁcation Proﬁle enables IT teams to construct a reliable process for out of hours maintenance.
The ﬁrst requirement when developing an out of hours maintenance process for unplanned outages is that the process must be collectively exhaustive with minimal management overhead when changes need to be made. This technical guide will detail a strategy that does all ﬁltering at one step so that it is easy to make sure that all hosts and services can send alerts to the relevant people outside of normal operating hours.
The second requirement of this on-call schedule is to limit notiﬁcations to the most appropriate audience so that the team can work effectively while maintaining a desirable work-life balance. The strategy detailed in this guide calls for the ﬁrst line of alerts to follow the MECE principle* meaning that they are mutually exclusive while remaining collectively exhaustive. It then may be expanded to broader audiences as an issue escalates. The third requirement is that changes can be easily made to the plan when hosts and services are added or removed from the system or if there arestaff or schedule changes that need to be made.
This guide details a system that will automatically incorporate new hosts and services into the business unit’s notiﬁcations which are easy to subscribe to or assign users to.
Strategy Opsview Monitor can ﬁlter alerts at many different levels. The desired strategy in this guide is to try to accomplish all ﬁltering in one place so that change management can be kept under control. The ﬁrst level that an alert can be ﬁltered at is the service check. The screenshot below shows that all of the status types must be checked or else alerts will never be raised past that level. There are also options for entering the time window that alerts will be sent at and the time between escalations which can be optionally inherited from the host level.
The second ﬁlter is at the host level. Similar to the screenshot above, there is a ‘Notiﬁcations’ tab where statuses must be checked and a time window needs to be selected or alerts will not be raised beyond that point.
According to our desired strategy, it is most appropriate to allow all alerts to be raised through the service check and host level by always checking all boxes and setting the ‘Notiﬁcation Period’ to 24x7. When this is the case, should any problem status occur, Opsview Monitor will raise an internal alert and log an event. If and only if there is a relevant Notiﬁcation Proﬁle for this alert will an email or text message be sent. By leveraging these Notiﬁcation Proﬁles, a weekly on-call schedule can be created and the shifts can be divided by business unit or team, notiﬁcation method, escalation level, and time window. These shifts can be subscribed to or assigned to users so that the business can ensure coverage outside of business hours without entire teams being constantly bombarded by emails.
The ﬁrst step for creating an on-call schedule in Opsview Monitor is to create time period entries for all possible shifts. There are some standard entries for time periods including 24x7, working hours and non-working hours. Verify that the working hours time period exists in your environment and note the 24hr time format. This is the time period that will be used most often in private Notiﬁcation Proﬁles.
Time periods for the on-call shifts need to be created. These time shifts can be as granular and speciﬁc as required by the business. Some sample time periods that could be applied would be the after work and next morning shift and the weekend shift examples below. Time notation can be comma separated to represent multiple time windows within each day.
One of these time periods will be selected when a Notiﬁcation Proﬁle is created. These time periods can be reused across many proﬁles and may have many other applications in Opsview Monitor such as timed exceptions and check periods for hosts and service checks. Roles
Roles are an important tool in Opsview Monitor which are used to restrict access to speciﬁc objects and actions per user or group. A suggested conﬁguration of the ‘Status Access’ tab of a Role for this tutorial is shown below. The idea is that anyone who is on-call should be able to view objects, acknowledge events, schedule downtime, receive alerts and test service checks on objects they have access to.
The next step is to choose which objects a user can interact with. In this menu, host groups, service groups and Hashtags can be selected. The objects that a user under this role will have access to are those that sit within both the selected Hostgroups and Service Groups plus any objects within selected Hashtags.
For example, if the host group Linux Servers was selected and the service group Application – Apache Server was selected, the set of objects accessible to the user is all Apache services running on Linux servers. Only the Linux hosts running apache will be accessible and only the apache services will be viewed. The next piece of logic after the host group, service group intersection is the union with all objects in selected Hashtags. Role access by Hashtag is the easiest way to make ﬂexible permissions since a Hashtag can be made up of any combination of individual host and service that the user wants.
This is the step where the MECE principle must be followed as all object level ﬁltering occurs here. It is therefore important to make sure that at least one role covers every service that needs to be alerted on outside of regular hours thereby making it collectively exhaustive. If possible, avoid any overlap within these roles. If there is overlap, multiple people will get the same email. This can cause confusion and unnecessarily disturb an employee during their free time.
Shared Notiﬁcation Proﬁles are speciﬁc to a single role so a role should be created for every business unit or technology area that will require a specialist to be on-call.
Notiﬁcation Proﬁles Notiﬁcation Proﬁles are a collection of preferences indicating how a user wishes to receive alerts. Notiﬁcation Proﬁles can be private or public so that users can subscribe and unsubscribe to them as necessary. By creating a set of Notiﬁcation Proﬁles that are separated by business unit, shift, and alert type; a business can create a custom schedule to meet their unique needs.
Notiﬁcation Method The ﬁrst step in creating a Notiﬁcation Proﬁle is to select the method, by which, to receive alerts. All available notiﬁcation methods will be listed by check box and one or many can be selected.
Select Objects Select which objects the proﬁle should send alerts for. This is selected the same way as when a role is created. It is important to note that the user’s role will further ﬁlter these objects after the fact. It is therefore important to make sure that the user’s role includes all objects that the user would want to receive notiﬁcations for.
A suggested way to approach this is to select the checkboxes for all objects and allow the role to do all of the ﬁltering rather than the proﬁle. This allows for new hosts and services to be automatically picked up by the Notiﬁcation Proﬁle so long as the new objects are available to the role. This reduces the management cost associated with making changes to hosts, services, and Hashtags. As long as the role has been modiﬁed, the Notiﬁcation Proﬁle will pick up the change as well.
Select Status A Notiﬁcation Proﬁle can ﬁlter alerts based on the status of the host or service. For instance, a proﬁle can be created that only sends alerts for critical errors while another may be created for all alert statuses, including recovery and ﬂapping. This is why the recommended strategy of this guide is to select all statuses at the service and host level. If any ﬁltering is required, it can be done at this step without having to worry about if the alert was lost along the way.
In the case of an on-call schedule, it is encouraged to enable all options in this step as well. On-call teams are likely to be much smaller than peak hours teams. This limited team is more likely to require all of the notiﬁcations. If the on-call staff is greater than one person, it is important to include the ‘Recovery’ status so that everyone on that shift is alerted when problems are resolved. This will further ensure that employees aren’t needlessly called into the office.
why the recommended strategy of this guide is to select all statuses at the service and host level. If any ﬁltering is required, it can be done at this step without having to worry about if the alert was lost along the way.
In the case of an on-call schedule, it is encouraged to enable all options in this step as well. On-call teams are likely to be much smaller than peak hours teams. This limited team is more likely to require all of the notiﬁcations. If the on-call staff is greater than one person, it is important to include the ‘Recovery’ status so that everyone on that shift is alerted when problems are resolved. This will further ensure that employees aren’t needlessly
Select Time Period This is where the on-call shift is deﬁned. Any time period that we created in previous steps can be selected. These are public to all users. A proﬁle should be created for all speciﬁc shifts for the group of objects and the alert method. This is why this white paper recommends using the 24x7 time period at the host and service level. This ﬁltering can be done when deﬁning the on-call shifts in the Notiﬁcation Proﬁle.
Select Escalation Level Opsview Monitor allows users to deﬁne which step they are in an escalation path. Service checks and host checks have a parameter called re-notiﬁcation time. This is the span of time between the initial alert with all subsequent alerts. A notiﬁcation that goes unacknowledged for the re-notiﬁcation time will increment starting from 1. For the purposes of this tutorial, the recommendation is to clone each proﬁle associated with a primary on-call shift and set the ‘Send from Alert’ to 2 on the new clone. This is a way to have two levels that follow the MECE principle before getting more people involved. Notiﬁcations that are escalated past the ﬁrst backup are more likely to oversee multiple teams so cloning proﬁles would not be appropriate for those cases and roles can and should expand. It is important to note that re-notiﬁcation only increments if a notiﬁcation is triggered. If nobody is subscribed to the ﬁrst alert and the notiﬁcation never gets sent, Opsview Monitor will never increment to the second notiﬁcation no matter how much time has passed. Make sure that all primary shifts are properly assigned before creating or assigning backups.
Shared Notiﬁcation Proﬁles Notiﬁcation Proﬁles can be private or shared. Private Notiﬁcation Proﬁles are created while editing a user. Shared Notiﬁcation Proﬁles are created for a speciﬁc role and will only be visible to that role. Shared Notiﬁcation Proﬁles minimize overall management of notiﬁcations and allow the on-call process to be conﬁgured centrally by Opsview Monitor administrators.
Subscribe to Appropriate Proﬁles There are two different ways that a Shared Notiﬁcation Proﬁle can be applied to a user. The ﬁrst is when editing a user under the “Users and Roles” menu and navigating to the ‘Notiﬁcations’ tab. This will bring up the menu shown below. This is where zero, one, or many proﬁles can be assigned to the user and is the best method for Opsview Monitor administrators to assign proﬁles to others. It is additionally possible for a given user to subscribe to additional Notiﬁcation Proﬁles without having the ‘Conﬁgure Proﬁles’ option in their role. This method does not give the user the ability to make changes to a proﬁle but allows them to subscribe to any shared proﬁles that are under the domain of their role. This is achieved by navigating through the user tab in the top right of Opsview Monitor. Rather than logging out, a user can select the ‘Access Proﬁle’ menu and arrive on a screen like the one below where user information can be changed and Notiﬁcation Proﬁles can be selected.
Conclusion The reader of this guide should now be able to set up a schedule for maintenance outside of business hours. To summarize the process, time periods must ﬁrst be created for all of the shifts that may be required. Roles need to be created for the ﬁrst line of response and their backups. These roles should ideally demonstrate the MECE principle or Mutually Exclusive while Collectively Exhaustive. This will ensure that every host and service will be able to send alerts when appropriate while limiting the audience to those that need to be alerted. Roles can be expanded as the escalation level increases for supervisors and management.
Shared Notiﬁcation Proﬁles then need to be created that cover the time period, role, escalation level, and notiﬁcation method desired. To make the management of this stage easier, it is a good idea to select all objects and all statuses. By selecting all Host Groups, all Service Groups, and all Hashtags, the user’s role can be edited to include new hosts, services, and Hashtags and the notiﬁcations will be automatically updated.
By leveraging this feature in Opsview Monitor, an Enterprise can have centralized control over processes that are put in place to maximize uptime in an efficient way. On-call schedules are a necessity in today’s IT climate but it is important to allow employees to have an attractive work-life balance if the business wants to continue attracting and retaining top talent. Through the use of Shared Notiﬁcation Proﬁles and the MECE principle, maximum uptime and a happy staff can be achieved.