Opsview Knowledge Center

Overview

An overview of BSM (Business Service Monitoring)

The purpose of BSM is to give real-world views in terms of 'we can allow one host to fail, and our 'service' is still operational' - achieving a 'resilient hierarchy' ' where Hosts are grouped together into Components (i.e. h-scaled clusters), and Components are then grouped together to form the overall Business Service (i.e. www.website.com, which uses the database cluster, apache cluster, etc).*

Using BSM, you can have multiple Hosts of the same type grouped together in Components, and if you have configured the resiliency level correctly, you can allow one Host to fail but the Component to still be 'operational'.

A Business Service can then be comprised of multiple Components, each with their own resiliency levels ' giving a true end-to-end view of the Business Service, as shown below:

  • You can then undertake operations at a Component and a BSM level, such as:
  • Who/which team is responsible for this?' ' Add notes to the BSM/Component.
  • Schedule downtime against entire Component'.
  • Acknowledge all issues in a Component'.
  • View events related to this BSM/Component'.
  • Send alerts at a Component/BSM level ' i.e. I only want to know when a Component is critical, not when a Host in it has failed ' as we have resiliency in place.
  • Be alerted that a BSM's availability has gone below a certain level, i.e. it is below your agreed SLA/OLA.
  • Run historical reports against BSM's and Components, automated or manual, and emailed in your companies brand to yourself / customer at a pre-determined time/date.

What does this look like?

In an Opsview Monitor system, you can have a website called 'Website.com' ' with six Components ' including an 'Apache Servers' cluster, 'Linux cluster', etc. (As shown below in an image taken from a live system).

With BSM, you can now monitor and display your entire stack in a single view, so you can see 'one Host has failed in the Linux cluster; it hasn't affected my website yet but I will need to fix that soon'.

Business Service monitoring is a terrific tool that will take existing Hosts, Services and Host templates and allow the creation of a hierarchy of Components and Business Services ' showing the relationship between Hosts and the Business Services they support, availability (SLA/OLA) at each layer, reporting, notifications, access control and more.

Logic

Component Hosts

Note that for the purposes of BSM, the Host only consists of the Service Checks related to the Host template used by the Component - the Host state is not taken into account.

The Host can be one of three calculated states:

  • OPERATIONAL: if no Service Checks are in a CRITICAL state
  • FAILED: at least one Service Check is in a CRITICAL state. Service Checks in downtime are ignored
  • DOWNTIME: all CRITICAL Service Checks are in a DOWNTIME state

Additionally, there is one calculated flag:

  • ACKNOWLEDGED: if all the CRITICAL service checks are acknowledged, then the Component Host is acknowledged

Note: The soft or hard state of the Service Check is not considered - the latest state is always used.
Note: If you set DOWNTIME and there are no failed services, then an operational state is used. This is to cover scenarios where downtime of two hours is recorded, but only 15 minutes is used. This allows the Host to be marked as DOWNTIME only during the time there were actual failures.
Note: It is possible that for a Host, the Service Checks are UNKNOWN yet the Host is DOWN. From a BSM perspective, the Host is considered to be OPERATIONAL because there are no CRITICAL Service Checks. This would be an error in configuration as the Service Check should be CRITICAL to show a severe error.

Components

This is calculated from the Component Host states and can be one of three calculated states:

  • OPERATIONAL: if no Hosts are failed or there are enough Hosts to satisfy the operational level, then the Component is operational
  • DOWNTIME: means all Hosts are in a DOWNTIME state
  • FAILED: otherwise Component is failed

Additionally, there are two calculated flags:

  • ACKNOWLEDGED: if all the failed Hosts are acknowledged, then the Component is acknowledged
  • IMPACTED: if the state is operational and there is at least one host failed, then the Component is impacted

The operational zone percent is calculated as hosts_required_online / hosts_total 100. If there are not enough operational Hosts, then the Component is failed. Hosts in DOWNTIME are not counted, but have the effect of making failed Hosts more important.
* Note:
Due to the operational zone percent, it is possible that a Component is in an operational state with failed Hosts. If those failed Hosts are acknowledged, then the Component will also be acknowledged, so you could have an acknowledged icon on an operational Component.

Business Services

This can be one of three calculated states:

  • OFFLINE: means at least ONE Component has failed
  • DOWNTIME: means at least ONE Component is in a DOWNTIME state
  • OPERATIONAL: otherwise, it means everything is working fine

Additionally, there are two calculated flags:

  • ACKNOWLEDGED: if all the impacted and failed Components are acknowledged, then the Business Service is acknowledged
  • IMPACTED: the Business Service is impacted if any Component is impacted. This means it is possible to have a Business Service in an OFFLINE state and be impacted.

Overview

An overview of BSM (Business Service Monitoring)