The purpose of BSM is to give real-world
views in terms of “we can allow one host to fail, and our ‘service’ is still operational”
- achieving a “resilient hierarchy” – where Hosts are grouped together into Components (i.e. h-scaled clusters), and Components are then grouped together to form the overall Business Service (i.e. www.website.com,
which uses the database cluster, apache cluster, etc).
Using BSM, you can have multiple Hosts of
the same type grouped together in Components, and if you have configured the
resiliency level correctly, you can allow one Host to fail but the Component to
still be “operational”.
A Business Service can then be comprised of
multiple Components, each with their own resiliency levels – giving a true
end-to-end view of the Business Service, as shown below:
You can then undertake
operations at a Component and a BSM level, such as:
“Who/which team is responsible
for this?” – Add notes to the BSM/Component.
“Schedule downtime against
“Acknowledge all issues in a Component”.
“View events related to this
Send alerts at a Component/BSM level – i.e. I only want to know when a Component is critical, not when a Host in it has failed – as we have resiliency in place.
Be alerted that a BSM’s
availability has gone below a certain level, i.e. it is below your agreed
Run historical reports against
BSM’s and Components, automated or manual, and emailed in your companies brand
to yourself / customer at a pre-determined time/date.
does this look like?
In an Opsview Monitor system, you can have
a website called “Website.com” – with six Components – including an ‘Apache Servers’
cluster, ‘Linux cluster’, etc. (As shown below in an image taken from a live system). With BSM, you can now monitor and display your
entire stack in a single view, so you can see “one Host has failed in the Linux
cluster; it hasn’t affected my website yet but I will need to fix that soon”.
Business Service monitoring is a terrific
tool that will take existing Hosts, Services and Host templates and allow the
creation of a hierarchy of Components and Business Services – showing the
relationship between Hosts and the Business Services they support, availability
(SLA/OLA) at each layer, reporting, notifications, access control and more.
Note that for the purposes of BSM, the Host only consists of the Service Checks related to the Host template used by the Component - the Host
state is not taken into account.
The Host can be one of three calculated states:
OPERATIONAL: if no Service Checks are in a CRITICAL state
FAILED: at least one Service Check is in a CRITICAL state. Service Checks in downtime are ignored
DOWNTIME: all CRITICAL Service Checks are in a DOWNTIME state
Additionally, there is one calculated flag:
ACKNOWLEDGED: if all the CRITICAL service checks are acknowledged, then the Component Host is acknowledged
Note: The soft or hard state of the Service Check is not considered - the latest state is always used.
Note: If you set DOWNTIME and there are no failed services, then an
operational state is used. This is to cover scenarios where downtime of two hours is recorded, but only 15 minutes is used. This allows the Host to
be marked as DOWNTIME only during the time there were actual failures.
Note: It is possible that for a Host, the Service Checks are UNKNOWN yet
the Host is DOWN. From a BSM perspective, the Host is considered to be
OPERATIONAL because there are no CRITICAL Service Checks. This would be
an error in configuration as the Service Check should be CRITICAL to
show a severe error.
This is calculated from the Component Host states and can be one of three calculated states:
OPERATIONAL: if no Hosts are failed
or there are enough Hosts to satisfy the operational level, then the Component is operational
DOWNTIME: means all Hosts are in a DOWNTIME state
FAILED: otherwise Component is failed
Additionally, there are two calculated flags:
ACKNOWLEDGED: if all the failed Hosts are acknowledged, then the Component is acknowledged
IMPACTED: if the state is operational and there is at least one host failed, then the Component is impacted
The operational zone percent is calculated as hosts_required_online /
hosts_total * 100. If there are not enough operational Hosts, then the Component is failed. Hosts in DOWNTIME are not counted, but have the
effect of making failed Hosts more important.
Note: Due to the operational zone percent, it is possible that a Component is in an operational state with failed Hosts. If those failed Hosts are acknowledged, then the Component will also be acknowledged, so
you could have an acknowledged icon on an operational Component.
This can be one of three calculated states:
OFFLINE: means at least ONE Component has failed
DOWNTIME: means at least ONE Component is in a DOWNTIME state
OPERATIONAL: otherwise, it means everything is working fine
Additionally, there are two calculated flags:
ACKNOWLEDGED: if all the impacted and failed Components are acknowledged, then the Business Service is acknowledged
IMPACTED: the Business Service is
impacted if any Component is impacted. This means it is possible to have
a Business Service in an OFFLINE state and be impacted.