The “Fog of War” between business and IT services The ﬁeld of monitoring has become very good at collecting status information and performance metrics from various endpoints. However, executive stakeholders are much more interested in the performance of people and processes than of CPU, memory, and throughput. It is no longer enough to test the health of end points without providing insight into the true impact and cost that an outage could cause. There are services that customers and employees need to be able to consume and the health of that service is what ultimately needs to be measured. However, in today’s world, the modeling a business service is not a simple hierarchy. Concepts like resiliency, ﬂexibility, and reusability need to be applied to translate host health into business continuity.
So how do we cut through the fog and determine if the lights are still on?
How has the market responded? The market’s response to evaluating the overall health has been to invest in APM techniques such as synthetic transaction proﬁling and end-user experience monitoring. But is this truly an approach for evaluating application uptime?
The truth is that this approach opts to forego the task of decoding the relationships between the consumable service and the components that make it work. It instead skips to the end and identiﬁes more symptoms. This is adding to the confusion and for any root causes or trends to be effectively evaluated, complex and expensive event correlation and analysis needs be added in for good measure. The idea has always been to detect problems before users do.
So why is it that we are attempting to save minutes and hours instead of days? The mindset must move away from trying to detect as many symptoms as possible so that correlations can be made over time. Instead the market needs a monitoring tool that understands the anatomy of the application it seeks to protect. This way underlying conditions can be associated with risk instead of trying to correlate symptoms with causes.
The individual metrics that have been collected need to be grouped together by function. All machines and services that are working towards a common goal should be deﬁned as an individual and reusable component such as a cluster or stack. At this stage SLA requirements can be set at this component level. This SLA requirement is how resiliency and priority can be assigned appropriately. Now the application is deﬁned as a collection of ﬂexible and reusable components that each understand the concept of fault tolerance and priority. If an individual host outage occurs it may or may not impact to the larger service and may not actually cause an outage. If the availability SLA of a computer cluster is violated, however, this is sure to cause a much more expensive outage. This is a much better way of determining whether or not a business service is at risk or in distress so that problems can be addressed more proactively.
So is the other approach useless? Absolutely not. The other approach is simply addressing a completely different need. End user experience monitoring and transaction proﬁling are great techniques for performance benchmarking and base-lining. These metrics can be used to compare a company to the competition and to ensure that each new release is performing better than the previous version. These metrics, as they are being collected, are also perfectly suited to be grouped together and deﬁned as a component inside of a business service, preferably one with a 0% uptime requirement so that the symptom can be tied to the business service but will not be falsely considered a root cause. Conclusion A comprehensive monitoring strategy and implementation is no longer a “Nice to Have” in an enterprise, it is mandatory. With the new ability to model your applications and business services inside the monitoring tool, this business continuity measure is now a key artifact to contribute to the completeness of an Enterprise Architecture.
This allows the business to remain focused on continuing operations by detecting risks instead of outages so that they never reach the point where an end user would be impacted. This can only be done with a tool that has a comprehensive understanding of the way an application, website, or service functions at all levels.