TA5 - Scalability: Ensuring Your Monitoring Keeps Up With Your Infrastructure
Intro When choosing an Enterprise monitoring tool there are
many considerations, but one that is almost always right at the top of the list
is scalability. Picking a tool that does all the things you need it to do is
critical, but ensuring that it is not going to grind to a halt as you expand
your monitoring environment is key.
You can just keep deploying more and more monitoring
systems to ensure the system limits are not reached, but this quickly becomes
very hard to administer and can add a lot of extra cost in terms of both
licensing and hardware. Distributed monitoring offers a solution that ensures
your systems remain easy to administer while also ensuring the system can scale
to meet your needs.
As well as allowing your monitoring to scale, distributed monitoring
also ensures that you can deliver consistent monitoring across multiple sites.
It is common to have monitoring responsibility for multiple locations so using
distributed monitoring means that you overcome the issues of sending all your
monitoring traffic over your wide area links between sites.
What is distributed monitoring? Distributed monitoring in Opsview Monitor means having a single “Master” system that collates all of the monitoring data. This master system is then supported by a network of monitoring collectors known as “Slaves”.
The key advantages of distributed monitoring are that the load is taken off of your Master system which you use for day to day administration and rely on for those critical alerts and reports and that you can locate your Slave systems in locations that best suit your monitoring needs.
Single point of management Managing enterprise scale infrastructure can be challenging. For monitoring, having a single point of management is a real advantage. Opsview Monitor’s distributed architecture provides all the advantages of running the checks at the right places within your infrastructure, while ensuring that all of these results are handed back to a central system for management. This means that all notifications can be administrated from a single location, all reports can be generated on the same system and include all of the data they require. When using distributed monitoring in conjunction with the Business Service Module, services that comprise of a variety of systems that span multiple physical locations are still correctly handled.
Distributing the monitoring system load When you are monitoring, each individual check is putting some load on the monitoring system. Optimizing these checks will reduce this load, but there will always be a limit to how many checks a single system can handle. Distributing the load across multiple systems is a proven method of ensuring that those limits are not reached.
Most monitoring systems are designed to be as efficient as possible. Opsview Monitor provides detailed guidance on the load considerations for deployments. However, when monitoring large scale enterprise infrastructure, even with modern hardware, a single system is just not going to be able to cope. Through calculation or measurement we can evaluate the load on a system and before it becomes a problem add an extra Slave system that takes any future monitoring load. Here are some example calculations that will allow you to calculate total Service Checks Per Second. Service Checks Per Second are the key factor when assessing performance and we recommend a maximum figure of around 20-25 checks per second. Let us look at an example. Say we have 2000 hosts, with 10 checks per host, using a 5 minute interval. 2000 (hosts) * 10 (service checks) / 300 (seconds) = 66 service checks per secondThis is over our figure of 20-25 checks per second so we would need to attach 3 slaves to the host to hit a rate of 22 checks per second. Remembering we can utilise each core to handle a separate worker thread, we can divide our figure of 66 by the number of cores our slave servers will have. For example, if we have 3 dual CPUs in our slave servers, this brings the number of service checks per second on each core to 11. 66 service checks per second / 6 (number of cores in 3 CPUs) = 11For more information about calculating the load on your system see the
Opsview knowledge center.
Distributing monitoring load on your infrastructure It is an all too common story that badly designed monitoring systems, instead of providing valuable insight into the status of systems, actually put an unnecessary load upon the infrastructure they are designed to monitor and end up causing more problems than they solve. The traffic caused by monitoring being sent over networks can quickly add up and as problems occur more data can be generated potentially meaning one issue ends up causing many. Distributed Slaves give you the flexibility to design your monitoring system to ensure that it fits the needs of your infrastructure.
A common example of where slaves are used to manage traffic is for optimizing the flow of traffic between datacenters. Networking within a datacenter is likely to have an abundance of spare bandwidth, however a tunnel from one datacenter to another is much more likely to be a bottleneck. Distributing a Slave into each datacenter means that checks are only run over the local network and just the results are passed over the tunnels between datacenters, greatly reducing the bandwidth usage for monitoring.
Handling network outages A networking problem between data centers can present a problem for monitoring systems. One that will commonly result in incorrect reporting of a system being down rather that detecting that the network over which the monitoring is being run has a problem. A distributed architecture has the advantage of providing resilience to issues in the interconnectivity between sites. Should the network between the Opsview Monitor Master and an Opsview Monitor Slave go down the Slave system will begin to buffer the data for collection by the master system when normal networking returns.
Configuration for security Along with the other advantages of being able to choose by design where you wish to locate your monitoring systems it can also significantly reduce the effort required to work around infrastructure security. Firewalls within infrastructure can be a common hurdle for deploying monitoring, different checks use different ports and transports. Using centralized monitoring, these will need to be taken into consideration and the firewall opened up. In some circumstances opening the firewall may not even be an option. Using distributed monitoring you can carefully design where checks are run from. The Slave monitoring collector can be located behind the firewall and then pass results back to the master.
In Opsview Monitor, communication between the Master and the Slave is very secure using a Secure Shell Tunnel. The communication can be configured to run from the Master to the Slave or from the Slave to the Master making it easy to manage your network security.
Clustering slaves It is common to implement redundancy in any monitoring system. Opsview Monitor has extensive support for both high availability and disaster recovery for the Master system, so why not apply this approach to the distributed Slave systems as well?
Opsview Monitor Slave monitoring systems fully support clustering, allowing you to deploy two or more Slave systems and then should one go down for whatever reason the other will automatically take over.
Conclusion Opsview Monitor’s distributed architecture is an incredibly flexible way to ensure that your monitoring system keeps up with your infrastructure. So take the headache out of monitoring security and deployment across sites and try Opsview Monitor slaves. You can find information on how to set up Opsview Monitor Slave in the Opsview Knowledge Center.