Opsview Knowledge Center

Installing Opsview Monitor Slave Clusters

Learn best practices when installing Opsview Monitor Slave servers

When configuring a slave, you can set more than one host to be in the slave cluster. This provides failover and load balancing. Each host in this cluster is called a node.

Note: Within a slave cluster, there is support for a single node failure. A double node failure in the same cluster will result in active checks being missed.

Note: When a host is marked against a slave that is set up as a cluster, it could be monitored by any node in that cluster. The host could move to a different node after a reload. This is required for failover to work.

Note: Acknowledgements and scheduled downtime are synchronised at certain times. See this section for when the synchronisation occurs.

Note: As slave clusters split the hosts between all the nodes in a cluster, any single node may not know the current state of the parents for a host. This means there may not be the parent/child notification suppression, so you may get more notifications than expected. This can be overcome by notifying only from the master, which will know the state of all hosts.

The failover process relies on the slaves in the cluster being able to ssh to each other to check the health of the nagios process. If a node fails, each of the other nodes will notice the failure independently and then takes over running the active checks on the host itself.

Note: If you send passive results to a slave cluster (including SNMP traps), you will have to send to all nodes in the cluster. This is because you will not know which node in the cluster is the active one for the host. Passive results will be suppressed on the non-active cluster nodes so only a single result will reach the Opsview master. Note, there is currently a limitation where passive results could be missed when a node has failed.

Note: During the recovery of a slave cluster node, it is possible that multiple results for the same host or service will be sent back to the Opsview master because the original slave cluster node and the secondary slave cluster node could both be actively monitoring hosts and services until they know that each are working correctly.

Design

Each Opsview slave should have the following service checks associated with it:

  • Running on Opsview Master: Slave-node: {nodename}
  • Running on each slave based on other nodes in the cluster: Cluster-node: {nodename}

These services are associated via the Application - Opsview Common and the Application - Opsview Master host templates. The Application - Opsview Common host template should be used for every Opsview slave.

The Cluster-node check is used to check every other node in the cluster.

For instance, if you have a slave with 3 nodes, nodeA, nodeB, nodeC, then the following services would automatically be generated for each node:

  • nodeA

    • Cluster-node: nodeB
    • Cluster-node: nodeC
  • nodeB

    • Cluster-node: nodeA
    • Cluster-node: nodeC
  • nodeC

    • Cluster-node: nodeA
    • Cluster-node: nodeB

If any of these services fail the failed node's active services will be taken over by the remaining nodes.

Setup

It is assumed you have done all the pre-install tasks to make this host into a slave.

As you have to have ssh communication between all nodes in a slave cluster, setup public SSH keys on all nodes in a cluster so that the nagios user can ssh to the remote node.

su - nagios # On one node
ssh-keygen -t rsa  # Creates your SSH public/private keys if they do not currently exist
ssh-copy-id -i .ssh/id_rsa.pub {othernode}
/usr/local/nagios/libexec/check_opsview_slave_cluster {othernodeaddress}

If you get the message:

The authenticity of host '192.168.101.47 (192.168.101.47)' can't be established.
RSA key fingerprint is 91:14:59:47:e5:aa:d5:8d:d7:67:4f:84:ee:5a:e4:dc.
Are you sure you want to continue connecting (yes/no)? yes
CLUSTER CRITICAL - Error: Remote command execution failed: Warning: Permanently added '192.168.101.47' (RSA) to the list of known hosts.

Enter yes and try again. You should get an OK response when the SSH keys have been setup correctly.

Note: The {othernodeaddress} needs to be the host address defined for this host as this is how the service check will be configured.

You need to do this process for all nodes in a cluster, to each of the other nodes in the cluster.

When Opsview creates the node information, automatic checks will get created prefixed with 'Cluster-node:' and this checks the SSH connectivity between the nodes.

Logs

You can see the effects of the takeover in the log file: /usr/local/nagios/var/log/opsview-slave.log.

You can see the effects of the NMIS rsync in the log file: /usr/local/nagios/var/log/takeover_hosts.log.

Node failure

Definition

All nodes in a cluster need to communicate to every other node in a cluster. Opsview uses the check_opsview_slave_cluster plugin to determine the state of the cluster nodes. A node failure occurs if one node cannot:

  • communicate with another node via ssh
  • the other node does not have nagios running (as determined by the check_nagios plugin)
  • the other node does not update the /usr/local/nagios/var/status.dat file within 60 minutes

Although other scenarios might prevent a node from running its checks correctly (e.g. firewall rules or router failure blocking access to monitored devices), an automatic node fail over would effectively hide the problem. In these cases the fail over can be forced manually while the root cause is fixed.

Forcing a Node Failure

You can force a node failure by stopping Nagios(R) Core on the node:

/etc/init.d/rc.opsview stop

Sequence of Events

When a node fails, this is the sequence of actions:

  • All nodes are working correctly
  • One of the nodes fails. A Slave-node service on the Opsview master server will recognise the failure. This will write a message like:
SLAVE CRITICAL - Error retrieving slave information - slave is likely to be down
  • The other nodes in the slave cluster will see the failure and raise an error like:
CLUSTER CRITICAL - Error: Remote command execution failed: ssh: connect to host ...
  • The other nodes will read the latest synchronisation data from the failed node and for each host that it is responsible for taking over:

    • Start actively checking services
    • Enable notifications
  • The NMIS configuration will be changed appropriately too

  • After 30 minutes, the services on the failed host will go into an UNKNOWN stale state - this is because no results about the node itself are being sent to the master

Note: The takeover nodes will mark the failed host into a down state which will get pushed to the master. You may get duplicate notifications as all other nodes in that cluster will each raise an alert. The takeover script will also send a passive result about the failed in the appropriate Cluster-node service to note that it has been take over. The message will read: 'Take over by slave cluster node: {nodename}'. This will be changed to an UNKNOWN stale result if the slave node remains offline.

Note: While a node is down, a reload will fail. This is because the reload expects to be able to communicate with all slave nodes.

If the node is expected to be offline for a long period, you can remove the node from the slave list and then reload to take it out of the system. When the node is restarted, Nagios Core will continue to run checks but results will not be sent to the master. You may still get notifications from this server - remember to stop Nagios!

Note: It is possible during this failover period that results are either missed (due to time taken to notice a node failure), or run additional times within a frequency period (due to a node taking over before the usual regular check cycle). This is unavoidable.

Note: If you have a passive service with a small freshness threshold configured on the slave node, it is possible to receive a result such as 'No results received within 10 minutes' because the take over node has not synchronised with the latest data from the failed node. This will correct itself when the next result is received.

Node recovery

When a node recovers, this is the sequence of actions:

  • A node has recovered
  • The master will notice that the slave node is okay. An event handler will be invoked to push the latest host/service status information to the slave node
  • Other nodes in the cluster will notice that the node has recovered and will stop monitoring the hosts it took over, to allow the failed one to take back again
  • It is possible that the SSH tunnel has failed. The master may show the message:
SLAVE CRITICAL - NSCA problem - status code: 2 - restarting tunnel
  • After a short while, the node check should return to an OK state

Note: Some results from the node may be lost when the tunnel has failed.


Note: If you use Network Analyzer, then you will need to do some manual steps to synchronise historical data, see troubleshooting section.

Installing Opsview Monitor Slave Clusters

Learn best practices when installing Opsview Monitor Slave servers