Opsview Knowledge Center

Slave State Synchronization

Learn the importance of keeping Opsview's Slaves synchronized

A major feature of Opsview Monitor is the simple way in which you can set up distributed monitoring. This works by having separate Nagios(R) Core instances on slave systems.
It is possible for the slaves to have synchronized status of Hosts and services sent from the master or from other cluster nodes. This affects users that have a distributed environment.

When Synchronised Status Matters

Each Nagios Core instance has retention information which describes the last Host or service's state, output, acknowledgements, comments and downtimes.

Opsview Reload Time

If a host is moved from being monitored by the master to being monitored by a slave, or if a Host is moved from a slave to a different slave, the new slave does not know the last status of that Host and its services. This means that if you have Notifications configured to be sent from a slave, then you could get Notifications for states that the Opsview Monitor master already knows about.
Alternatively, if a Host is assigned to a slave system with multiple nodes, Opsview Monitor will decide which node is actively monitoring this Host and its services. This could be a different cluster node to the last time.

Cluster Node Take Over Time

When a cluster node takes over a different cluster node, then the new node will not know the current state and could send out Notifications if it considers state changes to have occurred.

Cluster Node Recovery

When a cluster node recovers, it will start monitoring its own Hosts and services as usual. However, it may have older state information, as the state of the Hosts and services may have changed.
The Opsview Monitor master's Slave-node: {name} check will notice that a slave node has recovered and will send the latest state information to that slave node for synchronization.

How Opsview Synchronizes States

At Opsview Monitor reload time, Opsview Monitor takes the current state information from the master Nagios Core instance (from the status.dat file) and constructs a sync.dat file for each slave system. This file is sent to each slave and is loaded when Nagios Core reloads. As the master knows about all states, acknowledgements and downtimes, the slave will also have the latest information before it starts doing its monitoring.

Additionally, every 15 minutes from a cronjob by the Nagios user, each slave node will create a sync.dat.node.{nodename} file to send to all other nodes in the slave cluster. When a take over occurs, this state information is read into Nagios Core before the Hosts are set to be actively monitored.
In all cases, the slave will use the information in the synchronization file. However, if the last check time on the object is newer than the data from the master, then the slave will ignore this state information. Downtimes and comments will be added assuming that no similar downtime or comment already exists.

Limitations

It is critical that time is synchronized between the Opsview Monitor master and its slaves, as the state information is only processed if it is newer than the current information on the slave for each Host/service.
There are windows where the state information may not be completely up to date:

  • if a state change occurs, or an acknowledgement or downtime is set between Opsview Monitor reload time and the slave starting to monitor
  • the state information from the failing slave could be up to 15 minutes stale at take over time
  • the state information sent to the slave from the master when the master notices a slave has recovered as there is a latency to the master noticing a recovered slave. If the slave checks a Host or service before the master sends the latest information to the slave, then the slave will have more recent information and ignore the sync request for this object

Slave State Synchronization

Learn the importance of keeping Opsview's Slaves synchronized