Opsview Knowledge Center

Distributed Monitoring

Generic architecture and principles

Overview

Distributed Monitoring is a feature of Opsview Monitor that allows checks and notifications to be executed from remote servers, thus giving the capability of scaling your Opsview Monitor system by spreading the load and reducing latency. This is useful when:

  • You have a large number of Hosts
  • You have a several datacenters that span across different geographic locations
  • You have networks that have secured zones or firewall rules to segregate Hosts

Opsview Monitor uses Collectors to handle the execution and collection of results. You can group multiple Collectors together to form a Cluster, which gives additional failover and load balancing capabilities.

Each Host in Opsview is assigned to a Cluster. The Host will be actively checked by any Collector in that Cluster.

Note: The first Cluster with a Collector that is registered into Opsview Monitor in a new system will assume the role of ''Master Cluster'. Using the Advanced Automated Installation, the 'Master Cluster' forms part of the 'Opsview Monitor Primary Server'.

To setup additional Collectors, you need to:

  • install the software
  • register it in Opsview Monitor
  • assign it to a Cluster.

These steps are detailed below in Managing Collector Servers.

Opsview Monitor Primary Server

The host where Opsview is first installed is called the Opsview Monitor Primary Server . This host has all the necessary software packages installed so that it can function as a single Opsview system, but you can separate out the functional components onto other hosts to spread the load and increase redundancy.

The Opsview Monitor Primary Server will have the following Host Templates assigned to it:

  • Application - Opsview
  • Opsview - Component - Agent
  • Opsview - Component - Autodiscovery Manager
  • Opsview - Component - BSM
  • Opsview - Component - Datastore
  • Opsview - Component - Downtime Manager
  • Opsview - Component - Executor
  • Opsview - Component - Flow Collector
  • Opsview - Component - Freshness Checker
  • Opsview - Component - License Manager
  • Opsview - Component - Load Balancer
  • Opsview - Component - Machine Stats
  • Opsview - Component - MessageQueue
  • Opsview - Component - Notification Center
  • Opsview - Component - Orchestrator
  • Opsview - Component - Registry
  • Opsview - Component - Results Dispatcher
  • Opsview - Component - Results Flow
  • Opsview - Component - Results Forwarder
  • Opsview - Component - Results Live
  • Opsview - Component - Results Performance
  • Opsview - Component - Results Recent
  • Opsview - Component - Results Sender
  • Opsview - Component - Results SNMP
  • Opsview - Component - Scheduler
  • Opsview - Component - SNMP Traps Collector
  • Opsview - Component - SNMP Traps
  • Opsview - Component - SSH Tunnels
  • Opsview - Component - State Changes
  • Opsview - Component - TimeSeries Enqueuer
  • Opsview - Component - TimeSeries
  • Opsview - Component - TimeSeries RRD
  • Opsview - Component - TimeSeries InfluxDB
  • Opsview - Component - Watchdog
  • Opsview - Component - Web
  • OS - Unix Base
  • Network - Base

This host will be automatically assigned to the Master Cluster, and would therefore normally monitor itself.

To add, remove, register clusters and collectors, see Managing Collector Servers.

Troubleshooting

The most common problem is related to misconfiguration of Components that require access to Master MessageQueue Server - Scheduler and Results-Sender. Check /var/log/opsview/opsview.log for detailed errors.

Architecture

Opsview Scheduler is the main component of Collector - it receives commands and configuration from the Orchestrator and schedules execution of monitoring plugins, event handlers and notification scripts.

The execution of plugins is performed by Opsview Executor, who's only job is to execute commands requested by Scheduler. Results are sent back to Opsview Scheduler which has requested those commands.

This approach allows to share multiple Opsview Executors among all Collectors of a given Cluster - just point all Components to the same Cluster MessageQueue and the automatic load-balancing will be available.

Opsview Scheduler sends the results to Opsview Results-Sender which will forward them to the Results Processors. In case of network outage the Results-Sender will hold on the results for configurable amount of time.

Scalability

For high-availability we recommend you to have a single monitoring Cluster per monitored location (e.g. datacenter) with as many Collector nodes as required. All Collectors should point to single Cluster MessageQueue Server. For more information and assistance, contact our Customer Success Team.

Security

To secure communication over the network please refer to the Securing Network Connection documentation here.

Failure Scenarios

Opsview 6 can handle n-1 Collector failures within a Monitoring Cluster and since there is no upper limit on number of Collectors in Cluster we recommend you have at least three nodes per Cluster.
If there is a Collector failure, the Orchestrator will detect this within 60 seconds and automatically re-assigns the hosts monitored by that failed Collector to the remaining Collectors of the Cluster. The re-assignment will use the currently known state of the objects and the configuration of the last time you have performed an Apply Change from the Configuration menu. Re-assigned hosts and their services are instantly re-checked.

When the Collector recovers, the Orchestrator would also automatically re-assign the hosts back again.

Limitations

Currently, to synchronize scripts between Master Collector and Clusters we recommend using Configuration Management tools like Chef, Puppet or Ansible.

If necessary, you can manually sync plugins from the Orchestrator to all Collectors by following these steps (assuming all steps are run as the root user):

1 Create an SSH key for the root user on Master server

ssh-keygen

2 Copy SSH key from the Master Server to all Collectors

# for each Collector - this assumes that each Collector allows root login
ssh-copy-id $IP_OF_COLLECTOR

Or, an alternative solution is to create the key in-place on each Collector server

# Master Server - display the public key contents
cat /root/.ssh/id_rsa.pub
.....
# On each Collector, create the key directory and authorized_keys file with appropriate permissions
install -o root -g root -m 0700 -d /root/.ssh
test ! -e /root/.ssh/authorized_keys && install -o root -g root -m 0600 -b /dev/null /root/.ssh/authorized_keys
# Run this command and paste the content of id_rsa.pub seen in the previous step and then hit CTRL-D to finish the process
cat - | tee -a /root/.ssh/authorized_keys

3 Run Monitoring Scripts synchronization script once now on the Master server and then again whenever adding or modifying new plugins/opspacks

/opt/opsview/coreutils/utils/rsync_monitoringscripts
#Usage:
# -c hostname|ip         host address to use when connecting to Collector - either hostname or ip (default: ip)
# -q                     quiet
# -t                     test run - rsync's dry-run
# -d /path/to/monitoringscripts/
                  path to monitoringscripts directory (default: /opt/opsview/monitoringscripts/)