Managing Clusters and Collectors

Learn about adding, registering, removing Clusters and Collectors

Overview

Detailed steps on adding new collector servers to a single-box system, and new collector servers to an existing Opsview Monitor system with multiple servers and existing collectors.

Prerequisites before adding collectors:

  • A deployment host running a supported Operating System by the version of Opsview
  • Root access to the deployment host
  • SSH access from the deployment host to all Opsview hosts (including new servers to be added as collector hosts)
    • Authentication must use SSH public keys
    • The remote user must be 'root' or have 'sudo' access without a password and without TTY

Adding Collector Servers

To a Single Server System

To add new collector servers to an existing single-server Opsview Monitor system, open the /opt/opsview/deploy/etc/opsview_deploy.yml file, and add the following lines.
Note: Do not change the existing lines in opsview_deploy.yml:

collector_clusters:
  collectors-de:
    collector_hosts:
      opsview-de-1: { ip: 10.12.0.9 }

Change "opsview-de-1" and "10.12.0.9" to the hostname and IP address of your new collector, and give your collector cluster a name by changing "collectors-de".

You may add multiple collector-clusters, and multiple collectors under each cluster such as:

collector_clusters:
  collectors-de:
    collector_hosts:
      opsview-de-1: { ip: 10.12.0.9 }
      opsview-de-2: { ip: 10.12.0.19 }
      opsview-de-3: { ip: 10.12.0.29 }
 
 collectors-fr:
    collector_hosts:
      opsview-fr-1: { ip: 10.7.0.9 }
      opsview-fr-2: { ip: 10.7.0.19 } 
      opsview-fr-3: { ip: 10.7.0.10 }
      opsview-fr-4: { ip: 10.7.0.20 }
      opsview-fr-5: { ip: 10.7.0.30 }

🚧

Cluster Size

There should always be an odd number of nodes within a collector cluster: 1, 3, 5, etc. This is to help with resiliency and avoid split-brain issues.
In an even number cluster, if half the nodes go down the other half will stop functioning as the cluster within opsview-datastore and opsview-messagequeue will have no quorum and so will not accept updates until the other cluster members are restored.
We do not support clusters with only two collectors for the above reason.

In the example configuration above, two new collector clusters called "collectors-de" and "collectors-fr" are created.

"collectors-de" has the minimum requirement of 3 collector nodes, while "collectors-fr" has 5 collector nodes, with hostnames and IP addresses provided.

After modifying opsview_deploy.yml, run opsview deploy as follows:

cd /opt/opsview/deploy
./bin/opsview-deploy lib/playbooks/check-deploy.yml
./bin/opsview-deploy lib/playbooks/setup-hosts.yml
./bin/opsview-deploy lib/playbooks/setup-infrastructure.yml
./bin/opsview-deploy lib/playbooks/collector-install.yml

After running opsview-deploy, check "Registering New Collector Servers in Opsview Web" section.

To a Multiple Server System

If you already have some collectors and you want to add new collectors, open /opt/opsview/deploy/etc/opsview_deploy.yml on your deployment server (which is typically opsview host with orchestrator and opsview-web) and add new collector clusters or collector hosts after existing ones such as:

collector_clusters:
  existing-collector1:
    collector_hosts:
      existing-host1: { ip: 10.12.0.9 }
      new-host1: { ip: 10.12.0.19 }
      new-host2: { ip: 10.12.0.29 }
 
 new-collector-cluster1:
    collector_hosts:
      new-host3: { ip: 10.7.0.9 }
      new-host4: { ip: 10.7.0.19 } 
      new-host5: { ip: 10.7.0.29 }

In the example above, 5 new collector hosts exist (new-host1, new-host2, new-host3, new-host4 and new-host5), and 1 new collector cluster (new-collector-cluster1) have been added.

  • new-host1 and 2 are added to the existing collector cluster (existing-collector1)
  • new-host3, 4 and 5 are added to the new collector cluster (new-collector-cluster1).

After modifying opsview_deploy.yml, run opsview deploy as follows:

cd /opt/opsview/deploy
./bin/opsview-deploy lib/playbooks/check-deploy.yml
./bin/opsview-deploy lib/playbooks/setup-hosts.yml
./bin/opsview-deploy lib/playbooks/setup-infrastructure.yml
./bin/opsview-deploy lib/playbooks/datastore-reshard-data.yml
./bin/opsview-deploy lib/playbooks/collector-install.yml

If you wish to speed up this process you may specify the collector cluster you are updating or creating.
The best way to do this is to specify the collector cluster using the minus lowercase "-l" (l for Lima) option

  • this is a measure really sided with updating a collector cluster, to ensure the opsview-messagequeue configuration is correct
  • the below utilises the above cluster name of "existing-collector1", which is now "existing_collector1"
cd /opt/opsview/deploy
./bin/opsview-deploy -l opsview_cluster_existing_collector1 lib/playbooks/check-deploy.yml
./bin/opsview-deploy -l opsview_cluster_existing_collector1 lib/playbooks/setup-hosts.yml
./bin/opsview-deploy -l opsview_cluster_existing_collector1 lib/playbooks/setup-infrastructure.yml
./bin/opsview-deploy -l opsview_cluster_existing_collector1 lib/playbooks/datastore-reshard-data.yml
./bin/opsview-deploy -l opsview_cluster_existing_collector1 lib/playbooks/collector-install.yml

You may also use the collector names within double quotes if these are new collector clusters
For a single new collector cluster (a cluster or one), you may use the collector name or names of the collectors

This is also best practice for removing a collector from a cluster.

Collector variables

You may set specific component configuration against any Collector. Settings may be rolled out to individually or to all Collectors by utilising /opt/opsview/deploy/etc/user_vars.yml and /opt/opsview/deploy/etc/opsview_deploy.yml. In this example we shall look at setting specific examples against the opsview-executor configuration for all collectors, then for the existing-collector1 server.

To push out the configuration against all collectors upon a deployment, you will need to have a "ov_component_overrides" section and an applicable component section specified such as "opsview_executor_config" - this is set within the /opt/opsview/deploy/etc/user_vars.yml. These changes are applied to the components <opsview-component>.yaml configuration file, so for the executor this is /opt/opsview/executor/etc/executor.yaml. The below will change the system defaults for initial_worker_count to 4 (a system default of 2) and max_concurrent_processes to 10 (a system default of 25).

ov_component_overrides:
  opsview_executor_config:
    initial_worker_count: 4
    max_concurrent_processes: 10

Then run a deployment using the 'setup_everything.yaml` playbook to push out this configuration to all Collectors.

If the configuration is only required on one collector then modify the /opt/opsview/deploy/etc/opsview_deploy.yml to add the overrides into the vars: section for specific collector, as follows:

collector_clusters:
  collectorcluster:
    collector_hosts:
      existing-collector1:
        ip: 10.12.0.9
        vars:
          ov_component_overrides:
            opsview_executor_config:
              initial_worker_count: 4
              max_concurrent_processes: 10

Instead of running the whole Deploy process, use the collector-install.yml playbook against the specific collector (as detailed in an above section). If multiple collectors within the same Cluster are modified, ensure you run the playbook against all of them at the same time by using the option -l collector1,collector2,collector3.

Registering New Collector Servers in Opsview Web

Log into your Opsview Monitor user interface and go to the Configuration > Monitoring Collectors page.
You should see a yellow "Pending Registration" message at the right such as below:

903903

Click the menu icon on the right side of the hostname of your collector and click Register as below:

571571

another window to register the collector will appear:

582582

Click "Submit Changes and Next". A new window will appear to create a "New Monitoring Cluster":

525525

Give the new monitoring cluster the same name that you add to opsview_deploy.yml, such as "collectors-de". Select the collectors that should be in this monitoring cluster from the list of collectors, then click Submit Changes.

After adding the first monitoring cluster, you may register a collector in an existing monitoring cluster by selecting "Existing Cluster", and selecting the monitoring cluster from the drop down list:

583583

After registering your new collectors, you should see your clusters and the number of collectors under each cluster in the "Clusters" tab:

419419

You can even click the numbers in "COLLECTORS" column to see the collector hostnames:

586586

Once the new collectors are registered go to Configuration > Apply Changes, to place the collectors into production.

Confirm the Collectors are running correctly by checking the System Overview tab in Configuration > My System:

735735

Cluster Health

The Configuration > Monitoring Collectors page shows details on the health of both individual collector nodes and each Cluster.

Clusters Tab

The Status column shows the current state of the cluster. Possible values are:

  • ONLINE - Cluster is running normally
  • DEGRADED - Cluster has some issues. Hover over the status to get a list of alarms
  • OFFLINE - Cluster has not responded within a set period, so is assumed to be offline

Cluster Health Alarms

The table below describes the possible alarms that will be shown when users hover over the status of a DEGRADED cluster. These alarms refer to conditions of the following Opsview components:

  • opsview-schedulers
  • opsview-executors
  • opsview-results-sender

Alarms

Description

Suggestions / Actions

All [Components Name] components are unavailable

e.g. All opsview-executor components are unavailable

Master/ Orchestrator server can’t communicate with any [Components Name] components on collector cluster. This may be because of a network/communications issue, or because no [Components Name] components are running on the cluster.

Note: this alarm only triggers when all [Components Name] components on the collector cluster are unavailable, since a cluster may be configured to only have these components running on a subset of the collectors. Furthermore, the cluster may be able to continue monitoring with some (though not all) of the [Components Name] components stopped.

To resolve this, ensure that the master/orchestrator server can communicate with the collector cluster (i.e. resolve any network issues) and that at least one scheduler is running
e.g. SSH to collector and run
/opt/opsview/watchdog/bin/opsview-monit start [Component Name]

A [Components Name] component started monitoring < 15 minutes ago ([Collector Ref])

e.g. All opsview-results-sender component started monitoring < 15 minutes ago (930d8e58-28bd-11eb-aec9-93fc7dd5d47d)

A [Components Name] component has just started up. This may be due to rebooting a collector, a system upgrade, adding a new collector, or due to the component restarting for some other reason. This alarm will show for 15-minutes after any [Components Name]component has started to alert the user to the fact that there is not yet sufficient data about its performance available to be able to accurately report the overall health status.

Note: unlike the “All components are unavailable” alarm which triggers only when all components are impacted, this alarm triggers when any of the [Components Name] components on the cluster restarts.

The presence of this alarm does not necessarily mean that a cluster’s ability to monitor has been seriously impacted. However, if it occurs frequently, and is not associated with a scheduled reboot, upgrade, or addition of a collector, then the cause should be investigated. To do this inspect the /var/log/opsview/opsview.log log on the collectors and look at the context around any restarts of this component. If an error is reported in the log which is causing the component to restart, then please report this to the Opsview customer success team.

Not enough messages received ([Components Name 1] → [Components Name 2]): [Time Period] [Percentage Messages Received]%.

e.g. Not enough messages received (opsview-scheduler → opsview-executor):[15m] 0%.

Less than 70% of the messages sent by [Components Name 1] have been received by [Components Name 2] within the time period. This could indicate a communication problems between the components on the collector cluster, or that [Components Name 2] is overloaded and is struggling to process the messages it is receiving in a timely fashion.

e.g. 0% messages sent by the scheduler have been received by the executor within a 15-minute period.

If 0% of the messages sent have been received by [Components Name 2] and no other alarms are present then this may imply a communications failure on the cluster. To resolve this ensure that the collectors in the cluster can all communicate on all ports (see https://knowledge.opsview.com/docs/ports#collector-clusters) and that opsview-messagequeue is running on all the collectors without errors.

Alternatively, this may be indicate that not all the required components are running on the collectors in the cluster. Please run /opt/opsview/watchdog/bin/opsview-monit summary on each collector to check that all the components are in a running state. If any are stopped then run /opt/opsview/watchdog/bin/opsview-monit start [component name]to start them.

If > 0% messages sent have been received by [Components Name 2], then this likely implies a performance issue in the cluster. To address this you can:

  • Reduce the load on the cluster e.g.
  • Reduce the number of objects monitored by that cluster
  • Reduce the number of checks being performed on each object in the cluster (i.e. remove host templates/service checks).
  • Increase the check interval for monitored hosts
  • Increase the resources on the cluster
  • Add additional collectors to the cluster
  • Improve the hardware/resources of each collector in the cluster (i.e. investigate bottleneck by inspecting self-monitoring statistics and allocate additional CPU/memory resources as needed).

Collectors Tab

The Status column shows the current state of the collector. Possible values are:

  • ONLINE - Collector is running normally, based on the status of opsview-scheduler
  • OFFLINE - Collector has not responded within a set period, so is assumed to be offline
14831483

Removing a Collector from a Cluster

To remove a Collector from a Cluster, click "CONFIGURATION > MONITORING COLLECTORS" from top menu and then click Clusters tab. Then, click menu icon and "Edit":

643643

Then, deselect the Collector that you want to remove and click "Submit Changes" button. You can now go to Configuration > Apply Changes to confirm the change and shutdown the Collector.

521521

Adding a Collector to a Cluster

To add a Collector to a Cluster, edit the Cluster and then select the Collector (use Cntrl on Windows or Cmd on Mac OS to select in addition to the existing selections). Go to Configuration > Apply Changes to confirm the change.

522522

Deleting a Cluster

You may only delete clusters that are not monitoring any hosts. If you need to delete a cluster that has hosts assigned to be monitored, you must manually change the "monitored by" field for those hosts to another monitoring cluster. This can be done easily using the Bulk Edit tool within Configuration > Hosts.

You will need to go to Configuration > Apply Changes for this to take effect.

Deleting a Collector

If you need to decommission a Collector, you must do the following:

  • Remove the Collector from any Clusters before attempting to delete it. You can remove Collectors from Clusters in the Clusters tab. If the Cluster only contains a single Collector, disable the Collector, then delete the cluster.
  • Delete the Collector record. Deleting a collector will remove it from the list of known Collectors. You can delete collectors from Collectors tab in Configuration > Monitoring Collectors page.
  • Delete the associated Host record in Host settings in order to completely remove it from Opsview.

Note: If you have deleted a Collector but then you want to register it again, you will not see it become available in the Unregistered Collectors grid until you stop the Scheduler on that collector for at least a whole minute and then restart it.

Upgrading a Collector

Upgrading Collector is as simple as upgrading all Opsview packages on the Collector Server. To avoid any downtime shut down the connection from Collector to Master MessageQueue Server, upgrade all packages and reset the system. Once the connection is restored the Collector will automatically join the Cluster and you can now perform upgrade of the other Collectors.

Managing Collector Plugins

In a distributed Opsview Monitor system, monitoring scripts on the Collectors may become out of sync with the ones on the Orchestrator when:

  • new Opspacks, monitoring scripts or plugins have been imported to Orchestrator.
  • monitoring scripts have been updated directly on Orchestrator.

In such cases, the monitoring scripts folder (/opt/opsview/monitoringscripts) on the Orchestrator needs to be synced to all of the Collectors by using an ansible playbook called sync_monitoringscripts.yml.

Overview

The sync_monitoringscripts.yml playbook uses rsync to send appropriate updates to each Collector (it will be installed automatically if required) while excluding specific sets of files.

The following directories and files (relative to /opt/opsview/monitoringscripts) are not synced:

.../lib/*
.../tmp/*
.../share/*
.../perl/*
.../plugins/utils.pm
.../var/*
.../opspacks/*
.../etc/notificationmethodvariables.cfg
.../etc/plugins/check_snmp_interfaces_cascade

For example, using the above exclude list, files within the /opt/opsview/monitoringscripts/lib/ directory and specific files such as /opt/opsview/monitoringscripts/etc/notificationmethodvariables.cfg won't be synced.

Additionally, if the Collector does not have the same OS version as the Orchestrator, only statically linked executable files and text-based files will be synced. This is to ensure binaries used on the Orchestrator are not synced with an incompatible Collector. For example, an AMD64 binary will not be sent to an ARM32 based Collector.

  • Interpreted script files such as Python, Perl and Bash scripts and configurations files are all text-based files and will be synced.
  • Dynamically linked executable files will not be synced because they may not run properly due to runtime dependencies. Such dynamically linked executable files need to be installed on collectors manually if collectors have a different OS version than the Orchestrator.

Prerequisites

SSH keys are setup between the Orchestrator and collectors (this should already be in place if Opsview Deploy was previously used to install or update the system).

How to Sync

Run the following commands as root on the Orchestrator:

cd /opt/opsview/deploy/
bin/opsview-deploy lib/playbooks/sync_monitoringscripts.yml

Limitations

If your deploy server is not Orchestrator, you can run the same commands on your deploy server but SSH keys must have been setup between the Orchestrator and collectors for the SSH users defined for your collectors in your opsview_deploy.yml file.