Known Issues

An overview of the known issues in this release of Opsview Monitor

Overview

The following issues have been identified to exist within this release of Opsview Monitor:

Upgrade/Installation

  • opsview-deploy package needs to be upgraded before running opsview-deploy to upgrade an Opsview Monitor System.
  • Changing the flow collectors configuration in Opsview Monitor currently requires a manual restart of the flow-collector component for it to start working again.
  • At upgrade, the following are not preserved:
    • Downtime: we recommend that you cancel any downtime (either active or scheduled) before you upgrade/migrate. Scheduling new downtime will work fine.
    • Flapping status: the state from pre-upgrade/migration is not retained but if the host/service is still flapping, the next checks will set the status to a flapping status again.
    • Acknowledgements: at the end of an upgrade/migration, the first reload removes the acknowledgement state from hosts and services. Any further acknowledgement will work as usual.
  • If you use an HTTP proxy in your environment, the TimeSeries daemons may not be able to communicate. You can work around this by adding export NO_PROXY=localhost,127.0.0.1 environment variable (note: this is in upper case, not lower case) to the opsview user `.bashrc' file
  • Hosts and services in downtime will appear to stay in downtime even when it is cancelled. You can work around this issue by creating a new downtime, wait until it starts and then cancel it.
  • The sync_monitoringscripts.yml playbook fails to execute whenever the SSH connection between the host where opsview-deploy is being run and the other instances is reliant on a user other than root and we only define the private SSH key using the ansible_ssh_private_key_file property in opsview_deploy.yml. This happens because the underlying rsync command is not being passed the private SSH key and thus fails to connect to the instances. To work around this issue add, in the root SSH configs. Consider the following example:
# If you use ansible_ssh_private_key_file on the opsview_deploy.yml file

(...)
collector_clusters:
  cluster-A:
    collector_hosts:
      ip-172-31-9-216:
        ip: 172.31.9.216
        user: ec2-user  
        vars:
          ansible_ssh_private_key_file: /home/ec2-user/.ssh/ec2_key
      ip-172-31-5-98:
        ip: 172.31.5.98
        user: ec2-user  
        vars:
          ansible_ssh_private_key_file: /home/ec2-user/.ssh/ec2_key
(...)

# You need to add the following entries to /root/.ssh/config

Host ip-172-31-9-216 172.31.9.216
    User ec2-user
    IdentityFile /home/ec2-user/.ssh/ec2_key
Host ip-172-31-5-98 172.31.5.98
    User ec2-user
    IdentityFile /home/ec2-user/.ssh/ec2_key

Plugins

  • There is no automated mechanism in this release to synchronize scripts between the Opsview Monitor Primary Server and Collector Clusters. A sync_monitoringscripts.yml deploy playbook is provided to fulfil this purpose but it must be run manually or from cron on a regular basis.
  • check_wmi_plus.pl may error relating to files within your /tmp/* directory due to the ownership of these files needing to be updated to the Opsview user. Seen when upgrading from an earlier version of Opsview, as the nagios user previously ran this plugin.

Modules support

  • SMS Gateway is not available in this release. If you rely on this method, please contact Support.

Collectors and clusters

  • Despite the UI/API currently allowing it, you should not set parent/child relationships between the collectors themselves in any monitoring cluster, as collectors do not have a dependency between each other and are considered equals.
  • When trying to Investigate a host, if you get an Opsview Web Exception error with "Caught exception in Opsview" message, this could be an indicator that the Cluster monitoring for that host has failed and needs you to address it.

Other Issues

  • There is no option to set a new Home Page yet. For new installations, the Home Page is set to the Configuration > Navigator page.
  • Start and End Notifications for flapping states are not implemented in this release.
  • Deploy cannot be used to update the database root password. Root user password changes should be made manually and the /opt/opsview/deploy/etc/user_secrets.yml file updated with the correct password.
  • When a Host has been configured with 2 or more Parents and all of them are DOWN, the Status of the Services Checks on the host is set to CRITICAL instead of UNKNOWN. Consequently, the Status Information is not accurate either.
  • If an Opsview Monitor system is configured to have UDP logging enabled in rsyslog, RabbitMQ will log at INFO level messages to opsview.log and syslog with a high frequency - 1 message every 20 seconds approximately.
  • Some components such as opsview-web and opsview-executor can log credential information when in Debug mode.
  • When running an Autodiscovery Scan via a cluster for the first time there must be at least one host already being monitored by that cluster. If the cluster does not monitor at least one host, the scan may fail with this message: "Cannot start scan because monitoring server is deactivated".
  • When running an Autodiscovery Scan for the first time after an upgrade, it may fail to begin and remain in the Pending state. To resolve this, simply restart the opsview-autodiscoverymanager component on the Opsview Master Server (orchestrator). After the component has restarted successfully, the scan will start.
  • You may get occasional errors appearing in syslog, such as:
Nov 28 16:31:50 production.opsview.com opsview-datastore[<0.6301.0>] req_err(2525593956) unknown_error : normal#012 
   [<<"chttpd:catch_error/3 L353">>,<<"chttpd:handle_req_after_auth/2 L319">>,<<"chttpd:process_request/1 L300">>,
   <<"chttpd:handle_request_int/1 L240">>,<<"mochiweb_http:headers/6 L124">>,<<"proc_lib:init_p_do_apply/3 L247">>]
  
# You can ignore them as there is no operation impact.
  • In order to get the SNMP Traps working on a hardened environment the following settings need to be changed:
# Add the following lines to /etc/hosts.allow
 
snmpd:ALL
snmptrapd:ALL
 
# Add the following lines to hosts.deny
 
snmpd: ALL: allow
snmptrapd: ALL: allow
  • Using Delete All on the SNMP Traps Exceptions page may sometimes hide new ones as they come in. They can by viewed again by changing the 'Page Size' at the bottom of the window to a different number.
  • CPU utilization is sometimes high due to the datastore.

Apply Changes

  • After upgrading you may see some strange text in the Apply Changes UI window. Resolve this by clearing the cache of your browser.

AutoMonitor

  • When an AutoMonitor Windows Express Scan is set with a wrong, but still reachable, Active Directory Server IP or FQDN, the scan could remain in a "pending" state until it times out (1 Hour default value). This means that no other scans can run on the same cluster for that period of time. This is due to PowerShell not timing out correctly.
  • Automonitor automatically creates the Host Groups used for the scan: Opsview > Automonitor > Windows Express Scan > Domain. If any of these Host Groups already exist elsewhere in Opsview Monitor, then the scan will fail. If one of the Host Groups is moved then it should be renamed to avoid this problem.
  • Also, if you have renamed your Opsview Host Group, the Automonitor scan will currently fail. You will need to rename this or create a new Opsview Host Group in order for the scan to be successful
  • Automonitor application on logout will clear local storage - this means that if a scan is in progress and a user logs out when the user logs in they won't see that scans progress even if it's still running in the background

Opspacks

  • Hosts using deprecated "Cloud - Azure" Host Templates will be transitioned automatically to new Host Templates during an upgrade to Opsview 6.4. As part of this transition, the value of the Primary Address/IP field will be used to populate the second Argument of the AZURE_RESOURCE_DETAILS Variable (labelled as Resource Name). If this is not equal to the name of the Azure resource in question, the check may not run correctly. To fix, ensure that the Argument matches the resource name in Azure. Affected Host Templates:
    • Cloud - Azure - PostgreSQL Server
    • Cloud - Azure - MySQL Server
    • Cloud - Azure - SQL
    • Cloud - Azure - Redis
    • Cloud - Azure - Elastic Pool
  • Windows WMI - Base Agentless - LAN Status Servicecheck: Utilization values for Network adaptors byte send/byte receive rates are around 8 times lower than expected. Therefore, Warning and Critical thresholds should be adjusted accordingly as a workaround. See Plugin Change Log
  • Cloud - AWS related Opspacks: The directory /opt/opsview/monitoringscripts/etc/plugins/cloud-aws, which is the default location for aws_credentials.cfg file, is not created automatically by Opsview. Therefore, it needs to be created manually.
  • If opsview_tls_enabled is set to false, the Cache Manager component used by Application - Kubernetes and OS - VMware vSphere Opspacks will not work correctly on distributed environments
  • 'Hardware - Cisco UCS'. If migrating this Opspack over from an Opsview v5.x system it may produce error Error while trying to read configuration file or File "./check_cisco_ucs_nagios", line 25, in <module> from UcsSdk import * ImportError: No module named UcsSdk.
    If this is seen then running the following will resolve the issue
# as root
wget https://community.cisco.com/kxiwq67737/attachments/kxiwq67737/4354j-docs-cisco-dev-ucs-integ/862/1/UcsSdk-0.8.3.tar.gz
 
tar zxfv UcsSdk-0.8.3.tar.gz
cd UcsSdk-0.8.3
sudo python setup.py install

Place config file 'cisco_ucs_nagios.cfg' into the plugins path /opt/opsview/monitoringscripts/plugins/.

  • Opsview - Login is critical on a rehomed system. Resolve this by adding an exception to the Servicecheck on the Host specifying /opsview/login as the destination rather than /login.

Unicode Support

  • While inputting non-UTF-8 characters into Opsview Monitor will not generate any problem, the rendering of those characters in the user interface may be altered in places such as free text comments.

Service Check UNKNOWNS

  • "Opsview - Datastore" service checks may return an error such as "UNKNOWN: Error: Name or password is incorrect."
  • Related to the Host Template "Opsview - Component - Datastore"
  • This will be incorporated/resolved in a later release
  • To fix this at present you will need to add the OPSVIEW_DATASTORE_SETTINGS variable to your Opsview host and set arguments 2 and 4
  • Please "Apply Changes" is needed after adding these variables

'UNKNOWN: Error decoding CSV' status

Remove the files under /opt/opsview/agent/tmp/* on the Opsview host with the 'UNKNOWN'

Argument 2: Obtaining the Opsview Datastore password

grep opsview_datastore_password /opt/opsview/deploy/etc/user_secrets.yml

Argument 4: Obtaining the Opsview Datastore node name/information

 grep couchdb /opt/opsview/datastore/etc/vm.args
  • Also seen for "Opsview - Messagequeue" service checks, UNKNOWNS may be received
  • The related variable your Opsview Host may need is OPSVIEW_MESSAGEQUEUE_CREDENTIALS
  • Arguments one and four of this variable are populated by default, being "opsview" and 15672"
  • Arguments two and three may be obtained (if not already populated) by the below:

Argument 2: Obtaining the Opsview Messagequeue password

grep opsview_messagequeue_password /opt/opsview/deploy/etc/user_secrets.yml

Argument 3: Obtaining the NODENAME, which would be [email protected], where hostname is the full hostname, as from hostname -f.

grep NODENAME /opt/opsview/messagequeue/etc/rabbitmq-env.conf

'METRIC UNKNOWN - Message Queue API call failed. Status 404: Not Found'

The above may display if your server times are out of sync. Please ensure they are in sync or use a utility such as NTP to do this for you.

SNMP Traps

SNMPTraps daemons are started on all nodes within a cluster. At startup a 'master SNMP trap node' is selected and is the only one in a cluster to receive and process traps. Other nodes silently drop traps.

The majority of SNMPTrap sending devices can at most send to 2 different devices.

The current (6.3) fix is to manually pick two nodes in a given cluster to act as the snmp trap and standby node. Then mark all other nodes within the cluster to not have the trap daemons installed, for example

collector_clusters:
  Trap Cluster:
    collector_hosts:
      traptest-col01: { ip: 192.168.18.53,  ssh_user: centos }
      traptest-col02: { ip: 192.168.18.157, ssh_user: centos }
      traptest-col03: { ip: 192.168.18.155, ssh_user: centos, vars: { opsview_collector_enable_snmp: False } }
      traptest-col04: { ip: 192.168.18.61,  ssh_user: centos, vars: { opsview_collector_enable_snmp: False } }
      traptest-col05:
        ip: 192.168.18.61
        ssh_user: centos
        vars:
          opsview_collector_enable_snmp: False

On a fresh installation the daemons will not be installed.

On an existing installation the trap packages must be removed and the trap demons on the 2 active nodes restarted to re-elect the master trap node

# INACTIVE NODES:
CentOS/RHEL: yum remove opsview-snmptraps-base opsview-snmptraps-collector
Ubuntu/Debian: apt-get remove opsview-snmptraps-base opsview-snmptraps-collector

# ACTIVE NODES:
/opt/opsview/watchdog/bin/opsview-monit restart opsview-snmptrapscollector
/opt/opsview/watchdog/bin/opsview-monit restart opsview-snmptraps

Undefined subroutine

Undefined subroutine &Opsview::Utils::SnmpInterfaces::Helper::pp called at 
/opt/opsview/monitoringscripts/lib/Opsview/Utils/SnmpInterfaces/Helper.pm line 270.

Seen if a large number of devices are being polled.
The fix for this at the moment is to add the below line of code to the Helper.pm file on all your Opsview servers

  • a suggestion of making the change on your Orchestrator and using scp or rsync to transfer the file across to the rest of your server infrastructure may be quickest for you
  • add the change in the beginning of the file with the other "use" statements
use Data::Dump qw( pp );

Please not recheck the service checks in question.