Opsview Knowledge Center

Event Handlers

An overview of Event Handlers and how to use them in Opsview Monitor

This section provides an overview of Event Handlers and how they are configured and used for both Hosts and Service Checks within Opsview Monitor.

Introduction to Event Handlers

This document will explain the concept of Event Handlers; what Event Handlers are, how to create an Event Handler, and how to apply these Event Handlers to both host and service checks. After reading the User Guide, users should be able to create their own Event Handlers and apply them to host and service checks within their Opsview Monitor system.

Overview

Event Handlers are a feature within Opsview Monitor that moves your monitoring solution away from a 'detect and alerting' system to a more proactive monitoring tool. What does this mean? Well, if Opsview Monitor detects that the web service is not running on a monitored host it can not only alert you, but it can also automatically restart the web service. This means that you will know a problem occurred so you can diagnose and ensure it doesn't happen again. But at the same time your users are not impacted as the web server is back online within seconds of the outage. This is done via an Event Handler.

Event Handlers are scripts (Perl, Python, etc.) that can be automatically run by Opsview Monitor when it detects that a host or service check has failed (i.e. gone into a non 'OK' or non 'UP' state). For reference, the Event Handler commands are executed when host or service check:

  • Goes into a 'soft' error state
  • Each soft error state also invokes the handler
  • Goes into a 'hard' error state
  • Recovers from a 'soft' or 'hard' error state.

Note: Downtime and Acknowledged Service Checks will still have event handlers run.

Event Handlers sit on the monitoring server and are invoked via Opsview Monitor. In order to successfully run the Event Handler, it must be stored within /usr/local/nagios/libexec/eventhandlers/ on the Master/Slave with ownership of 'nagios:nagios' and file permissions of 0640 so that it can be successfully executed.

The graphic above shows the relationship between the Opsview Monitor software, the Opsview Agent and the Event Handler. The Master/Slave runs the Event Handler when the service changes to a non-OK state. At the same time, the 'retry interval' will be running, meaning the Master/Slave is likely monitoring the server at a one-minute interval (if the default value is unmodified). This means that once the Event Handler has been run, the Opsview Monitor server should detect that the service is now back 'up' and running, and thus the service check state should return to an 'OK' state (unless there is a problem stopping the service from restarting, such as misconfiguration, etc).

In the example above, we have chosen to run an Event Handler on the 'Apache service status' service check, however Event Handlers can be run on any host or service check; e.g. you may create an Event Handler that clears /tmp or 'Recycle Bin' when the 'Disk capacity' check changes to WARNING or CRITICAL. Alternatively, you may wish to create an Event Handler that flashes a series of lights red when a service check monitoring the number of 'Severity 1 tickets' changes from zero to one or more, in order to alert your support team quickly.

Creating a New Event Handler

Event Handlers, as explained in Introduction to Event Handlers, are scripts that can be written in any language understandable by the host operating system; e.g. Perl or Python. In most cases, however, Event Handlers tend to be either shell or Perl scripts.

Event Handlers should use the available macros within the environment to ensure that they only run when required. The main macros are:

  • $NAGIOS_HOSTSTATE (UP, DOWN or UNREACHABLE)
  • $NAGIOS_HOSTSTATETYPE (SOFT or HARD)
  • $NAGIOS_HOSTATTEMPT (number, starts from 1)
  • $NAGIOS_SERVICESTATE (OK, WARNING, CRITICAL or UNKNOWN)
  • $NAGIOS_SERVICESTATETYPE (SOFT or HARD)
  • $NAGIOS_SERVICEATTEMPT (number, starts from 1)

Other macros available within Opsview Monitor are:

  • $NAGIOS_CONTACTALIAS
  • $NAGIOS_CONTACTEMAIL
  • $NAGIOS_CONTACTGROUPLIST
  • $NAGIOS_CONTACTNAME
  • $NAGIOS_CONTACTPAGER
  • $NAGIOS_HOSTACKAUTHOR
  • $NAGIOS_HOSTACKCOMMENT
  • $NAGIOS_HOSTADDRESS
  • $NAGIOS_HOSTALIAS
  • $NAGIOS_HOSTDOWNTIME
  • $NAGIOS_HOSTDURATION
  • $NAGIOS_HOSTGROUPALIAS
  • $NAGIOS_HOSTGROUPNAME
  • $NAGIOS_HOSTNAME
  • $NAGIOS_HOSTNOTIFICATIONNUMBER
  • $NAGIOS_HOSTOUTPUT
  • $NAGIOS_HOSTPROBLEMID
  • $NAGIOS_HOSTSTATEID
  • $NAGIOS_LASTHOSTCHECK
  • $NAGIOS_LASTHOSTDOWN
  • $NAGIOS_LASTHOSTPROBLEMID
  • $NAGIOS_LASTHOSTSTATE
  • $NAGIOS_LASTHOSTSTATECHANGE
  • $NAGIOS_LASTHOSTUNREACHABLE
  • $NAGIOS_LASTHOSTUP
  • $NAGIOS_LASTSERVICECHECK
  • $NAGIOS_LASTSERVICECRITICAL
  • $NAGIOS_LASTSERVICEOK
  • $NAGIOS_LASTSERVICEPROBLEMID
  • $NAGIOS_LASTSERVICESTATE
  • $NAGIOS_LASTSERVICESTATECHANGE
  • $NAGIOS_LASTSERVICEWARNING
  • $NAGIOS_LASTSTATECHANGE
  • $NAGIOS_LONGDATETIME
  • $NAGIOS_LONGHOSTOUTPUT
  • $NAGIOS_LONGSERVICEOUTPUT
  • $NAGIOS_NOTIFICATIONAUTHOR
  • $NAGIOS_NOTIFICATIONCOMMENT
  • $NAGIOS_NOTIFICATIONNUMBER
  • $NAGIOS_NOTIFICATIONTYPE
  • $NAGIOS_SERVICEACKAUTHOR
  • $NAGIOS_SERVICEACKCOMMENT
  • $NAGIOS_SERVICEDESC
  • $NAGIOS_SERVICEDOWNTIME
  • $NAGIOS_SERVICEDURATION
  • $NAGIOS_SERVICENOTES
  • $NAGIOS_SERVICENOTIFICATIONNUMBER
  • $NAGIOS_SERVICEOUTPUT
  • $NAGIOS_SERVICEPROBLEMID
  • $NAGIOS_SERVICESTATEID
  • $NAGIOS_SHORTDATETIME
  • $NAGIOS_TIMET

In the example script below, we are restarting the Apache service per the scenario in Section 4.5.1:

#!/bin/bash
# Uncomment below to get debug information about the environment variables set by Nagios Core
# {date; env | sort; echo; } >> /tmp/handler.log
# If Service State is CRITICAL (options are OK, WARNING, CRITICAL and UNKNOWN)
# and Service State Type is HARD (options are HARD and SOFT)
# then execute Event Handler action
if [[ "$NAGIOS_SERVICESTATE" = "CRITICAL" && "$NAGIOS_SERVICESTATETYPE" = "HARD" ]]
then
        echo "restarting apache"
        # insert Event Handler action here...
        /usr/local/nagios/libexec/check_nrpe -H $NAGIOS_HOSTADDRESS -c eh_apache_restart >/dev/null 2>&1
        # record event to syslog
        logger "Apache 2 restarted by Opsview $NAGIOS_HOSTADDRESS"
fi

This Event Handler is located within /usr/local/nagios/libexec/eventhandlers on the Opsview Monitor master server. The first part of the Event Handler will check the service state to ensure it is CRITICAL and also HARD (i.e. in case the service has temporarily stopped; this can be changed easily).

Once the Event Handler is satisfied the above criteria are met, it will echo 'restarting apache' and then run the command 'eh_apache_restart' on the host in question, along with piping the output of the command to /dev/null (i.e. hide the output). Finally, it will log that it has restarted the script.

In the Opsview Monitor user interface, the Event Handler can be configured either on a global basis for the service check, for example, if this service check changes to a CRITICAL state on any host, run this Event Handler, or on an individual basis, if this service check changes to a CRITICAL state just on this host. This allows for bespoke Event Handlers that are custom to individual hosts.

To set the Event Handler on a global basis, navigate to 'Settings > Service Checks' and edit the service check in question; in our example it will be 'Apache active sessions'.

Once opened, click on the 'Advanced' drawer within the 'Details' tab and populate the 'Event Handler:' field as shown below:

Once saved, any host that has the 'Apache active sessions' service check applied will have the Event Handler enabled for its service check.

To set the Event Handler on a host by host basis, navigate to 'Settings > Host Settings' and edit the host in question; in our example 'Opsview'.

Once opened, click on the 'Service Checks' tab, and navigate to the service check in question using the tree panel on the left hand side. Note: Ensure the service check is checked in the left hand panel; if the service check is not checked then the 'Exceptions' drawer will not be enabled.

Once 'within' the service check, click on the 'Exceptions' drawer and check the 'Event Handler' checkbox as shown above. Finally, enter the name of the Event Handler and click 'submit changes'; this will now enable the 'restart_apache' service check just for this service check on the host 'Opsview'.

Debugging Event Handlers

There are a few helpful tips that can assist you in debugging Event Handlers that are not working. First, ensure that 'Log Event Handlers:' is set to 'ENABLED' within the 'Nagios Core' tab of 'My System'.

Secondly, ensure that the scripts are placed in /usr/local/nagios/libexec/eventhandlers ' and that they are owned by 'nagios:nagios' and have the file permissions '644' as below:

root@system:/usr/local/nagios/libexec/eventhandlers# ls -la
total 64
drwxr-xr-x 2 nagios nagios  4096 Jun 29 14:17 .
drwxrwxr-x 6 nagios nagios 36864 Aug 11 11:15 ..
-rwxr-xr-x 1 nagios nagios  1610 Feb  2  2012 apache_restart
-rwxr-xr-x 1 nagios nagios  9713 Jun 22 16:01 cluster_node_takeover_hosts
-rwxr-xr-x 1 nagios nagios  3743 Jun 22 16:01 slave_node_resync
-rwxr-xr-x 1 nagios nagios   217 Jan 29  2013 windows_service_restart

Thirdly, you can test the Event Handler by passing through the environment variables (macros) to the script to simulate a check execution using a command such as:

NAGIOS_SERVICESTATE=CRITICAL NAGIOS_SERVICESTATETYPE=HARD NAGIOS_SERVICEATTEMPT=3
/usr/local/nagios/libexec/eventhandlers/apache_restart

Running this against our apache_restart script we can see:

nagios@system:/usr/local/nagios/libexec/eventhandlers$
NAGIOS_SERVICESTATE=CRITICAL NAGIOS_SERVICESTATETYPE=HARD NAGIOS_SERVICEATTEMPT=3
/usr/local/nagios/libexec/eventhandlers/apache_restart
restarting apache
nagios@system:/usr/local/nagios/libexec/eventhandlers$

Fourthly, check the 'nagios.log' file on the appropriate server (Slave or Master) for the message:

[1224616751] SERVICE EVENT HANDLER: opsview:Pipe check;CRITICAL;SOFT;1;host1_service83_eh-hander_cmdpipe.sh

This means the Event Handler was called successfully. You can look for these messages using a command similar to the one shown below:

nagios@system:/usr/local/nagios/var$ tail -n2000 /usr/local/nagios/var/nagios.log | grep EVENT

Finally, if you see a log message similar to the one shown below:

[1224616753] Warning: Attempting to execute the command "/usr/local/nagios/libexec/eventhandlers/event_handler_for_tcpip" resulted in a return code of 127.  Make sure the script or binary you are trying to execute actually exists...

Then check the permissions for the script (see above for setting permissions and ownership).

Event Handlers

An overview of Event Handlers and how to use them in Opsview Monitor