Opsview Knowledge Center

SNMP Traps

Brief Overview of SNMP Traps in Opsview Monitor

What are SNMP Traps?

With SNMP Polling, the Network Management Station (NMS), Opsview Monitor, is required to poll various objects for information, on various devices. This could take a large amount of time to configure and fine tune and also in the very large environments use a large amount of computing power.

The alternative is called 'SNMP Traps'. With SNMP Traps, instead of the router (for example) being polled for information on a regular basis by Opsview Monitor, the router itself will let Opsview Monitor know of any problems or issues via a 'trap'.

SNMP Traps vs SNMP Polling

The above illustration shows how an Opsview Monitor server will regularly poll a host for information, whether there is a problem on that Host or not. In the example below it, the Opsview Monitor is sitting 'listening' for SNMP Traps; when the Host encounters an issue then a trap will be sent to tell Opsview Monitor ' which in turn will change a Service Check to the state of 'CRITICAL' or 'WARNING' respectively based on the rules you define.

Hosts can usually be configured to send specific types of trap such as link status changes, BGP, HSRP and many others, making this a flexible monitoring option.

From a host perspective, the traps can be configured to be sent to either a master server or one of any number of slaves. If a slave receives an SNMP Trap, it will be sent back to the Opsview Monitor master server via a passive check result. In a slave cluster, you should configure all your host to send traps to all the slaves nodes in the cluster - Opsview will only forward the trap to the master based on the current slave node for this host.

Once the results are received on the master server, they are processed using a perl-based rules engine, allowing you to match specific traps from devices on their network and generate appropriate alerts. In order to do this, SNMP traps must be passed from the operating system to Opsview.

SNMP Trap Initial Setup

Debian/Ubuntu Operating Systems

First, install snmpd and snmptrapd using:

apt-get install snmpd

Once installed, edit /etc/snmp/snmpd.conf and ensure that the following line is uncommented:

master agentx

Next, edit /etc/default/snmpd and set:

SNMPDRUN=yes
TRAPDRUN=yes
TRAPDOPTS='-t -m ALL -Oa -M +/usr/local/nagios/snmp/load -p /var/run/snmptrapd.pid'
SNMPDOPTS='-u nagios -Lsd -p/var/run/snmpd.pid'
SNMPDCOMPAT=yes

Next, edit /etc/snmp/snmptrapd.conf and ensure it contains only the following uncommented lines:

traphandle default /usr/local/nagios/bin/snmptrap2nagios
disableAuthorization yes

Once completed, restart the snmpd daemon as below:

service snmpd restart

Confirm that the snmptrapd process is running:

ps -ef | grep snmptrapd

RHEL and CentOS Operating Systems

First, install the rpm's:

  • lm_sensors
  • net-snmp-libs
  • net-snmp
  • net-snmp-devel
  • net-snmp-perl

Once installed, edit /etc/snmp/snmpd.conf and uncomment the line 'master agentx'. Next, edit /etc/sysconfig/snmptrapd and ensure the following line is set:

  OPTIONS="-t -m ALL -Oa -M +/usr/local/nagios/snmp/load -p /var/run/snmptrapd.pid"

Next, edit /etc/sysconfig/snmpd and add / modify the following line:

  OPTIONS="-u nagios -Lsd -p /var/run/snmpd.pid"

Finally, edit /etc/snmp/snmptrapd.conf to contain the following configuration:

  traphandle default /usr/local/nagios/bin/snmptrap2nagios
  disableAuthorization yes

Once completed, start the snmpd and snmptrapd daemons:

  sudo chkconfig snmpd on
  sudo chkconfig snmptrapd on
  service snmpd start
  service snmptrapd start

Testing

Once the operating system is configured and set up, we now want to test it and confirm it all works correctly, in order to prove that traps can now be received. To test this, run the command below:

snmptrap -v 2c -c public localhost "" SNMPv2-MIB::coldStart sysContact.0 s "Hello World"

Note that since we have disabled authorization to snmptrapd, the SNMP community string is not validated. If successful, there should be an entry written to /usr/local/nagios/var/snmptrap2nagios.exception.log. This file will be read by Opsview Monitor every five minutes and imported into the database.

cat /usr/local/nagios/var/snmptrap2nagios.exception.log
14418132785
localhost
UDP: [127.0.0.1]:43294->[127.0.0.1]
DISMAN-EXPRESSION-MIB::sysUpTimeInstance 1:2:49:52.60
SNMPv2-MIB::snmpTrapOID.0 SNMPv2-MIB::coldStart
SNMPv2-MIB::sysContact.0 Hello World#---next trap---#

SNMP Traps Within Opsview Monitor

Each Service Check configured to accept SNMP traps has an ordered list of rules. Each rule is evaluated in turn. If a rule is false, then the next rule is evaluated. If a rule matches as true, the specified action is taken and no more rules for that Service Check are evaluated.

An action could be either:

  • submit a passive check result to Opsview Monitor with an appropriate message, or
  • do nothing, and thus stop processing of any further rules

If the incoming trap does not evaluate to true for any rules, then it becomes an exception and will appear on the SNMP Trap Exceptions page. This is required so that an administrator is aware that the rules need tuning to cater for this particular trap.

When a trap is received, it contains information about the source IP. This is associated to a Host. A Host can have more than one SNMP trap service check defined. In this case, each Service Check is evaluated independently of the others. The illustration below shows a trap being evaluated in four Service Checks, represented as columns.

These columns are not ordered, so there is no guarantee which Service Check column will be evaluated first. A consequence of multiple Service Checks is that a single trap could raise multiple alerts to Opsview Monitor. However, there will only ever be one SNMP Trap Exception.

One example of using multiple Service Checks is if you wanted a Service Check to show interface status, with another Service Check alerting on error log messages.

Note: Traps received will have the SNMP community value hidden so that passwords are not stored on the file system.

SNMPv3 Traps

Abstract

SNMP messages come in two major flavors - GETs and TRAPs. From the Opsview Monitor point of view, an SNMP GET is when the monitoring server requests a piece of information from a Host. An SNMP Trap is when the Host tells Opsview Monitor when an event has happened. For example, network devices can send messages about ports going off or on line, or about bandwidth on a particular link meeting a specified level; servers can send TRAPs about someone logging onto or off a server, or when a new connection is made to a service (see the manuals for your particular Host). All devices should be able to send a message about a power-on event (i.e. when the system is booting up).

There are a number of versions of SNMP. SNMPv1 was first released in 1988 and has been improved over time with the release of SNMPv2 (and v2c) - these improved security and amended the message format, and SNMPv3, which improved security and authentication even further.

It is relatively easy to configure Opsview Monitor to handle SNMPv1 and SNMPv2c, but SNMPv3 is more complicated to set up due to the extra security involved. This article will concentrate on SNMPv3 as previous versions are covered elsewhere in our documentation.

Background

SNMPv3 Message Types:

SNMPv1 had one type of message; the TRAP. This is a message sent from a device with no response expected. SNMPv2c and SNMPv3 use TRAPs but also introduces a new message type; the INFORM (which was reworked further in v3). The basic difference between the two is that INFORMS must be acknowledged by the receiving device. If the message is not acknowledged then after a period of time it is resent.

SNMP Trap Processing:

A daemon must be set up and running on the Master and Slave Nodes to receive the TRAPs and INFORMs. After receiving one and doing some initial processing, the daemon passes the messages into Opsview Monitor. Opsview Monitor will handle both TRAPs and INFORMs in the same way; it does not hold information to differentiate between them.

SNMPv3 Security:

Security in SNMPv3 is handled by creating users. Each user may have:

  • A name (securityName)
  • An authentication protocol (authProtocol)
  • An authentication key (authKey)
  • A privacy type (privProtocol)
  • A privacy key (privKey)

Authentication uses the user's authKey to sign the message being sent with the authProtocol (MD5 or SHA)

Messages are then encrypted using the user's privKey with the privProtocol (AES or DES)

Messages are sent using one of the following securityLevel levels:

  • Unauthenticated (noAuthNoPriv)
  • Authenticated (authNoPriv)
  • Authenticated and Encrypted (authPriv)

SNMP Daemon Configuration

SNMP Daemon configuration for all versions of TRAPs are normally held in /etc/snmp/snmptrapd.conf, but this location may differ between some platforms. In order to receive SNMPv3 TRAPs a User must be created (for authentication) with the appropriate role (authorization).

Creating Users to receive TRAPs/INFORMs is done in the format:

createUser -e <engineID> <securityName> <authProtocol> <privProtocol> <privKey>

The " -e <engineID>" may be committed if only INFORMs are being received. For TRAPs it must match the configuration on the device sending them.

Example:

createUser myInformUser SHA myPassword AES myPassPhrase    
createUser -e 0x0011223344 myTrapUser MD5 myPassword DSA myPassPhrase

Note: The engine ID can be retrieved from the device sending the traps. On Cisco IOS devices this is usually:

# show snmp engineid

Authorization is handled with authUser tokens.

Example:

authUser log,exec myInformUser authPriv

To configure Opsview Monitor for receiving SNMPv3 TRAPs and INFORMs sent as the user 'opsview', the following configuration may be applied to snmptrapd.conf:

createUser opsview -e 0x80706050 SHA myPassword AES myPassphrase
authUser log,exec opsview authPriv    
traphandle default /usr/local/nagios/bin/snmptrap2nagios

Be aware that the following line will allow TRAPs to be received without any authorization:

disableAuthorization yes

Configuration Testing

After the SNMP trap daemon snmptrapd has been restarted, the configuration can be tested on the Opsview Master or Slave Node as follows:

snmpinform -v3 -u opsview -a SHA -A myPassword -x AES -X myPassphrase -l authPriv localhost 1 0

and

snmptrap -e 0x80706050 -v3 -u opsview -a SHA -A myPassword -x AES -X myPassphrase -l authPriv localhost 1 0

To test from a different server the reference to 'localhost' should be changed to the Opsview Master or Slave Node hostname or IP address.

After a few moments the message will be passed to Opsview Monitor and (if no rules are yet set up) recorded in the 'SNMP TRAP exceptions list'. The message may also be logged to syslog if snmptrapd is configured to do so.

References

There is more information about SNMP in general at https://en.wikipedia.org/wiki/Simple_Network_Mana...

Configuring a New SNMP Trap

To configure a new SNMP Trap, navigate to Service Checks; this is located within the 'Settings' tab in the overlay menu, as shown below:

Menu with 'Service Checks' highlighted

Menu with 'Service Checks' highlighted

Once within the Service Checks window, click on the 'Add New' button in the top level ' and then click on 'SNMP Trap':

'Add New > SNMP Trap within Service Checks window

'Add New > SNMP Trap within Service Checks window

Once 'SNMP Trap' has been clicked a window similar to the one below will load:

New SNMP Polling Service Check

New SNMP Polling Service Check

The window is split into two tabs:

  • Details: This is where you can configure various Service Check related fields, such as the name, description, its Service Group, its Host templates and more
  • Trap rules: The SNMP traps specific tab ' this is where trap rules can be added and ordered.

Details Tab: 'Basic'

The Details tab is split into two drawers, 'Basic' and 'Advanced'.

The items within 'Basic' are the most commonly-used fields for Service Check configuration:

  • Name: The name of the Service Check, i.e. 'Cisco 3750 Stack configuration status'.
  • Description: A friendly description of the Service Check, i.e. 'A custom SNMP check that returns the status of the switch in the context of its stack configuration. Apply this to all stacked Cisco 3750's.'
  • Service group: Covered in Section 4.4.2, a Service Group is a container for one or more Service Checks and are used for alerting and access control, amongst others.
  • Host templates: Covered in Section 4.4.3, a Host template can contain one or more Service Checks from any Service Group. While a Service Check can only ever belong to one Service Group, it can belong to as many Host templates as you desire.

Details Tab: 'Advanced'

The items within 'Advanced' are the less used, more 'advanced' Service Check options:

  • Hashtags: The Hashtags which this Service Check will belong to, when applied to one or more Hosts.
  • Globally applied hashtags: If the Service Check has been added to a Hashtag via the 'Settings > Hashtags' section instead of the selection box above, then the Hashtags will be listed here. To remove the Service Check from the Hashtag listed here, you should edit the Hashtag within 'Settings > Hashtags'.
  • Dependencies: Dependencies allow you to set a parent/child relationship for the Service Check, i.e. for this SNMP polling check, we may choose to have a parent Service Check of 'TCP Port 161'. This means that if the Service Check 'TCP Port 161' changes to a CRITICAL state (i.e. SNMP is down), then this Service Check and all other Service Checks that are a child of the aforementioned parent Service Check will change to an UNKNOWN state and will not resume their normal running until after the parent Service Check returns to an 'OK' state. This not only reduces the work load of the Opsview Monitor server but also reduces alerts; Opsview Monitor will only alert for the 'TCP Port 161' failure and not for all of its dependent children.
  • Notify for service on: This section determines which states the Service Check should notify on, i.e. only on 'CRITICAL' or 'UNKNOWN', for example. Note: If a Host does not notify on any states, then the Service Checks on that host will also not send any Notifications.
  • Notification period: This field uses the 'Time Periods' already defined within the Opsview Monitor system, and determines when Notifications are allowed to be sent to Users.
  • Re-notification interval: This field determines the period of time (in hours, minutes or seconds) after which a Notification is re-sent if the Host is still unhandled (i.e. the problem has not been ACKNOWLEDGED). If this is set to '0', only the first Notification is sent (when the Host changes to the 'HARD' state).
  • Create Multiple Services: If a Variable is selected within this drop-down, for each Variable of the selected type added a new Service Check will be added with the value in the Variable added to the Service Check name. I.e. if we have 'Disk Capacity' as a Service Check with '%DISK%' selected in the 'Create Multiple Services: drop-down', then if four Variables are added via the 'Variables' tab ' four Service Checks will be added 'Disk Capacity: Value1, Disk Capacity: Value2', and so forth.
  • Flap Detection: A service is considered flapping if its state changes too much. If this option is set, any services will be checked for this flapping condition and an icon will appear for the service and Notifications will be temporarily disabled until the service comes out of a flapping state. We recommend that flap detection is enabled for active checks. However if you find a service is flapping frequently, there is probably another issue that needs investigating. We recommend that flap detection is disabled for passive checks.
  • Sensitive arguments: If the Service Check is a plugin-based one, then the Sensitive Arguments checkbox allows you to determine if the arguments for the Service Check are displayed within the 'Test Service Check' tab within the investigate mode. If the flag is checked, the arguments will be hidden ' if unchecked the arguments will be shown. If you have TESTCHANGE set within your Role, you will be able to modify the arguments before testing the Service Check.
  • Record Output Changes: Normally, the output of a Service Check is only recorded when the state of that service changes. For example, assuming a new check has been set up:
State Output Output Recorded
OK Service OK: 10% Yes
OK Service OK: 15% No
OK Service OK: 15% No
OK Service OK: 20% No
CRITICAL Service warning: 80% Yes
CRITICAL Service warning: 75% NO
WARNING Service warning: 70% Yes
WARNING Service warning: 40% No
WARNING Service warning: 40% No
OK Service OK: 20% Yes
OK Service OK: 18% No

This option instead causes every change of output to be logged regardless of change of state (for the selected state changes). For example, for the same sequence above with OK and WARNING selected:

State Output Output Recorded
OK Service OK: 10% Yes
OK Service OK: 15% Yes
OK Service OK: 15% No
OK Service OK: 20% Yes
CRITICAL Service warning: 80% Yes
CRITICAL Service warning: 75% NO - CRITICAL option was not selected
WARNING Service warning: 70% Yes
WARNING Service warning: 40% Yes
WARNING Service warning: 40% No
OK Service OK: 20% Yes
OK Service OK: 18% Yes
  • Alert every failure: This option forces a Notification to be sent on every check in a non-OK state. This is useful if you have a passive Service Check which receives results.
    There are three states for this option:
    • Disabled: only get alerts on state changes
    • Enabled: get alerts for every failed state. This overrides the re-notification interval option
    • Enabled with re-notification interval: get alerts for every failed state as long as the re-notification interval has passed. This is useful if you get a lot of results in quick succession**.

Note: The Notification number will increase for every non-OK result and only gets reset to zero when an OK state is received.

  • Event handler: Covered in greater detail in the 'Event handler' section of the User Guide, Event handlers are scripts that can be triggered when a Service Check goes into or recovers from a problem state, such as 'WARNING' or 'CRITICAL'. The script can do anything you like, but a common usage includes restarting a service or server (virtual machine, for example) via an API.
  • Markdown filter:If this option is chosen, then the service output will be filtered through the Markdown plugin. This allows you to mark up the output with bold, italics and URL links. For instance, if the output is:
    Disk failure on sd1 - see [internal wiki](http://opsview.org)
    
    This will be displayed as: Disk failure on sd1 - see internal wiki

Use http://daringfireball.net/projects/markdown/dingus to test your plugin output. Bear in mind that you cannot use the pipe symbol as Nagios Core interprets this as the start of performance data.
Also, < and > characters are converted to the HTML entities so you cannot embed other HTML tags, and finally you should keep to only one line due to NSCA limitations in a distributed environment.
Therefore, you should stick to using just bold, italics and links in your output.
Note: If your plugin returns HTML output, this will be displayed as the text. You must use markdown format if you want to use links.

  • Check Freshness: If you are receiving passive results, you may want to check that you are getting results within a certain timeframe. From Opsview Monitor 3.5.1, you can configure this to take an action. You can enable freshness checking which means that if this service has not been updated for this amount of time, then Nagios Core will force a stale result for the service based on the configuration. There are two actions that can be taken:
    Resend Notifications
    Submit Result

Note: Due to a limitation in Nagios Core, only one of these actions can be chosen.

  • Resend Notifications
    When a passive check is received, there are normally no others that follow. The status on screen will show this last state.

    An alert will also be raised through the usual mechanism. However, you do not get a re-notification unless the service fails again.

    If this option is selected, then we implement this Service Check with a freshness check that just submits the same result back to Nagios Core. The freshness threshold is set to the Notification interval so it looks like the service has received the same result again. However, a side effect of this is that if the passive check is run on a slave, the master will not get a stale result for this service. Do not enable this option if you expect regular passive results to arrive.

Note: This feature is available in Opsview Monitor 3.5.0, but the user interface options will change in Opsview Monitor 3.5.1 onwards.
Submit Result
This submits a result back into Nagios Core if the freshness timeout value has been reached, so you can either change the state to display an error or perhaps reset the state of a service back to OK.

  • Freshness Timeout
    This is the amount of time before Nagios Core considers a service to be not fresh. You can enter this value in a duration format, such as 10m for 10 minutes or 48h 15m for 48 hours and 15 minutes.

Note that due to the way that Nagios Core calculates this value, the stale action will run a few minutes after this timeout value.

  • Stale State
    Choose the appropriate state that you want the service to change to when it passes the freshness threshold. You may want to automatically set a service back to OK after 1 hour for certain types of checks.
  • Stale Text Choose what text you want to set as the output. The text will be added to the end of the state phrase (OK, WARNING, CRITICAL, UNKNOWN).

Trap Rules Tab

Once you have configured the relevant options within the 'Details' tab, you can click on the trap rules tab:

The main user journey for the Trap rules tab is:

  • Add the new rule
  • Configure the new rule
  • Re-order as appropriate (See the image in Section SNMP Traps Within Opsview Monitor to understand why the order is important).

In the example below, we will create a Service Check that generates alerts for a Cisco router when a link state change is detected (i.e. a link goes down, etc).

To do this, we need to add three rules to the 'Trap rules' tab as below:

The fields within a Trap rule are:

  • Name: Functionally insignificant, this field should be set so that the traps purpose is clear if the list of rules is reviewed at a later stage.
  • Rule: Accepts snippets of Perl code which is used to match the contents of a trap and hence trigger the action. The image above shows an example rule which is used to match the link status change to 'DOWN'. The rule will match when the name of the trap (in the variable ${TRAPNAME}) is set to either 'IF-MIB::linkUp' or 'IF-MIB::linkDown', since the code must return 'true' in order for the trap to be considered a match.
  • Action: What to do if the rule is met. 'Stop processing' means do not process any further rules, i.e. ones 'below'. 'Send Alert' enables the 'Alert level:' and 'Message' boxes.
  • Message: This determines the text of the alert that will be generated if the rule matches. The macros listed below the field allow us to create a message which contains useful and specific information about the event that has occurred. The syntax ${Px} and ${Vx} allows you to refer to specific lines in the trap contents, and we can use the information obtained from the SNMP Trap Exceptions page to find out what is contained on each line of the trap (image 1). Here we have selected line 6 ' the name of the interface, and line 8 ' the new state of that interface, in order to create a useful message.

Once the rule has been configured and the changes submitted, the Service Check can then be applied to a Host so that alerts will be generated if this Service Check's rules match a trap received from that Host. See Service Checks Tab for guides on how to add the newly-created Service Check to a Host.

After applying the new Service Check to a router called TestRouter, and setting the interface Serial0/0 to 'down', the following service alert is generated:

[13-11-2006 16:44:47] SERVICE ALERT: TestRouter;Cisco - Link State Change;WARNING;HARD;1;Interface Serial0/0 has changed state to "administratively down"

More examples of rule variables Suppose you have the following trap information:

SNMPv2-MIB::sysUpTime.0 4:20:49:47.73
SNMPv2-MIB::snmpTrapOID.0 CISCO-CONFIG-MAN-MIB::ciscoConfigManEvent
CISCO-CONFIG-MAN-MIB::ccmHistoryEventCommandSource.45 1
CISCO-CONFIG-MAN-MIB::ccmHistoryEventConfigSource.45 2
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.10.20
SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "public"

In this example, the name of the trap is "CISCO-CONFIG-MAN-MIB::ciscoConfigManEvent" and can be referenced in trap rules either by ${TRAPNAME} or by another type of syntax that uses the ${NAME_OF_TRAP_ITEM_WITHOUT_INDEX} format. In this case, this would be ${SNMPv2-MIB::snmpTrapOID}

If you wanted to reference the snmpTrapAddress or snmpTrapCommunity fields, you could use ${SNMP-COMMUNITY-MIB::snmpTrapAddress} or ${SNMP-COMMUNITY-MIB::snmpTrapCommunity} in a similar fashion.

Please take notice of how the index and its preceding dot are removed from the Variable name.

SNMP Traps

Brief Overview of SNMP Traps in Opsview Monitor