Hey! These docs are for version 6.3, which is no longer officially supported. Click here for the latest version, 6.7!

## Overview

This pages lists important concepts and idea you should understand to make full use of Opsview

## Hosts and Services

A service is something that is important to you, that you want to know the status of. Services can be "active" (checked on a regular basis) or "passive" (waits to be given data). This document will focus only on "Active" Service Checks.

All Services (also called a _Check _or a _Service Check_) have a status, one line of output and (optionally) some performance data.

Hosts are a logical grouping for a set of services (normally associated with a single device on a network). Services have to be associated with a host - they cannot exist without a host.

Services are regularly checked based upon their check interval. Each Service is checked independently of other Services and all have their own timing schedules.

Hosts are also checked. This can be on a regular basis (if its host check interval is defined). The Host will also be checked “on-demand” by the monitoring engine whenever a Service has changed state.

**Note**: Hosts without any services will **not **be shown in the monitoring status pages.

See more details within the [Hosts, Host Groups and Host Check Commands](🔗) section.

## States

Services have one of 4 possible States:

  • OK - Everything is fine

  • CRITICAL - Something is wrong

  • WARNING - Something may be wrong

  • UNKNOWN - There is some internal error with the check such as incorrect parameters, or there is a dependency failure

The last 3 states are collectively called _Problem States_.

Hosts have one of 3 possible States:

  • UP - Host is okay

  • DOWN - Host has a problem

  • UNREACHABLE - All parents of this host are in a failure state. This is a calculated state based on the parent/child relationship dependency of a Host

If a Host is DOWN, then the Services on the Host will be marked as CRITICAL with the summary text of "Dependency failure: Host X is DOWN". Service will no longer be executed until the Host has returned to an UP state.

If a Host is UNREACHABLE, then it will be marked with the summary text of "Dependency failure: Host X is DOWN". The Host will not be checked again until at least one of its parents is UP. All Services on the Host will not be checked until the Host returns to an UP state.

## Plugins

All active checks use a plugin. This plugin will have the actual logic to know how to check something to determine its Status. For example, a plugin will know how to communicate with a DNS server, or how to interrogate for free filesystem space, or how to get a web page.

The same plugin can be used many times for different services. It takes parameters to determine what to check or what the threshold levels are.

The parameters available are dependent on the plugin used.

After a plugin has run, it must return a status code to Opsview - which maps to one of the OK, WARNING, CRITICAL or UNKNOWN statuses. It must also return some summary text.

The plugin may also return some optional performance data which Opsview will record and can later be used in performance graphing.

Opsview supports Nagios compatible plugins.

See more details within the [Active Checks](🔗) section.

## Check Intervals and State Types

When the active check for a service runs, it is executed on a set frequency (by default 5 minutes). This is called the check interval.

Usually, services are in an OK state, showing that service is stable. However, if a problem occurs and the service changes to a different state, we need to have confidence that this is the correct state. We use _state types_ to highlight this confidence factor.

Services can have one of two state types:

  • Hard - when a service has been in a specific state for a number of checks

  • Soft - when a service has just switched to a different state

There are two important parameters to determine the soft and hard state types:

  • retry interval - during a soft state, the next scheduled check will be after this interval, rather than the check interval

  • maximum check attempts - this is the number of times a check has to be in the same state before it becomes a hard state type

The check attempts will be displayed as 3/5, which means the third check with a maximum of fix before it becomes hard.

When a service has gone into a hard state type, then the check attempts will revert to 1.

**Note**: If a service changes from one problem state to another, the check attempts are reset.

This same logic also applies for an OK state.

The main reason for state types is that notifications are sent on hard states only. This avoids sending notifications for temporary problems.

## Notifications

Notifications are sent on hard state changes only. This means notifications will be sent for hosts or services only when they have been in a particular state for a "check attempts * retry interval" amount of time.

Notifications are also sent when a host/service returns to an OK hard state. This is called a hard recovery notification.

Notifications are executed in parallel.

Notifications are suppressed if:

  • the host/service is in downtime (for a planned outage)

  • the host/service is in an acknowledged state (for an unplanned outage)

See more details Notifications {Link to KC Section}

Event Handlers Event handlers are an external script that is executed when a result is returned. There are three possible options:

  • No event handler defined

  • Event handler, with "Always execute" off - the event handler will execute after every check in a problem state, including the first state change back to OK/UP

  • Event handler with "Always execute" on - the event handler will execute for every check, regardless of state

Event handlers are executed in parallel.

See more details within the [Event Handlers](🔗) section.

## Lifecycle of a Service

This shows the lifecycle of a service, which transitions from a WARNING to a CRITICAL back to OK state. This assumes the service is run every 31s, with a retry interval of 20s. Max check attempts is 3:

TimeStateCheck AttemptState typeNotification ExecutedEvent Handler Executed