Checkmk
to checkmk.com

1. States and events

Until now we have been concerned with the installation and implementation of Checkmk. Now it is time to begin explaining the basic concepts and definitions of monitoring (with Checkmk).

It is important to understand the basic differences between states and events - and namely for a very practical benefit. Most classic IT monitoring systems revolve around events. An event is something that occurs uniquely at a particular time. A good example would be error when accessing drive X. Typical sources of events are syslog messages, SNMP traps, the Windows Event Log, and log data entries. Events are quasi-spontaneous (self-generating, asynchronous) occurrences.

In contrast a state describes a sustained situation, e.g. drive X is online. In order to observe the state of something, the monitoring system must regularly poll it. As the example shows, in monitoring it is often possible to choose to work with events or with states.

Checkmk can accommodate both states and events, but, where the choice is available, always prioritize state-based monitoring. The reason for this lies in the numerous advantages of this method. Some of these are:

  • An error in the monitoring itself is detected immediately, because it is obviously noticeable when the status query no longer works. The non-occurrence of a message, on the other hand, does not give any certainty whether the monitoring is still working.

  • Regular checking in a fixed time-frame enables the capturing of performance data to record their time history.

  • Checkmk itself can control the rate at which states are polled. There is no risk of an event storm in global error situations.

  • Even in chaotic situations - e.g. a power failure in a computer center - one always has a reliable overall status.

One can well say that Checkmk’s state-based monitoring is the norm. For the processing of states, the Checkmk Event Console is also available. This is specialized for the correlation and evaluation of large numbers of events and is seamlessly-integrated into the Checkmk platform.

2. Hosts and services

2.1. Hosts

Everything in Checkmk revolves around hosts and services. A host can be many things, e.g.:

  • A server

  • A network device (switch, router, load balancer)

  • A measuring device with an IP connection (thermometer, hygrometer)

  • Anything else with an IP address

  • A cluster of several Hosts

  • A virtual machine

  • A Docker container

In monitoring a host always has one of the following states:

StateColorMeaning

UP

green

The host is accessible via the network (this generally means that it answers a PING.)

DOWN

red

The host does not answer network inquiries, is not accessible.

UNREACH

orange

The path to the host is currently blocked to monitoring, because a router or switch in the path has failed.

PEND

gray

The host has been newly-included in the monitoring, but never before been polled. Strictly-speaking this is not really a condition.

Alongside the state, a host has a number of other attributes that can be configured by the user, e.g.:

  • A unique name

  • An IP address

  • Optional - an alias, that must not be unique

  • Optional - one or more parents

2.2. Parents

In order for the monitoring to be able to assess the UNREACH status it must know via which path every individual host can be reached. Additionally for every host one or more so-called parent hosts may be specified. If, e.g. server A seen from monitoring is only accessible via router B, then B is a parent of A. In this way only direct parents are configured. Further, a tree-like structure is created with the Checkmk instance at its center (shown here as parent map root):

monitoring basics parents

In this example, if the host myhost shows the DOWN state, the monitoring automatically assumes that an eventual failure of myhost4 can be explained simply as due to its no longer being accessible to monitoring. Whether it really has failed cannot be determined. It will be classified as UNREACH in monitoring. And there is the important rule that (by default) for hosts with the status UNREACH no notifications take place.

Because this is the most important task of the parents concept: the avoidance of mass false alarms if an entire network segment is no longer accessible for monitoring.

By the way: The monitoring of myhost4 still takes place! If this host responds, it will be displayed as UP in any case.

2.3. Services

A host has a number of services. A service can be anything - please don’t confuse this with services in Windows. A service is any part or aspect of the host that can be OK, or not OK. Naturally the state can only be determined if the host is in an UP condition.

A service being monitored can have the following states:

StateColorMeaning

OK

green

The service is fully in order. All values are in their allowed range.

WARN

yellow

The service is functioning normally, but its parameters are outside their optimal range.

CRIT

red

The service has failed

UNKNOWN

orange

The service’s status cannot be correctly determined. The monitoring agent has delivered defective data or the element being monitored has disappeared.

PEND

gray

The service has been newly-included and has so far not provided monitoring data.

When determining which condition is 'worse', Checkmk utilizes the following sequence:

OKWARNUNKNOWNCRIT

3. Host and Service Groups

Hosts and services can be grouped for an overview. In this way a host/service can be in more than one group. These groups are purely optional and not required for the configuration. Host groups can be useful when, alongside the folder structure in which the hosts are managed, an additional grouping is desired. If for example you have built a folder structure according to geographic standpoints, then it could be useful to have a Linux-Server host group e.g., that lists all Linux servers regardless of their geographic locations.

4. Contacts and contact groups

Contacts and contact groups offer the possibility of assigning persons to hosts and services. A contact correlates with a user name or web interface. The correlation with hosts and services does not occur directly however, rather via contact groups. Firstly, a contact (e.g. harri) is assigned to a contact group (e.g. linux-admins). Then hosts - or as required, individual services - can be assigned to the contact group. In this way users, and likewise hosts and services can be assigned to multiple contact groups.

These assignments are useful for a number of reasons:

  1. Who is permitted to view something?

  2. Who is authorized to configure and control which hosts and services?

  3. Who receives notifications for which problems?

By the way - the user cmkadmin, who is automatically defined by the creation of an instance, is always permitted to view all hosts and services even when they are not a contact. This is determined through their role as administrator.

5. Users and roles

Whereas the persons who are responsible or authorized for a particular host or service are defined through contacts and contact groups, their privileges are controlled via roles. Checkmk is supplied with three roles from which further roles can be later derived. Each role defines a series of rights which may be customized. The standard roles have the following meanings:

RoleMeaning

icon roles

admin

May view all, has all privileges

icon roles

user

May only view that for which he/she is a contact. May manage hosts in folders assigned to him/her. Is not permitted to make global settings

icon roles

guest

May view all, but may not configure and may not influence monitoring

6. Problems, events and notifications

6.1. Handled and unhandled problems

Checkmk identifies every host that is not UP, and every service that is not OK as a problem. A problem can have two states: unhandled and handled. The procedure is that a new problem is first treated as unhandled. As soon as someone confirms (acknowledges) the problem it is then flagged as handled. It can also be said that unhandled problems are those which nobody has attended to. The Overview in the sidebar therefore differentiates the two types of problems:

Overview snapin in Show more mode.

By the way: service problems from hosts that are currently not UP are not identified as problems.

Further details about acknowledgments can be found in it’s own article.

6.2. Notifications

When a host’s condition changes, (e.g. from OK to CRIT), Checkmk registers a monitoring event. These events may or may not generate a notification. Checkmk is so designed that whenever a host or service has a problem, an email is sent to the object’s contacts (please note that cmkadmin, by default, is not a contact for any objects). These can be customized very flexibly however. Notifications also depends on a number of parameters. It is simplest when we look at cases for which notifications are not sent. Notifications are suppressed …​

  • …​when notifications have been globally-deactivated in the Master control

  • …​when notifications have been deactivated in the host/services

  • …​when notifications have been deactivated for a particular status of the host/services (e.g. no notifications for WARN)

  • …​when the problem affects a service whose host is DOWN or UNREACH

  • …​when the problem affects a host, whose parents are all DOWN or UNREACH

  • …​when for the host/service a notification period has been set that is not currently active (see below)

  • …​when the host/service is currently flapping (see below)

  • …​when the host/service is currently in a scheduled downtime (see below)

If none of these prerequisites for suppressing notifications are satisfied, the monitoring core then creates a notification, which in a second step passes through a chain of rules. In these rules you can define further exclusion criteria, and decide whom should be notified and in what form (email, SMS, etc.)

All particulars concerning notifications can be found in it’s own article.

6.3. Flapping hosts and services

It sometimes happens that a service continuously and quickly changes its condition. In order to avoid continuous notifications, Checkmk switches such a service into the flapping state. This is illustrated with the icon flapping symbol. When a service enters a flapping state, a notification will be generated which informs the user of the change, and silences further notifications. After a suitable time, if no further rapid changes are occurring, and a final (good or bad) status is evident, then the flapping status disappears and normal notifications resume.

6.4. Scheduled downtimes

If you perform maintenance work on a server, device or software, you will normally want to avoid potential problem notifications during this time. In addition, you will probably want to advise your colleagues that problems appearing in monitoring during this time may be temporarily ignored.

For this purpose you can enter a condition of scheduled downtimes on a host or service. This can can be done directly before starting the work, or in advance. Scheduled downtimes are illustrated by the symbols:

Icon for displaying the scheduled downtime for services.

The service is in a scheduled downtime.

Icon for displaying the scheduled downtime for hosts.

The host is in a scheduled downtime. Services whose host is in a downtime are also marked with this icon.

While a host or service has a scheduled downtime:

  • No notifications will be sent.

  • Problems will not be shown in the Overview.

Additionally, when you wish to later document statistics on the availability of hosts and services it is a good idea to include scheduled downtimes. These can be factored into later availability evaluations.

7. Time periods

timeperiods

Time periods define regular, weekly-recurring time periods that are used in various positions in the monitoring’s configuration. A typical time period could be called work hours and could contain the time from 8:00 to 17:00 on all weekdays except Saturday and Sunday. The period 24X7 simply includes all times and is predefined. Time periods can also include exceptions for particular calendar days - e.g. Bavarian public holidays.

Some important situations which use time periods are:

  • Limiting the time during which notifications will be made (notification period)

  • Limiting the time during which checks are to be performed (check period)

  • Service times for the evaluation of availability (service period)

  • Times during which the Event Console applies defined rules

8. Check interval, check attempts and check period

The execution of checks occurs at fixed intervals in status-based monitoring. Checkmk uses one minute as its standard. Every check is therefore performed once per minute. This can be altered in the configuration:

  • To a longer interval in order to save CPU resources on the server and target systems

  • To a shorter interval in order to receive notifications more quickly and to collect performance data at a higher resolution.

Through defining a check period other than 24X7, the execution of active checks can be interrupted in specified time frames. The service’s status will no longer be updated, and will be flagged as stale, symbolized by icon stale.

In combination with a long check interval one can ensure that an active check is performed once per day at a specified time. If you set an interval of e.g. 24 hours and the check period at 02:00 - 02:01 on every day (only one minute per day), then Checkmk will ensure that the check really will be executed in this short time frame.

With the aid of max check attempts you can avoid notifications in the case of sporadic errors. In this way you are effectively making a check less sensitive. If the check attempts are set to e.g. 3, and the corresponding service becomes CRIT, then initially no notification will be generated. If the next two checks produce a result other than OK, the number of current attempts will increase to 3 and a notification will be sent.

A service that finds itself in this intermediate state - is thus not OK, but has not yet reached its maximum number of attempts - has a soft state.

9. Active and passive Checks

If you look at the Checkmk interface you can see that for some services in the icon menu menu a yellow arrow (icon reload) is shown, but a gray arrow (icon reload cmk) for most others. The services with the yellow arrow are active checks. These are executed by Checkmk directly. Services with a gray arrow are those for which the check results are determined by the active check Check_MK. These occur for performance reasons and illustrate a special feature of Checkmk:

monitoring basics check mk service

In order that the target system (server, network device, etc.) is not newly-contacted for every single service, once per interval Checkmk collects all important data in one pass. From this data, in a single action it calculates new results for all passive checks. This conserves CPU resources on both systems and is an important factor that supports Checkmk’s high performance and scalability.

10. Overview of the most important host and service icons

The following table provides a short overview of the most important status icons appearing beside hosts and services:

Icon for displaying the scheduled downtime for services.

This service is in a scheduled downtime.

Icon for displaying the scheduled downtime for hosts.

This host is in a scheduled downtime. Services whose host is in a downtime are also marked with this symbol.

icon outofnot

This host/service is currently outside its notifications periods

icon notif man disabled

Notifications for this host/service are currently deactivated

icon disabled

Checks for this service are currently deactivated

icon stale

This host/service has a status of stale

icon flapping

This host/service has a status of flapping

icon ack

This host/service has a confirmed problem

icon comment

There is a comment for this host/service

icon aggr

This host/service is a part of a BI aggregation

icon check parameters

Here you can directly-access the settings for the check parameters

icon logwatch

Only for logwatch services: here you can access stored log files

icon pnp

Here you can access a timegraph of the performance data

icon inventory

This host/service has inventory data. A click on it shows the related view

icon crash

This Check crashed. Click on it to view and submit a crash/bug report

On this page