Until now we have been concerned with the installation and implementation of Checkmk. Now it is time to begin explaining the basic concepts and definitions of monitoring (with Checkmk).
It is important to understand the basic differences between states and events - and namely for a very practical benefit. Most classic IT monitoring systems revolve around events. An event is something that occurs uniquely at a particular time. A good example would be error when accessing drive X. Typical sources of events are syslog messages, SNMP traps, the Windows Event Log, and log data entries. Events are quasi-spontaneous (self-generating, asynchronous) occurrences.
In contrast a state describes a sustained situation, e.g. drive X is online. In order to observe the state of something, the monitoring system must regularly poll it. As the example shows, in monitoring it is often possible to choose to work with events or with states.
Checkmk can accommodate both states and events, but, where the choice is available, always prioritize state-based monitoring. The reason for this lies in the numerous advantages of this method. Some of these are:
An error in the monitoring itself is detected immediately, because it is obviously noticeable when the status query no longer works. The non-occurrence of a message, on the other hand, does not give any certainty whether the monitoring is still working.
Regular checking in a fixed time-frame enables the capturing of performance data to record their time history.
Checkmk itself can control the rate at which states are polled. There is no risk of an event storm in global error situations.
Even in chaotic situations - e.g. a power failure in a computer center - one always has a reliable overall status.
One can well say that Checkmk’s state-based monitoring is the norm. For the processing of states, the Checkmk Event Console is also available. This is specialized for the correlation and evaluation of large numbers of events and is seamlessly-integrated into the Checkmk platform.
Everything in Checkmk revolves around hosts and services. A host can be many things, e.g.:
A network device (switch, router, load balancer)
A measuring device with an IP connection (thermometer, hygrometer)
Anything else with an IP address
A cluster of several Hosts
A virtual machine
A Docker container
In monitoring a host always has one of the following states:
The host is accessible via the network (this generally means that it answers a PING.)
The host does not answer network inquiries, is not accessible.
The path to the host is currently blocked to monitoring, because a router or switch in the path has failed.
The host has been newly-included in the monitoring, but never before been polled. Strictly-speaking this is not really a condition.
Alongside the state, a host has a number of other attributes that can be configured by the user, e.g.:
A unique name
An IP address
Optional - an alias, that must not be unique
Optional - one or more parents
In order for the monitoring to be able to assess the UNREACH status it must know via which path every individual host can be reached. Additionally for every host one or more so-called parent hosts may be specified. If, e.g. server A seen from monitoring is only accessible via router B, then B is a parent of A. In this way only direct parents are configured. Further, a tree-like structure is created with the Checkmk instance at its center (shown here as ):
In this example, if the host sw-ks-01.lan.tribe29.com shows the DOWN state, the monitoring automatically assumes that an eventual failure of sw-ks-02.lan.tribe29.com can be explained simply as due to its no longer being accessible to monitoring. Whether it really has failed cannot be determined. It will be classified as UNREACH in monitoring. And there is the important rule that (by default) for hosts with the status UNREACH no alerting takes place.
Because this is the most important task of the parents concept: the avoidance of mass false alarms if an entire network segment is no longer accessible for monitoring.
By the way: The monitoring of sw-ks-02.lan.tribe29.com still takes place! If this host responds, it will be displayed as UP in any case.
A host has a number of services. A service can be anything - please don’t confuse this with services in Windows. A service is any part or aspect of the host that can be OK, or not OK. Naturally the state can only be determined if the host is in an UP condition.
A service being monitored can have the following states:
The service is fully in order. All values are in their allowed range.
The service is functioning normally, but its parameters are outside their optimal range.
The service has failed
The service’s status cannot be correctly determined. The monitoring agent has delivered defective data or the element being monitored has disappeared.
The service has been newly-included and has so far not provided monitoring data.
When determining which condition is 'worse', Checkmk utilizes the following sequence:
OK → WARN → UNKNOWN → CRIT
Hosts and services can be grouped for an overview. In this way a host/service can be in more than one group. These groups are purely optional and not required for the configuration. Host groups can be useful when, alongside the folder structure in which the hosts are managed, an additional grouping is desired. If for example you have built a folder structure according to geographic standpoints, then it could be useful to have a Linux-Server host group e.g., that lists all Linux servers regardless of their geographic locations.
Contacts and contact groups offer the possibility of assigning persons to hosts
and services. A contact correlates with a user name or web interface. The
correlation with hosts and services does not occur directly however, rather
via contact groups. Firstly, a contact (e.g.
harri) is assigned
to a contact group (e.g.
linux-admins). Then hosts - or as required,
individual services - can be assigned to the contact group. In this way users,
and likewise hosts and services can be assigned to multiple contact groups.
These assignments are useful for a number of reasons:
Who is permitted to view something?
Who is authorized to configure and control which hosts and services?
Who receives notifications for which problems?
By the way - the user
cmkadmin, who is automatically defined by the
creation of an instance, is always permitted to view all hosts and services
even when they are not a contact. This is determined through their role
Whereas the persons who are responsible or authorized for a particular host or service are defined through contacts and contact groups, their privileges are controlled via roles. Checkmk is supplied with three roles from which further roles can be later derived. Each role defines a series of rights which may be customized. The standard roles have the following meanings:
May view all, has all privileges
May only view that for which he/she is a contact. May manage hosts in folders assigned to him/her. Is not permitted to make global settings
May view all, but may not configure and may not influence monitoring
Checkmk identifies every host that is not UP, and every service that is not OK as a problem. A problem can have two states: unhandled and handled. The procedure is that a new problem is first treated as unhandled. As soon as someone confirms (acknowledges) the problem it is then flagged as handled. It can also be said that unhandled problems are those which nobody has attended to. The tactical overview in the sidebar therefore differentiates the two types of problems:
By the way: service problems from hosts that are currently not UP are not identified as problems.
Further details about acknowledgments can be found in it’s own article.
When a host’s condition changes, (e.g. from OK to CRIT), Checkmk registers
an event. These events may or may not generate a notification. Checkmk
is so designed that whenever a host or service has a problem, an email is sent
to the object’s contacts (please note that
by default, is not a contact for any objects). These can be customized
very flexibly however. The alert also depends on a number of parameters. It is
simplest when we look at cases for which notifications are not sent.
Notifications are suppressed …
…when notifications have been globally-deactivated in the master control
…when notifications have been deactivated in the host/services
…when notification is deactivated for a particular status of the host/services (e.g. no notification for WARN)
…when the problem affects a service whose host is DOWN or UNREACH
…when the problem affects a host, whose parents are all DOWN or UNREACH
…when for the host/service a notification period has been set that is not currently active (see below)
…when the host/service is currently flapping (see below)
…when the host/service is currently in a scheduled downtime (see below)
If none of these prerequisites for suppressing notifications are satisfied, the monitoring core then creates a notification, which in a second step passes through a chain of rules. In these rules you can define further exclusion criteria, and decide whom should be alerted and in what form (email, SMS, etc.)
All particulars concerning alerts can be found in it’s own article.
It sometimes happens that a service continuously and quickly changes its condition. In order to avoid continuous notifications, Checkmk switches such a service into the flapping state. This is illustrated with the symbol. When a service enters a flapping state, a notification will be generated which informs the user of the change, and silences further alerts. After a suitable time, if no further rapid changes are occurring, and a final (good or bad) status is evident, then the flapping status disappears and normal alerting resumes.
If you perform maintenance work on a server, device or software, you will normally want to avoid potential problem notifications during this time. In addition, you will probably want to advise your colleagues that problems appearing in monitoring during this time may be temporarily ignored.
For this purpose you can enter a condition of scheduled downtimes on a host or service. This can can be done directly before starting the work, or in advance. Scheduled downtimes are illustrated by the symbols:
The host/service is in a scheduled downtime
The host on which the service is located has a scheduled downtime
While a host or service has a scheduled downtime:
No notifications will be sent.
Problems will not be shown in the tactical overview.
Additionally, when you wish to later document statistics on the availability of hosts and services it is a good idea to include scheduled downtimes. These can be factored into later availability evaluations.
Time periods define regular, weekly-recurring
time periods that are used in various positions in the monitoring’s
configuration. A typical time period could be called
work hours and
could contain the time from 8:00 to 17:00 on all weekdays except Saturday and
Sunday. The period
24X7 simply includes all times and is predefined.
Time periods can also include exceptions for particular calendar days -
e.g. Bavarian public holidays.
Some important situations which use time periods are:
Limiting the time during which notifications will be made (notification period)
Limiting the time during which checks are to be performed (check period)
Service times for the evaluation of availability (service period)
Times during which the event console applies defined rules
The execution of checks occurs at fixed intervals in status-based monitoring. Checkmk uses one minute as its standard. Every check is therefore performed once per minute. This can be altered in the configuration:
To a longer interval in order to save CPU resources on the server and target systems
To a shorter interval in order to receive alerts more quickly and to collect performance data at a higher resolution.
Through defining a check period other than 24X7, the execution of active checks can be interrupted in specified time frames. The service’s status will no longer be updated, and will be flagged as stale, symbolized by .
In combination with a long check interval one can ensure that an active check is performed once per day at a specified time. If you set an interval of e.g. 24 hours and the check period at 02:00 - 02:01 on every day (only one minute per day), then Checkmk will ensure that the check really will be executed in this short time frame.
With the aid of max check attempts you can avoid alerts in the case of sporadic errors. In this way you are effectively making a check less sensitive. If the check attempts are set to e.g. 3, and the corresponding service becomes CRIT, then initially no notification will be generated. If the the next two checks produce a result other than OK, the number of current attempts will increase to 3 and a notification will be sent.
A service that finds itself in this intermediate state - is thus not OK, but has not yet reached its maximum number of attempts - has a soft state.
If you look at the Checkmk interface you can see that for some services in the menu a green double-arrow () is shown, but a gray four-way-arrow () for most others. The services with the green arrow are active checks. These are executed by Checkmk directly. Services with a gray arrow are those for which the check results are determined by the active check Check_MK. These occur for performance reasons and illustrate a special feature of Checkmk:
In order that the target system (server, network device, etc.) is not newly-contacted for every single service, once per interval Checkmk collects all important data in one pass. From this data, in a single action it calculates new results for all passive checks. This conserves CPU resources on both systems and is an important factor that supports Checkmk’s high performance and scalability.
The following table provides a short overview of the most important status icons appearing beside hosts and services:
This host/service currently has a scheduled downtime at the moment
This service’s host currently has a scheduled downtime at the moment
This host/service is currently outside its notifications periods
Notifications for this host/service are currently deactivated
Checks for this service are currently deactivated
This Host/Service has a status of stale
This host/service has a status of flapping
This host/service has a confirmed problem
There is a comment for this host/service
This host/service is a part of a BI aggregation
Here you can directly-access the settings for the check parameters
Only for logwatch services: here you can access stored log files
Here you can access a timegraph of the performance data
This host/service has inventory data. A click on it shows the related view
This Check crashed. Click on it to view and submit a crash/bug report