Best practices, tips & tricks

1. Monitoring the individual CPU utilization in all cores

Checkmk automatically sets up a service under both Linux and Windows that monitors the average CPU usage over the course of the last minute. On the one hand, this makes sense, but on the other hand it does not recognize some errors, for example, that a single process runs amok and continuously utilizes one CPU at 100 %. In a system with 16 CPUs, however, a CPU only contributes 6.25 % to the overall performance, and so even in the extreme case described, a total utilization of only 6.25 % is recorded — which does not trigger a notification.

For this reason, Checkmk provides the option (for Linux and for Windows) of monitoring all available CPUs individually and determining whether any of their cores is constantly busy over a longer period of time. Setting up this check has turned out to be a good idea.

To set up this check for your Windows servers, you need for the CPU utilization service the CPU utilization for simple devices rule set, which you can find under the Service monitoring rules. This rule set is responsible for monitoring all CPUs — but it also has this option: Levels over an extended time period on a single core CPU utilization.

Create a new rule and activate only this option in it:

Dialog for defining the monitoring of CPU utilization of individual cores for Windows servers.

Define the condition so that it only applies to the Windows servers, for instance by using a suitable folder or host tag. This rule will not affect other rules of the same rule set if they set other options, such as the thresholds for the total CPU utilization.

For Linux servers, this is the responsibility of the CPU utilization on Linux/UNIX rule set, in which you can set the same option.

2. Monitoring Windows services

By default, Checkmk does not monitor any services on your Windows servers. Why not? This is simply because Checkmk does not know which services are important to you.

If you do not want to go to the trouble of manually determining for each server which services are important, you can also set up a check that simply checks whether all services with the start type "automatic" are in fact running. In addition, you can be informed whether services are running that were started manually — out of order, so to speak. These will no longer run after a reboot — which could be a problem.

To implement this, you first need the Windows Services rule set which you can find under the Service monitoring rules, by using the search function Setup > General > Rule search, for example. The crucial option in the new rule is Services states. Activate this and add three new elements for the states of the services:

Dialog for defining the Windows server services to be monitored depending on their status.

This allows you to implement the following monitoring:

A service with the start type auto, and which is running, is considered to be OK.
A service with the start type auto that is not running is considered to be CRIT.
A service with the start type demand, and which is running is considered to be WARN.

However, this rule only applies to services that are actually being monitored. Therefore, we need a second step and a second rule, this time from the Windows service discovery rule set, with which you define which Windows services Checkmk should monitor as services.

When you create this rule, you can first enter the regular expression .* in the Services (Regular Expressions) option, which will then be applied to all services.

After saving the rule, switch to the service configuration for a suitable host. There you will find a large number of new services — one for each Windows service.

To limit the number of monitored services to those of interest to you, return to the rule and refine the search terms as needed. This is case-sensitive. Here is an example of a customized service selection:

Dialog for defining the names of the Windows services to be monitored.

If you have previously included services that do not match the new search expressions now in the monitoring, they will appear as vanished in the service configuration. With the Rescan button you can clear the air and have the entire service list recreated.

3. Monitoring Internet connections

Your organization’s access to the Internet is certainly very important to everybody. Monitoring the connection to 'the Internet' is a bit difficult to implement, as it involves the billions of computers which could (hopefully) be accessible — or not. Nevertheless, you can still set up an efficient monitoring system, based on the following construction plan:

Select several computers on the Internet that should normally be reachable via a ping command and note their IP addresses.
Create a new host in Checkmk, for example with the name internet and configure it as follows: For IPv4 address enter one of the noted IP addresses. Under Additional IPv4 addresses enter the remaining IP addresses. Under Monitoring agents, enable Checkmk agent / API integrations and select No API integrations, no Checkmk agent there. Save the host without service discovery.
Create a new rule from the Check hosts with PING (ICMP Echo Request) rule set that only applies to the new host internet (for example, via the Explicit hosts condition, or a matching host tag). Configure the rule as follows: Enable Service Description and enter Internet connection. Enable Alternate address to ping and select Ping all IPv4 addresses there. Enable Number of positive responses required for OK state and enter 1.
Create another rule that also only applies to the host internet, this time from the Host Check Command rule set. There, select as Host Check Command the Use the status of the service… option and enter Internet connection as its name - the same name you chose as the service name in the previous step.

If you now activate the changes, you will get the new host internet with the single service Internet connection in the monitoring.

If at least one of the ping destinations is reachable, the host will have the state UP and the service will have the state OK. At the same time, the service provides you with performance data for a typical packet — its round trip time — and packet loss for each of the specified IP addresses. This will give you an indication of the quality of your connection over time:

List entry of a service for monitoring the Internet connection to several IP addresses.

The fourth and final step is necessary so that the host does not enter the DOWN state if the first IP address is not reachable via ping. Instead, the host will acquire the state of its only service.

Important: Since a service basically does not notify when its host is DOWN, it is important that you control the notifications via the host — and not via the service. As well, in this particular case, you should use a notification method that does not require an internet connection.

4. Monitoring HTTP/HTTPS services

Let’s say you want to check the accessibility of a website or web service. The Checkmk agent does not offer a solution here, as it does not display this information — and besides, you may not even have the possibility of installing an agent on the server.

The solution is a so-called active check. This is one that is not performed by an agent, rather by directly contacting a network protocol at the target host — in this case HTTP(S).

The procedure is as follows:

First create a new host for the web server in Checkmk, for checkmk.com for example. Under Monitoring agents, enable Checkmk agent / API integrations and select No API integrations, no Checkmk agent there. Save the host without service discovery.
Then create a new rule from the Check HTTP service rule set that will only apply to the new host (e.g. via the Explicit hosts condition).
In the Check HTTP service box you will find numerous options for carrying out the check. Please note the following points:
- For Service name, give the service a name, e.g. Homepage.
- At Host settings > Virtual host you may need to specify the domain of the server if this hosts more than one domain.
- Mode of the Check > Use SSL/HTTPS for the connection enables HTTPS monitoring.
- With Mode of the Check > Expected response time you can have the service set to WARN or even CRIT if the response time is too slow.
- With Mode of the Check > Fixed string to expect in the content you can check whether a certain text occurs in the response — i.e. in the delivered page. This allows you to check a relevant part of the content so that a simple error message from the server is not interpreted as a positive response.
Save the rule and activate the changes.

You will now have a new host with a service that checks access via HTTP(S).

List entry of a service for monitoring the HTTP/HTTPS services on a host.

You can of course also perform this check on a host that is already being monitored with Checkmk via an agent. In this case, you will not need to create the host, and you will only need to create this new, additional rule for the host.

5. Customizing file system thresholds 'magically'

Finding good thresholds for monitoring file systems can be tedious. After all, a threshold of 90 % is much too low for a very large hard disk and is perhaps already too marginal for a small one. We have already introduced the facility for setting thresholds depending on a files system’s size in the chapter on fine-tuning monitoring — and hinted then that Checkmk has another, even cleverer option on offer — the magic factor.

You set up the magic factor like this:

In the Filesystems (used space and growth) rule set, you create just a single rule.
In this rule, enable Levels for used/free space and leave the default for the thresholds at 80 %, or 90 % unchanged.
In addition, activate Magic factor (automatic level adaptation for large filesystems) and confirm the 0.80 default value.
Also set Reference size for magic factor to 20 Gbyte. Since 20 Gbyte is the default value, it will take effect even without you explicitly activating the option.

The result will look like this:

Dialog for setting the magic factor for file system thresholds.

If you now save this rule and activate the change, you will have threshold values that vary automatically depending on the size of the file system:

File systems that are exactly 20 Gbyte in size are given the thresholds 80 % / 90 %.
File systems smaller than 20 Gbyte are given lower thresholds.
File systems larger than 20 Gbyte are given higher thresholds.

Exactly how high the threshold values are is, well — magical! The factor (here 0.80) determines how much the values are adjusted. A factor of 1.0 changes nothing, and all file systems get the same values. Smaller values have a greater effect on the adjustment of the values. The default values for Checkmk used in this chapter have proven themselves in practice with very many installations.

You can see exactly which thresholds apply for each service in its Summary:

List with two file system services and their thresholds.

The following table shows some examples of the effect of the magic factor (mf) with a reference value of 20 Gbyte / 80 %:

File system size	mf = 1.0	mf = 0.9	mf = 0.8	mf = 0.7	mf = 0.6	mf = 0.5
5 Gbyte	80 %	77 %	74 %	70 %	65 %	60 %
10 Gbyte	80 %	79 %	77 %	75 %	74 %	72 %
20 Gbyte	80 %	80 %	80 %	80 %	80 %	80 %
50 Gbyte	80 %	82 %	83 %	85 %	86 %	87 %
100 Gbyte	80 %	83 %	86 %	88 %	89 %	91 %
300 Gbyte	80 %	85 %	88 %	91 %	93 %	95 %
800 Gbyte	80 %	86 %	90 %	93 %	95 %	97 %

With this chapter on the magic factor, we conclude our Beginner’s Guide. We hope that you have been able to set up a solid foundation for your Checkmk system — with or without magic. For nearly all the topics we have covered in this Beginner’s guide, you will find more in-depth information in other articles in the User guide.

We wish you every success with Checkmk in the future!

Last modified: Wed, 24 Jan 2024 19:22:06 GMT via commit 3e606d2c

On this page

1. Monitoring the individual CPU utilization in all cores
2. Monitoring Windows services
3. Monitoring Internet connections
4. Monitoring HTTP/HTTPS services
5. Customizing file system thresholds 'magically'

3.1. Server and VMs

3.2. Appliance, container, cloud

3.3. Updates

4.1. Server

4.2. Sites

5.1. Hosts

5.2. Services

5.3. Rules

5.4. Supporting configurations

5.5. Users and permissions

5.6. Notifications

5.7. Events

6.1. Checkmk agents and SNMP

6.2. Agent extensions

6.3. VM, cloud, container

6.4. Endpoints

7.1. General

7.2. Commands in views

8.1. Analysis

8.2. Prognosis

11.1. APIs for automation

11.2. APIs for development

11.3. Development of check plug-ins

12.1. The Checkmk Micro Core (CMC)