1. Monitoring the individual CPU load in all cores
Checkmk automatically sets up a service under both Linux and Windows that monitors the average load on CPU performance over the course of the last minute. On the one hand, this makes sense, but on the other hand it does not recognise some errors, for example, that a single process runs amok and continuously loads one CPU at 100%. In a system with 16 CPUs, however, a CPU only contributes 6.25% to the overall performance, and so even in the extreme case described, a total load of only 6.25% is recorded — which does not trigger a notification.
For this reason, Checkmk provides the option (for Linux and for Windows) of monitoring all available CPUs individually and determining whether any of their cores is being constantly loaded over a longer period of time. Setting up this check has turned out to be a good idea.
To set up this check for your Windows servers, you need for the CPU utilization service the CPU utilization for simple devices ruleset, which you can find under the Service monitoring rules. This ruleset is responsible for monitoring all CPUs — and it also has the optional parameter: Levels over an extended time period on a single core CPU utilization.
Create a new rule and activate only this parameter in it:
Define the condition so that it only applies to the Windows servers, for instance by using a suitable folder or host tag. This rule will not affect other rules in the same rule set if they set other parameters, such as the thresholds for the total load.
For Linux servers, this is the responsibility of the CPU utilization on Linux/UNIX rule set, in which you can set the same parameter.
2. Monitoring Windows services
By default, Checkmk does not monitor any services on your Windows servers. Why not? This is simply because Checkmk does not know which services are important to you.
If you do not want to go to the trouble of manually determining for each server which services are important, you can also set up a check that simply checks whether all services with the start type "automatic" are in fact running. In addition, you can be informed whether services are running that were started manually — out of order, so to speak. These will no longer run after a reboot — which could be a problem.
To implement this, you first need the Windows Services rule set which you can find under the Service monitoring rules, by using the search function Setup > General > Rule search, for example. The crucial parameter in the new rule is Service states. Activate this and add three new elements for the states of the services:
This allows you to implement the following monitoring:
A service with the start type auto, and which is running, is considered to be OK.
A service with the start type auto that is not running is considered to be CRIT.
A service with the start type demand, and which is running is considered to be WARN.
However, this rule only applies to services that are actually being monitored. Therefore, we need a second step and a second rule, this time from the Windows service discovery rule set, with which you define which Windows services Checkmk should monitor as services.
When you create this rule, you can first enter the regular expression
in the Services (Regular Expressions) parameter, which will then be applied to all services.
After saving the rule, switch to the service configuration for a suitable host. There you will find a large number of new services — one for each Windows service.
To limit the number of monitored services to those of interest to you, return to the rule and refine the search terms as needed. This is case sensitive. Here is an example of a customised service selection:
If you have previously included services that do not match the new search expressions now in the monitoring, they will appear as missing in the service configuration. With the Full service scan button you can clear the air and have the entire service list recreated.
3. Monitoring Internet connections
Your organisation’s access to the Internet is certainly very important to everybody. Monitoring the connection to 'the Internet' is a bit difficult to implement, as it involves the billions of computers which could (hopefully) be accessible — or not. Nevertheless, you can still set up an efficient monitoring system, based on the following construction plan:
Select several computers on the Internet that should normally be reachable via a
pingcommand and note their IP addresses.
Create a new host in Checkmk, for example with the name
internetand configure it as follows: For IPv4 Address enter one of the noted IP addresses. Under Additional IPv4 addresses enter the remaining IP addresses. Enable the Checkmk Agent data source and set it to No agent. Save the host without service detection.
Create a new rule from the Check hosts with PING (ICMP Echo Request) rule set that only applies to the new host
internet(for example, via the Explicit hosts condition, or a matching host tag). Configure the rule as follows: Enable Service Description and enter
Internet connection. Enable Alternate address to ping and select Ping all IPv4 addresses there. Enable Number of positive responses required for OK state and enter
Create another rule that also only applies to the host
internet, this time from the Host check command rule set. There, select Host check command as Use the status of the service… and enter
Internet connectionas its name - the same name you chose as the service name in the previous step.
If you now activate the changes, you will get the new host
internet with the single service
Internet connection in the monitoring.
If at least one of the ping destinations is reachable, the host will have the state UP and the service will have the state OK. At the same time, the service provides you with performance data for a typical packet — its round trip time and packet loss for each of the specified IP addresses. This will give you an indication of the quality of your connection over time:
The fourth and final step is necessary so that the host does not enter the DOWN state if
the first IP address is not reachable via
Instead, the host will acquire the state of its only service.
Important: Since a service is basically not alerted when its host is DOWN, it is important that you control the notifications via the host — and not via the service. As well, in this particular case, you should use a notification method that does not require an internet connection.
4. Monitoring HTTP/HTTPS services
Let’s say you want to check the accessibility of a Website or Web service. The normal Checkmk agent does not offer a solution here, as it does not display this information — and besides, you may not even have the possibility of installing an agent on the server.
The solution is a so-called active check. This is one that is not performed by an agent, rather by directly contacting a network protocol at the target host — in this case HTTP(S). The procedure is as follows:
First create a new host for the Web server in Checkmk, for
example. Activate the data source Checkmk Agent and set it to No agent.
Save the host without service detection.
Then create a new rule from the Check HTTP service rule set that will only apply to the new host (e.g. via the Explicit hosts condition).
In the Check HTTP service dialogue you will find numerous parameters for carrying out the check. Please note the following points:
For Service name, give the service a name, e.g.
At Host settings > Virtual host you may need to specify the domain of the server if this hosts more than one domain.
Mode of the Check > Use SSL/HTTPS for the connection enables HTTPS monitoring.
With Mode of the Check > Expected response time you can have the service set to WARN or even CRIT if the response time is too slow.
With Fixed string to expect in the content you can check whether a certain text occurs in the response — i.e. in the delivered page. This allows you to check a relevant part of the content so that a simple error message from the server is not interpreted as a positive response.
Save the rule and activate the changes. You will now have a new host with a service that checks access via HTTP(S).
You can of course also perform this check on a host that is already being monitored with Checkmk via an agent. In this case, you will not need to create the host and you will only need to create this new, additional rule for the host.
5. Customising file system thresholds 'magically'
Finding good thresholds for monitoring file systems can be tedious. After all, a threshold of 90% is much too low for a very large hard disk and is perhaps already too marginal for a small one. We have already introduced the facility for setting thresholds depending on a files system’s size in the chapter on fine-tuning monitoring — and hinted then that Checkmk has another, even cleverer option on offer — the Magic Factor.
You set up the Magic Factor like this:
In the Filesystems (used space and growth) rule set, you create just a single rule.
In this rule, enable Levels for filesystem and leave the default for the thresholds at 80.0%, or 90.0% unchanged.
In addition, activate Magic factor (automatic level adaptation for large filesystems) and confirm the default value of 0.80.
Also set Reference size for magic factor to 20 GByte. Since 20 GByte is the default value, it will take effect even without you explicitly activating the parameter.
The result will look like this:
If you now save this rule and activate the change, you will have threshold values that vary automatically depending on the size of the file system:
File systems that are exactly 20 GByte in size are given the thresholds 80% / 90%.
File systems smaller than 20 GBytes are given lower thresholds.
File systems larger than 20 GByte are given higher thresholds.
Exactly how high the threshold values are is, well — magical! The factor (here 0.8) determines how much the values are adjusted. A factor of 1.0 changes nothing, and all file systems get the same values. Smaller values have a greater effect on the adjustment of the values. The default values for Checkmk used in this chapter have proven themselves in practice with very many installations.
You can see exactly which thresholds apply for each service in its Summary:
The following table shows some examples of the effect of the Magic Factor (mf) with a reference value of 20 GByte / 80%:
|File system size||mf = 1.0||mf = 0.9||mf = 0.8||mf = 0.7||mf = 0.6||mf = 0.5|
With this chapter on the Magic Factor, we conclude our Beginner’s Guide. We hope that you have been able to set up a solid foundation for your Checkmk system — with or without magic. For nearly all of the topics we have covered in this guide, you will find more in-depth information in the reference section for experts, which comprises the rest of the User guide. We wish you every success with Checkmk in the future!