The smooth operation of IT systems has always been a challenge. Both the complexity of the hardware and software stacks, as well as the demands of users continue to increase — regardless of whether you work with real hardware or with cloud solutions. These days a detailed and comprehensive IT monitoring solution has become a central role in an efficient organisation.
The requirements that users expect from their monitoring are of course as complex as the IT world itself. From its very beginning Checkmk has been developed for large and heterogeneous IT landscapes. That is why it offers a wealth of features and capabilities in order to meet all of the challenges to be found in an organisation. For entry-level users the comprehensiveness of Checkmk can at first be overwhelming.
So that you can nevertheless get your first Checkmk monitoring system into operation quickly and easily, we have broken the Checkmk User’s Guide into two parts:
A beginner’s guide — this article
A comprehensive reference section
The Beginner’s guide takes you step by step through Checkmk, and it is structured so that you can read it quickly from start to finish, and can then begin working with Checkmk. That is why the guide is also short and concise, and contains no distracting, unnecessary details. At the end of the guide you will have a working Checkmk system. In the last section, some of our experienced consultants will show you a few very useful tips and tricks which have proven themselves in almost every Checkmk installation.
Of course, our beginner’s manual leaves many questions unanswered. Answers to these can be found in the manual’s reference section. There for each topic you will find all the background and details to gain deeper knowledge.
Before you can install Checkmk, you first have to consider the question of which Checkmk you want. There are four different editions:
The Checkmk Raw Edition is free and 100% open source, and contains Nagios as its core. It can comprehensively monitor complex environments. You can receive support in our forum from the Checkmk community and in the future also in a community portal.
The Checkmk Enterprise Standard Edition (CEE) is aimed primarily at professional users, and beyond the scope of the Raw Edition it offers a number of interesting features, such as a very high-performance core that replaces Nagios, a reporting function, a sophisticated system for the visualization of measured values, a flexible agent deployment function, and much more. For the Standard Edition you can optionally get professional support from us or from one of our partners. You can find a list of its most important differences compared to the Raw Edition on our homepage.
The Checkmk Enterprise Free Edition (CFE) is the right solution for you if you want to test the Standard Edition first without obligation, or if you want to install Checkmk in small operations with up to two sites with 10 monitored hosts each. The Free Edition contains all of the features of the Standard Edition and is supplied at no cost. Both the Free Edition and the Raw Edition can be upgraded directly and easily to the Standard Edition at a later date.
The Checkmk Enterprise Managed Services Edition (CME) is the right edition for you if you are a managed service provider offering services to your customers. It is a multi-client-capable extension of the Standard Edition.
Whenever we discuss functions in this manual that only apply to one of the Enterprise Editions — i.e. for the CEE, CFE or CME — we mark this accordingly here.
We are of course continously developing all Checkmk editions, so there are different versions of each edition. For the entry level we recommend the latest stable version of Checkmk. A detailed overview of what types of other versions still exist can be found in its own article.
The Checkmk server needs a Linux system on which it can run (of course you can also easily monitor Windows and other operating systems). If you do not wish to set up your own Linux server, you can also operate Checkmk with the help of Docker or an appliance. There are four options in total:
The installation of Checkmk on a Linux server, whether on a ‘real’ or on a virtual machine is – so to speak – the ‘normal’ method. If you have basic Linux knowledge, this method is very simple, and all the software you need is either in your distribution or is included in our package.
We support the following Linux distributions: Red Hat, CentOS, SLES, Debian and Ubuntu. For each edition and version of Checkmk, each of these distributions has its own customized package created by us. You can find these on our download page. You install a package directly with the package manager applicable to your distribution. Please follow the instructions in the Installation on Linux systems article.
With the Checkmk virt1 virtual appliance you get a complete, already set-up virtual machine that you can use in VMware, HyperV or VirtualBox. Alongside Checkmk it also contains a complete operating system based on Debian GNU/Linux. The advantage of the appliance is that with it you can also configure the operating system completely using the graphical interface. Thus administering Checkmk is also possible without an in-depth knowledge of Linux. Updating of Checkmk and many other operations are also made possible without using the command line.
If you prefer a physical hardware appliance, you can choose between several models with different support levels. Once this has been done Checkmk is set up and ready to use. With the hardware appliance you receive a complete system that you can install directly in your data center. With two hardware appliances, in a few easy steps you can combine these into an HA-cluster. The instructions for commissioning the appliances can be found in its own article.
Should you wish to deploy Checkmk using a Docker container you also have this option. We support both the Raw Edition and the Enterprise Editions with finished container images that can be set up in a few simple steps.
Detailed instructions on deploying Checkmk can be found in its own article.
Checkmk has a peculiarity that may appear to be superfluous at first, but one which has proved to be very useful in practice: You can have multiple, independent Checkmk instances (Sites) running in parallel on one server. It is even possible for each instance to run a different version of Checkmk.
Here are two common uses for this feature:
Uncomplicated trial and error testing of a new Checkmk version
Parallel operation of a test instance to monitor hosts that are not yet in live operation
If you have just installed Checkmk, there are no instances yet. We will show you here how to create an instance during a normal installation of Checkmk.
If you are running Checkmk on Linux or using Docker, an instance will be automatically created for you. The Checkmk appliances are managed via a web interface which also covers the creation of instances. This will be explained in an article about the appliance.
First, select a name for your instance. This may only consist of letters
and numbers. The convention is to use lowercase letters. In the manual we
use the name
mysite for all examples. Always substitute your own
instance names when you see this field.
The creation itself is very easy. Just enter the
omd create command
root user, followed by the name of the instance:
root@linux# omd create mysite Adding /opt/omd/sites/mysite/tmp to /etc/fstab. Creating temporary filesystem /omd/sites/mysite/tmp...OK Restarting Apache...OK Created new site mysite with version 1.6.0.cee. The site can be started with omd start mysite. The default web UI is available at http://linux/mysite/ The admin user for the web applications is cmkadmin with password: ZBdHdkl2 (It can be changed with 'htpasswd -m ~/etc/htpasswd cmkadmin' as site user.) Please do a su - mysite for administration of this site.
When creating a new instance, the following actions will take place:
A Linux user and a Linux group are created with the name of the instance in the system. The user is called instance user.
A data directory for the instance is created under
A meaningful default configuration is copied to the new directory.
For the Checkmk web interface a user with the name
cmkadminand a random password will be created.
Note: If you receive the error
Group ‘foobar’ already
existing., then a Linux user with the desired instance name already
exists. In this case simply choose another name.
As soon as you have created the new instance, further administration no
longer takes place as
root, but as the instance user. The easiest
way to get here is to use the
su - mysite command:
root@linux# su - mysite
At the changed prompt you will see that you are ‘logged in’ in to the
instance. As the command
pwd shows, you will then automatically
be in the data directory for the instance (instance directory):
As you saw in the output from
omd create, when you create the
instance it automatically generates an Checkmk administrator-user named
cmkadmin. This user is intended for logging in to the Checkmk web interface
(GUI), and it receives a random password. As the instance user
you can easily change this password:
htpasswd -m etc/htpasswd cmkadmin New password: Re-type new password: Updating password for user cmkadmin
By the way: Whenever we specify path names in the manual that do not
begin with a slash, these refer to the instance directory. If you are
already in this directory, you can thus use such paths directly. This also
applies, for example, to the file
etc/htpasswd, whose absolute
path here is
/omd/sites/mysite/etc/htpasswd, and which is the file
containing the passwords for the Checkmk user. Please do not confuse this
An instance can be started or stopped. The ‘startup mode’ is here
automatic, which means that all instances will start automatically
following a system reboot. Freshly-created instances begin their lives
stopped, however. You can easily verify this with the
command which shows the status of all of the individual processes which are
required for the operation of the instance:
stopped liveproxyd: stopped mknotifyd: stopped rrdcached: stopped cmc: stopped apache: stopped dcd: stopped crontab: stopped ----------------------- Overall state: stoppedomd status mkeventd:
You can start the instance with a simple
omd start command:
omd start Creating temporary filesystem /omd/sites/mysite/tmp...OK Starting mkeventd...OK Starting liveproxyd...OK Starting mknotifyd...OK Starting rrdcached...OK Starting cmc...OK Starting apache...OK Starting dcd...OK Initializing Crontab...OK
As expected, the status following this shows all services as
running liveproxyd: running mknotifyd: running rrdcached: running cmc: running apache: running dcd: running crontab: running ----------------------- Overall state: runningomd status mkeventd:
Because the Raw Edition does not have all the features of the Enterprise Editions,
you will see fewer services. In addition,
is replaced by
started rrdcached: started npcd: started nagios: started apache: started crontab: started ----------------------- Overall state: startedomd status mkeventd:
omd command has many more options for controlling and
configuring instances. All details on these can be found in the corresponding
articles covering instances.
There is also a specific article covering more detail on the directory structure of the instance and the options for the command line in Checkmk.
Once the instance is running it can be used. Every instance has its own URL
which you can open in your browser. This URL is composed of the IP address
or hostname of your monitoring server, a slash, and the name of the instance
– for example,
http://mycmkserver/mysite/. There you will find
the following login window:
If your instance has not started, you will see the following error message:
If there is no instance with this name (or you have landed on a server without Checkmk), it will look like this:
Now log in with the user
cmkadmin and the initial, randomly-generated
password, or respectively your new, updated password. This will land you on
Important: As soon as you are operating Checkmk in a production environment, we recommend for security reasons that you access the interface exclusively via HTTPS. How to do this is explained in its own article.
You will see quite a number of elements in the interface which we do not need at this time. Many of these elements are empty, or in any case display only zeros because we have not yet included objects in the monitoring configuration.
Nevertheless, you should first familiarize yourself with the basic elements of the interface. Most important is the division into the Sidebar on the left and the main area on the right. Of course, what you see in the main section depends on where you are in Checkmk right now. After logging in you first start in the default dashboard, which shows a rough overview of the current state and the recent events in monitored objects.
More important is the page guide. Here you will find a number of elements, also referred to as snap-ins. Depending on the size of your screen, not all snap-ins will be visible. But how does one move the sidebar without scroll bars? Here are two options:
Simply roll the mouse wheel up and down while the mouse pointer is over the sidebar. For touchpads, this feature is often possible with the ‘two fingers up and down’ gesture.
With the mouse just ‘grab’ one of the snap-ins outside of its title bar and move it up or down.
In the default setting (of course, the sidebar is customizable!) you will find the following elements:
The Tactical Overview — an overview of all monitored objects
The Quicksearch — Search box
Views — The directory of various status views
Reporting — Create PDF reports
Bookmarks — Your personal bookmarks within Checkmk
WATO-Configuration — The most important: For the configuration of monitoring
The Master Control — various main switches for the monitoring service
At the top of the sidebar you will find the Checkmk edition and version identification, as well as the Checkmk logo. A click on the logo will always bring you to Checkmk’s home dashboard.
Below the sidebar you will find the icon that brings you to your personal settings. There you can change your password. Finally the icon logs you out of the interface.
So, Checkmk is now ready. But before we start with the actual monitoring, we should briefly explain some important terms. We will begin with the host. In Checkmk a host is typically a server, a VM, a network device, an appliance, or anything else with an IP address which is being monitored by Checkmk. Every host always has one of the states UP, DOWN or UNREACH. There are also hosts without an IP address, such as Docker containers.
On each host a number of services are monitored. A service can be anything — for example, a file system, a process, a hardware sensor, a switchport — but it can also just be a specific metric like CPU usage or RAM usage. Each service has one of the states OK, WARN, CRIT or UNKNOWN.
In order for Checkmk to be able query data from a host, an agent is usually necessary. This is a small program that is installed on the Host which provides ‘health’ information about the host on request. The manufacturers of network devices and many appliances usually include a pre-installed agent which Checkmk can easily query using the standardized SNMP protocol. Cloud services like AWS or Azure also have features similar to agents, but they are called ‘APIs’ and are queried by Checkmk via HTTP. Servers running Windows, Linux or Unix can only be monitored by Checkmk if you install one of our CMK agents.
Even if Checkmk requires no name resolution of hosts, a well-maintained DNS is of great help with configuration and for avoiding mistakes. Checkmk can then autonomously name the hosts so that you do not have to manually enter any IP addresses in Checkmk.
The implementation of monitoring is therefore a good opportunity to bring your DNS up to date and to add any missing entries!
Checkmk manages your hosts in a hierarchical tree of folders — quite analogously to the way you see files in your operating system. If you only have a handful of hosts to monitor, that may not be that important to you — but remember — Checkmk has been designed for monitoring thousands and tens of thousands of hosts. And then good organisation is half the battle won!
So, before you include your first hosts in Checkmk, it is a good idea to give some thought to the structure of these folders, since this is not only useful for your own overview, but is also basically the same method that you can use to define all of the configuration attributes of the hosts in a folder. These attributes are then automatically inherited by any subfolders and hosts this folder contains.
You can of course change the folder structure at any time. You must, however, proceed very carefully, since moving a host to another folder may alter its attributes without your being aware of it.
The real question when building a folder structure that makes sense to you is the consideration of the criteria you want to use to structure the folders. This can be different in each level of the tree. So you can — for example — in the first level order by location, and below that in the second level order by technology.
The following classification criteria have proven themselves well in practice:
Sorting by location is obviously used mostly by larger companies,
especially if the monitoring is distributed over multiple Checkmk servers.
Each server then monitors a region or a country, for example. If your folders
map such divisions, for example, in the folder ‘Munich’, you can define
that all hosts in this folder be monitored by the
muc instance in
Alternatively, the question of organization — that is, who is ‘responsible’ for a host — can be a more meaningful criterion, because location and responsibility are not always the same. It may be that one group of your colleagues is responsible for the administration of Oracle, regardless of where the respective hosts are located. So, if the Oracle folder is provided for the Oracle colleagues’ hosts for example, it is then easy to configure that all the hosts within this folder are visible only to these colleagues and that they can even take care of their own hosts there.
Structuring according to technology could, for example, provide a
folder for Windows servers, and one for Linux servers. This in turn simplifies
configuration according to the formula ‘the process
be running on all Linux servers’. Another example is the monitoring
of devices such as switches or routers via SNMP. Here no agent is used,
but the devices are queried via the SNMP protocol. When these hosts are in
folders you can make necessary SNMP settings — such as community — directly
in the folder.
Of course, a tree structure does not reflect the whole complexity of reality — with the host properties (tags) Checkmk provides another structure option that intelligently complements the trees. But more about this later. Further information on structuring the folders can be found in the reference section.
The function for the administration of folders and hosts can be found in the WATO ⇒ Hosts module, which you can reach via the WATO – Configuration sidebar element:
One folder — the root folder — is present in a freshly-installed Checkmk system. This is named Main Directory by default, but if you don’t like this name, you can easily rename it by using the Folder properties button. You can create new hosts directly here, but it is better if you first create some suitable subfolders.
For our beginner’s manual we will use a simple example — the three folders Windows, Linux and Network. Create these three folders by clicking the New folder button and in the first menu titled General properties, enter each folder’s respective name:
Tip: If you are too lazy to scroll to the Save & Finish button, just press enter while the cursor is still in the text input field. That also performs a save, and exits the form.
After that the situation will look like this:
Tip: In many windows (as seen here when creating a new folder) you will see a small icon of a book in the upper right corner . With this you can turn online help on and off. The help explains the individual input fields.
Now we are ready to add the first host into the system. And what could be more obvious to monitor than the Checkmk server itself? Of course, this will never be able to notify its own total failure, but it is still useful, since you will not just get an overview of the CPU and RAM usage, but also quite a few metrics and checks about the Checkmk system itself.
The procedure for adding a Linux or Windows host is always the same:
Download the Checkmk agent
Install the Checkmk agent on the destination host
With WATO add the host into a suitable folder
Perform a service configuration
Activate the changes
Because the Checkmk server is a Linux machine, you need the Checkmk agent for Linux. You can find this directly in the interface under WATO ⇒ Monitoring Agents.
Click here to access the Enterprise Editions
Agent Bakery. This allows ‘baking’ of
individually-configured agent packages — however, this always generates a
generic agent without you needing to do anything:
Choose RPM format for Red Hat, CentOS, or SLES, and DEB format for Debian and Ubuntu. Download the file and copy it to the Checkmk server.
The Raw Edition does not have an agent bakery. Clicking WATO ⇒ Monitoring Agents
takes you directly to a download page on which you can find
preconfigured agents and agent plug-ins. (In the Enterprise Editions this
same page can be found under Agent files.)
From the first box, Packaged Agents, select one of the two Linux packages (RPM/DEB) and copy it to the Checkmk server.
In the example below, assume that you put the file in the
directory — i.e. in the home directory of the
This file is only needed during installation — you can delete it later.
The installation is done as
root on the command line with either
rpm, preferably with the option
root@linux# rpm -U check-mk-agent-1.6.0-3a83e51d5c12619c.noarch.rpm
… or for DEB respectively with the
dpkg -i command:
root@linux# dpkg -i check-mk-agent_1.6.0-3a83e51d5c12619c_all.deb
Important: In order to function, the agent requires either
systemd — which in newer distributions is the default — or the auxiliary
xinetd. What the situation is in your case can be easily seen in
the output when installing the agent:
|Agent running …||Output|
Enable Check_MK_Agent in systemd…
agent not running
Neither of the two above messages appear — but:
If you have neither
xinetd, simply install
That is performed on RedHat/CentOS with:
root@linux# yum install xinetd
On SLES the command is:
root@linux# zypper install xinetd
And on Debian/Ubuntu:
root@linux# apt install xinetd
Incidentally, the Checkmk agent for Linux is an executable program (shell
script) which you can easily test by calling the
RPM:check_mk_agent <<<check_mk>>> Version: 1.6.0 AgentOS: linux Hostname: linux AgentDirectory: /etc/check_mk DataDirectory: /var/lib/check_mk_agent SpoolDirectory: /var/lib/check_mk_agent/spool PluginsDirectory: /usr/lib/check_mk_agent/plugins LocalDirectory: /usr/lib/check_mk_agent/local ...
To test the accessibility of the agent from outside, from an external system
telnet you can attempt to connect to port 6556. Here the
agent should respond with the same information:
root@linux# telnet mycmkserver 6556 Trying 192.168.56.100... Connected to mycmkserver.example.net. Escape character is '^]'. <<<check_mk>>> Version: 1.6.0 AgentOS: linux Hostname: linux ...
Note: By default the agent is reachable from the entire network and can be queried without requiring a password. As the agent does not accept commands from the network, however, a potential attacker cannot gain access. Information such as the list of current processes is still visible, though. How to protect the agent can be learned in the article about the linux agent.
After the agent has been installed on the destination host you can start monitoring it. In our example that is the Checkmk server itself, but that does not really make a difference.
Now go back to the WATO ⇒ Hosts module and there switch to the Linux folder by simply clicking on the folder’s graphic. Click on New host.
There you will find a form with several boxes and many input options. As mentioned at the beginning, CMK is a complex system which has an answer to every question. That is why you can perform a lot of configuration in a host.
The good news is that you only have to fill in one field, namely the Host name field in Basic Settings. You can use this name freely. It serves as a key in monitoring at all points and is the unique name for the host:
If the host is resolvable under its own name in DNS, you are already finished with this form. If not, or if you do not want to use DNS, you can enter the address by hand in the IPv4 address field:
Note: So that Checkmk can always run stably and efficiently, it maintains its own cache for hostname resolution. Thus a failure of the DNS service does not cause a failure of the monitoring system. The DNS query is performed only once – when the host is added to the system.
This cache is automatically renewed every day at 00:05. Clicking on the Update DNS cache button in the Host Properties window of one of your hosts you can rebuild the entire DNS cache manually. Do this if you want a change in your DNS to take effect immediately.
You can find detailed information about name resolution during monitoring in the article covering host administration.
Everything that can go wrong eventually will go wrong — and, of course, especially when you are doing things for the first time! That is why good fault diagnosis options are so important. One of these options can be found in WATO if you have set Save & Test in the host’s properties. Alternatively, in the Host Properties, by using the Diagnostic button you can also at any time come to the same diagnostic page — but in this case without first needing to save.
Scroll down the diagnostics page and press Test. Now Checkmk will try to reach the host in all possible ways. For Windows and Linux hosts only the two upper boxes are interesting:
Other boxes try to contact via SNMP and these are very useful for network devices in ways that we will be discussing below.
On the diagnostics page in the Host properties box you can, if necessary, try a different IP address, and even use this IP address with Save & Exit directly in the host properties.
Once the host itself has been added, we come to the really interesting part: the configuration of services. This can be achieved in a number of ways:
by saving the host properties with Save & go to Services
by clicking the icon in the folder view of a host
by clicking on the Services button in the host properties, or at the top of any other page for the host
On this page you specify which services you wish to monitor on the host. If the agent is running correctly on the host and is reachable, Checkmk automatically finds a set of services and suggests these to be monitored (abbreviated here):
For each of these services there are in principle three possibilities:
Undecided: You have not yet decided whether you want to monitor this service.
Monitored: The service is being monitored.
Disabled: You have chosen not to monitor the service.
In the beginning all services start as undecided. For starters, it is easiest if you now click Fix all missing/vanished — all services will then be transferred directly to the configuration.
You can call up this view at any time later to configure its services. Sometimes new services are the result of changes to a host, e.g., if you include a LUN as a file system, or configure a new instance of Oracle. These services first appear as undecided, and you can then add them one at a time or all at once into the monitoring configuration.
Conversely, services may disappear, e.g., because a file system has been removed. These services then appear in the monitoring as UNKNOWN, and in the configuration page as vanished. You can remove these from the monitoring here.
The Fix all missing/vanished button performs both of these functions at once — adding missing services, and removing unnecessary ones.
WATO is basically designed so that any changes you make initially only appear in a preliminary ‘configuration environment’, so that the current production operation is not yet affected. Only after Activate changes (Activate changes) are these transferred into production monitoring. Learn more about the background for this in the article about WATO.
Now click on the button to apply the changes. This brings you to a new page that, among other things, in Pending changes lists changes that have not yet been activated:
Now click on the Activate affected button to apply all changes. Shortly afterwards in the Tactical Overview sidebar you will see how the host and its services appear there. Also in the main dashboard that you reach by clicking on the Checkmk logo at the top left corner, you will now be able to see that the monitoring system has been brought to life.
As with Linux, CMK also has its own Windows agent. This is provided as an MSI package. You find it at the same location as the Linux agent. Once you have copied the MSI package to your Windows machine you can install it with the usual Windows double-click.
Note: You may need to adjust the firewall settings on Windows, so that Checkmk can access the network.
Once the agent has been installed you can add the host to the monitoring setup. This works in the same way as seen above with the Linux host. Because Windows is structured differently from Linux, the agent, however, finds other services to monitor. More details about monitoring Windows can be found in its own article.
Professional switches, routers, printers and many other devices and appliances already have a built-in interface for monitoring provided by their manufacturer: the Simple Network Management Protocol (SNMP). Such devices are very easy to monitor with Checkmk – and you do not even have to install an agent.
The basic procedure is always the same:
Using the device’s management interface, enable SNMP read access for the Checkmk server’s IP address.
You assign a Community string. This is nothing more than a password for access. Since this is usually transmitted in plain text within the network, it is of limited sense to make the password very complicated. Most users simply use the same community string for all devices within a company. This also greatly simplifies the configuration in Checkmk.
Create the host as usual in Checkmk.
In the host’s properties in the Data sources box, set Check_MK Agent to No agent.
In the same box activate SNMP, and select SNMP v2 or v3.
If the community string is not
public, enable SNMP credentials ⇒ SNMP community (SNMP Versions 1 and 2c) and enter the community string here.
If you have all SNMP devices in their own folder, simply configure the Data sources directly on the folder — the settings will then automatically apply for all hosts in the folder!
The rest is as usual. If you want you can have one more look at the diagnostics page — there you will also see immediately if the access via SNMP works, here, e.g., for a CISCO Catalyst 4500 switch:
Then click Save & go to Services again to see the list of all services. Of course, this looks completely different from that in Windows or Linux. For all devices Checkmk by default monitors all ports that are currently in use. Of course, you can later adjust this as desired. With a service which is always OK it also shows the general information for the device, as well as its uptime.
All details about monitoring SNMP with Checkmk can be found in a separate article in the reference section.
You can also easily monitor cloud and container services with Checkmk, even if you do not have access to the actual server. For this Checkmk uses the providers’ APIs. These APIs use HTTP or HTTPS. The basic principle is always the same:
You set up an account for Checkmk in the provider’s management interface.
In Checkmk you create a host to access the API.
For this host you create a configuration to access the API.
For the monitored objects, such as VMs, EC2 instances, containers, etc., create or automate additional hosts in Checkmk.
There are step-by-step instructions in the manual for all of these:
Now that we finally have something for our monitoring system to do, it would make sense for us to have a closer look at the interface. Above all we are interested in the things relevant to operations — with the everyday life of a monitoring system, so to speak. In Checkmk this component is also sometimes referred to as the status interface, because it is mostly about seeing the current status of all hosts and services.
Let’s take a closer look at the Tactical Overview:
In the left column of this small table you will first see the number of monitored hosts and services. The third line shows Events. These will only become relevant for you if you have configured a monitor for messages — here we mean messages from syslog, SNMP traps and logfiles, for example. For this Checkmk has its own very powerful module, the Event Console, which will not be discussed in this beginner’s guide.
The second column shows the problems. These are the monitored objects which have the status WARN/CRIT/UNKNOWN, or DOWN/UNREACH. You can click on the number in the cell and be linked directly to the objects that are counted here.
The third column can never be bigger than the second one, because it shows those problems that are still unacknowledged. An acknowledgment is a kind of ‘recognition’ of problems, a subject which we will discuss later.
The last column shows objects that are currently stale. These are hosts or services that currently have no up-to-date monitoring data available. If a host is currently not available, Checkmk of course can have no information about its services. That does not necessarily mean that there is a problem with them. That is why Checkmk does not just assume a new status for these services, instead it flags them with the pseudostate stale. The Stale column will be missing if all other fields show a 0 (zero).
For pages you visit regularly you can create bookmarks with the Bookmarks snap-in:
But why do you need these bookmarks? After all, there are also bookmarks in the browser! Well, the Checkmk bookmarks have a few advantages:
You only change the content on the right side without reloading the sidebar.
You can share bookmarks with other users.
Setting bookmarks automatically prevents the repetition of actions.
The Checkmk bookmarks are organized in lists. Such a list is a collection of bookmarks that you can manage as a whole. So you can, per list, decide if the list should be provided to other users or stays private for your use.
Besides, each bookmark has a topic — this is the folder under which the bookmark is saved in the sidebar.
Important: A list can sort bookmarks into different topics! Or vice versa — a topic can also contain bookmarks or different lists.
To start with, the snap-in for the bookmarks is still empty:
If you click Add Bookmark, a new bookmark will be generated from what is currently displayed in the main view, and this new bookmark will be automatically saved in the (Topic) My bookmarks folder.
If you want look deeper into the subject of bookmarks, you can find more details in the GUI Reference.
The Quicksearch element searches for hosts and services in the status interface (not in WATO!). It is very interactive. Once you’ve typed something, you immediately see auto-completion suggestions. Here are a few tips:
The search is not case-sensitive.
You do not have to select an entry from the suggestion list. Just press Enter to find a view of all the hosts or services that match the search expression.
You can save the result of the search in a bookmark.
If you want to search for host and service patterns, you can work with
s:in combination. A search for
h:win s:cpuwill show you all the services that contain
cpuon all hosts that contain
In the Master control element you can turn various functions of the monitoring system on and off individually — such as the alerting (Notifications) for example. This latter is very useful if you are making major alterations on the system and want to avoid annoying your colleagues with useless messages.
Please make sure that all switches are set back to on during normal operation, otherwise important monitoring functions may remain switched off!
Each of the items can be removed and collapsed from the sidebar. You have two icons in the upper right corner of each element. Clicking on the cross removes the element. A click on the small dash collapses the element. When an element is collapsed, the small dash changes to a square. If you click on the square, the element will unfold again.
You will find the icon on the far left at the bottom of the sidebar. With this you can extend the sidebar with additional snap-ins. Clicking on the icon will show you all available elements, which you can then simply click on to add. Note that these appear at the bottom and you may need to scroll down the bar to see them.
The order of snap-ins in the sidebar can be changed easily with the mouse. Click with the left mouse button on the upper edge of the snap-in, hold the mouse button down and move the snap-in to the desired position.
If you want to hide the sidebar in order to enlarge another window, all you have to do is move the mouse pointer to the very left of the sidebar’s frame and click to collapse the sidebar — you will then only see a black vertical line. If you later click on this, you can unfold the sidebar again.
The most important snap-in for an operation is next to the Tactical Overview — the one titled Views. A view is a status display that shows you the current state of hosts or services (and sometimes other objects).
Such a view may have a context, e.g. when they contain all services of the
myhost012. Other views operate globally, e.g., the one that
shows you all of the services that currently have a problem.
All of these global views are accessible through the Views snap-in. The views are grouped into Topics (folders) which can be opened and closed individually:
You have numerous options in the status views:
You can navigate to other views by clicking certain cells (here, for example, the host name or the number of its services in the WARN state).
By clicking on a column title you can sort by this column.
Click on to see a whole series of other buttons that will take you to related views.
The button opens a series of search fields which you can use to filter the objects shown.
allows you to change the number of columns displayed (to take full advantage of your wide screen). You can also change this with the mouse wheel when the pointer is over this button.
With you set the number of elapsed seconds after which the view is automatically refreshed (after all, status data can change at any time).
The views have many more options, so that you can customize the views, and even build your own views. You can find out how to do that in a separate article.
The vast majority of services not only provide a condition, but also
measured values. As an example, take the service which checks
the file system
C: on a Windows server:
In addition to the status OK, the file system’s total capacity of 135.78 GB is 68.67 GB full, equivalent to 50.57%. The details are shown in the text section of the status output. The most important value of this — the percentage — is also visualized on the right side in the Perf-O-Meter column.
But this is just a rough overview. A detailed table of all measured values for a service can be found in its detail view in the Service Metrics line:
Even more interesting, however, is that Checkmk automatically stores the time line of all such readings for up to four years (this is of course customizable). Within the first 48 hours, the values are stored to the minute. Time lines are displayed in graphs like this, as they are shown in the Checkmk Enterprise Editions:
Here are a few tips on what you can do with these graphs:
If you move your mouse over a reading, a small pop-up opens with the exact values for that time.
‘Position’ the graph anywhere in the data area. Move the mouse left or right to adjust the time range.
While still holding down the mouse button, slide up and down to scale the graphs vertically.
With the mouse wheel you can zoom in and out in the timeline.
You can resize the graph with the in the lower right corner.
In the Checkmk Raw Edition there is also a system for displaying graphs. This is based on PNP4Nagios and is not interactive.
The system for recording, evaluating and displaying measured data in Checkmk can do much more — especially in the Checkmk Enterprise Editions. Details can be found in its own article.
We have included hosts in the configuration, and we have looked at the operation of the status interface. Now we can start with the actual monitoring. It’s important to bear in mind that the purpose of Checkmk is not to constantly occupy staff with its own configuration, but to support an IT department.
Now the different status views show you exactly how many and what problems there are. However, for the illustration of workflows, and for ‘working’ properly with the monitoring we need something more:
In this chapter, we will start with only the first two elements. The alerting will be handled separately later — with good reason, as we will see.
In the Tactical Overview we have already seen that problems can be either unhandled or handled. An Acknowledgment is the action that changes an unhandled problem into a handled one. That does not necessarily mean that someone really cares about the problem. Some problems even disappear by themselves. But acknowledging them helps you keep track and to establish workflows.
What exactly happens when you acknowledge a problem?
The host/service will no longer be listed in the third column in the Tactical Overview.
The default dashboard also does not list the problem.
The object is marked with the icon in status views.
By acknowledging, an entry is made in the object history so that you can follow it up later.
Repeating alarms (if configured) are stopped by acknowledgments.
So, how do you acknowledge a problem? Well, first open it in a status view. There are two ways of acknowledging — the first way is the best if you just want to acknowledge a single problem. To do this, click through to the details of the host/service — thus the view titled …
Status of host myhost123 in the case of a host
Service myhost123, FOO Service in the case of a service
Now click on the symbol at the top. This will open a number of input fields through which you can take numerous actions on the displayed host/service. The searched-for object is the field at the top:
Enter a comment here and click on Acknowledge — and after the obligatory “Are you sure?” question…
… the problem will be considered as acknowledged. Here are some hints:
You can also remove an acknowledgment with the Remove acknowledgment button.
Acknowledgments can automatically expire. The Expire Acknowledgment after … option provides for this.
It’s not that unusual to have a number of (related) problems needing to be acknowledged at the same time. This can be handled almost as easily. Call up a status view which shows all of these problems. Sometimes that works with Quicksearch, and the Services ⇒ Service Search view is somewhat more flexible.
Once you have got a view of the exact services to be acknowledged, simply proceed as described above. The command will be automatically applied for each of the services shown.
However, if you need a specific selection, with a click on you can open a checkbox for each line. Check the required hosts or services boxes, and then execute the command.
Attention: Never forget that commands are always performed on ALL displayed objects if you have NOT activated ANY checkboxes!
Sometimes things have not been broken accidentally, but on purpose. Or as we prefer to say, the problem is expected. For example, every piece of hardware or software must be serviced occasionally, and while the necessary maintenance work is being performed the affected host or service in the monitoring will of course, go to WARN or CRIT.
For those who need to respond to problems in Checkmk, it is naturally very important that they know about the planned downtime and thus valuable time is not wasted with ‘false alarms’. To ensure this CMK uses the concept of maintenance times. In English these are called Scheduled Downtimes (and in many locations you will occasionally see the shortened form Downtimes, which actually only means that a host is DOWN or a service is CRIT), but deliberately so.
So, if maintenance is required on an object, you can put it into maintenance — either immediately or for a selected period in the future. This is the same as for an acknowledgment, but in this case is entered in the Downtimes field:
There are a whole bunch of options for maintenance. A comment must be entered in every case. By selecting the appropriate button you can start and end a maintenance time. For example, with the 2 hours button the object is declared as ‘in maintenance’ for two hours starting from the current time. Unlike the acknowledgements, maintenance times always have an end time that is set in advance.
Here are some hints:
When you put a host into maintenance, all of its services are automatically considered to be in maintenance. You therefore save yourself the work of doing it multiple times.
If you use the Checkmk Enterprise Editions, you can also define regular maintenance times (for example, due to a mandatory reboot once a week).
The flexible downtimes start automatically only when the object actually assumes a non-OK state.
Here are the effects of a maintenance time:
The views will display an icon for the affected hosts/services.
Alerting of problems is disabled during maintenance.
Affected hosts/services no longer appear as problems in the Tactical Overview.
Scheduled maintenance times are considered separately in the availability analysis.
At the beginning and at the end of a maintenance period, a special alert is triggered to inform you.
Further information about maintenance times can be found as always in its own article.
Monitoring is only really useful if it is precise. The biggest obstacle to acceptance among colleagues (and probably also yourself) is false positives, or simply false alarms.
With some Checkmk newcomers, we have found that they have included many systems into their monitoring within a short time frame — maybe because this so easy in Checkmk. When, shortly after implementation, the alert functions for all elements have been activated, operations staff have been flooded with hundreds of emails each day, so that after just a few days their enthusiasm for monitoring is permanently destroyed.
Even though Checkmk really makes an effort to have sensible defaults for everything, it simply cannot know precisely enough how to deal with the normal conditions in your IT environment. Therefore a bit of manual effort on your part is required to fine-tune your monitoring and to get rid of the last few false positives. Apart from this, CMK will identify a lot of real problems that you and your colleagues have not noticed. These must first be dealt with – by resolving the problems, not by adjusting the monitoring!
The following principle has therefore proved successful: first quality, then quantity. Or differently-expressed:
Do not include too many hosts in the monitoring system at once.
Make sure that all services that do not really have a problem are flagged reliably as OK.
Activate the notifications via e-mail or SMS only if Checkmk runs reliably for a while without any, or with very few, false alarms.
In this chapter we will show you what fine-tuning options you have available (so that everything turns green), and how to get a grip on the occasional misfires.
Before we go to the configuration, we briefly have to address the subject of settings for hosts and services in Checkmk. Because CMK has been designed for large and complex environments, its operation is based on rules. This concept is very powerful and brings many benefits even in smaller environments.
The basic idea is that you do not need to set every parameter for each
service explicitly, but rather code something like: ‘On all Oracle
production servers, when file systems prefixed
/var/ora are at 90%
fill-level flag _WARN, and at 95% flag _CRIT.’
Such a rule can in one fell swoop establish thresholds for thousands of file systems. At the same time it also very clearly documents which monitoring policies apply in your business.
Of course, you can also specify individual cases separately. A suitable
rule might look like this: ‘On the server
srvora123 the file
/var/ora/db01 at 96% fill receives _WARN, and at
98% receives_ CRIT.’ This example can be called an Exception
– but it is nevertheless a completely normal rule.
Each rule has the same structure. It always consists of one condition, and one value. In addition you can also include a title and a comment to document the function of the rule.
The rules are organized in rule chains. There is a separate rule chain for every type of parameter in Checkmk. For example there is one named Filesystems (used space and growth) which sets the thresholds for all services that monitor file systems. If Checkmk wants to determine which thresholds a particular file system check receives, it goes through all of the rules in this chain in turn. The first rule that satisfies the condition sets the value — so in this case the exact requirements for when the file system check flags a WARN or CRIT.
How does that look in practice? The normal method is via the Host & Service Parameters WATO module, which provides you with all known rule chains:
Here is the easiest way to get started with the search field. For example,
tablespace here so you can find all rule chains that have this
text in the name or in the (not visible here) description:
The number with each name (here all
0) shows the number of rules
in the respective chain. If you click on the name of the rule chain, you
get the detailed view:
The rule chain shown here does not yet contain any rules. But with the Create rule in folder button you can create a rule. You can already define the first part of the condition of the rule: namely in which WATO folder this should apply. If you change the Main directory setting, e.g., on Windows, the new rule applies only to hosts directly in or below the Windows folder.
The creation (and of course the later editing) brings you to an input box with three fields: general, value and condition. In the Rule properties box all information is optional. In addition to the informative texts, you also have the possibility to temporarily disable a rule. That is handy because that’s how you can sometimes avoid having to delete and create a new rule if you do not require one, but only temporarily.
Of course, what you find in the Value of a rule is completely up to you. As you can see here in the example, there can be quite a number of parameters. A typical case is as shown here: Each single parameter is activated by a checkbox, and the rule then alters only this parameter. You can allow a parameter to be set by another rule if that simplifies your configuration. In the example, only the thresholds for the percentage of free space in the tablespace is defined:
The field with the conditions looks a bit confusing:
The Condition type allows you to use predefined conditions that are managed via the Predef. Conditions button. This is a feature for ‘Power users’ who use a lot of rules which always have the same conditions. Let’s just leave that on Explicit conditions for now.
You have already defined the Folder when you created it, but you can alter it again here.
The Host tags (host properties) are a very important feature of Checkmk: With this you can simply say that a rule should only apply for production systems. Because the host tags are so important, we’ll dedicate a separate section to them right after this. To add a tag condition, first select a Tag Group in the selection list, followed by a click on Add tag condition.
Explicit hosts allows you to limit a rule to a few specific hosts.
Very important are the Explicit Tablespaces which restrict a rule to very specific services. Two points are important to note for this:
The name of this condition conforms to the rule type. If this is here Explicit Services, specify the names of the affected services. These can be e.g.
Tablespace DW20– including the word
tablespace. In the example shown, on the other hand, you only want to specify the name of the tablespace itself, e.g.
The texts are always matched starting at the left! The example rule thus also applies to the fictitious tablespace
DW20A. If you do not want this, put a
$at the end — e.g.
DW20$. These are so-called regular expressions.
The labels, which you can also see in the screenshot, are treated in their own chapter in the manual.
After saving, exactly one rule will be found in the rule chain:
Above we have seen an example of a rule that should apply only for ‘production’ systems. More specifically, we usually have a condition that defines a Production system through the Host Tag. Why do you do that instead of simply using folders? Well, you can only define a single folder structure, and each host can only work in a single folder. But there are many very different features that a host may have, and the folders are simply not flexible enough.
Tags, on the other hand, can be assigned to the hosts completely freely and arbitrarily – no matter in which folder the hosts are located. Rules can then later refer to these tags. This not only makes configuration easier, but also easier to understand and less prone to error than if everything was explicitly set for every host.
But how and where to determine which hosts should have which tags? And how can you define your own tags?
Let’s start with the second question: your own tags. First you have to know that tags are organized in groups: i.e. Tag Groups _. Let us take _Location as an example. A Tag Group could thus be called Location. And this group could have the characteristics Munich, Austin and Singapore. Basically, every host in each group has exactly one tag, so as soon as you define your own tag groups, each host without exception always has one of the tags from the group. Hosts for which you have not selected a tag from the group are simply assigned the first tag by default.
The definitions of the tag groups can be found in the WATO ⇒ Tags WATO module.
As you can see, some tag groups are already predefined. Most of these you cannot change. We also recommend that the two predefined examples Criticality and Networking Segment are left alone. It is preferable to define your own groups — which is very easy.
Click New tag group, which brings you as expected to a form with multiple fields. In the first field you assign an internal ID, as so often in Checkmk — which serves as the key and which cannot be changed later — and a meaningful Title which you can customize later. The Topic only serves in the overview. If you assign a topic here, it will be displayed in a separate field in the host properties.
The actual tags are entered in the second field — the selection choices for the group. Again you assign an internal ID and a title to each tag:
The IDs must be unique across all groups.
Groups with only one selection are allowed and are even useful. These will appear as checkboxes. Each host then either has the feature or not.
It is best to ignore the Auxiliary Tags.
Once you have saved, you can use the new tag group.
You have already seen how tags are assigned to a host: in the Host Properties when creating or editing a host. In the Custom attributes field (or in your own, if you have assigned a topic) the new tag group will appear and there you can make a selection for the host:
As always, you can also set the tag to the folder and overwrite it on individual hosts as needed.
There are many rule chains, and when searching it is not always easy to find the right one. But there is another way: If you have a certain service and want to modify its check parameters, click the menu, and select the Parameters for this service entry:
This takes you to a page where you have access to all of the rule chains for this service:
In the first field titled Check origin and parameters, the second entry (here CPU utilization on Linux/UNIX) takes you directly to the rule chain that sets the thresholds for this service.
Now that you have learned the basic principle of configuring services, in the rest of the chapter we will show you some important things that you should configure in a new Checkmk system in order to reduce false alarms.
The first are custom thresholds for monitoring file systems. By default in Checkmk, used disk space is set to 80% for WARN and 90% for CRIT. Now on a 2TB drive 80% is eqivalent to 400 GB still available — maybe that is a bit too much buffer. So here are a few tips:
Create your own rules in the Filesystem (used space and growth) chain.
The parameters allow thresholds that depend on the size of the file system. Select Levels for filesystems ⇒ Levels for filesystem used space ⇒ Dynamic levels. With the Add new element button you can now define your own threshold values appropriate to each drive’s capacity.
It is even easier with the Magic Factor, which we will introduce in the Best Practices chapter.
It is not always a problem when a computer is turned off. A classic case is with printers. Monitoring printers with Checkmk makes sense — some users even manage the reordering of toner via Checkmk. However, switching off a printer before closing time is not a problem — it is rather positive in fact — it’s just senseless if Checkmk alerts the situation when the corresponding host goes DOWN.
You can tell Checkmk that it is fine if a host is turned off. Search for it in WATO ⇒ Host & Service parameters under the Host check command rule set. Place a rule there for all printers (depending on their structure, for example via a folder or via a matching host tag), and set its value to Always assume host to be up:
Now all printers are basically displayed as UP — no matter what their real status is.
The printers’ services will still be checked, though, and would get a timeout and thus a CRIT. To avoid this, configure a rule in the Access to Agents ⇒ Check_MK Agent ⇒ Status of the Check_MK services ruleset, in which you set timeouts and connection problems to OK:
If you monitor a switch with Checkmk, you will notice that in the service configuration a service will be created automatically for each port that is UP at the time. This is a sensible default setting for core and distribution switches — i.e., where only infrastructure devices or servers are connected. For switches connected to devices such as workstations or printers, this leads to constant alarms when a port goes DOWN, and conversely to new services constantly being found because a previously unmonitored port is now UP.
Here two approaches have become recommended practice. The first of these is to restrict monitoring to the uplink ports. Do this by creating a rule in the disabled services that excludes the other ports from the monitoring.
Much more interesting, however, is the second method. With it you monitor all ports, but allow the DOWN state as a valid state. The advantage: for ports where printers or workstations are attached you also have monitoring of transmission errors, and so can very quickly recognize bad patch leads or errors in auto-negotiation.
To use this second method you need two rules. The first rule is in the Parameters for discovered services ⇒ Discovery — automatic service detection ⇒ Network Interface and Switch Port Discovery chain. This rule determines the conditions under which switch ports should be monitored. Create a rule for the required switches, and activate it in Network interface port states to discover alongside 1 - up and 2 - down:
In the service configuration of the switches, the ports with the status DOWN are now also available, and you can add these to the service list. Now before you activate everything you of course need the second rule which ensures that this condition is considered OK. The rule chain is called Network interfaces and switch ports. Activate the Operational state option, uncheck Ignore the operational state, and in Allowed states check the states 1 - up and 2 - down (and possibly other states if needed).
Some servers are restarted at regular intervals — whether to patch, or simply because it is intended. You can avoid false alarms at these times in two ways:
In the Checkmk Raw Edition you first define a Timeperiod covering the times of the reboot. You can find out how to do that in the article on timeperiods. Then place a rule in each of the Notification period for hosts and Notification period for services rule chains for the affected hosts, and there select the previously-defined time period. The second rule is necessary so that services which go to CRIT within this time period trigger no alarm. If problems occur (and then disappear) during these times, again no alarm will be triggered.
There are maintenance times in the Checkmk Enterprise Editions — which are automatically repeated on a regular basis — that you can easily specify for the affected hosts.
Tip: As well as the method using commands that we showed under maintenance times, there is also a way through the Recurring downtimes for hosts rule set. This one has the big advantage that hosts that are initially planned to be added to the monitoring at a later date automatically get these maintenance times.
For some services which are simply not reliably OK, it is in the end better not to monitor them at all. In such cases, in WATO you could just manually remove the services from the affected hosts from the monitoring by putting them back on Undecided or just leaving them there. This is, however, awkward and prone to errors.
It is much better if you define rules according to which certain services
should systematically NOT be monitored. There is the Disabled services
rule set for this in which you can, e.g. very easily create a
rule in which the file systems with the mount point
should not be monitored.
Tip: If you deactivate a single service in a host’s service configuration by clicking on , a rule for the host will be created automatically only in this rule chain. You can edit this rule by hand and for example, remove the explicit hostname. The affected service will then be shut down for all hosts.
For more information about configuring services read its own article in the reference section.
One reason for sporadic alerts are thresholds on workload metrics — such as CPU utilization for example — which are only exceeded for a short time. As a rule such brief spikes are not a problem and thus should not be raised as alarms by the monitoring system.
For this reason a whole range of check plug-ins in your configuration include the option of averaging the measured values before applying the thresholds over a longer time frame. An example of this is the rule chain for CPU usage for non-Unix systems named CPU utilization for simple devices. Here is the Averaging for total CPU utilization option:
If you activate this and enter
15, the CPU load will first be
averaged over a 15-minute period, after which the thresholds will be
applied to this averaged value.
If nothing else helps — and some services occasionally just go into a problem status during an individual check (even if only for a minute) — there is one last method that prevents false alarms. Here is the rule chain for this situation Maximum number of attempts to verify the service.
Create a rule there and set the value, e.g., to
3, so that for example,
when a service goes from OK to WARN, at first no alarm will be triggered,
and thus no problem is displayed in the Tactical overview at this time.
Only when the status is not OK for three consecutive checks (which is a
total elapsed time of just over two minutes), the problem will be considered
‘hard’ and will then be reported.
Admittedly, that is not an ideal solution, and you should always try to solve the problem at the root, but sometimes things are just as they are, and with the Check attempts you at least have a viable workaround in such cases.
A data center is constantly changing, and thus the list of monitored services will never stay constant. So that you do not miss anything, Checkmk automatically creates a special service on each host. This is Check_MK Discovery;
By default, every two hours this checks whether new (not yet monitored) services are found or existing ones have been dropped. If this is the case, the service will go to WARN. You can then open the service configuration in WATO and bring it back up to date.
Tip: Some users save a bookmark for a view that contains all of the discovery services on all hosts which are not in the OK state. These you can then work through regularly — e.g., once a day.
Once you have your monitoring in a state where it runs, in order for it to become useful
to others, it is time to familiarize yourself with user management
in Checkmk. If you only operate the system yourself, working with
cmkadmin is quite sufficient, and you can just read the next chapter
But let’s say you have colleagues working with you who should use Checkmk.
Why not all simply work as one?
cmkadmin? Well, theoretically
that works, but it does create a number of difficulties. If you create an
account per person, however, you will have several advantages:
Individual users can create their own bookmarks, customize their sidebar, and customize other things for themselves.
Different users may have different permissions.
Users can be responsible only for certain hosts and services, and only need to see these in their monitoring display.
You can delete one user’s account when they leave or change jobs, without affecting anyone else’s account name or password.
As always you will find all of the details about users, rights and roles in its own article.
These last two points need special explanation. Let’s start with permissions — the question of which users are permitted to perform which actions. For this purpose Checkmk uses the usual concept of roles. A role is nothing more than a collection of permissions. Each of the permissions allows a very specific action. For example, there is a Permission to be able to change global settings.
Checkmk is supplied with three basic roles as standard. These are:
A user with this role is allowed to do everything. Its main task is the general configuration of Checkmk, not the day-to-day operation of it. This of course includes creating users and customizing roles.
Normal monitoring user
This role is for a ‘normal’ user operations. They may only see such hosts and services for which they are responsible. There is also the possibility of giving the role the right to manage its own hosts in WATO itself.
A guest user is allowed to see everything, but not change anything. This role is, e.g., useful if you want to hang a status monitor on a wall to display an overview of the monitoring. Because a guest user cannot change anything, it is also possible for multiple colleagues to use that account at the same time.
How to customize roles is explained in the detailed user management article.
The second important aspect of users is defining Responsibilities.
Who is in charge of the host
mysrv024, or is responsible for the
tablespace FOO on the host
ora012? Who should see
this in the status interface, and possibly be alerted if there is a problem?
This is performed in Checkmk not via roles, but via Contact Groups. The word ‘contact’ is meant in the sense of an alert: Who should the monitoring system contact when there is a problem?
The basic principle is as follows:
Each user can be a member of any number of contact groups, including none.
Each host and service is a member of at least one contact group.
Here is an example of such an association:
As you can see, both a person and a host (or service) can be a member of several groups. Membership in the groups has the following effects:
A user with the
userrole sees precisely the objects in the monitoring system which are in one of his contact groups.
If there is a problem with a host or service, then by default all users who are in at least one of its contact groups are alerted.
Important: There is no option in Checkmk to assign a host or service directly to a user. This is deliberate because it leads to problems in practice — for example when a colleague leaves your company.
Creating new contact groups is very easy, and is performed in the Contact groups WATO module. A contact group with the name Everything is already predefined. This is assigned automatically to all hosts and services. The purpose of this is for a simple system setup in which there is initially no division of tasks among the administrators (or you in the case where you take on everything yourself).
Use New contact group to create a new group. Here,
as always, you need an ID that is used internally as a key, as well as a title
that you can change later. Here in the example you will see a contact group
with the ID
servers, and the title
Windows & Linux Servers:
After you have created the contact groups, you must on the one hand assign hosts and services, and of course on the other hand assign users. The latter is what you do in the properties for the users themselves, which we’ll see right after this.
There are two ways to assign hosts to contact groups – you can also choose both methods at the same time:
Assignment using rules with the Assignment of Hosts to Contact Groups rule set
Assignment via the properties of the hosts or folders in WATO
The rule set that you need for the first method is most easily found with
the Rules button in the Contact groups
module. But as always the search function via Host & service parameters
also helps if you just search for
By the way, even with a fresh Checkmk installation the rule set is not empty. You will find a rule here that assigns all hosts of the above-mentioned group Everything. So create new rules here yourself, and choose the group you want to assign to the rule-selected hosts:
Important: If multiple rules apply to a host, all of the rules will be evaluated, and in this way the host will then receive several contact groups.
The second method for assigning is to use the properties of a host in WATO. The procedure is as follows:
Invoke the host properties in WATO.
In the Basic settings box check the Permissions checkbox.
Select one or more groups in the box Available, and move them to the right with the arrow buttons, into the Selected field.
Enable Add these contact groups to the hosts.
The checkbox Always add host contact groups also to its services is not usually required, because services automatically inherit their host’s contact groups. You will learn more about this later.
Of course, as always, you can also define this host property in the folder. The process is similar, except that this time there are a few extra checkboxes that you can simply leave in their default state.
You only have to assign services to contact groups if these groups differ from those of their host. However, there is an important principle: If a service has been explicitly assigned to at least one contact group, it will inherit no contact groups from the host.
This allows you to have a separation of server operations teams and applications teams,
for example. If, for example, you plug the host
windows contact group, but all services with the prefix
Oracle are in the
oracle contact group, the windows admins
will not see the Oracle services, and conversely, the Oracle admins receive no
details of the operating system’s services — often a very useful separation.
If you do not need this separation, then simply create assignments for the hosts — and you’re done!
If you nevertheless need an explicit assignment, this is done via the Assignment of services to contact groups rule set. The procedure is analogous to that described above, but as usual you give conditions for the service name.
The administration of users can be found in the WATO Users module:
Do not be surprised if next to the
cmkadmin entry there is also
automation user! This user is for requests from processes and
scripts that are intended for the HTTP-API, and which are provided by the
Checkmk system itself. For details see the reference.
If you have discovered the LDAP Connections button – should your company use Active Directory or another LDAP service — you also have the option of including users and groups from these services. This will be described in its own article.
Create a new user with the New user button. This form is of course almost identical to the one you see when you edit an existing user (the icon next to the user), except that it is not possible to change an existing user’s username:
As always, enter an ID and a title in the first field — here the advertised name of the user. The Email address and pager address fields are optional and are used for alerting via email or sms.
*Note:*Please do not enter any email address here. First read the notes in the chapter on alerting.
The second field concerns security and permissions:
Leave the setting on Normal user login with password and assign an initial password here. At the bottom you can assign roles to the user. If you assign more than one role, the user simply receives the maximum permissions from these roles (although for the three predefined roles this is not very useful).
In the third field you select the contact groups to which the user should belong. If you select the predefined Everything group, the user becomes responsible for everything, since this group contains every host and service:
By the way: The Personal Settings field contains precisely the
settings — except for the password — which the user can change themselves.
Users of the
guest role cannot change their settings, so here there is
the possibility, e.g., of setting the language or the User interface theme.
In Checkmk Notification means that users are actively notified when the
state of a host or service changes. Let’s say, at some point, on the
mywebsrv17 the service
HTTP foo.bar changes from OK to
CRIT. Checkmk recognizes this and, for example, sends an email with the most
important data for this event to all contact persons for this service. Later
the service again changes its state from CRIT to OK, so the contacts
will receive a new email for the event — this time called Recovery.
But this is just the simplest way of alerting, and there are many possibilities for refining it:
You can alert via SMS, pager, Slack or other Internet services.
You can set alerts to certain time windows (standby).
You can define escalations if the responsible contact does not react quickly enough.
Users can autonomously ‘subscribe’ to or unsubscribe from notifications if you want to allow them.
You can generally use complex rules to specify who should be alerted about what, and when.
However, before you start using notifications, you should be aware of the following:
Notifying is an optional feature. Some organisation have a control desk that is staffed around the clock and which works only with the status view.
Initially enable notifications only for yourself, and make yourself responsible for everything. For a few days or weeks observe how big the volume of alarms is. Tune your monitoring.
Do not enable alerts for your colleagues until you have minimized false positives (false alarms).
The simplest and by far the most common procedure is alerting by e-mail. This is easy to set up, and in an email there is enough ‘space’ to include any graphs of the measured data to be sent.
Before you can alert by email, your Checkmk server needs to be set up for sending mail. For all supported Linux distributions this is performed using one or other of the steps below:
Install an SMTP server service. This usually takes place automatically when installing the distribution.
Specify a smarthost. Again, you are usually asked this when installing the distribution. The smarthost is a mail server in your company that handles the delivery of emails for Checkmk. Very small companies usually do not have their own smart host. In such cases you use the SMTP server provided by your email provider.
If the mail delivery is set up correctly, you should be able to send an email using the command line — with this command for example:
email@example.com ‘Testcontent’ | mail -s Test
The email should be delivered without delay. If this does not work, you will
find information in the
/var/log directory in the SMTP server’s log
file. More details on setting up mail services on Linux can be found in the
reference section of the manual.
If the sending of email works in principle, then the activation of notifications is very easy — you may already have done it without realising it when creating the users. For a user to receive notifications the following two steps are necessary:
An email address must be entered in the user’s properties.
The user must be responsible for hosts or services (via the appropriate contact groups).
It would be a bit cumbersome to test notifications by waiting for a real problem to occur or even by provoking one. Testing is easier using the Fake check results command. These are found in the same way as the acknowledgements or the maintenance times.
Important: This box is only visible if you have the
It is best to choose a service that is currently OK and set it manually to CRIT. This should immediately trigger an alert. After one minute at the latest — when the next regular check is executed — the service then reverts by itself to OK, and a second alarm should be triggered — the Recovery.
If you do not receive an email, it does not necessarily indicate an error, because there are many situations in which Checkmk notifications are deliberately suppressed:
If a host is DOWN, no alerts will be triggered on its services!
If you turned off notifications in the Master Control snap-in.
When a service or host is in a maintenance time.
If a service has recently been changing between different states too often, and the service has thus been marked as flapping! This can happen quickly if you constantly change the state using Fake check results!
You can customize notifications in Checkmk in many different ways, and define very complex rules for who, when and how should be notified. All details can be found in the reference section of the manual.
The notification module in Checkmk is very complex — simply because it covers many very different requirements that have proven to be important in over 10 years of field experience. The question “why has Checkmk not notified here” is thus asked more often by beginners than you may have suspected. This is why you will find some troubleshooting tips here.
If a notification from a particular service has not been triggered, the first step is to look at the History of the service. You will find this if you go to the service’s detail page in the status interface, and click on History. There you will find all events for this service listed chronologically from the newest to the oldest. Here is an example of a service that was trying to trigger an alert, but mail delivery did not work (because no SMTP server is installed):
For more information see the
var/log/notifiy.log file. You can
for example, monitor this continuously in a terminal with the
command, or with the
tail -f command. The latter is useful if you
are only interested in new messages — i.e. those which were created
after entering the
tail command. Do not forget to first switch
to your instance user with `su - `:
root@linux# su - mysite
You can now open the file with
If you are not yet familiar with
less, press Shift-G to
jump to the bottom of the file (this is always useful in log files), and
less with Q.
Here is a snippet from
notify.log for a successfully-triggered alert:
2019-09-05 10:21:48 Got raw notification (server-linux-3;CPU load) context with 71 variables 2019-09-05 10:21:48 Global rule 'Notify all contacts of a host/service via HTML email'... 2019-09-05 10:21:48 -> matches! 2019-09-05 10:21:48 - adding notification of martin via mail 2019-09-05 10:21:48 Executing 1 notifications: 2019-09-05 10:21:48 * notifying martin via mail, parameters: (no parameters), bulk: no 2019-09-05 10:21:48 Creating spoolfile: /omd/sites/mysite/var/check_mk/notify/spool/cbe1592e-a951-4b70-9bac-0141d3d74986
If you want to go deeper into the subject of notifications, you will find all the relevant details in the reference part of the manual.
With the setting up of notifications you have completed the last step, and your Checkmk system is ready! The possibilities within Checkmk are of course not yet exhausted at ths point. There are many more ways to continue the expansion of your monitoring.
Even if monitoring is ‘only watching’, the subject of IT security is also important. In the reference section you will find a security overview article which will give you tips on how to optimise your system’s security.
If your monitoring has reached an order of magnitude where you are monitoring thousands of hosts, or even more, architecture and tuning issues become interesting. The most important topic here is distributed monitoring. With this you work with multiple Checkmk instances that interconnect into a large system — which may even be distributed globally.
With the availability module, CMK can very precisely calculate the availability of hosts or services in specific time periods, how many failures occurred — and their durations, and much more.
With the SLA module included in the Checkmk Enterprise Editions, Checkmk can verify compliance with service level agreements, and even actively monitor these.
The hardware/software inventory does not really belong to the topic of monitoring, but using the already installed agents Checkmk can provide extensive information on the hardware and software of your monitored systems. This is very helpful for maintenance, license management, or the automatic loading of data into Configuration Management Databases.
So far we have only been monitoring the current states of hosts or services. A completely different topic is the evaluation of spontaneous messages which, e.g. appear in log files, or are sent by syslog or SNMP traps. Checkmk has a complete, integrated system called the Event Console.
With the NagVis add-on integrated in Checkmk you can represent any states with maps or diagrams. This is great for creating appealing overviews — for screens in control rooms for example.
With the Business Intelligence module you can derive and clearly present the overall state of business-critical applications, based on the many individual status values provided by Checkmk
The reporting module Checkmk included in the Checkmk Enterprise Editions enables the creation of PDF reports for clearly displaying information on past periods, events, availabilities and much more.
If you monitor many Linux and Windows servers, you can keep your monitoring agents and their configurations at the desired level with the agent-updater contained in the Checkmk Enterprise Editions, from a centralised base.
Checkmk automatically sets up a service on both Linux and Windows which monitors the average CPU usage over the last minute. This of course makes sense, but it fails to recognize a number of problems — for example, when a single process runs amok and permanently loads one CPU core at 100%. For a system with 16 CPU cores a single core contributes only 6.25% to the overall performance, and so in extreme cases like this one a load of only 6.25% is measured — which of course does not lead to an alert.
Checkmk therefore offers the possibility (for both Windows and Linux) to monitor all existing CPU cores individually and determine if any is permanently busy for a long time. Setting up this check has turned out to be a good idea.
To set this up for your Windows server, add a rule to the CPU utilization for simple devices chain. This rule is actually responsible for the monitoring of all CPUs. There is an option here called Levels over extended periods on a single core CPU utilization. In general, only activate this option:
Define the rule condition so that it only applies to the Windows server, e.g. through a suitable folder or host tag. This rule does not affect other rules in the same chain if they set other options, e.g., the thresholds for total utilization.
The additional validation will be found in the existing service CPU utilization.
For this function Linux servers use the CPU utilization on Linux/UNIX rules chain — where you find exactly the same option.
Checkmk does not by default monitor services on your Windows servers! Why not? Well, because it is not automatically clear which services are important to you.
If you do not want to bother to manually specify which services are important for each server, you can also set up a check that simply checks if all services with automatic Startup are actually running. In addition you can be informed whether manually-started services really have started. A problem could result since of course these services will not automatically be running after a reboot.
To do this you’ll first need a rule in the Windows Services chain, which you can always find with the search function. The crucial option in this rule is Service states. Activate this and add three elements:
This gives you the following definitions:
A service with startup auto if running is considered OK.
A service with an auto startup that is not running is considered CRIT.
A service with startup Demand if running is considered WARN.
However, this rule only applies to services that really become monitored! That is why we now need a second step: Create a new rule in the Windows Service Discovery chain. This controls which Windows services Checkmk automatically suggests as monitored services.
When you create this rule, first in the Services (Regular Expressions)
field you can enter the regular expression
.* that matches all
services. If you save, and then in WATO switch to the service configuration
for a suitable host, you will find a large number of new services — one
for each Windows service.
To limit the number of monitored services, return to the rule and refine the search terms as needed. This is case-sensitive! Here is an example:
If you have already included the services in the monitoring configuration, they will now appear as missing. With the Automatic refresh (tabula rasa) button, you can clear the table and regenerate the whole list.
Of course, your company’s access to the Internet is very important to everyone. The supervision of this is somewhat unusual, since there is not ‘the Internet’, but rather billions of hosts. However, you can still set up monitoring very efficiently according to the following blueprint:
Select multiple Internet ping destinations that should normally be reachable and record their IP addresses.
In WATO create one host called
Enter one of the IP addresses for this host as an IPv4 address.
Enter the other addresses for the same host under the Network address ⇒ Additional IPv4 addresses option.
Also set Data sources ⇒ Check_MK Agent to No agent.
Create a rule under Active checks (HTTP, TCP, etc.) ⇒ Check hosts with PING (ICMP Echo Request) which only applies to this host.
In this rule activate Service description, and enter
Internet connectionin the service name field.
Also enable Alternative address to ping, and select Ping all IPv4 addresses.
Activate Number of positive responses required for OK state and enter
Create another rule – this time under Monitoring Configuration ⇒ Host check command — which also applies only to the host
In the Host check command field, select Use the status of a service …, and enter the service name
Internet connectionwhich you defined in step 7.
If you now activate the changes, you will receive a new host with the name
Internet with only the
Internet connection service. If
at least one of the ping destinations is reachable the host will have the
status UP and the service will have the status OK. Simultaneously,
from the service you will get the data for the typical round trip time from
each of the ping targets, as well as the packet loss, and thus also get an
indication of the quality of your connection over time:
Steps 10 and 11 are necessary so that the host does not get the state DOWN
if the first IP address cannot be reached by a
the host always takes the status of its only service.
Important: Because a service is generally not alerted when its host is DOWN, it is important that you make the notification relate to the host — not the service. In addition you should use an notification method that does not require an Internet connection!
Let’s say you want to check the accessibility of a website or web service. The normal Checkmk agent does not provide a solution because it does not display this information — and you may also not have the possibility of installing the agent on the server.
The solution for this is a so-called active check. This is one that is not performed by an agent, rather by contacting a network protocol directly at the destination host — in this case HTTP(S). The procedure is as follows:
Create the destination server as a host in WATO. Let’s give it the name tribe29.com.
In Data sources ⇒ Check_MK Agent, select No agent and save it without service detection.
Now create a rule in the Active Checks (HTTP, TCP, etc.) ⇒ Check HTTP service rule set for this host (eg with Explicit hosts or an appropriate host tag).
In the Check HTTP service box you will find many options for how to perform the check. More on this later.
Save the rule and activate the changes. Now you will get a new host with a service that checks access via HTTP(S).
The options for this rule include the following:
In Virtual host you may be required to specify a domain of the server if it hosts more than one domain.
The Use SSL/HTTPS for the connection option allows monitoring of HTTPS.
Expected response time allows you to set the service to WARN or even CRIT if the response time is too slow.
The Fixed string to expect in the content option allows you to check the answer for a specific text in the delivered page. You should always check a relevant part of the content, so that a simple error message from the server is also considered a positive response.
By the way, you can of course also perform the HTTP check on a host that is already being monitored by a Checkmk agent. In this case creating the host is omitted and you just need the correct rule.
Finding good thresholds for monitoring file systems can be a bit tedious and require a lot of rules. A threshold of 90% is much too low for a very large drive, and it may be too high for a small drive. In addition to the method mentioned in the chapter about tuning, there is another more practical way to define thresholds depending on the size of the drive: the Magic factor. It works like this:
In the Filesystems (used space and growth) rule set, you apply only one rule, with thresholds of 80% and 90% respectively.
In the same rule enable Magic factor (automatic level adaptation for large filesystem), and enter
Also enable Reference size for magic factor and enter 100 GB as the size.
If you enable now enable the rule, you will get thresholds that automatically depend on the size of the file system:
File systems that are exactly 100 GB receive the thresholds 80%/90%.
File systems that are larger than 100 GB get higher thresholds which are closer to 100%.
File systems that are smaller than 100 GB get lower thresholds — i.e. ones below 80%/90%.
How high the thresholds exactly are is — well, magical! The factor (here 0.8) determines how strongly the values can be adjusted. A factor of 1.0 does not change anything, and all drives get the same values. Smaller values bend the thresholds more. Which exact thresholds apply can easily be seen in each service’s status text:
The following table shows some examples of the resulting thresholds for a reference size of 100 GB:
|Disk capacity||mf = 1.0||mf = 0.9||mf = 0.8||mf = 0.7||mf = 0.6||mf = 0.5|