1. Introduction
Probably not everybody has the same understanding of the term ‘Distributed Monitoring’. In fact monitoring is always distributed over multiple computers, unless the monitoring system is only monitoring itself – which wouldn’t be very useful.
In this handbook we therefore always refer to a distributed monitoring when the monitoring system as a whole consists of more than a single Checkmk-instance. There are a number of good reasons for splitting monitoring over multiple instances:
Performance: The processor load should, or must be shared over multiple machines.
Organisation: Various different groups should be able to administer their own instances independently.
Availability: The monitoring at one location should function independently of other locations.
Security: Data streams between two security domains should be separately and precisely controlled (DMZ, etc.)
Network: Locations that have only narrow band or unreliable connections cannot be remotely-monitored reliably.
Checkmk supports various procedures for implementing a distributed monitoring.
Checkmk controls some of these as it is largely compatible with, or based on
Nagios (if Nagios has been installed as the core). Also covered are the
old NSCA process and the somewhat more modern mod_gearman
.
Compared to Checkmk’s own system they offer no advantages and are also more
cumbersome to implement. For these reasons we don’t recommend them.
The precedure preferred by Checkmk is based on Livestatus and a division of the configuration using WATO. For situations with very separated networks, or even a strict one-way data transfer from the peripherie to the centre there is a method using Livedump, or respectively, CMCDump. Both methods can be combined.
2. Distributed monitoring with Livestatus
2.1. Basic principle
Central status
Livestatus is an interface integrated into the monitoring core which enables other external programs to query status data and execute commands. Livestatus can be made available over the network so that it can be accessed by a remote Checkmk-instance. Checkmk’s user interface uses livestatus to combine all tethered instances into a general overview. This then feels like a ‘large’ monitoring system.
The following diagram schematically shows the structure of a monitoring with livestatus distributed over three locations. The Checkmk-instance Central Site is found in the central processing site. From here central systems will be directly controlled. Additionally, there are the remote Site 1 and remote Site 2 instances which are located in other networks and controlled by their local systems:

What makes this method special is that the monitoring status of the remote instances is not sent continuously to the central site. The GUI always only retrieves data live from the remote instances when it is required by a user in the control centre. The data is then compiled into a centralised view. There is thus no central data holding, which means it offers huge advantages for scaling-up!
Here are some of the advantages of this method:
Scalability: The monitoring itself generates no network traffic at all between central and remote site. In this way hundreds of locations, or more, can be connected.
Reliability: If a network connection to a remote instance fails the local monitoring nonetheless continues operating normally. There is no ‘hole’ in the data recording and also no data ‘jam’. A local notification will still function.
Simplicity: Instances can be very easily incorporated or removed.
Flexibility: The remote instances are still self-contained and can be used for the operating in their respective location. This is then particularly interesting if the ‘location’ should never be permitted to access the the rest of the monitoring.
Centralized configuration
In a system distributed using Livestatus as described above, it is quite possible that the individual instances can be independently maintained by different teams, and the central site only has the task of providing a centralised dashboard.
In the case of multiple, or all instances needing to be administered by the same team, a central configuration is much easier to handle. Checkmk supports this and refers to such a configuration as a ‘distributed WATO’. With this all hosts and services, users and permissions, time periods, and notifications, etc., will be maintained centrally on the central using WATO, and then depending on their tasks, be automatically distributed to the remote instances.
Such a system not only has a common status overview but also a common configuration, and effectively ‘feels like a large system’.
2.2. Installing a distributed monitoring
Installing a distributed monitoring using livestatus/distributed WATO is achieved in the following steps:
First install the central instance as is usually done for a single instance
Install remote instance instances, and enable livestatus via the network
Integrate the remote instances into the central instance using the
Distributed monitoring WATO-module
For the hosts and services, specify from which instance they are to be monitored
Execute a service discovery for the migrated hosts, and then activate the fresh changes
Installing a central instance
No special requirements are placed on the central instance. This means that a long-established instance can be expanded into a distributed monitoring without requiring additional modifications.
Installing remote instances and enabling livestatus via the network
The remote instances are then generated as new instances in the usual way with
omd create
. This will naturally take place on the (remote) server
intended for the respective remote instance.
Special notes:
For the remote instances, use IDs unique to your distributed monitoring.
The remote’s Checkmk-version is permitted to diverge from the central instance’s version to a maximum of one patch level (denoted by the numeral following the
‘p’
for stable versions). Other versions may be compatible, but not necessarily. Information on the Checkmk version-numbering system can be found in its own article.In the same way as Checkmk supports multiple instances on a server, remote instances can also run on the same server. Here is an example for creating a remote instance with the name
remote1
:
root@linux# omd create remote1
Adding /opt/omd/sites/remote1/tmp to /etc/fstab.
Creating temporary filesystem /omd/sites/remote1/tmp...OK
Updating core configuration...
Generating configuration for core (type cmc)...Creating helper config...OK
OK
Restarting Apache...OK
Created new site remote1 with version 1.6.0.
The site can be started with omd start remote1.
The default web UI is available at http://myServer/remote1/
The admin user for the web applications is cmkadmin with password: lEnM8dUV
For command line administration of the site, log in with 'omd su remote1'.
After logging in, you can change the password for cmkadmin with 'htpasswd etc/htpasswd cmkadmin'.
The most important step is now to enable live status via TCP on the network.
Please note that live status is not per se a secure protocol and should only be
used within a secure network (secured LAN, VPN, etc.). The enabling appears
per omd config
as an
instance user on a stopped site:
root@linux# su - remote1
OMD[remote1]:~$ omd config
Now select Distributed Monitoring:

Set LIVESTATUS_TCP to ‘on’ and enter an available port number for LIVESTATUS_TCP_PORT that is explicit on this server. The default is 6557:

After saving, start the instance as normal with omd start
:
OMD[remote1]:~$ omd start
Starting mkeventd...OK
Starting Livestatus Proxy-Daemon...OK
Starting rrdcached...OK
Starting Check_MK Micro Core...OK
Starting dedicated Apache for site remote1...OK
Starting xinetd...OK
Initializing Crontab...OK
Retain the password for cmkadmin
.
Once the remote has been subordinated to the central instance,
all users will likewise be replaced by those from the central instance.
The remote is now ready. Verify with netstat
which should show that
Port 6557 is open. The connection to this port is performed by an instance
of the auxiliary daemon xinetd
, which runs directly in the instance:
root@linux# netstat -lnp | grep 6557
tcp 0 0 0.0.0.0:6557 0.0.0.0:* LISTEN 10719/xinetd
Assigning remote instances to the central instance
The configuration of the distributed monitoring takes place exclusively
on the central instance. The required WATO-module is
Distributed monitoring, and this serves to manage the connections to the
individual instances. For this function the central instance itself counts as an
instance and is already present in the list:

Using , now define the connection to the first remote instance:

In the Basic settings it is important to use the remote instance’s EXACT name
– as defined with omd create
– as the Site-ID. As always the alias can
be defined as desired and also be later changed.

The Livestatus settings determine how the central queries the status of the remote instances via live status. The example in the screenshot shows a connection with the Connect via TCP method. This is the optimal for stable connections with short latency periods (such as, eg. in a LAN). We will discuss the optimal settings for WAN connections later.
The URL prefix is required for integrating other applications (e.g. PNP4Nagios).
We will come to this subject separately later.
Enter the HTTP-URL to the remote’s web interface here (only the part preceeding
the check_mk/
component). If you basically access Checkmk per HTTPS,
then substitute the http
here with https
.
Further information can be found in the online help or
the corresponding article regarding HTTPS together with Checkmk.

The use of Distributed WATO is, as we discussed in the introduction, optional. Activate this if you wish to configure the remote with and from the central instance. In such a case select the exact settings as shown in the image above.
A correct setting for the Multisite-URL of the remote site is very important.
The URL must always end with /check_mk/
. A connection with HTTPS is
recommended, provided that the remote instance’s Apache supports HTTPS.
This must be installed manually on the remote at the Linux level.
For the Checkmk Appliance, HTTPS can be set up using the
web-based configuration interface. If you utilise a self-signed certificate,
you will require the Ignore SSL certificate errors check box.
Once the mask has been saved a second instance will appear in the overview:

The (so far) empty remote instance’s monitoring status is now correctly integrated.
A Login to the remote’s WATO is still required for the distributed WATO.
To this end, via HTTP the central instance exchanges a randomly-generated password with the
remote instance, through which all future communication will take place.
The cmkadmin
access on the remote instance will subsequently no longer be used.
To login use the access data cmkadmin
and the according password of the remote instance:

A successful login will be so acknowledged:

Should an error occur with the login, this could be due to a number of reasons – for example:
The remote instance is currently stopped.
The Multisite-URL of the remote site has not been correctly set up.
The remote is not reachable under the host name ‘from central instance’ specified in the URL.
The Checkmk versions of the central and the remote instance are (too) incompatible.
An invalid user ID and/or password have been entered.
Points 1. and 2. can be easily tested by manually calling the remote’s URL in your browser.
When everything has been successful run Activate Changes. This will, as always, bring you to an overview of the not yet activated changes. Simultaneously it will also show the states of the livestatus connections, likewise the WATO-synchronisation states of the individual instances:

The Version column shows the Livestatus-version of the respective site.
When using the CMC as the Checkmk’s core (Enterprise Editions), the core’s version number
(shown in the ‘Core’ column) is identical to that of the livestatus.
If you are using Nagios as the core ( Checkmk Raw Edition), the Nagios version number will be seen here.
The following symbols show WATO’s replication status:
This instance has outstanding changes. The configuration matches the central instance, but not all changes have been activated. With the Restart button a targetted activation for this instance can be performed. |
|
The WATO-configuration for this instance is not synchronous and must be carried over. A restart will then of course be necessary to activate it. Both functions can be performed with the Sync & Restart button. |
In the Status column the state of the livestatus connection for the respective instance can be seen. This is shown purely for information since the configuration is not transmitted via Livestatus, but rather over HTTP. The following values are possible:
The instance is reachable via Livestatus. |
|
The instance is currently not reachable. Livestatus queries are running in a Timeout. This delays the page loading. Status data for this instance is not visible in the GUI. |
|
The instance is currently not reachable, but this is due to the setting up of a statushost or is known through the Livestatus proxy (see below). The inaccessability does not lead to Timeouts. Status data for this instance is not visible in the GUI. |
|
The livestatus connection to this instance has been temporarily deactivated by the (central instance’s) administrator. The setting matches the ‘Temporarily disable this connection’ check box in the settings for this connection. |
Clicking on the button will now synchronise
all instances and activate the changes. This is performed in parallel,
so that the overall time equates to the time required by the slowest instance.
Included in the time is the creation of a configuration snapshot for the
respective instance, the transmission over HTTP, the unpacking of the snapshot
on the remote instance, and the activation of the changes.
Important: Do not leave the page before the synchronisation has been completed on all instances – leaving the page will interrupt the synchronisation.
Specifying to the hosts and folders which instance should monitor them
Once your distributed environment has been installed you can begin to use it. You actually only need to tell each host by which instance it should be monitored. The central instance is specified by default.
The required attribute for this is ‘Monitored on site’. You can set this individually for each host. This can naturally also be performed at the folder level:

Executing a fresh service discovery and activating changes for migrated hosts
Adding hosts functions as usual – apart from the fact that the surveillance as well as the service discovery will be run from the respective remote instance, there are no special considerations.
When migrating hosts from one instance to another there are a couple of points to be aware of. Neither current nor historic status data from the host will be carried over. Only the host’s configuration is retained in the WATO. In effect it is as if the host has been removed from one instance and freshly-installed on the other instance:
Automatically discovered services will not be migrated. Run a Service discovery after the migration.
Once restarted, hosts and services will show PEND. Currently existing problems may as a result be newly-notified.
Historic graphing will be lost. This can be avoided by manually moving the relevant RRD-files. The location of the files can be found in Files and directories.
Data for availability and from historic events will be lost. These are unfortunately not easy to migrate as the data consists of single lines in the monitoring log.
If the continuity of the history is important to you, when implementing the monitoring you should carefully plan which host is to be monitored, and from where.
2.3. Connecting Livestatus with encryption
From version 1.6.0 Livestatus connections between the central instance and
a remote can be encrypted. For newly-created instances nothing further needs to done,
as Checkmk takes care of the necessary steps automatically.
As soon as you then use omd config
to activate Livestatus, encryption is also
automatically activated by TLS:

The configuration of distributed monitoring therefore remains as simple as it has been up to now. For new connections to other instances the option Encryption is then automatically enabled.
After you add the remote instance, you will notice two things – firstly,
the connection is marked as encrypted by this new icon.
And secondly, Checkmk will tell you that the CA will no longer trust the remote instance. Click on
to get to the
details of the certificates used. A click on
lets you
conveniently add the CA via the web interface. Then both certificates will
be listed as trusted:

Details of the technologies used
To achieve the encryption Checkmk uses the stunnel
program along with
its own certificate and its own Certificate Authority (CA) to sign
the certificate. These will be individually generated automatically with
a new instance and they are therefore not predefined static CAs
or certificates. That is a very important safety factor to prevent fake
certificates from being used by attackers, because any attackers could then
gain access to a publicly-available CA.
The generated certificates also have the following properties:
Both certificates are in the PEM format. The signed certificates for the instance also contain the complete certificate chain.
The keys use 2048-bit RSA, and the certificate is signed using SHA512
The instance’s certificate is valid for 999 years.
The fact that the standard certificate is valid for so long very effectively prevents you from getting connection problems that you cannot classify. At the same time it is of course possible that once a certificate has been compromised it is accordingly long open to abuse. So if you fear that an attacker will gain access to the CA or to the instance certificate signed with it, always replace both certificates (CA and instance)!
Using your own certificates
In larger environments you might in any case want to use your own certificates. To replace the supplied ones, simply substitute the instance certificate with your own, and make sure that the CA which has signed the new certificate is also trusted.
Migrating from older versions
For compatibility reasons the LIVESTATUS_TCP_TLS
option will
not be automatically activated after an update from an older version to
1.6.0, since in the new version it is only possible to use the
connection with encryption. After the update, to make use of the new
feature in your monitoring instances, stop the instance and activate the
option mentioned:
OMD[mysite]:~$ omd config set LIVESTATUS_TCP_TLS on
Since the certificates were generated automatically during the update, the instance then immediately uses the new encryption feature. So that you can still access the instance from the central instance, in the second step activate the Encryption option in the Instance Connection Properties under WATO > Distributed Monitoring:

The last step is as described above – again here you first have to mark the CA of the remote instance as trusted.
2.4. Special features of a distributed setup
A distributed monitoring operates via livestatus much like a single system, but it does have a couple of special characteristics:
Access to the monitored hosts
All accesses of a monitored host are consistently carried out from the instance to which the host is assigned. This applies not only to the actual monitoring, but also to the service discovery, the Diagnostics page, the Notifications, Alert handlers and everything else. This point is very important as it is not assumed that the central instance actually has access to this host.
Specifying the instance in views
Some of the standard views are grouped according the instance from which the host will be monitored – this applies for, e.g. All hosts:

The instance will likewise be shown in the host’s or service’s details:

This information is generally available for use in a column when creating your own views. There is also a filter with which a view of hosts on a specific site can be filtered:

Site status element
There is a Site status snap-in element for the side bar which can be added
using . This displays the status of the
individual instances, and it also provides the option of clicking on the status
to temporarily hide or show individual sites. These will be flagged with the
status. With this you can also disable a
instance that is generating timeouts,
thus avoiding superfluous timeouts:

This is not the same as disabling the livestatus connection using the connection configuration in WATO. Here the ‘disabling’ only affects the currently logged-in user and has a purely visual function. Clicking on an instance’s name will display a view of all of its hosts.
The central instance Control element
In a distributed monitoring the central instance control element has a different appearance. Each instance has its own global switch:

Checkmk Cluster hosts
If you monitor with Checkmk HA-Cluster, the cluster’s individual nodes must be assigned to the same instance as the cluster itself. This is because determining the clustered services’ status accesses cache files generated through monitoring the node. This data is located locally on the respective instance.
Piggyback data (e.g., ESX)
Some check plug-ins use ‘Piggyback’ data, for example, for allocating monitoring data retrieved from an ESX-host to the individual virtual machines. For the same reason as with cluster monitoring, in distributed monitoring the ‘piggy’ (carrying) host as well as its dependent hosts must be monitored from the same instance. In the case of ESX this means that the virtual machines must be assigned to the same site in Checkmk as the ESX-System from which the monitoring data is collected. This can mean that it is better to poll the ESX-host system directly rather than to poll a global vCenter. Details for this can be found in the documentation on ESX-monitoring.
Hardware/Software inventory
The Checkmk Hardware/Software inventory also functions in distributed
environments. In doing so the inventory data from the
var/check_mk/inventory
directory must be regularly transmitted from the
remotes to the central instance. For performance reasons the user interface always
accesses this directory locally.
In the Checkmk Enterprise Editions the synchronisation is carried out automatically on all sites that
are connected using the Livestatus proxy.
If you run inventories using the Checkmk Raw Edition in distributed systems, the directory must
be regularly mirrored to the central instance with your own tools (e.g., with
rsync
).
Changing a password
Even when all instances are being centrally monitored, a login on an individual instance’s interface is quite possible and often also appropriate. For this reason WATO ensures that a user’s password is always the same for all sites.
A password change made by the administrator will take effect automatically as soon as it is shared to all instances with Activate Changes.
A change made by a user themselves using the
sidebar in their personal settings works somewhat differently.
This cannot execute an Activate changes since the user of course
has no general authority for this function.
In such a case WATO will automatically share the changed password across
all instances – directly after it has been saved in fact.

As we all know, networks are never 100% available. If an instance is unreachable at the time of a password change, it will not receive the new password. Until the administrator successfully runs an Activate changes, or respectively, the next successful password change, this instance will retain the old password for the user. A status symbol will inform the user of the status of the password sharing to the individual instances.
2.5. Tethering existing instances
As mentioned above, existing instances can also be retrospectively
tethered to a distributed monitoring.
As long as the preconditions described above have been satisfied
(compatible Checkmk versions), this will be completed exactly as for
setting up a new remote instance. Share livestatus with
TCP, then add the instance to the
Distributed monitoring module – and you’re done!
The second stage – the changeover to a centralised configuration – is somewhat trickier. Before integrating the instance into the distributed WATO as described above, you should be aware that in doing so the instance’s entire local configuration will be overwritten!
Should you wish to take over existing hosts, and possibly rules as well, three steps will be required:
Match the host tags’ scheme
Copy the WATO-directories
Edit the characteristics in the parent folder
1. Host tags
It is self-evident that the host tags used in the remote must also be known to the central instance in order that they can be carried over. Check these before the migration and add any missing tags to the central instance manually. Here it is essential that the Tag-IDs match – the tag’s title is irrelevant.
2. WATO-directories
Next, move the hosts and rules into the central WATO on the central instance. This only works for hosts and rules in sub-directories (i.e., not in the ‘Main directory’ ). Hosts in the main directory should first simply be moved into a remote instance’s sub-directory using WATO.
The actual migration can then be achieved quite simply by copying
the appropriate directories.
Each host directory in WATO corresponds to a
directory within etc/check_mk/conf.d/wato/
.
These can be copied using a tool of your choice (e.g. scp
) from
the tethered site to the same location in the central instance.
If a directory with the same name already exists there, simply rename it.
Please note that Linux users and groups are also used by the central instance.
Following the copying the hosts should appear in the central’s setup –
as well as the rules you have created in these folders.
The folders’ characteristics will also be included with the copying.
These can be found in the folder in the hidden .wato
file.
3. One-time editing and saving
So that the attributes of the central instance’s parent folder’s functions are correctly inherited, as a final step following the migration the parent folders’ characteristics must be opened and saved once – the host’s attributes will thereby be freshly defined.
2.6. Instance-specific global settings
A centralised configuration over WATO means that first and foremost, all instances have a common and (apart from the hosts) the same configuration. What is the situation however, when individual instances require different global settings? An example could be the CMC setting Maximum concurrent Checkmk checks. It could be that a customised setting is required for a particularly small or a particularly large instance.
For such cases there is an instance-specific global setting.
This is reached via the symbol in the
Distributed monitoring WATO-module:

Via this symbol you will find a selection of all global settings – although anything you define here will only be effective for the chosen instance. A value that diverges from the standard will be visually-highlighted, and it will apply only to this instance:

Note: Site-specific settings for the central instance are only indirectly possible – since it is of course the central that predefines the configuration. In a situation where ONLY the centrals’s settings diverge, for every other site it will be necessary to make site-specific settings to ‘RETURN’ them to the ‘default’.
2.7. Distributed event console
The Event Console processes syslog-messages, SNMP traps and other types of events of an asynchronous nature.
Up to version 1.2.8, in a distributed environment the recommended procedure is to operate only a single instance in the Event Console – and that one within the central instance. Here is to where you direct all host events.
This setup has the disadvantage that the hosts’ events must be sent to another instance, rather than to the instance which is actively monitoring them. A consequence of this is that when generating notifications from the event console, the host’s information is incomplete since the local Checkmk doesn’t know about them. On the one hand, this applies to the detection of hosts’ contact groups, and on the other hand also to events in which the originating host is identified only by its IP-Adresse and a real host name is absent. In such a case notification rules containing conditions linked to the host names cannot function.
From Version 1.4.0i1] Checkmk also provides the option of
running a distributed Event Console. Then every instance will run its own
event processing which captures the events from all of the hosts being
monitored from the instance. The events will thus not be sent to the
central system, rather they will remain at the instances and be only centrally-retrieved.
This is effected in a similar way to that for the active states via Livestatus, and
functions with both the Checkmk Raw Edition and the
Checkmk Enterprise Editions.
Converting to a distributed Event Console according to the new scheme requires the following steps:
In the connection settings, for WATO-Replication activate the EC (Replicate Event Console configuration to this site) option
Switch the Syslog location and SNMP-Trap-destinations for the affected hosts to the remote instance. This is the most laborious task.
If you use the Check event state in Event Console rule set, switch this back to Connect to the local Event Console.
If you use the Logwatch Event Console Forwarding rule set, switch this likewise to the local Event Console.
In the Event Console Settings, switch the Access to event status via TCP back to no access via TCP.
2.8. PNP4Nagios
In the
Checkmk Raw Edition the PNP4Nagios
Open-Source-Projekt is used for displaying performance values graphically.
This has its own web interface which is integrated in Checkmk.
Using this, in some locations single graphics will be embedded, and in other
locations a complete page including its own navigation will be provided:

In distributed monitoring the performance data bases (Round-Robin-Databases, or RRDs) are always located locally on the remote sites. This is very important because a continuous transmission of all performance data to the central instance – and its resulting network traffic – is thus avoided. Furthermore all of the other advantages of a distributed monitoring through livestatus are retained, as described at the outset.
PNP4Nagios unfortunately has no compatible interface for accessing the graphs in livestatus. Therefore Checkmk simply retrieves the individual graphs, or respectively, the complete websites from PNP4Nagios via HTTP over its standard-URLs. Two methods are used for this:
The PNP4Nagios-data is retrieved directly from the user’s browser
The PNP4Nagios-data is retrieved from the central instance and then forwarded to the user
1. Retrieval via the user’s browser
The first method is very simple to implement. For the relevant sites,
configure the URL-prefix in the connection’s attributes, and set it to the
URL used for accessing this instance – though without the /check_mk/
:

Checkmk will embed the graphs in the GUI so that the browser can retrieve the graphs’ PNG-images, or respectively, the website’s Iframes from PNP4Nagios over this URL. Specify the URL thus as it functions with the application’s browser. An access to the remote from the central instance is not necessary.
The URL method as just described is quick and easy to set up, but it has a few small disadvantages:
Since the browser retrieves the PNP4Nagios-data from a different host to the Checkmk-GUI, a Checkmk session cookie will not be sent. The user must thus make a new login for very remote instance. With the first access to a graph a login screen will appear.
The remote server may not in fact be reachable from the user’s browser – rather only from the central instance. In such a case this method can’t function.
The URL-prefix must be set to either
http://
or tohttps://
. A selection made by the user will then no longer function.
2. Retrieval via the central instance
The best solution to this problem is to retrieve the PNP4Nagios-data from
the central instance, rather than from the user’s browser itself.
To this end, create a proxy rule on the central’s Apache-server. This will route
PNP4Nagios queries per HTTP or HTTPS to the correct remote server.
Important: this must be done on the operating system’s Apache,
not that running on the instance. For this reason a
root
-permission is required.
The prerequisite for this setup is that all Checkmk instance-IDs in your network are explicit, since Apache must use the remote-ID to decide which server it should forward to.
Assuming the following example:
ID | IP-Addresse | Livestatus | Checkmk URL |
---|---|---|---|
central |
10.15.18.223 |
local |
|
remote1 |
10.1.1.133 |
Port 6557 |
In the connection settings, now simply set /remote1/
as the URL-prefix:

With this, queries to PNP4Nagios initially go to the central instance on the /remote1
URL.
Should the remote1
instance coincidentally be running on the same
server as the central instance, you will now be finished and no proxy rule will be required,
since the data can be delivered directly.
In the general case that the remote instance runs on another host,
you will require the root
-permission and must create a configurations
file for the system-wide Apache server.
The path for this file will depend on your Linux distribution:
Distribution | Path |
---|---|
RedHat, CentOS |
/etc/httpd/conf.d/check_mk_proxy.conf |
SLES, Debian, Ubuntu |
/etc/apache2/conf.d/check_mk_proxy.conf |
The file consists of five lines for each tethered remote instance.
In the following example, substitute the instance name (here remote1
) and the
instance’s URL (here 10.1.1.133/remote1/
).
Please note that for Apache it is relevant whether a URL ends
with a (/) ‘slash’ or not:
<Location /remote1>
Options +FollowSymLinks
RewriteEngine On
RewriteRule ^/.+/remote1/(.*) http://10.1.1.133/remote1/$1 [P]
</Location>
This rule tells Apache that all URLs beginning with /remote1
are
to be retrieved via reverse-proxy from the URL 10.1.1.133/remote1
.
Important: don’t forget to activate the configuration. For SLES, Debian and Ubuntu, perform this with:
root@linux# /etc/init.d/apache2 reload
RedHat and CentOS require:
root@linux# /etc/init.d/httpd reload
If everything has been done correctly, PNP4Nagios must now be able to access the graphs.
2.9. Logwatch
Checkmk includes the mk_logwatch
plug-in with which under Linux and
Windows you can monitor text log files, and especially the Windows event log.
This plug-in provides a special webpage in the GUI in which the relevant
detected messages can be viewed and acknowledged:

Up until Checkmk Version 1.2.8 this page required local access to the saved log messages. This installed the plug-in on the remote instance from which the respective server was monitored. In distributed monitoring however the central instance has no direct access to these files. The solution is the same as with PNP4Nagios: The remote server’s logwatch webpage is embedded and retrieved from the remote separately per HTTP.
The configuration required for this is identical to that used when setting up Checkmk for PNP4Nagios. If this has already been set up the Logwatch interface will automatically function correctly.
From Version 1.4.0i1] Checkmk the Logwatch webpage
exclusively uses Livestatus for the transfer and no longer requires HTTP.
The setting up of HTTP or the proxy rule is then only needed for users
of the Checkmk Raw Edition for PNP4Nagios.
2.10. NagVis

The NagVis open source program visualises status data from monitoring on self-produced maps, diagrams and other charts. NagVis is integrated in Checkmk and can be used immediately. The access is easiest over the NagVis Maps sidebar element. The integration of NagVis in Checkmk is described in its own article.
NagVis supports distributed monitoring via Livestatus in pretty much the same way as Checkmk does. The links to the individual sites are refferred to as backends. The backends are automatically set up correctly by Checkmk so that one can immediately begin generating NagVis-charts – also in distributed monitoring.
Select the correct backend for each object that you place on a chart – i.e., the Checkmk instance from which the object is to be monitored. NagVis cannot find the host or service automatically, above all for performance reasons. Therefore if you move hosts to a different remote instance you will need to update the NagVis-charts accordingly.
Details on backends can be found in the documentation here: NagVis.
3. Unstable or slow connections
The general status overview in the user interface enables an always available, and reliable access to all of the connected instances. The one snag with this is that a view can only be displayed when all instances have responded. The process is always that first a Livestatus query is sent (for example, “List all services whose state is not OK.”). The view can then only be displayed once the last instance has responded.
It is annoying when an instance doesn’t answer at all. To tolerate brief outages
(e.g., due to restarting a site or a lost TCP-Packet), the GUI waits for a given
time before an instance is declared to be ,
and then continues processing the responses from the remaining sites.
This results in a ‘hanging’ GUI. The timeout is set to 10 seconds by default.
If this occasionally happens in your network you should set up either Status hosts or (even better) the Livestatus proxy.
3.1. Status hosts
The configuration of Status hosts is the recommended procedure with
the
Checkmk Raw Edition in order to recognise defective connections reliably.
The idea is simple: The central instance actively monitors the connection to
each individual remote instance. At least we will then have a monitoring system available!
The GUI will then be aware of unreachable instances and can immediately exclude
and flag them as
. Timeouts are thus minimised.
Here is how to set up a status host for a connection:
Add the host on which the remote instance is running to the central instance in monitoring.
Enter this as the status host in the connection to the remote:

A failed connection to a remote instance can now only lead to a brief hangup of the GUI – namely until the monitoring has recognised it. By reducing the status host’s proof interval from the default of sixty seconds to, e.g. five seconds, you can minimise the duration of a hangup.
If you have set up a status host, there are further possible states for connections:
The computer on which the remote instance is running is just now unreachable to the monitoring because a router is down (the status host has an UNREACH state). |
|
The status host that monitors the connection to the remote system has not yet been verified by the monitoring (it still has a PEND state). |
|
The status host’s state has an invalid value (this should never occur). |
In all three cases the connection to the instance will be excluded and timeouts thus avoided.
3.2. Persistent connections
With the Use persistent connections check box you can prompt the GUI
to maintain established Livestatus connections to remote instances permanently
in an ‘up’ state, and to continue using them for queries.
Especially for connections with longer packet turnarounds (e.g. intercontinental),
this can make the GUI noticeably more responsive.
Because the Apache GUI is shared over multiple independent processes a connection
is required for each Apache-Client process running simultaneously.
If you have many simultaneous users, please ensure the configuration
has a sufficient number of Livestatus connections in the remote’s Nagios core.
These are configured in the etc/mk-livestatus/nagios.cfg
file.
The default is twenty (num_client_threads=20
).
By default, Apache is so configured in Checkmk that it permits up to 128
simultaneous user connections. This is configured in the following section
of the etc/apache/apache.conf
file:
<IfModule prefork.c>
StartServers 1
MinSpareServers 1
MaxSpareServers 5
ServerLimit 128
MaxClients 128
MaxRequestsPerChild 4000
</IfModule>
This means that under high load up to 128 Apache processes can start which then
also generate and sustain up to 128 Livestatus connections.
Not setting the num_client_threads
high enough can result in errors or a
very slow response time in the GUI.
For connections with LAN or with fast WAN-Networks we advise not utilising persistent connections.
3.3. The livestatus proxy
With the Livestatusproxy the
Checkmk Enterprise Editions feature
a sophisticated mechanism for detecting dead connections.
Additionally, this especially optimises the performance of connections
with long round-trip-times. The livestatus proxy’s advantages are:
Very fast, proactive detection of unresponding instances
Local caching of queries that deliver static data
Standing TCP-connections – which require fewer round trips and consequently allow much faster responses from distant instances (e.g. USA ⇄ China)
Precise control of the maximum number of livestatus connections required
Enables Hardware/Software inventory in distributed environments
Installation
Installing the livestatus proxy is very simple. It is activated by default in the CEE – which can be seen when starting a site:
OMD[central]:~$ omd start
Starting mkeventd...OK
Starting Livestatus Proxy-Daemon...OK
Starting rrdcached...OK
Starting Check_MK Micro Core...OK
Starting dedicated Apache for site remote1...OK
Starting xinetd...OK
Initializing Crontab...OK
Select the setting ‘Use Livestatus Proxy-Daemon’ for the connection to the remote instances instead of ‘Connect via TCP’:

The details for host and port are as always. No changes must be made on the remote instance. In Number of channels to keep open enter the number of parallel TCP-connections the proxy should establish and sustain to the target site.
The TCP-connections pool is shared by all GUI enquiries. The number of connections limits the maximum number of queries that can be processed concurrently. This indirectly limits the number of users. In situations in which all channels are reserved this will not immediately lead to an error. The GUI waits a given time for a free channel. Most queries actually require only a few milliseconds.
If the GUI must wait longer than Timeout waiting for a free channel for a channel, it will be interrupted with an error and the user will receive an error message. In such a case the the number of connections should be increased. Be aware however that on the remote instance sufficient parallel incoming connections must be allowed – this is set to 20 by default. This setting can be found in the global options under Monitoring core > Maximum concurrent Livestatus connections.
The Regular heartbeat provides a constantly active monitoring of the connections directly at the protocol level. In the process the proxy regularly sends a simple Livestatus query which must be answered by the remote instance within the predetermined time (default: 2 seconds). With this method a situation where the target server and the TCP-port are actually reachable, but the monitoring core no longer responds, will also be detected.
If a response fails to appear, all connections will be declared ‘dead’, and following a ‘cooldown’ time (default: 4 seconds) will be newly established. All this takes place proactively – i.e. without a user needing to open a GUI-window. In this way outages can be quickly detected, and via a recovery the connections can be immediately reestablished and in the best case be available before a user even notices the outage.
The Caching ensures that static queries need only be responded-to once by the remote instance, and from that point of time can be responded to directly and locally, without delay. An example of this is the list of monitored hosts required by Quicksearch.
Error diagnosis
The Livestatus proxy has its own log file
which can be found under var/log/liveproxyd.log
.
On a correctly-configured remote instance with five channels (standard)
it will look something like this:
2016-09-19 14:08:53.310197 ----------------------------------------------------------
2016-09-19 14:08:53.310206 Livestatus Proxy-Daemon starting...
2016-09-19 14:08:53.310412 Configured 1 sites
2016-09-19 14:08:53.310469 Removing left-over unix socket /omd/sites/central/tmp/run/liveproxy/remote1
2016-09-19 14:08:53.310684 Channel remote1/5 successfully connected
2016-09-19 14:08:53.310874 Channel remote1/6 successfully connected
2016-09-19 14:08:53.310944 Channel remote1/7 successfully connected
2016-09-19 14:08:53.311009 Channel remote1/8 successfully connected
2016-09-19 14:08:53.311071 Channel remote1/9 successfully connected
The Livestatus proxy regularly records its state in the var/log/liveproxyd.state
file:
Current state:
[remote1]
State: ready
Last Reset: 2016-09-19 14:08:53 (125 secs ago)
Site's last reload: 2016-09-19 14:08:45 (134 secs ago)
Last failed connect: 1970-01-01 01:00:00 (1474287059 secs ago)
Cached responses: 1
Last inventory update: 1970-01-01 01:00:00 (1474287059 secs ago)
PID of inventory update: None
Channels:
5 - ready - client: none - since: 2016-09-19 14:10:38 ( 20 secs ago)
6 - ready - client: none - since: 2016-09-19 14:10:43 ( 15 secs ago)
7 - ready - client: none - since: 2016-09-19 14:10:48 ( 10 secs ago)
8 - ready - client: none - since: 2016-09-19 14:10:53 ( 5 secs ago)
9 - ready - client: none - since: 2016-09-19 14:10:33 ( 25 secs ago)
Clients:
Heartbeat:
heartbeats received: 24
next in 0.2s
And when an instance is currently stopped the state will look like this:
----------------------------------------------
Current state:
[remote1]
State: starting
Last Reset: 2016-09-19 14:12:54 ( 10 secs ago)
Site's last reload: 2016-09-19 14:12:54 ( 10 secs ago)
Last failed connect: 2016-09-19 14:13:02 ( 2 secs ago)
Cached responses: 0
Last inventory update: 1970-01-01 01:00:00 (1474287184 secs ago)
PID of inventory update: None
Channels:
Clients:
Heartbeat:
heartbeats received: 0
next in -5.2s
Here the state is ‘starting’
.
The proxy is thus attempting to establish connections.
There no channels yet. During this state queries to the site will be answered with an error.
4. Livedump and CMCDump
4.1. Motivation
The concept for a distributed monitoring with Checkmk that has been described up until now is a good and simple solution in most cases. It does however require network access from the central to the remote instances. There are situations in which access is either not possible or not desired, because, for example:
the remote instances are in your customer’s network for which you have no access
the remote instances are in a security area to which access is strictly forbidden
the remote instances have no permanent network connection and no fixed IP-addresses
Distributed monitoring with Livedump, or respectively, CMCDump takes a quite different approach. Firstly, the remote instances are so attached so that they operate completely independently of the central instance and are administered decentrally. A distributed WATO will be dispensed with.
All of the remote instance’s hosts and services will then be replicated as copies in the central instance. Livedump/CMCDump can help by generating a copy of the remote instances’ configuration which can then be loaded into the central instance.
Now during the monitoring, on every remote instance a copy of the current status will be
written to a file at predetermined intervals (e.g. every minute).
This will be transmitted to the central instance via a user-defined method and will be
saved there as a status update. No particular protocol has been provided or specified
for this data transfer.
Any automatable transfer protocol could be used. It is not essential to use scp
–
even a transfer by email is conceivable!
Such a setup differs from a ‘normal’ distributed monitoring in the following ways:
Actualisation of the states and performance data in the central instance will be delayed.
Calculation of availability on the central instance will give minimally different results to a calculation on the remote instance.
State changes that occurr more quickly than the actualisation interval will be invisible to the central instance.
If a remote instance is ‘dead’, the states will become obsolete on the central instance – the services will be ‘stale’, but nonetheless still visible. Performance and availability data for this time period will be ‘lost’ (but they will still be available on the remote instance).
Commands on the central instance such as Downtimes and Acknowledgements cannot be transmitted to the remote instance.
The central instance can never access the remote instances.
Access to logfile details by Logwatch is impossible.
The Event Console will not be supported by Livedump/CMCDump.
Since brief state changes – depending on the periodic interval selected on the central instance – may not be visible, a notification through the central instance is not ideal. If however the central instance is utilised as a purely display instance – as a central overview of all customers for example – this method definitely has its advantages.
Incidentally, Livedump/CMCDump can be used simultaneously alongside distributed monitoring over Livestatus without problems. Some instances are are simply connected via Livestatus directly – others use Livedump. Livedump can also be added to one of the Livestatus remote instances.
4.2. Installing Livedump
If you are installing the
Checkmk Raw Edition (or the CEE with a Nagios core),
use the
livedump
tool. The name is derived from Livestatus and
Status-Dump. From the Checkmk Version 1.2.8p12
livedump
is located directly in the search path and is thus available
as a command.
In older versions you can find it under ~/share/doc/check_mk/treasures/livedump/livedump
.
We will make the following assumptions… * … the remote instance has been fully set up and is actively monitoring hosts and services * … the central instance has been started and is running * … at least one host is being locally monitored on the central instance (because the central monitors itself).
Transferring the configuration
First, on the remote instance, create a copy of its host’s and service’s configurations in
Nagios-configuration format. Also redirect the output from livedump -TC
to a file:
OMD[remote1]:~$ livedump -TC > config.cfg
The start of the file will look something like this:
define host {
name livedump-host
use check_mk_default
register 0
active_checks_enabled 0
passive_checks_enabled 1
}
define service {
name livedump-service
register 0
active_checks_enabled 0
passive_checks_enabled 1
check_period 0x0
}
Transmit the file to the central instance, (e.g. with scp
) and save them there in
the ~/etc/nagios/conf.d/
directory – here Nagios expects to find the
configuration data for hosts and services. Select a file name that ends with
.cfg
, for example ~/etc/nagios/conf.d/config-remote1.cfg
.
If an SSH-access from remote to central instance is possible it can be done, for example, as below:
OMD[remote1]:~$ scp config.cfg central@mycentral.mydomain:etc/nagios/conf.d/config-remote1.cfg
central@mycentral.mydomain's password:
config.cfg 100% 8071 7.9KB/s 00:00
Now log in to the central instance and activate the changes:
:c-local:cmk -R
Generating configuration for core (type nagios)...OK
Validating Nagios configuration...OK
Precompiling host checks...OK
Restarting monitoring core...OK
Now all of the remote’s hosts and services should appear in the central instance – initially with the PEND state, which they will retain for the time being:

Note:
With the
-T
option inlivedump
template definitions are created in Livedump from which it draws the configuration. Without these Nagios cannot be started. Only one of these may be present however. If you import a configuration from another remote instance it must not use the-T
option!A dump of the configuration is also possible on a CMC-core — the importing of which requires Nagios. If the CMC is running on your central instance use CMCDump.
The copying and transferring of the configuration must be repeated for every change to hosts or services on the remote instance.
Transferring the status
Once the hosts are visible in the central instance, we will need to setup a (regular) transmission
of the remotes' monitoring status. Again create a file with livedump
,
but this time without secondary options:
OMD[remote1]:~$ livedump > state
This file contains the states of all hosts and services in a format which Nagios can read directly from check results. The start of this file looks something like this:
host_name=myserver666
check_type=1
check_options=0
reschedule_check
latency=0.13
start_time=1475521257.2
finish_time=1475521257.2
return_code=0
output=OK - 10.1.5.44: rta 0.005ms, lost 0%|rta=0.005ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.019ms;;;; rtmin=0.001ms;;;;
Copy this file to the central instance into the ~/tmp/nagios/checkresults
directory.
Important: This file’s name must begin with c
and be seven characters long.
With scp
it will look something like this:
OMD[remote1]:~$ scp state central@mycentral.mydomain:tmp/nagios/checkresults/caabbcc
central@mycentral.mydomain’s password:
state 100% 12KB 12.5KB/s 00:00
Finally, create an empty file on the central instance with the same name and the .ok
extension.
With this Nagios will know that the status file has been transferred completely and can
now be read in:
:c-local:touch tmp/nagios/checkresults/caabbcc.ok
The status of the remotes’ hosts/services will now be immediately updated on the central instance:

The transmission of the status must from now on be made regularly.
Livedump unfortunately doesn’t support this task and you will need to script it yourself.
The livedump-ssh-recv
script can be found
in ~/share/check_mk/doc/treasures/livedump
, which you can employ
in order to receive Livedump updates (including those from the configuration)
on the central instance per SSH. Details about this can be found in the script itself.
The configuration and staus dump can also be restricted by using Livestatus filters.
For example, you could limit the hosts to the members of the mygroup
hostgroup:
OM(remote):livedump -H "Filter: host_groups >= mygroup" > state
Further information on Livedump – in particular how to transfer the data via
encrypted email – can be found in the README
file in the
~/share/doc/check_mk/treasures/livedump
directory.
4.3. Implementing CMCDump
CMCDump is for the Checkmk Micro Core what Livedump
is for Nagios – and it is thus the tool of choice for the Checkmk Enterprise Editions.
In contrast to Livedump, CMCDump can replicate the complete status
of hosts and services (Nagios doesn’t have the required interfaces for this task).
To compare: Livedump transfers the following data:
The current states – i.e. PEND, OK, WARN, CRIT, UNKNOWN, UP, DOWN or UNREACH
The output from Check plug-ins
The performance data
CMCDump additionally synchronises:
The long output from the plug-in
Whether the object is currently
flapping
The time stamps for the last check execution and the last state change
The duration of the check execution
The latency of the check execution
The sequence number of the current check attempt and whether the current state is ‘hard’ or ‘soft’
acknowledged, if present
Whether the object is currently in a
planned maintenace.
This provides a much more precise reflection of the monitoring. When importing the status the CMC doesn’t just simulate a check execution, rather by using an interface designed for this task it transmits an accurate status. Among other things, this means that at any time the operations centre can see whether problems have been acknowledged or if maintenance times have been entered.
The installation is almost identical to that for Livedump, but is however somewhat simpler since there is no need to be concerned about possible duplicated templates or similar.
The copy of the configuration is made with cmcdump -C
. Store this file on
the central instance in etc/check_mk/conf.d/
. The .mk
file extension must be used:
OMD[remote1]:~$ cmcdump -C > config.mk
OMD[remote1]:~$ scp config.mk central@mycentral.mydomain:etc/check_mk/conf.d/remote1.mk
Activate the configuration on the central instance:
OMD[central]:~$ cmk -O
As with Livedump the hosts and services will now appear on the central instance in the
PEND state. You will however see by the symbol that we
are dealing with a shadow object. In this way it can be distinguished
from an object being monitored directly on the central instance or on a ‘normal’ remote instance:

The regular generation of the status is achieved with cmcdump
without
additional arguments:
OMD[remote1]:~$ cmcdump > state
OMD[remote1]:~$ scp state central@mycentral.mydomain:tmp/state_remote1
To import the status to the central instance the file content must be written into the
tmp/run/live
UNIX-Socket with the help of the unixcat
tool.
OMD[central]:~$ unixcat tmp/run/live < tmp/state_remote1
If you have a connection from the remote to the central instance via SSH without a password all three commands can be combined into a single one – and when so doing not even a temporary file is created:
OMD[remote1]:~$ cmcdump | ssh central@mycentral.mydomain "unixcat tmp/run/live"
It really is so simple! But, as already mentioned, ssh
/scp
is
is not the only method for transferring files, and a configuration or status
can be transferred just as well using email or another desired protocol.
5. Notifications in distributed environments
5.1. Centralised or decentralised?
In a distributed environment the question arises – from which instance should the notifications (e.g. emails) be sent: from the individual remotes or from the central instance? There are arguments in favour of both procedures.
Arguments for sending from the remotes:
Simpler to set up
A local notification is still possible if the link to the central instance is not available
Also works with the
Checkmk Raw Edition
Arguments for sending from the central instance:
Notifications can be further processed at a central location (e.g. be forwarded to a ticket system)
remote instances require no setting up for email or SMS
For sending an SMS over hardware this is only required once – on the central instance
5.2. Decentralised notification
No special steps are required for a decentralised notification since this is the standard setting. Every notification that is generated on a remote instance runs through the chain of notifications rules there. If you implement a distributed WATO these rules are the same on all instances. Notifications resulting from these rules will be delivered as usual, for which the appropriate notification scripts will have been run locally.
It must simply be ensured that the appropriate service has been correctly installed on the instances – that a smart host has been defined for emails for example – in other words the same procedure as for setting up an individual Checkmk instance.
5.3. Centralised notifications
Fundamentals
The
Checkmk Enterprise Editions provide a built-in mechanism for centralised notifications
which can be individually activated for each remote instance.
Such remotes then route all notifications to the central instance for further processing.
The centralised notification is thereby independent of whether the distributed
monitoring has been set up in the standard way,
or with CMCDump,
or by using a blend of these procedures.
Technically speaking, the central notification server does not even need
to be the ‘central’. This task can be taken on by any Checkmk instance.
If a remote instance has been set to ‘forwarding’, all notifications wiil be forwarded directly to the central instance as they would be from the core – effectively in a raw format. Once there the notification rules will be evaluated which actually decide who should be notified and how. The required notification scripts will be invoked on the central instance.
Activating the alarm spooler
The first step for implementing centralised notification is to activate the
notification spooler (mknotifyd
) on all participating instances.
This is an auxiliary process that is required on the central as well as
on the remote instances. In newer Checkmk-versions the notification spooler is
automatically aktivated. Please verify this with omd config
and activate
it if needed. This point can be found under Distributed Monitoring > MKNOTIFYD.

An omd status
must show the mknotifyd
process:
OMD[mysite]:~$ omd status
OMD[central]:~$ omd status
mkeventd: running
liveproxyd: running
mknotifyd: running
rrdcached: running
cmc: running
apache: running
crontab: running
-----------------------
Overall state: running
Only when the notification spooler is active will the point Notifications > Notification spooling be found under the global settings in WATO.
Setting up the TCP-connections
The remote and (notification-)central notification spoolers communicate with each other via TCP. Notifications are sent from remote to central instance. The central acknowledges to the remotes that the notifications have been received, which prevents notifications being lost even if the TCP connection is broken.
There are two alternatives for the construction of a TCP-connection:
A TCP-connection is configured from central to remote instance. Here the remote is the TCP-server.
A TCP-connection is configured from remote to central instance. Here the central is the TCP-server.
Consequently there is nothing standing in the way of forwarding notifications if for network reasons establishing connections is only possible in a specific direction. The TCP-connections are supervised by the spooler with a heartbeat signal and are immediately reestablished as needed – not only in the event of a notification.
Since remote and central instance require different global settings you must make site specific settings for all remotes. Configuring the central instance is performed using the normal global settings. This is due to Checkmk currently not supporting any specific settings for the local instance (= central instance). Please note – these settings will be automatically inherited by all remotes for which no specific settings have been defined.
Let us look first at an example where the central instance establishes the TCP-connections to the remote instances.
Step 1: On the remote instance, edit the instance specific global setting Notifications > Notification Spooler Configuration and activate Accept incoming TCP connections. TCP-Port 6555 will be recommended for incoming connections. If there are no objections, adopt these settings.

Step 2: Now, likewise, in the Notification Spooling submenu only on the remote instance, select the option Forward to remote site by notification spooler.

Step 3: Now, on the central instance – i.e. in the normal global settings – configure the connection to the remote (and then to additional remotes as needed):

Step 4: Set the global setting Notification Spooling to Asynchronous local delivery by notification spooler, so that the central instance’s communications will also be processed over the same central spooler.

Step 5: Activate the changes.
Establishing connections from a remote instance
If the TCP-connection should be established from the remote outwards, the procedure is identical, differing only from the description above by simply exchanging the roles of central and remote instance.
A blend of the two procedures is also possible. In such a case the central instance must be installed so that it listens to incoming connections as well as connecting to remote instances. However in every central/remote relationship only one of the pair is permitted to establish the connection!
Test and diagnose
The alarm spooler logs to the var/log/mknotifyd.log
file.
In the spooler configuration the loglevel can be raised so that more messages are
received. With a standard loglevel one should see something like this on the central instance:
2016-10-04 17:19:28 [5] -----------------------------------------------------------------
2016-10-04 17:19:28 [5] Check_MK Notification Spooler version 1.2.8p12 starting
2016-10-04 17:19:28 [5] Log verbosity: 0
2016-10-04 17:19:28 [5] Daemonized with PID 31081.
2016-10-04 17:19:28 [5] Successfully connected to 10.1.8.44:6555
At all times the var/log/mknotifyd.state
file contains the current status of
the spooler and all of its connections:
Connection: 10.1.8.44:6555
Type: outgoing
State: established
Status Message: Successfully connected to 10.1.8.44:6555
Since: 1475594368 (2016-10-04 17:19:28, 140 sec ago)
Connect Time: 0.000 sec
A version of the same file is also present on the remote instance. There the connection will look something like this:
Connection: 10.22.4.12:56546
Type: incoming
State: established
Since: 1475594368 (2016-10-04 17:19:28, 330 sec ago)
To test, select any monitored remote service and set it manually to CRIT with the Fake check results command.
Now on the central instance an incoming notification should appear in
the notifications log file (notify.log
):
2016-10-04 17:27:57 ----------------------------------------------------------------------
2016-10-04 17:27:57 Got spool file 68c30b35 (myserver123;Check_MK) from remote host for local delivery.
The same event will look like this on the remote instance:
2016-10-04 17:27:23 ----------------------------------------------------------------------
2016-10-04 17:27:23 Got raw notification (myserver123;Check_MK) context with 71 variables
2016-10-04 17:27:23 Creating spoolfile: /omd/sites/remote1/var/check_mk/notify/spool/f3c7dea9-0e61-4292-a190-785b4aa46a64
In the global settings, as well as the normal notifications log (notify.log
)
you can also alter the notification spooler’s log to a higher loglevel.
Monitoring the spooling
Once you have set up everything as described you will notice that on the central instance, and respectively on the remotes, a new service will be found that must definitely be taken into the monitoring. This monitors the alarmspooler and its TCP-connections. Every connection will thereby be monitored twice: once by the central, and once by the remote instance:

6. Files and directories
6.1. Configurations files
Path | Description |
---|---|
etc/check_mk/multisite.d/sites.mk |
Here WATO stores the configuration for the connections to the individual instances. If the interface ‘hangs’ due to an error in the configuration, so that it becomes inoperable, you can edit the disruptive entry directly in the file. If the livestatus proxy is activated however, it will subsequently be necessary to edit and save at least one connection over WATO, since only with this action will a suitable configuration be generated for this daemon. |
etc/check_mk/liveproxyd.mk |
Configuration for the Livestatus proxy. This file will be freshly-generated by WATO with every alteration in the configuration of a distributed monitoring. |
etc/check_mk/mknotifyd.d/wato/global.mk |
Configuration for the notification spooler. This file will be generated by WATO when saving the global settings. |
etc/check_mk/conf.d/distributed_wato.mk |
This is generated on the remote instances by the distributed WATO and it ensures that the remote only monitors its own hosts. |
etc/nagios/conf.d/ |
Storage location for customer-created Nagios-configurations files with hosts and services. These are required for the use of Livedump on the central instance. |
etc/mk-livestatus/nagios.cfg |
The configuration of Livestatus for the use of Nagios as the core. Here you can configure the maximum number of simultaneous connections allowed. |
etc/check_mk/conf.d/ |
The configuration of hosts and rules for Checkmk. Store configurations files that are generated by CMCDump here. Only the `wato/`subdirectory is managed by, and will be visible in WATO. |
var/check_mk/autochecks/ |
For services found by the service discovery. These are always stored locally on the remote instance. |
var/check_mk/rrds/ |
Location of the Round-Robin-Database for archiving the performance data when using the Checkmk-RRD-format (the default with the Enterprise Editions) |
var/pnp4nagios/perfdata/ |
Location of the Round-Robin-Database with the PNP4Nagios-format ( |
var/log/liveproxyd.log |
Log file for the Livestatus proxies. |
var/log/liveproxyd.state |
The current state of the Livestatus proxies in a readable form. This file is updated every 5 seconds. |
var/log/notify.log |
Log file for the Checkmk notification system. |
var/log/mknotifyd.log |
Log file for the notification spooler. |
var/log/mknotifyd.state |
The current state of the notification spooler in a readable form. This file is updated every 20 seconds. |