Scaling and distributing Checkmk

1. Introduction

Probably not everybody has the same understanding of the term ‘Distributed Monitoring’. In fact monitoring is always distributed over multiple computers, unless the monitoring system is only monitoring itself – which wouldn’t be very useful.

In this handbook we therefore always refer to a distributed monitoring when the monitoring system as a whole consists of more than a single Checkmk-instance. There are a number of good reasons for splitting monitoring over multiple instances:

Performance: The processor load should, or must be shared over multiple machines.
Organisation: Various different groups should be able to administer their own instances independently.
Availability: The monitoring at one location should function independently of other locations.
Security: Data streams between two security domains should be separately and precisely controlled (DMZ, etc.)
Network: Locations that have only narrow band or unreliable connections cannot be remotely-monitored reliably.

Checkmk supports various procedures for implementing a distributed monitoring. Checkmk controls some of these as it is largely compatible with, or based on Nagios (if Nagios has been installed as the core). Also covered are the old NSCA process and the somewhat more modern mod_gearman. Compared to Checkmk’s own system they offer no advantages and are also more cumbersome to implement. For these reasons we don’t recommend them.

The precedure preferred by Checkmk is based on Livestatus and a division of the configuration using WATO. For situations with very separated networks, or even a strict one-way data transfer from the peripherie to the centre there is a method using Livedump, or respectively, CMCDump. Both methods can be combined.

2. Distributed monitoring with Livestatus

2.1. Basic principle

Central status

Livestatus is an interface integrated into the monitoring core which enables other external programs to query status data and execute commands. Livestatus can be made available over the network so that it can be accessed by a remote Checkmk-instance. Checkmk’s user interface uses livestatus to combine all tethered instances into a general overview. This then feels like a ‘large’ monitoring system.

The following diagram schematically shows the structure of a monitoring with livestatus distributed over three locations. The Checkmk-instance Central Site is found in the central processing site. From here central systems will be directly controlled. Additionally, there are the remote Site 1 and remote Site 2 instances which are located in other networks and controlled by their local systems:

What makes this method special is that the monitoring status of the remote instances is not sent continuously to the central site. The GUI always only retrieves data live from the remote instances when it is required by a user in the control centre. The data is then compiled into a centralised view. There is thus no central data holding, which means it offers huge advantages for scaling-up!

Here are some of the advantages of this method:

Scalability: The monitoring itself generates no network traffic at all between central and remote site. In this way hundreds of locations, or more, can be connected.
Reliability: If a network connection to a remote instance fails the local monitoring nonetheless continues operating normally. There is no ‘hole’ in the data recording and also no data ‘jam’. A local notification will still function.
Simplicity: Instances can be very easily incorporated or removed.
Flexibility: The remote instances are still self-contained and can be used for the operating in their respective location. This is then particularly interesting if the ‘location’ should never be permitted to access the the rest of the monitoring.

Centralized configuration

In a system distributed using Livestatus as described above, it is quite possible that the individual instances can be independently maintained by different teams, and the central site only has the task of providing a centralised dashboard.

In the case of multiple, or all instances needing to be administered by the same team, a central configuration is much easier to handle. Checkmk supports this and refers to such a configuration as a ‘distributed WATO’. With this all hosts and services, users and permissions, time periods, and notifications, etc., will be maintained centrally on the central using WATO, and then depending on their tasks, be automatically distributed to the remote instances.

Such a system not only has a common status overview but also a common configuration, and effectively ‘feels like a large system’.

2.2. Installing a distributed monitoring

Installing a distributed monitoring using livestatus/distributed WATO is achieved in the following steps:

First install the central instance as is usually done for a single instance
Install remote instance instances, and enable livestatus via the network
Integrate the remote instances into the central instance using the Distributed monitoring WATO-module
For the hosts and services, specify from which instance they are to be monitored
Execute a service discovery for the migrated hosts, and then activate the fresh changes

Installing a central instance

No special requirements are placed on the central instance. This means that a long-established instance can be expanded into a distributed monitoring without requiring additional modifications.

Installing remote instances and enabling livestatus via the network

The remote instances are then generated as new instances in the usual way with omd create. This will naturally take place on the (remote) server intended for the respective remote instance.

Special notes:

For the remote instances, use IDs unique to your distributed monitoring.
The remote’s Checkmk-version is permitted to diverge from the central instance’s version to a maximum of one patch level (denoted by the numeral following the ‘p’ for stable versions). Other versions may be compatible, but not necessarily. Information on the Checkmk version-numbering system can be found in its own article.
In the same way as Checkmk supports multiple instances on a server, remote instances can also run on the same server. Here is an example for creating a remote instance with the name remote1:

root@linux# omd create remote1
Adding /opt/omd/sites/remote1/tmp to /etc/fstab.
Creating temporary filesystem /omd/sites/remote1/tmp...OK
Updating core configuration...
Generating configuration for core (type cmc)...Creating helper config...OK
OK
Restarting Apache...OK
Created new site remote1 with version 1.6.0.

  The site can be started with omd start remote1.
  The default web UI is available at http://myServer/remote1/

  The admin user for the web applications is cmkadmin with password: lEnM8dUV
  For command line administration of the site, log in with 'omd su remote1'.
  After logging in, you can change the password for cmkadmin with 'htpasswd etc/htpasswd cmkadmin'.

The most important step is now to enable live status via TCP on the network. Please note that live status is not per se a secure protocol and should only be used within a secure network (secured LAN, VPN, etc.). The enabling appears per omd config as an instance user on a stopped site:

root@linux# su - remote1
OMD[remote1]:~$ omd config

Now select Distributed Monitoring:

Set LIVESTATUS_TCP to ‘on’ and enter an available port number for LIVESTATUS_TCP_PORT that is explicit on this server. The default is 6557:

After saving, start the instance as normal with omd start:

OMD[remote1]:~$ omd start
Starting mkeventd...OK
Starting Livestatus Proxy-Daemon...OK
Starting rrdcached...OK
Starting Check_MK Micro Core...OK
Starting dedicated Apache for site remote1...OK
Starting xinetd...OK
Initializing Crontab...OK

Retain the password for cmkadmin. Once the remote has been subordinated to the central instance, all users will likewise be replaced by those from the central instance.

The remote is now ready. Verify with netstat which should show that Port 6557 is open. The connection to this port is performed by an instance of the auxiliary daemon xinetd, which runs directly in the instance:

root@linux# netstat -lnp | grep 6557
tcp        0      0 0.0.0.0:6557            0.0.0.0:*     LISTEN      10719/xinetd

Assigning remote instances to the central instance

The configuration of the distributed monitoring takes place exclusively on the central instance. The required WATO-module is icon sites Distributed monitoring, and this serves to manage the connections to the individual instances. For this function the central instance itself counts as an instance and is already present in the list:

Using button new connection , now define the connection to the first remote instance:

In the Basic settings it is important to use the remote instance’s EXACT name – as defined with omd create – as the Site-ID. As always the alias can be defined as desired and also be later changed.

The Livestatus settings determine how the central queries the status of the remote instances via live status. The example in the screenshot shows a connection with the Connect via TCP method. This is the optimal for stable connections with short latency periods (such as, eg. in a LAN). We will discuss the optimal settings for WAN connections later.

The URL prefix is required for integrating other applications (e.g. PNP4Nagios). We will come to this subject separately later. Enter the HTTP-URL to the remote’s web interface here (only the part preceeding the check_mk/ component). If you basically access Checkmk per HTTPS, then substitute the http here with https. Further information can be found in the online help icon help or the corresponding article regarding HTTPS together with Checkmk.

The use of Distributed WATO is, as we discussed in the introduction, optional. Activate this if you wish to configure the remote with and from the central instance. In such a case select the exact settings as shown in the image above.

A correct setting for the Multisite-URL of the remote site is very important. The URL must always end with /check_mk/. A connection with HTTPS is recommended, provided that the remote instance’s Apache supports HTTPS. This must be installed manually on the remote at the Linux level. For the Checkmk Appliance, HTTPS can be set up using the web-based configuration interface. If you utilise a self-signed certificate, you will require the Ignore SSL certificate errors check box.

Once the mask has been saved a second instance will appear in the overview:

The (so far) empty remote instance’s monitoring status is now correctly integrated. A Login to the remote’s WATO is still required for the distributed WATO. To this end, via HTTP the central instance exchanges a randomly-generated password with the remote instance, through which all future communication will take place. The cmkadmin access on the remote instance will subsequently no longer be used.

To login use the access data cmkadmin and the according password of the remote instance:

A successful login will be so acknowledged:

Should an error occur with the login, this could be due to a number of reasons – for example:

The remote instance is currently stopped.
The Multisite-URL of the remote site has not been correctly set up.
The remote is not reachable under the host name ‘from central instance’ specified in the URL.
The Checkmk versions of the central and the remote instance are (too) incompatible.
An invalid user ID and/or password have been entered.

Points 1. and 2. can be easily tested by manually calling the remote’s URL in your browser.

When everything has been successful run Activate Changes. This will, as always, bring you to an overview of the not yet activated changes. Simultaneously it will also show the states of the livestatus connections, likewise the WATO-synchronisation states of the individual instances:

The Version column shows the Livestatus-version of the respective site. When using the CMC as the Checkmk’s core (Enterprise Editions), the core’s version number (shown in the ‘Core’ column) is identical to that of the livestatus. If you are using Nagios as the core ( CRE Checkmk Raw Edition), the Nagios version number will be seen here.

The following symbols show WATO’s replication status:

icon need restart

This instance has outstanding changes. The configuration matches the central instance, but not all changes have been activated. With the Restart button a targetted activation for this instance can be performed.

icon need replicate

The WATO-configuration for this instance is not synchronous and must be carried over. A restart will then of course be necessary to activate it. Both functions can be performed with the Sync & Restart button.

In the Status column the state of the livestatus connection for the respective instance can be seen. This is shown purely for information since the configuration is not transmitted via Livestatus, but rather over HTTP. The following values are possible:

button sitestatus online

The instance is reachable via Livestatus.

button sitestatus dead

The instance is currently not reachable. Livestatus queries are running in a Timeout. This delays the page loading. Status data for this instance is not visible in the GUI.

button sitestatus down

The instance is currently not reachable, but this is due to the setting up of a statushost or is known through the Livestatus proxy (see below). The inaccessability does not lead to Timeouts. Status data for this instance is not visible in the GUI.

button sitestatus disabled

The livestatus connection to this instance has been temporarily deactivated by the (central instance’s) administrator. The setting matches the ‘Temporarily disable this connection’ check box in the settings for this connection.

Clicking on the button activate changes button will now synchronise all instances and activate the changes. This is performed in parallel, so that the overall time equates to the time required by the slowest instance. Included in the time is the creation of a configuration snapshot for the respective instance, the transmission over HTTP, the unpacking of the snapshot on the remote instance, and the activation of the changes.

Important: Do not leave the page before the synchronisation has been completed on all instances – leaving the page will interrupt the synchronisation.

Specifying to the hosts and folders which instance should monitor them

Once your distributed environment has been installed you can begin to use it. You actually only need to tell each host by which instance it should be monitored. The central instance is specified by default.

The required attribute for this is ‘Monitored on site’. You can set this individually for each host. This can naturally also be performed at the folder level:

Executing a fresh service discovery and activating changes for migrated hosts

Adding hosts functions as usual – apart from the fact that the surveillance as well as the service discovery will be run from the respective remote instance, there are no special considerations.

When migrating hosts from one instance to another there are a couple of points to be aware of. Neither current nor historic status data from the host will be carried over. Only the host’s configuration is retained in the WATO. In effect it is as if the host has been removed from one instance and freshly-installed on the other instance:

Automatically discovered services will not be migrated. Run a Service discovery after the migration.
Once restarted, hosts and services will show PEND. Currently existing problems may as a result be newly-notified.
Historic graphing will be lost. This can be avoided by manually moving the relevant RRD-files. The location of the files can be found in Files and directories.
Data for availability and from historic events will be lost. These are unfortunately not easy to migrate as the data consists of single lines in the monitoring log.

If the continuity of the history is important to you, when implementing the monitoring you should carefully plan which host is to be monitored, and from where.

2.3. Connecting Livestatus with encryption

From version 1.6.0 Livestatus connections between the central instance and a remote can be encrypted. For newly-created instances nothing further needs to done, as Checkmk takes care of the necessary steps automatically. As soon as you then use omd config to activate Livestatus, encryption is also automatically activated by TLS:

The configuration of distributed monitoring therefore remains as simple as it has been up to now. For new connections to other instances the option Encryption is then automatically enabled.

After you add the remote instance, you will notice two things – firstly, the connection is marked as encrypted by this new icon encrypted icon. And secondly, Checkmk will tell you that the CA will no longer trust the remote instance. Click on to get to the details of the certificates used. A click on icon trust lets you conveniently add the CA via the web interface. Then both certificates will be listed as trusted:

Details of the technologies used

To achieve the encryption Checkmk uses the stunnel program along with its own certificate and its own Certificate Authority (CA) to sign the certificate. These will be individually generated automatically with a new instance and they are therefore not predefined static CAs or certificates. That is a very important safety factor to prevent fake certificates from being used by attackers, because any attackers could then gain access to a publicly-available CA.

The generated certificates also have the following properties:

Both certificates are in the PEM format. The signed certificates for the instance also contain the complete certificate chain.
The keys use 2048-bit RSA, and the certificate is signed using SHA512
The instance’s certificate is valid for 999 years.

The fact that the standard certificate is valid for so long very effectively prevents you from getting connection problems that you cannot classify. At the same time it is of course possible that once a certificate has been compromised it is accordingly long open to abuse. So if you fear that an attacker will gain access to the CA or to the instance certificate signed with it, always replace both certificates (CA and instance)!

Using your own certificates

In larger environments you might in any case want to use your own certificates. To replace the supplied ones, simply substitute the instance certificate with your own, and make sure that the CA which has signed the new certificate is also trusted.

Migrating from older versions

For compatibility reasons the LIVESTATUS_TCP_TLS option will not be automatically activated after an update from an older version to 1.6.0, since in the new version it is only possible to use the connection with encryption. After the update, to make use of the new feature in your monitoring instances, stop the instance and activate the option mentioned:

OMD[mysite]:~$ omd config set LIVESTATUS_TCP_TLS on

Since the certificates were generated automatically during the update, the instance then immediately uses the new encryption feature. So that you can still access the instance from the central instance, in the second step activate the Encryption option in the Instance Connection Properties under WATO > Distributed Monitoring:

The last step is as described above – again here you first have to mark the CA of the remote instance as trusted.

2.4. Special features of a distributed setup

A distributed monitoring operates via livestatus much like a single system, but it does have a couple of special characteristics:

Access to the monitored hosts

All accesses of a monitored host are consistently carried out from the instance to which the host is assigned. This applies not only to the actual monitoring, but also to the service discovery, the Diagnostics page, the Notifications, Alert handlers and everything else. This point is very important as it is not assumed that the central instance actually has access to this host.

Specifying the instance in views

Some of the standard views are grouped according the instance from which the host will be monitored – this applies for, e.g. All hosts:

The instance will likewise be shown in the host’s or service’s details:

This information is generally available for use in a column when creating your own views. There is also a filter with which a view of hosts on a specific site can be filtered:

Site status element

There is a Site status snap-in element for the side bar which can be added using button sidebar addsnapin . This displays the status of the individual instances, and it also provides the option of clicking on the status to temporarily hide or show individual sites. These will be flagged with the button sitestatus disabled status. With this you can also disable a button sitestatus dead instance that is generating timeouts, thus avoiding superfluous timeouts:

This is not the same as disabling the livestatus connection using the connection configuration in WATO. Here the ‘disabling’ only affects the currently logged-in user and has a purely visual function. Clicking on an instance’s name will display a view of all of its hosts.

The central instance Control element

In a distributed monitoring the central instance control element has a different appearance. Each instance has its own global switch:

Checkmk Cluster hosts

If you monitor with Checkmk HA-Cluster, the cluster’s individual nodes must be assigned to the same instance as the cluster itself. This is because determining the clustered services’ status accesses cache files generated through monitoring the node. This data is located locally on the respective instance.

Piggyback data (e.g., ESX)

Some check plug-ins use ‘Piggyback’ data, for example, for allocating monitoring data retrieved from an ESX-host to the individual virtual machines. For the same reason as with cluster monitoring, in distributed monitoring the ‘piggy’ (carrying) host as well as its dependent hosts must be monitored from the same instance. In the case of ESX this means that the virtual machines must be assigned to the same site in Checkmk as the ESX-System from which the monitoring data is collected. This can mean that it is better to poll the ESX-host system directly rather than to poll a global vCenter. Details for this can be found in the documentation on ESX-monitoring.

Hardware/Software inventory

The Checkmk Hardware/Software inventory also functions in distributed environments. In doing so the inventory data from the var/check_mk/inventory directory must be regularly transmitted from the remotes to the central instance. For performance reasons the user interface always accesses this directory locally.

In the CEE Checkmk Enterprise Editions the synchronisation is carried out automatically on all sites that are connected using the Livestatus proxy.

If you run inventories using the CRE Checkmk Raw Edition in distributed systems, the directory must be regularly mirrored to the central instance with your own tools (e.g., with rsync).

Changing a password

Even when all instances are being centrally monitored, a login on an individual instance’s interface is quite possible and often also appropriate. For this reason WATO ensures that a user’s password is always the same for all sites.

A password change made by the administrator will take effect automatically as soon as it is shared to all instances with Activate Changes.

A change made by a user themselves using the button sidebar settings sidebar in their personal settings works somewhat differently. This cannot execute an Activate changes since the user of course has no general authority for this function. In such a case WATO will automatically share the changed password across all instances – directly after it has been saved in fact.

As we all know, networks are never 100% available. If an instance is unreachable at the time of a password change, it will not receive the new password. Until the administrator successfully runs an Activate changes, or respectively, the next successful password change, this instance will retain the old password for the user. A status symbol will inform the user of the status of the password sharing to the individual instances.

2.5. Tethering existing instances

As mentioned above, existing instances can also be retrospectively tethered to a distributed monitoring. As long as the preconditions described above have been satisfied (compatible Checkmk versions), this will be completed exactly as for setting up a new remote instance. Share livestatus with TCP, then add the instance to the icon sites Distributed monitoring module – and you’re done!

The second stage – the changeover to a centralised configuration – is somewhat trickier. Before integrating the instance into the distributed WATO as described above, you should be aware that in doing so the instance’s entire local configuration will be overwritten!

Should you wish to take over existing hosts, and possibly rules as well, three steps will be required:

Match the host tags’ scheme
Copy the WATO-directories
Edit the characteristics in the parent folder

1. Host tags

It is self-evident that the host tags used in the remote must also be known to the central instance in order that they can be carried over. Check these before the migration and add any missing tags to the central instance manually. Here it is essential that the Tag-IDs match – the tag’s title is irrelevant.

2. WATO-directories

Next, move the hosts and rules into the central WATO on the central instance. This only works for hosts and rules in sub-directories (i.e., not in the ‘Main directory’ ). Hosts in the main directory should first simply be moved into a remote instance’s sub-directory using WATO.

The actual migration can then be achieved quite simply by copying the appropriate directories. Each host directory in WATO corresponds to a directory within etc/check_mk/conf.d/wato/. These can be copied using a tool of your choice (e.g. scp) from the tethered site to the same location in the central instance. If a directory with the same name already exists there, simply rename it. Please note that Linux users and groups are also used by the central instance.

Following the copying the hosts should appear in the central’s setup – as well as the rules you have created in these folders. The folders’ characteristics will also be included with the copying. These can be found in the folder in the hidden .wato file.

3. One-time editing and saving

So that the attributes of the central instance’s parent folder’s functions are correctly inherited, as a final step following the migration the parent folders’ characteristics must be opened and saved once – the host’s attributes will thereby be freshly defined.

2.6. Instance-specific global settings

A centralised configuration over WATO means that first and foremost, all instances have a common and (apart from the hosts) the same configuration. What is the situation however, when individual instances require different global settings? An example could be the CMC setting Maximum concurrent Checkmk checks. It could be that a customised setting is required for a particularly small or a particularly large instance.

For such cases there is an instance-specific global setting. This is reached via the button configuration symbol in the icon sites Distributed monitoring WATO-module:

Via this symbol you will find a selection of all global settings – although anything you define here will only be effective for the chosen instance. A value that diverges from the standard will be visually-highlighted, and it will apply only to this instance:

Note: Site-specific settings for the central instance are only indirectly possible – since it is of course the central that predefines the configuration. In a situation where ONLY the centrals’s settings diverge, for every other site it will be necessary to make site-specific settings to ‘RETURN’ them to the ‘default’.

2.7. Distributed event console

The Event Console processes syslog-messages, SNMP traps and other types of events of an asynchronous nature.

Up to version 1.2.8, in a distributed environment the recommended procedure is to operate only a single instance in the Event Console – and that one within the central instance. Here is to where you direct all host events.

This setup has the disadvantage that the hosts’ events must be sent to another instance, rather than to the instance which is actively monitoring them. A consequence of this is that when generating notifications from the event console, the host’s information is incomplete since the local Checkmk doesn’t know about them. On the one hand, this applies to the detection of hosts’ contact groups, and on the other hand also to events in which the originating host is identified only by its IP-Adresse and a real host name is absent. In such a case notification rules containing conditions linked to the host names cannot function.

From Version 1.4.0i1] Checkmk also provides the option of running a distributed Event Console. Then every instance will run its own event processing which captures the events from all of the hosts being monitored from the instance. The events will thus not be sent to the central system, rather they will remain at the instances and be only centrally-retrieved. This is effected in a similar way to that for the active states via Livestatus, and functions with both the CRE Checkmk Raw Edition and the CEE Checkmk Enterprise Editions.

Converting to a distributed Event Console according to the new scheme requires the following steps:

In the connection settings, for WATO-Replication activate the EC (Replicate Event Console configuration to this site) option
Switch the Syslog location and SNMP-Trap-destinations for the affected hosts to the remote instance. This is the most laborious task.
If you use the Check event state in Event Console rule set, switch this back to Connect to the local Event Console.
If you use the Logwatch Event Console Forwarding rule set, switch this likewise to the local Event Console.
In the Event Console Settings, switch the Access to event status via TCP back to no access via TCP.

2.8. PNP4Nagios

In the CRE Checkmk Raw Edition the PNP4Nagios Open-Source-Projekt is used for displaying performance values graphically. This has its own web interface which is integrated in Checkmk. Using this, in some locations single graphics will be embedded, and in other locations a complete page including its own navigation will be provided:

In distributed monitoring the performance data bases (Round-Robin-Databases, or RRDs) are always located locally on the remote sites. This is very important because a continuous transmission of all performance data to the central instance – and its resulting network traffic – is thus avoided. Furthermore all of the other advantages of a distributed monitoring through livestatus are retained, as described at the outset.

PNP4Nagios unfortunately has no compatible interface for accessing the graphs in livestatus. Therefore Checkmk simply retrieves the individual graphs, or respectively, the complete websites from PNP4Nagios via HTTP over its standard-URLs. Two methods are used for this:

The PNP4Nagios-data is retrieved directly from the user’s browser
The PNP4Nagios-data is retrieved from the central instance and then forwarded to the user

1. Retrieval via the user’s browser

The first method is very simple to implement. For the relevant sites, configure the URL-prefix in the connection’s attributes, and set it to the URL used for accessing this instance – though without the /check_mk/:

Checkmk will embed the graphs in the GUI so that the browser can retrieve the graphs’ PNG-images, or respectively, the website’s Iframes from PNP4Nagios over this URL. Specify the URL thus as it functions with the application’s browser. An access to the remote from the central instance is not necessary.

The URL method as just described is quick and easy to set up, but it has a few small disadvantages:

Since the browser retrieves the PNP4Nagios-data from a different host to the Checkmk-GUI, a Checkmk session cookie will not be sent. The user must thus make a new login for very remote instance. With the first access to a graph a login screen will appear.
The remote server may not in fact be reachable from the user’s browser – rather only from the central instance. In such a case this method can’t function.
The URL-prefix must be set to either http:// or to https://. A selection made by the user will then no longer function.

2. Retrieval via the central instance

The best solution to this problem is to retrieve the PNP4Nagios-data from the central instance, rather than from the user’s browser itself. To this end, create a proxy rule on the central’s Apache-server. This will route PNP4Nagios queries per HTTP or HTTPS to the correct remote server. Important: this must be done on the operating system’s Apache, not that running on the instance. For this reason a root-permission is required.

The prerequisite for this setup is that all Checkmk instance-IDs in your network are explicit, since Apache must use the remote-ID to decide which server it should forward to.

Assuming the following example:

ID	IP-Addresse	Livestatus	Checkmk URL
central	10.15.18.223	local	10.15.18.223/central/check_mk/
remote1	10.1.1.133	Port 6557	10.1.1.133/remote1/check_mk/

IP-Addresse

Livestatus

Checkmk URL

central

10.15.18.223

local

10.15.18.223/central/check_mk/

remote1

10.1.1.133

Port 6557

10.1.1.133/remote1/check_mk/

In the connection settings, now simply set /remote1/ as the URL-prefix:

With this, queries to PNP4Nagios initially go to the central instance on the /remote1 URL. Should the remote1 instance coincidentally be running on the same server as the central instance, you will now be finished and no proxy rule will be required, since the data can be delivered directly.

In the general case that the remote instance runs on another host, you will require the root-permission and must create a configurations file for the system-wide Apache server. The path for this file will depend on your Linux distribution:

Distribution	Path
RedHat, CentOS	/etc/httpd/conf.d/check_mk_proxy.conf
SLES, Debian, Ubuntu	/etc/apache2/conf.d/check_mk_proxy.conf

Distribution

Path

RedHat, CentOS

/etc/httpd/conf.d/check_mk_proxy.conf

SLES, Debian, Ubuntu

/etc/apache2/conf.d/check_mk_proxy.conf

The file consists of five lines for each tethered remote instance. In the following example, substitute the instance name (here remote1) and the instance’s URL (here 10.1.1.133/remote1/). Please note that for Apache it is relevant whether a URL ends with a (/) ‘slash’ or not:

/etc/apache2/conf.d/multisite_proxy.conf

<Location /remote1>
    Options +FollowSymLinks
    RewriteEngine On
    RewriteRule ^/.+/remote1/(.*) http://10.1.1.133/remote1/$1 [P]
</Location>

This rule tells Apache that all URLs beginning with /remote1 are to be retrieved via reverse-proxy from the URL 10.1.1.133/remote1.

Important: don’t forget to activate the configuration. For SLES, Debian and Ubuntu, perform this with:

root@linux# /etc/init.d/apache2 reload

RedHat and CentOS require:

root@linux# /etc/init.d/httpd reload

If everything has been done correctly, PNP4Nagios must now be able to access the graphs.

2.9. Logwatch

Checkmk includes the mk_logwatch plug-in with which under Linux and Windows you can monitor text log files, and especially the Windows event log. This plug-in provides a special webpage in the GUI in which the relevant detected messages can be viewed and acknowledged:

Up until Checkmk Version 1.2.8 this page required local access to the saved log messages. This installed the plug-in on the remote instance from which the respective server was monitored. In distributed monitoring however the central instance has no direct access to these files. The solution is the same as with PNP4Nagios: The remote server’s logwatch webpage is embedded and retrieved from the remote separately per HTTP.

The configuration required for this is identical to that used when setting up Checkmk for PNP4Nagios. If this has already been set up the Logwatch interface will automatically function correctly.

From Version 1.4.0i1] Checkmk the Logwatch webpage exclusively uses Livestatus for the transfer and no longer requires HTTP. The setting up of HTTP or the proxy rule is then only needed for users of the CRE Checkmk Raw Edition for PNP4Nagios.

2.10. NagVis

The NagVis open source program visualises status data from monitoring on self-produced maps, diagrams and other charts. NagVis is integrated in Checkmk and can be used immediately. The access is easiest over the NagVis Maps sidebar element. The integration of NagVis in Checkmk is described in its own article.

NagVis supports distributed monitoring via Livestatus in pretty much the same way as Checkmk does. The links to the individual sites are refferred to as backends. The backends are automatically set up correctly by Checkmk so that one can immediately begin generating NagVis-charts – also in distributed monitoring.

Select the correct backend for each object that you place on a chart – i.e., the Checkmk instance from which the object is to be monitored. NagVis cannot find the host or service automatically, above all for performance reasons. Therefore if you move hosts to a different remote instance you will need to update the NagVis-charts accordingly.

Details on backends can be found in the documentation here: NagVis.

3. Unstable or slow connections

The general status overview in the user interface enables an always available, and reliable access to all of the connected instances. The one snag with this is that a view can only be displayed when all instances have responded. The process is always that first a Livestatus query is sent (for example, “List all services whose state is not OK.”). The view can then only be displayed once the last instance has responded.

It is annoying when an instance doesn’t answer at all. To tolerate brief outages (e.g., due to restarting a site or a lost TCP-Packet), the GUI waits for a given time before an instance is declared to be button sitestatus dead , and then continues processing the responses from the remaining sites. This results in a ‘hanging’ GUI. The timeout is set to 10 seconds by default.

If this occasionally happens in your network you should set up either Status hosts or (even better) the Livestatus proxy.

3.1. Status hosts

The configuration of Status hosts is the recommended procedure with the CRE Checkmk Raw Edition in order to recognise defective connections reliably. The idea is simple: The central instance actively monitors the connection to each individual remote instance. At least we will then have a monitoring system available! The GUI will then be aware of unreachable instances and can immediately exclude and flag them as button sitestatus down . Timeouts are thus minimised.

Here is how to set up a status host for a connection:

Add the host on which the remote instance is running to the central instance in monitoring.
Enter this as the status host in the connection to the remote:

A failed connection to a remote instance can now only lead to a brief hangup of the GUI – namely until the monitoring has recognised it. By reducing the status host’s proof interval from the default of sixty seconds to, e.g. five seconds, you can minimise the duration of a hangup.

If you have set up a status host, there are further possible states for connections:

button sitestatus unreach

The computer on which the remote instance is running is just now unreachable to the monitoring because a router is down (the status host has an UNREACH state).

button sitestatus waiting

The status host that monitors the connection to the remote system has not yet been verified by the monitoring (it still has a PEND state).

button sitestatus unknown

The status host’s state has an invalid value (this should never occur).

In all three cases the connection to the instance will be excluded and timeouts thus avoided.

3.2. Persistent connections

With the Use persistent connections check box you can prompt the GUI to maintain established Livestatus connections to remote instances permanently in an ‘up’ state, and to continue using them for queries. Especially for connections with longer packet turnarounds (e.g. intercontinental), this can make the GUI noticeably more responsive.

Because the Apache GUI is shared over multiple independent processes a connection is required for each Apache-Client process running simultaneously. If you have many simultaneous users, please ensure the configuration has a sufficient number of Livestatus connections in the remote’s Nagios core. These are configured in the etc/mk-livestatus/nagios.cfg file. The default is twenty (num_client_threads=20).

By default, Apache is so configured in Checkmk that it permits up to 128 simultaneous user connections. This is configured in the following section of the etc/apache/apache.conf file:

etc/apache/apache.conf

<IfModule prefork.c>
StartServers         1
MinSpareServers      1
MaxSpareServers      5
ServerLimit          128
MaxClients           128
MaxRequestsPerChild  4000
</IfModule>

This means that under high load up to 128 Apache processes can start which then also generate and sustain up to 128 Livestatus connections. Not setting the num_client_threads high enough can result in errors or a very slow response time in the GUI.

For connections with LAN or with fast WAN-Networks we advise not utilising persistent connections.

3.3. The livestatus proxy

With the Livestatusproxy the CEE Checkmk Enterprise Editions feature a sophisticated mechanism for detecting dead connections. Additionally, this especially optimises the performance of connections with long round-trip-times. The livestatus proxy’s advantages are:

Very fast, proactive detection of unresponding instances
Local caching of queries that deliver static data
Standing TCP-connections – which require fewer round trips and consequently allow much faster responses from distant instances (e.g. USA ⇄ China)
Precise control of the maximum number of livestatus connections required
Enables Hardware/Software inventory in distributed environments

Installation

Installing the livestatus proxy is very simple. It is activated by default in the CEE – which can be seen when starting a site:

OMD[central]:~$ omd start
Starting mkeventd...OK
Starting Livestatus Proxy-Daemon...OK
Starting rrdcached...OK
Starting Check_MK Micro Core...OK
Starting dedicated Apache for site remote1...OK
Starting xinetd...OK
Initializing Crontab...OK

Select the setting ‘Use Livestatus Proxy-Daemon’ for the connection to the remote instances instead of ‘Connect via TCP’:

The details for host and port are as always. No changes must be made on the remote instance. In Number of channels to keep open enter the number of parallel TCP-connections the proxy should establish and sustain to the target site.

The TCP-connections pool is shared by all GUI enquiries. The number of connections limits the maximum number of queries that can be processed concurrently. This indirectly limits the number of users. In situations in which all channels are reserved this will not immediately lead to an error. The GUI waits a given time for a free channel. Most queries actually require only a few milliseconds.

If the GUI must wait longer than Timeout waiting for a free channel for a channel, it will be interrupted with an error and the user will receive an error message. In such a case the the number of connections should be increased. Be aware however that on the remote instance sufficient parallel incoming connections must be allowed – this is set to 20 by default. This setting can be found in the global options under Monitoring core > Maximum concurrent Livestatus connections.

The Regular heartbeat provides a constantly active monitoring of the connections directly at the protocol level. In the process the proxy regularly sends a simple Livestatus query which must be answered by the remote instance within the predetermined time (default: 2 seconds). With this method a situation where the target server and the TCP-port are actually reachable, but the monitoring core no longer responds, will also be detected.

If a response fails to appear, all connections will be declared ‘dead’, and following a ‘cooldown’ time (default: 4 seconds) will be newly established. All this takes place proactively – i.e. without a user needing to open a GUI-window. In this way outages can be quickly detected, and via a recovery the connections can be immediately reestablished and in the best case be available before a user even notices the outage.

The Caching ensures that static queries need only be responded-to once by the remote instance, and from that point of time can be responded to directly and locally, without delay. An example of this is the list of monitored hosts required by Quicksearch.

Error diagnosis

The Livestatus proxy has its own log file which can be found under var/log/liveproxyd.log. On a correctly-configured remote instance with five channels (standard) it will look something like this:

var/log/liveproxyd.log

2016-09-19 14:08:53.310197 ----------------------------------------------------------
2016-09-19 14:08:53.310206 Livestatus Proxy-Daemon starting...
2016-09-19 14:08:53.310412 Configured 1 sites
2016-09-19 14:08:53.310469 Removing left-over unix socket /omd/sites/central/tmp/run/liveproxy/remote1
2016-09-19 14:08:53.310684 Channel remote1/5 successfully connected
2016-09-19 14:08:53.310874 Channel remote1/6 successfully connected
2016-09-19 14:08:53.310944 Channel remote1/7 successfully connected
2016-09-19 14:08:53.311009 Channel remote1/8 successfully connected
2016-09-19 14:08:53.311071 Channel remote1/9 successfully connected

The Livestatus proxy regularly records its state in the var/log/liveproxyd.state file:

var/log/liveproxyd.state

Current state:
[remote1]
  State:                   ready
  Last Reset:              2016-09-19 14:08:53 (125 secs ago)
  Site's last reload:      2016-09-19 14:08:45 (134 secs ago)
  Last failed connect:     1970-01-01 01:00:00 (1474287059 secs ago)
  Cached responses:        1
  Last inventory update:   1970-01-01 01:00:00 (1474287059 secs ago)
  PID of inventory update: None
  Channels:
      5 - ready             -  client: none - since: 2016-09-19 14:10:38 ( 20 secs ago)
      6 - ready             -  client: none - since: 2016-09-19 14:10:43 ( 15 secs ago)
      7 - ready             -  client: none - since: 2016-09-19 14:10:48 ( 10 secs ago)
      8 - ready             -  client: none - since: 2016-09-19 14:10:53 (  5 secs ago)
      9 - ready             -  client: none - since: 2016-09-19 14:10:33 ( 25 secs ago)
  Clients:
  Heartbeat:
    heartbeats received: 24
    next in 0.2s

And when an instance is currently stopped the state will look like this:

var/log/liveproxyd.state

----------------------------------------------
Current state:
[remote1]
  State:                   starting
  Last Reset:              2016-09-19 14:12:54 ( 10 secs ago)
  Site's last reload:      2016-09-19 14:12:54 ( 10 secs ago)
  Last failed connect:     2016-09-19 14:13:02 (  2 secs ago)
  Cached responses:        0
  Last inventory update:   1970-01-01 01:00:00 (1474287184 secs ago)
  PID of inventory update: None
  Channels:
  Clients:
  Heartbeat:
    heartbeats received: 0
    next in -5.2s

Here the state is ‘starting’. The proxy is thus attempting to establish connections. There no channels yet. During this state queries to the site will be answered with an error.

4. Livedump and CMCDump

4.1. Motivation

The concept for a distributed monitoring with Checkmk that has been described up until now is a good and simple solution in most cases. It does however require network access from the central to the remote instances. There are situations in which access is either not possible or not desired, because, for example:

the remote instances are in your customer’s network for which you have no access
the remote instances are in a security area to which access is strictly forbidden
the remote instances have no permanent network connection and no fixed IP-addresses

Distributed monitoring with Livedump, or respectively, CMCDump takes a quite different approach. Firstly, the remote instances are so attached so that they operate completely independently of the central instance and are administered decentrally. A distributed WATO will be dispensed with.

All of the remote instance’s hosts and services will then be replicated as copies in the central instance. Livedump/CMCDump can help by generating a copy of the remote instances’ configuration which can then be loaded into the central instance.

Now during the monitoring, on every remote instance a copy of the current status will be written to a file at predetermined intervals (e.g. every minute). This will be transmitted to the central instance via a user-defined method and will be saved there as a status update. No particular protocol has been provided or specified for this data transfer. Any automatable transfer protocol could be used. It is not essential to use scp – even a transfer by email is conceivable!

Such a setup differs from a ‘normal’ distributed monitoring in the following ways:

Actualisation of the states and performance data in the central instance will be delayed.
Calculation of availability on the central instance will give minimally different results to a calculation on the remote instance.
State changes that occurr more quickly than the actualisation interval will be invisible to the central instance.
If a remote instance is ‘dead’, the states will become obsolete on the central instance – the services will be ‘stale’, but nonetheless still visible. Performance and availability data for this time period will be ‘lost’ (but they will still be available on the remote instance).
Commands on the central instance such as Downtimes and Acknowledgements cannot be transmitted to the remote instance.
The central instance can never access the remote instances.
Access to logfile details by Logwatch is impossible.
The Event Console will not be supported by Livedump/CMCDump.

Since brief state changes – depending on the periodic interval selected on the central instance – may not be visible, a notification through the central instance is not ideal. If however the central instance is utilised as a purely display instance – as a central overview of all customers for example – this method definitely has its advantages.

Incidentally, Livedump/CMCDump can be used simultaneously alongside distributed monitoring over Livestatus without problems. Some instances are are simply connected via Livestatus directly – others use Livedump. Livedump can also be added to one of the Livestatus remote instances.

4.2. Installing Livedump

If you are installing the CRE Checkmk Raw Edition (or the CEE with a Nagios core), use the livedump tool. The name is derived from Livestatus and Status-Dump. From the Checkmk Version 1.2.8p12 livedump is located directly in the search path and is thus available as a command. In older versions you can find it under ~/share/doc/check_mk/treasures/livedump/livedump.

We will make the following assumptions… * … the remote instance has been fully set up and is actively monitoring hosts and services * … the central instance has been started and is running * … at least one host is being locally monitored on the central instance (because the central monitors itself).

Transferring the configuration

First, on the remote instance, create a copy of its host’s and service’s configurations in Nagios-configuration format. Also redirect the output from livedump -TC to a file:

OMD[remote1]:~$ livedump -TC > config.cfg

The start of the file will look something like this:

nagios.cfg

define host {
    name                    livedump-host
    use                     check_mk_default
    register                0
    active_checks_enabled   0
    passive_checks_enabled  1

}

define service {
    name                    livedump-service
    register                0
    active_checks_enabled   0
    passive_checks_enabled  1
    check_period            0x0

}

Transmit the file to the central instance, (e.g. with scp) and save them there in the ~/etc/nagios/conf.d/ directory – here Nagios expects to find the configuration data for hosts and services. Select a file name that ends with .cfg, for example ~/etc/nagios/conf.d/config-remote1.cfg. If an SSH-access from remote to central instance is possible it can be done, for example, as below:

OMD[remote1]:~$ scp config.cfg central@mycentral.mydomain:etc/nagios/conf.d/config-remote1.cfg
central@mycentral.mydomain's password:
config.cfg                                             100% 8071     7.9KB/s   00:00

Now log in to the central instance and activate the changes:

:c-local:cmk -R
Generating configuration for core (type nagios)...OK
Validating Nagios configuration...OK
Precompiling host checks...OK
Restarting monitoring core...OK

Now all of the remote’s hosts and services should appear in the central instance – initially with the PEND state, which they will retain for the time being:

Note:

With the -T option in livedump template definitions are created in Livedump from which it draws the configuration. Without these Nagios cannot be started. Only one of these may be present however. If you import a configuration from another remote instance it must not use the -T option!
A dump of the configuration is also possible on a CMC-core — the importing of which requires Nagios. If the CMC is running on your central instance use CMCDump.
The copying and transferring of the configuration must be repeated for every change to hosts or services on the remote instance.

Transferring the status

Once the hosts are visible in the central instance, we will need to setup a (regular) transmission of the remotes' monitoring status. Again create a file with livedump, but this time without secondary options:

OMD[remote1]:~$ livedump > state

This file contains the states of all hosts and services in a format which Nagios can read directly from check results. The start of this file looks something like this:

state

host_name=myserver666
check_type=1
check_options=0
reschedule_check
latency=0.13
start_time=1475521257.2
finish_time=1475521257.2
return_code=0
output=OK - 10.1.5.44: rta 0.005ms, lost 0%|rta=0.005ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.019ms;;;; rtmin=0.001ms;;;;

Copy this file to the central instance into the ~/tmp/nagios/checkresults directory. Important: This file’s name must begin with c and be seven characters long. With scp it will look something like this:

OMD[remote1]:~$ scp state central@mycentral.mydomain:tmp/nagios/checkresults/caabbcc
central@mycentral.mydomain’s password:
state                                                  100%   12KB  12.5KB/s   00:00

Finally, create an empty file on the central instance with the same name and the .ok extension. With this Nagios will know that the status file has been transferred completely and can now be read in:

:c-local:touch tmp/nagios/checkresults/caabbcc.ok

The status of the remotes’ hosts/services will now be immediately updated on the central instance:

The transmission of the status must from now on be made regularly. Livedump unfortunately doesn’t support this task and you will need to script it yourself. The livedump-ssh-recv script can be found in ~/share/check_mk/doc/treasures/livedump, which you can employ in order to receive Livedump updates (including those from the configuration) on the central instance per SSH. Details about this can be found in the script itself.

The configuration and staus dump can also be restricted by using Livestatus filters. For example, you could limit the hosts to the members of the mygroup hostgroup:

OM(remote):livedump -H "Filter: host_groups >= mygroup" > state

Further information on Livedump – in particular how to transfer the data via encrypted email – can be found in the README file in the ~/share/doc/check_mk/treasures/livedump directory.

4.3. Implementing CMCDump

CMCDump is for the Checkmk Micro Core what Livedump is for Nagios – and it is thus the tool of choice for the CEE Checkmk Enterprise Editions. In contrast to Livedump, CMCDump can replicate the complete status of hosts and services (Nagios doesn’t have the required interfaces for this task).

To compare: Livedump transfers the following data:

The current states – i.e. PEND, OK, WARN, CRIT, UNKNOWN, UP, DOWN or UNREACH
The output from Check plug-ins
The performance data

CMCDump additionally synchronises:

The long output from the plug-in
Whether the object is currently flapping
The time stamps for the last check execution and the last state change
The duration of the check execution
The latency of the check execution
The sequence number of the current check attempt and whether the current state is ‘hard’ or ‘soft’
acknowledged, if present
Whether the object is currently in a planned maintenace.

This provides a much more precise reflection of the monitoring. When importing the status the CMC doesn’t just simulate a check execution, rather by using an interface designed for this task it transmits an accurate status. Among other things, this means that at any time the operations centre can see whether problems have been acknowledged or if maintenance times have been entered.

The installation is almost identical to that for Livedump, but is however somewhat simpler since there is no need to be concerned about possible duplicated templates or similar.

The copy of the configuration is made with cmcdump -C. Store this file on the central instance in etc/check_mk/conf.d/. The .mk file extension must be used:

OMD[remote1]:~$ cmcdump -C > config.mk
OMD[remote1]:~$ scp config.mk central@mycentral.mydomain:etc/check_mk/conf.d/remote1.mk

Activate the configuration on the central instance:

OMD[central]:~$ cmk -O

As with Livedump the hosts and services will now appear on the central instance in the PEND state. You will however see by the icon shadow symbol that we are dealing with a shadow object. In this way it can be distinguished from an object being monitored directly on the central instance or on a ‘normal’ remote instance:

The regular generation of the status is achieved with cmcdump without additional arguments:

OMD[remote1]:~$ cmcdump > state
OMD[remote1]:~$ scp state central@mycentral.mydomain:tmp/state_remote1

To import the status to the central instance the file content must be written into the tmp/run/live UNIX-Socket with the help of the unixcat tool.

OMD[central]:~$ unixcat tmp/run/live < tmp/state_remote1

If you have a connection from the remote to the central instance via SSH without a password all three commands can be combined into a single one – and when so doing not even a temporary file is created:

OMD[remote1]:~$ cmcdump | ssh central@mycentral.mydomain "unixcat tmp/run/live"

It really is so simple! But, as already mentioned, ssh/scp is is not the only method for transferring files, and a configuration or status can be transferred just as well using email or another desired protocol.

5. Notifications in distributed environments

5.1. Centralised or decentralised?

In a distributed environment the question arises – from which instance should the notifications (e.g. emails) be sent: from the individual remotes or from the central instance? There are arguments in favour of both procedures.

Arguments for sending from the remotes:

Simpler to set up
A local notification is still possible if the link to the central instance is not available
Also works with the Checkmk Raw Edition

Arguments for sending from the central instance:

Notifications can be further processed at a central location (e.g. be forwarded to a ticket system)
remote instances require no setting up for email or SMS
For sending an SMS over hardware this is only required once – on the central instance

5.2. Decentralised notification

No special steps are required for a decentralised notification since this is the standard setting. Every notification that is generated on a remote instance runs through the chain of notifications rules there. If you implement a distributed WATO these rules are the same on all instances. Notifications resulting from these rules will be delivered as usual, for which the appropriate notification scripts will have been run locally.

It must simply be ensured that the appropriate service has been correctly installed on the instances – that a smart host has been defined for emails for example – in other words the same procedure as for setting up an individual Checkmk instance.

5.3. Centralised notifications

Fundamentals

The CEE Checkmk Enterprise Editions provide a built-in mechanism for centralised notifications which can be individually activated for each remote instance. Such remotes then route all notifications to the central instance for further processing. The centralised notification is thereby independent of whether the distributed monitoring has been set up in the standard way, or with CMCDump, or by using a blend of these procedures. Technically speaking, the central notification server does not even need to be the ‘central’. This task can be taken on by any Checkmk instance.

If a remote instance has been set to ‘forwarding’, all notifications wiil be forwarded directly to the central instance as they would be from the core – effectively in a raw format. Once there the notification rules will be evaluated which actually decide who should be notified and how. The required notification scripts will be invoked on the central instance.

Activating the alarm spooler

The first step for implementing centralised notification is to activate the notification spooler (mknotifyd) on all participating instances. This is an auxiliary process that is required on the central as well as on the remote instances. In newer Checkmk-versions the notification spooler is automatically aktivated. Please verify this with omd config and activate it if needed. This point can be found under Distributed Monitoring > MKNOTIFYD.

An omd status must show the mknotifyd process:

OMD[mysite]:~$ omd status
OMD[central]:~$ omd status
mkeventd:       running
liveproxyd:     running
mknotifyd:      running
rrdcached:      running
cmc:            running
apache:         running
crontab:        running
-----------------------
Overall state:  running

Only when the notification spooler is active will the point Notifications > Notification spooling be found under the global settings in WATO.

Setting up the TCP-connections

The remote and (notification-)central notification spoolers communicate with each other via TCP. Notifications are sent from remote to central instance. The central acknowledges to the remotes that the notifications have been received, which prevents notifications being lost even if the TCP connection is broken.

There are two alternatives for the construction of a TCP-connection:

A TCP-connection is configured from central to remote instance. Here the remote is the TCP-server.
A TCP-connection is configured from remote to central instance. Here the central is the TCP-server.

Consequently there is nothing standing in the way of forwarding notifications if for network reasons establishing connections is only possible in a specific direction. The TCP-connections are supervised by the spooler with a heartbeat signal and are immediately reestablished as needed – not only in the event of a notification.

Since remote and central instance require different global settings you must make site specific settings for all remotes. Configuring the central instance is performed using the normal global settings. This is due to Checkmk currently not supporting any specific settings for the local instance (= central instance). Please note – these settings will be automatically inherited by all remotes for which no specific settings have been defined.

Let us look first at an example where the central instance establishes the TCP-connections to the remote instances.

Step 1: On the remote instance, edit the instance specific global setting Notifications > Notification Spooler Configuration and activate Accept incoming TCP connections. TCP-Port 6555 will be recommended for incoming connections. If there are no objections, adopt these settings.

Step 2: Now, likewise, in the Notification Spooling submenu only on the remote instance, select the option Forward to remote site by notification spooler.

Step 3: Now, on the central instance – i.e. in the normal global settings – configure the connection to the remote (and then to additional remotes as needed):

Step 4: Set the global setting Notification Spooling to Asynchronous local delivery by notification spooler, so that the central instance’s communications will also be processed over the same central spooler.

Step 5: Activate the changes.

Establishing connections from a remote instance

If the TCP-connection should be established from the remote outwards, the procedure is identical, differing only from the description above by simply exchanging the roles of central and remote instance.

A blend of the two procedures is also possible. In such a case the central instance must be installed so that it listens to incoming connections as well as connecting to remote instances. However in every central/remote relationship only one of the pair is permitted to establish the connection!

Test and diagnose

The alarm spooler logs to the var/log/mknotifyd.log file. In the spooler configuration the loglevel can be raised so that more messages are received. With a standard loglevel one should see something like this on the central instance:

var/log/mknotifyd.log

2016-10-04 17:19:28 [5] -----------------------------------------------------------------
2016-10-04 17:19:28 [5] Check_MK Notification Spooler version 1.2.8p12 starting
2016-10-04 17:19:28 [5] Log verbosity: 0
2016-10-04 17:19:28 [5] Daemonized with PID 31081.
2016-10-04 17:19:28 [5] Successfully connected to 10.1.8.44:6555

At all times the var/log/mknotifyd.state file contains the current status of the spooler and all of its connections:

central:var/log/mknotifyd.state (Auszug)

Connection:               10.1.8.44:6555
Type:                     outgoing
State:                    established
Status Message:           Successfully connected to 10.1.8.44:6555
Since:                    1475594368 (2016-10-04 17:19:28, 140 sec ago)
Connect Time:             0.000 sec

A version of the same file is also present on the remote instance. There the connection will look something like this:

remote:var/log/mknotifyd.state (Auszug)

Connection:               10.22.4.12:56546
Type:                     incoming
State:                    established
Since:                    1475594368 (2016-10-04 17:19:28, 330 sec ago)

To test, select any monitored remote service and set it manually to CRIT with the Fake check results command.

Now on the central instance an incoming notification should appear in the notifications log file (notify.log):

central:var/log/notify.log

2016-10-04 17:27:57 ----------------------------------------------------------------------
2016-10-04 17:27:57 Got spool file 68c30b35 (myserver123;Check_MK) from remote host for local delivery.

The same event will look like this on the remote instance:

remote:var/log/notify.log

2016-10-04 17:27:23 ----------------------------------------------------------------------
2016-10-04 17:27:23 Got raw notification (myserver123;Check_MK) context with 71 variables
2016-10-04 17:27:23 Creating spoolfile: /omd/sites/remote1/var/check_mk/notify/spool/f3c7dea9-0e61-4292-a190-785b4aa46a64

In the global settings, as well as the normal notifications log (notify.log) you can also alter the notification spooler’s log to a higher loglevel.

Monitoring the spooling

Once you have set up everything as described you will notice that on the central instance, and respectively on the remotes, a new service will be found that must definitely be taken into the monitoring. This monitors the alarmspooler and its TCP-connections. Every connection will thereby be monitored twice: once by the central, and once by the remote instance:

6. Files and directories

6.1. Configurations files

Path	Description
etc/check_mk/multisite.d/sites.mk	Here WATO stores the configuration for the connections to the individual instances. If the interface ‘hangs’ due to an error in the configuration, so that it becomes inoperable, you can edit the disruptive entry directly in the file. If the livestatus proxy is activated however, it will subsequently be necessary to edit and save at least one connection over WATO, since only with this action will a suitable configuration be generated for this daemon.
etc/check_mk/liveproxyd.mk	Configuration for the Livestatus proxy. This file will be freshly-generated by WATO with every alteration in the configuration of a distributed monitoring.
etc/check_mk/mknotifyd.d/wato/global.mk	Configuration for the notification spooler. This file will be generated by WATO when saving the global settings.
etc/check_mk/conf.d/distributed_wato.mk	This is generated on the remote instances by the distributed WATO and it ensures that the remote only monitors its own hosts.
etc/nagios/conf.d/	Storage location for customer-created Nagios-configurations files with hosts and services. These are required for the use of Livedump on the central instance.
etc/mk-livestatus/nagios.cfg	The configuration of Livestatus for the use of Nagios as the core. Here you can configure the maximum number of simultaneous connections allowed.
etc/check_mk/conf.d/	The configuration of hosts and rules for Checkmk. Store configurations files that are generated by CMCDump here. Only the `wato/`subdirectory is managed by, and will be visible in WATO.
var/check_mk/autochecks/	For services found by the service discovery. These are always stored locally on the remote instance.
var/check_mk/rrds/	Location of the Round-Robin-Database for archiving the performance data when using the Checkmk-RRD-format (the default with the Enterprise Editions)
var/pnp4nagios/perfdata/	Location of the Round-Robin-Database with the PNP4Nagios-format ( Checkmk Raw Edition)
var/log/liveproxyd.log	Log file for the Livestatus proxies.
var/log/liveproxyd.state	The current state of the Livestatus proxies in a readable form. This file is updated every 5 seconds.
var/log/notify.log	Log file for the Checkmk notification system.
var/log/mknotifyd.log	Log file for the notification spooler.
var/log/mknotifyd.state	The current state of the notification spooler in a readable form. This file is updated every 20 seconds.

Path

Description

etc/check_mk/multisite.d/sites.mk

Here WATO stores the configuration for the connections to the individual instances. If the interface ‘hangs’ due to an error in the configuration, so that it becomes inoperable, you can edit the disruptive entry directly in the file. If the livestatus proxy is activated however, it will subsequently be necessary to edit and save at least one connection over WATO, since only with this action will a suitable configuration be generated for this daemon.

etc/check_mk/liveproxyd.mk

Configuration for the Livestatus proxy. This file will be freshly-generated by WATO with every alteration in the configuration of a distributed monitoring.

etc/check_mk/mknotifyd.d/wato/global.mk

Configuration for the notification spooler. This file will be generated by WATO when saving the global settings.

etc/check_mk/conf.d/distributed_wato.mk

This is generated on the remote instances by the distributed WATO and it ensures that the remote only monitors its own hosts.

etc/nagios/conf.d/

Storage location for customer-created Nagios-configurations files with hosts and services. These are required for the use of Livedump on the central instance.

etc/mk-livestatus/nagios.cfg

The configuration of Livestatus for the use of Nagios as the core. Here you can configure the maximum number of simultaneous connections allowed.

etc/check_mk/conf.d/

The configuration of hosts and rules for Checkmk. Store configurations files that are generated by CMCDump here. Only the `wato/`subdirectory is managed by, and will be visible in WATO.

var/check_mk/autochecks/

For services found by the service discovery. These are always stored locally on the remote instance.

var/check_mk/rrds/

Location of the Round-Robin-Database for archiving the performance data when using the Checkmk-RRD-format (the default with the Enterprise Editions)

var/pnp4nagios/perfdata/

Location of the Round-Robin-Database with the PNP4Nagios-format ( CRE Checkmk Raw Edition)

var/log/liveproxyd.log

Log file for the Livestatus proxies.

var/log/liveproxyd.state

The current state of the Livestatus proxies in a readable form. This file is updated every 5 seconds.

var/log/notify.log

Log file for the Checkmk notification system.

var/log/mknotifyd.log

Log file for the notification spooler.

var/log/mknotifyd.state

The current state of the notification spooler in a readable form. This file is updated every 20 seconds.

On this page

1. Introduction
2. Distributed monitoring with Livestatus
3. Unstable or slow connections
4. Livedump and CMCDump
5. Notifications in distributed environments
6. Files and directories
- 6.1. Configurations files

Join us for the highlight of the year when the Checkmk Community gets together in Munich from May 20-22.

You are viewing the manual for Checkmk version 1.6.0, which is out of support since September 9th, 2022. Switch to the latest version of this article.

No thanks!