Alert handlers - Responding to problems automatically

1. Introduction

1.1. Should the monitoring intervene?

One would think it obvious that a monitoring system should never intervene in events - but rather that it should, well, monitor. And it’s probably a good idea to leave it at that.

It is however an admittedly attractive notion that a system which can reliably identify problems could also correct these, provided that it could function automatically.

A few suitable examples can be easily imagined:

Restarting a service that has crashed.
The triggering of a garbage collector if a Java-VM is running out of memory.
The rebuilding of a VPN channel if it is definitely dead.

If one can accept this, then one must think differently about monitoring. From a system that simply observes and is 'not necessary' for operations, a step by step process leads to monitoring becoming a vital organ in the processing center.

But correcting problems is not the only thing that monitoring can do automatically when it identifies a problem. Very useful, but also harmless is the collection of additional diagnostic data at the moment of a failure! You could no doubt offhand think of numerous further issues with which one could use Alert handlers as a starting point.

1.2. Alert handlers in Checkmk

Alert handlers are scripts you write yourself, that are run for you by Checkmk if a problem is detected - or more precisely - if a host or service changes its status.

Alert handlers are very similar to notifications and are similarly configured, but there are a few important differences:

Alert handlers are independent of Downtimes, notifications periods, Acknowledgements and similar controls.
Alert handlers will be activated by the first retry (if multiple check attempts have been configured).
Alert handlers are independent of users and contact groups.
Alert handlers are only available in the Checkmk Enterprise Editions.

One can also say that alert handlers are very 'low level'. As soon as a host or service changes its status, your configured alert handler will be immediately activated. In this manner an alert handler can even perform a repair successfully before an actual alert is generated.

You can naturally - as always in Checkmk - use rules to define the conditions for which a particular handler should be executed.

A further tip for CRE Checkmk Raw Edition users: monitoring can also be enabled to execute actions automatically. Use Nagios’s 'Event Handler' for this. Configure this with manual configuration files in Nagios-Syntax under ~/etc/nagios/conf.d/. The event handlers are well documented. Information can be found simply via Google.

2. Setting up alert handlers

2.1. Saving scripts in the correct directory

Alert handlers are scripts that are run on the Checkmk server. They must be stored in the ~/local/share/check_mk/alert_handlers/ directory, and may be coded in any Linux-supported language, e.g., BASH, Python or Perl. Please don’t forget to make the scripts executable (chmod +x …).

New from 1.4.0: if a comment is inserted into the second line of the script (with a # hash), this will appear as the script’s name in the rule’s selection list:

~/local/share/check_mk/alert_handlers/myhandler

#!/bin/bash
# Foobar handler for repairing stuff
...

2.2. A simple alert handler to try out

As with notifications the script obtains all information for the host or service as environment variables, all of which begin with the ALERT_ prefix.

To test exactly which environment variables appear in the script, you can use the following alert handler for a test:

~/local/share/check_mk/alert_handlers/debug

#!/bin/bash
# Dump all variables to ~/tmp/alert.out

env | grep ^ALERT_ | sort > $OMD_ROOT/tmp/alert.out

env outputs all environment variables
grep ^ALERT_ selects those that begin with ALERT_
sort sort the resulting list alphabetically

2.3. Activating alert handlers

The handler activation is performed using the icon alert handlers Alert handlers Setup module.

Proceed as follows:

Save the script in ~/local/share/check_mk/alert_handlers/debug.
Make it executable with chmod +x debug.
Call up the configuration page via Setup > Events > Alert handlers.
There, define a new rule with Add rule.

The mask for selecting the handler allows direct access and shows the title which is logged in the second line of the script. Additionally you may add arguments, which you enter in the text fields. These will be interpreted as commandline arguments in the script. On your shell you may access them with $1, $2, etc.

Important: After saving the rule the alert handler will be immediately active and will run with every change of status for any host oder service!

2.4. Test and fault diagnoses

To test, manually set a service, for example, Fake check results to CRIT. Now the file should have been created with the variables. Here are its first twenty lines:

OMD[mysite]:~$ head -n 20 ~/tmp/alert.out
ALERT_ALERTTYPE=STATECHANGE
ALERT_CONTACTNAME=check-mk-notify
ALERT_CONTACTS=
ALERT_DATE=2016-07-19
ALERT_HOSTADDRESS=127.0.0.1
ALERT_HOSTALIAS=myserver123
ALERT_HOSTATTEMPT=1
ALERT_HOSTCHECKCOMMAND=check-mk-host-smart
ALERT_HOSTCONTACTGROUPNAMES=all
ALERT_HOSTDOWNTIME=0
ALERT_HOSTFORURL=myserver123
ALERT_HOSTGROUPNAMES=check_mk
ALERT_HOSTNAME=myserver123
ALERT_HOSTNOTESURL=
ALERT_HOSTNOTIFICATIONNUMBER=1
ALERT_HOSTOUTPUT=Packet received via smart PING
ALERT_HOSTPERFDATA=
ALERT_HOSTPROBLEMID=0
ALERT_HOSTSHORTSTATE=UP
ALERT_HOSTSTATE=UP

A log file for the alert handler’s (non-)execution will be found in var/log/alerts.log. The section for the execution of handler debug, for the service Filesystem / on host myserver123 will look something like this:

var/log/alerts.log

2016-07-19 15:17:22 Got raw alert (myserver123;Filesystem /) context with 60 variables
2016-07-19 15:17:22 Rule ''...
2016-07-19 15:17:22  -> matches!
2016-07-19 15:17:22 Executing alert handler debug for myserver123;Filesystem /
2016-07-19 15:17:22 Spawned event handler with PID 6004
2016-07-19 15:17:22 1 running alert handlers:
2016-07-19 15:17:22 PID: 6004, object: myserver123;Filesystem /
2016-07-19 15:17:24 1 running alert handlers:
2016-07-19 15:17:24 PID: 6004, object: myserver123;Filesystem /
2016-07-19 15:17:24 Handler [6004] for myserver123;Filesystem / exited with exit code 0.
2016-07-19 15:17:24 Output:

A couple of further useful tips:

Texts produced by alert handlers on the standard version appear in the log file alongside Output:.
The script’s exit code will also be logged (exited with exit code 0).
Alert handlers really become useful when they execute a command on the target host. Checkmk offers a ready solution for Linux that will be explained later see below.

3. Rules based configuration

As shown in the introductory example, which events should trigger alert handlers are defined via rules. This functions completely analogous to Notifications, only somewhat simplified. In the example we specified no prerequisites, which is naturally unrealistic in practice. The following example shows a prerequisite that an alert handler defines for specific hosts and services:

The alert handler will only be triggered for the hosts myhost123 and myhost124…
… and only for the service JVM CaramKern Memory
… and only if the state changes from OK or WARN to CRIT
… and then only at the second check attempt.

In order for the handler to be triggered, in this example it is necessary to use a rule Maximum number of check attempts for service to set the minimum number of check attempts at 2. In order to suppress a notification in the case of a successful garbage collector the number should be set to 3 - since if the handler can solve the problem directly following the second attempt, the third attempt should detect an OK state and therefore no further notification will be necessary.

Please note: Unlike other places in Checkmk every alert handler rule will be executed if the conditions are matching. Even if two rules calling the same handler these handler will indeed run two times. The alert helper will suppress the second execution with an error message since the same handler must not run multiple times at the same time. Still it is recommended to set up the rules that this case will not appear.

4. How alert handlers are executed

4.1. Asynchronous execution

Alert handlers are very often used to log in to an affected machine remotely using SSH or another protocol, and once there execute a script-controlled action. Since this machine is experiencing a problem, it cannot be excluded that connecting takes a long time or even goes into a timeout.

So that the monitoring is not allowed to stand still, or other alert handlers stalled during this time, as a matter of principle alert handlers are run asynchronously. An auxiliary process - the alert helper - is responsible for this function, and it is started by the CMC. To reduce overhead, this only occurs if a minimum of one alert handler rule has been created. In the cmc.log you will then see the following line:

var/log/cmc.log

2016-07-19 15:17:00 [5] Alert handlers have been switched on

With every change of state for a host or service, the alert helper receives a notification from the CMC containing all information relevant to the event. It then evaluates all alert rules and determines whether a handler should be triggered. If yes, the appropriate script will be started and run in the background as an external process.

4.2. Stopping the core

When you stop the CMC (e.g. via omd stop or by shutting the monitoring server down), all alert helpers still running will be aborted. These will not be repeated later - since who knows when 'later' will be? It’s possible that restarting a service or similar may be more damaging than useful!

4.3. Timeouts

In order to protect itself from too many processes being started in the case of an error situation, when an alert handler is running a 60 second (settable) timeout is in force. At the end of this time the handler will be stopped. From version 1.4.0 it is regulated that at the end of a timeout a Signal 15 (SIGTERM) will be sent to the handler. In this way it has the possibility to stop itself cleanly. After a further 60 seconds (double timeout) it will then be 'terminated' finally with a Signal 9 (SIGKILL).

4.4. Overlaying

From version 1.4.0 Checkmk precludes the simultaneous running of alert helpers if they apply to the same host/service and would run the same script with the same parameters. Such a situation indicates that the first handler is still running and that it would make no sense to start a second copy of the same handler - the second handler would be instantly cancelled and identified as 'failed'.

4.5. Exit status and output from handlers

The alert handler’s outputs and exit codes are reliably evaluated and returned to the core where it is saved in the monitoring history. Furthermore you can trigger a notification below).

4.6. Global settings

There are a number of global settings for executing alert handlers:

The Alert handler log level influences the logging in the alert helper (alerts.log).

4.7. Master control

With a click in the Master control you can deactivate alert handlers globally. Currently running handlers will not be affected and will run to completion.

Please don’t forget to reset the small switch to green as soon as practicable! Otherwise you could be deceived by a false sense of security that the monitoring is fixing everything…

5. Alert handlers in the history

Since version 1.4.0 alert handlers have created entries in the monitoring history. With this you have a better traceability compared to only having the alerts.log log file. An entry is created as soon as an alert handler starts and another when it ends.

The alert handlers are thus regarded in the same way as typical monitoring plug-ins - that means they should produce one line of text and return one of the four exit codes 0 (OK), 1 (WARN), 2 (CRIT) or 3 (UNKNOWN). All errors that from the outset prevent a handler’s execution (abortion due to duplicate execution, script missing, timeout, etc.) are automatically flagged with UNKNOWN.

For example - calling this very simple handler…

local/share/check_mk/alert_handlers/dummy

#!/bin/bash
# Dummy handler for testing

sleep 3
echo "Everything is fine again"
exit 0

... produces a result as above in the relevant service’s history (as always the newest message is at the top):

There is also a generic view Monitor > System Alert handler executions, which provides a global display of all executing alert handlers.

6. Notification via alert handlers

The execution of an alert handler — or more exactly, the completion of an execution — is an event that triggers a notification. This way you can be informed that a handler has completed its task. There are additionally two new types of event that you can filter in a notification rule:

You can thus differentiate between successfully executed handlers (Exitcode 0 - OK), and failures (all other codes). The email notifications in Checkmk have been adapted so that alert handler notifications, instead of displaying results from checks now display results from alert handlers.

7. Handlers for every check execution

Alert handlers are normally only called up when the state of a host or service changes (or during retry attempts when handling problems). Simple check executions without a change of state trigger no alert handlers.

With the Alert handlers > Types of events that are being processed > All check executions! global option you can arrange exactly that. Every execution of a check can potentially trigger an alert handler. You can, for example, use this to transfer data from the active monitoring to other systems.

Please be careful with this setting! Starting processes and calling up scripts consumes a lot of CPU resources. Checkmk can easily run 1000 checks per second - but Linux certainly couldn’t handle 1000 alert handler scripts per second.

In order to make this usefully possible, Checkmk provides the option to write alert handlers as Python functions, which then run inline - without process creation. Such inline handlers can be saved in the same directory as the normal handler scripts. The following functioning example shows the structure of an inline handler:

local/share/check_mk/alert_handlers/foo

#!/usr/bin/python
# Inline: yes

# Do some basic initialization (optional)
def handle_init():
    log("INIT")

# Called at shutdown (optional)
def handle_shutdown():
    log("SHUTDOWN")

# Called at every alert (mandatory)
def handle_alert(context):
    log("ALERT: %s" % context)

This script has no central function, rather it simply defines three functions, although only the handle_alert(…) function is required. This is called up following every check execution and in its argument context receives a Python-Dictionary with variables such as "HOSTNAME", "SERVICEOUTPUT", etc. These represent the environment’s variables that the normal handlers also receive - however here without the ALERT_ prefix. The above example can be used in order to view the contents of context.

All outputs produced by the log() auxiliary function are saved to var/log/alert.log. Both of the global variables omd_root and omd_site are based on the home directory and the Checkmk-Instance’s name respectively.

The handle_init() and handle_shutdown() functions are called up by Checkmk when starting or stopping the monitoring core and enable an intialisation - for example when establishing a connection to a data base.

Additional information:

Please note the # Inline: yes in the second line.
The core must be restarted after every change in the script (omd restart cmc).
Import-commands are permitted.
The Checkmk alert helpers call up their functions synchronously. Make sure that no wait states occur!

8. Remote execution under Linux

8.1. Basic principles

Version 1.4.0 includes a built-in alert handler that enables the reliable execution of scripts on monitored Linux systems. The most important features of this solution are:

The scripts are called up using SSH with Command restriction.
No arbitrary commands may be used, rather only those commands defined by you.
This can all be implemented using the Agent bakery.

The Linux Remote Alert Handler consists of the following individual elements:

The linux_remote alert handler with the title „Linux via SSH“ on the Checkmk-Server.
The mk-remote-alert-handler script on the target system.
The ('Remote handler') scripts written by you on the target system.
Entries in .ssh/authorized_keys for those users on the target system who will execute them.
Rules in Monitoring agents > Rules > Linux Agent > Install remote alert handlers (Linux) that generate SSH keys.
Alert handler rules that call up linux_remote.

8.2. Installation

Assuming one wants to run the /etc/init.d/foo restart script on the myserver123 Linux system whenever the Process FOO service becomes critical (which we have already set up). Proceed as follows:

Coding the remote handler

Next, write the script to be executed on the target system. Since we are working with the Agent bakery, install the script on the Checkmk-Server (not on the target system!). The correct directory for this is local/share/check_mk/agents/linux/alert_handlers. Here as well the comment in the second line provides a title for selection in the user interface:

local/share/check_mk/agents/linux/alert_handlers/restart_foo

#!/bin/bash
# Restart FOO service

/etc/init.d/foo restart || {
    echo "Could not restart FOO."
    exit 2
}

Make the script executable:

OMD[mysite]:~$ cd local/share/check_mk/agents/linux/alert_handlers
OMD[mysite]:~$ chmod +x restart_foo

Our example script is so constructed that in the case of an error it ends with a Code 2 so that the alert handler assesses it as CRIT.

Preparing agent packages with the handler

Here we will describe the procedure with the agent bakery. Tips for installing by hand can be found further below. Define a rule under Setup > Agents > Windows, Linux, Solaris, AIX > Agent rules > Linux Agent > Remote alert handlers (Linux). In the characteristics, the Remotehandler Restart FOO service you have just defined can be seen. Select this for installation:

Once you have saved it you will see the rule in the list: an SSH-key pair for calling up the handler will be automatically generated and the fingerprint of which will appear in the rule. The fingerprint itself has been shortened to fit the width in this screenshot:

The public key is intended for the agent. The private key will later be required by the Checkmk-Server so that a script installed in this way can be called up without needing to enter a password.

One can also utilise another user as the root - naturally only when they have the appropriate rights for the required action. The Checkmk-Agent will only install the SSH-key on systems where this user already exists.

Baking agents

Now bake new agents with button bake agents . In the list of ready agents an entry should now appear in which your remote handler and SSH-key can be seen. The screenshot has also been shortened here. This time by the amount of possible packages that may be downloaded:

Agent installation

Next, install the RPM or DEB package on your target system (the TGZ-Installation cannot load the SSH-key and is thus incomplete). With the installation the following things occur:

Your remote handler script will be installed
The mk-remote-alert-handler auxiliary program will be installed
For selected users (here root) an entry will be made in authorized_keys that will enable the handler’s execution.
The .ssh directory and the authorized_keys file will be created as necessary.

With an installation via DEB it will look something like this:

root@myserver123:~# dpkg -i check-mk-agent_2016.07.19-9d3ab34905da4934_all.deb
Selecting previously unselected package check-mk-agent.
(Reading database ... 515080 files and directories currently installed.)
Preparing to unpack ...check-mk-agent_2016.07.19-9d3ab34905da4934_all.deb ...
Unpacking check-mk-agent (2016.07.19-9d3ab34905da4934) ...
Setting up check-mk-agent (2016.07.19-9d3ab34905da4934) ...
Reloading xinetd...
 * Reloading internet superserver configuration xinetd                            [ OK ]
Package 9d3ab34905da4934: adding SSH keys for Linux remote alert handlers for user root...

A look into the SSH-configuration for root reveals:

root@myserver123:~# cat /root/.ssh/authorized_keys
command="/usr/bin/mk-remote-alert-handler restart_foo",no-port-forwarding,no-x11-forwarding,no-agent-forwarding ssh-rsa  AAAAB3NzaC1yc2EAAAADAQABAAACAQCqoDVNFEbTqYEmhSZhUMvRy5SqGIPp1nE+EJGw1LITV/rej4AAiUUBYwMkeo5aBC6VOXkq78CdRuReSozec3krKkkwVbgYf98Wtc6N3WiljS85PLAVvPadJiJCkXFctbxyI2xeF5TQ1VKDRvzbBjXE9gjTnLWbPy77RC8SVXLoOQgabixpWQquIIdGyccPsWGTRgeI7Ua0lgWZQUJt7OIKQ0X7Syv2VHKJNqtW28IWu8y2hBEY/TERip5EQoNT/VclhHqjDG2y3F45PswcXD5in6y30EnfHGcwk+PD6fgp7jPGbO2+QBUwYgW67GmRpbaVQ97CqXFJvORNF+C6+O8DNweyH3ogspjfKvM7eN+M4NIJzjMRyNBMzqF3VmrMeqpzRjfFj2BS/8UbXGgHzZRapwrK3+GXX1pG49n77cIs+GWos9xb1DxX1pEu2tgQwRBBhYcTkk2eKkH18LKzFUyObxtQmf40C24cdQOp6USbwzsniqehsLIHH2unQ7bW6opF/GiaEjZamGbgsPOe8rmey5Vcd//e8cS+OsmcPZNybsTJpBeHpes+5bw0e1POw9GD9qptylrQLYIO5R467Ov8YlRFgYKyaDFHD40j5/JHPzmtp4vjH8Si7YZZOzvTRgBYEoEgbLS5dgdr/I5ZMRKfDPCpRUbGhp9kUEdGX99o5Q== mk-remote-alert-handler-9d3ab34905da4934

Please be aware that your system could be set up so that an SSH-access as root is not generally possible. In this case you can go via another user and there work with sudo, which is so configured that the desired command can be executed without a password.

Calling alert handlers using rules

We have almost reached our objective. The agent is ready. Now all that’s missing is a rule to actually summon the alert handler. The procedure is as described at the beginning of this article and is achieved through the creation of an appropriate rule. This time choose Linux via SSH as the handler, enter the user for whom the SSH-key should be installed, and select your remote handler:

Please also set a sensible condition in the rule, otherwise an SSH-connection will be attempted with every service alert!

Test

When, e.g., you now set the relevant service manually to CRIT, in the service’s history you will shortly see:

Naturally if no foo service exists, then /etc/init.d/foo restart also can’t work. It can however be seen that this command has been processed and also that the error status has been correctly reported. Likewise that Checkmk has triggered a notification which was stopped by an alert handler.

The Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts. message is harmless by the way and only appears on first contact with the host. To avoid the time-consuming manual exchange of the host key, SSH is called with -o StrictHostKeyChecking=false. At the first connection the key will be stored for future use.

8.3. Installation without the agent bakery

Of course manually preparing an agent also works. In such a case we recommend performing the agent bakery procedure on a test system, then examine the relevant data and replicate it manually on your own system. A list of the file paths can be found here.

Here in this case it is also important that in the agent bakery you create a rule for installing the remote handler - because in this rule the SSH-key will be generated for the access and also for use by the alert handler! The public key for the installation in authorized_keys can be found in the etc/check_mk/conf.d/wato/rules.mk configuration file (or in a subfolder in rules.mk).

9. Relevant files and directories

9.1. Files and directories on the Checkmk-Server

Pfad	Function
var/log/alerts.log	Log file with all events relevant to the alert handler (logged by the alert helper)
var/log/cmc.log	Log file for the core. Some alert handler information is also stored here
local/share/check_mk/alert_handlers/	Save your self-written alert handlers here
var/check_mk/core/history	Here the monitoring history’s log file is stored and also evaluated by the core
local/share/check_mk/agents/linux/alert_handlers/	Remote alert handlers to be run on Linux systems

Pfad

Function

var/log/alerts.log

Log file with all events relevant to the alert handler (logged by the alert helper)

var/log/cmc.log

Log file for the core. Some alert handler information is also stored here

local/share/check_mk/alert_handlers/

Save your self-written alert handlers here

var/check_mk/core/history

Here the monitoring history’s log file is stored and also evaluated by the core

local/share/check_mk/agents/linux/alert_handlers/

Remote alert handlers to be run on Linux systems

9.2. Files and directories on the monitored Linux system

Path Function

Path	Function
/usr/bin/mk-remote-alert-handler	Auxiliary script for executing the remote handlers
/usr/lib/check_mk_agent/alert_handlers/	Remote handlers written by you
/root/.ssh/authorized_keys	SSH-configuration for the user `root`
~harri/.ssh/authorized_keys	SSH-configuration for a user `harri`

/usr/bin/mk-remote-alert-handler

Auxiliary script for executing the remote handlers

/usr/lib/check_mk_agent/alert_handlers/

Remote handlers written by you

/root/.ssh/authorized_keys

SSH-configuration for the user root

~harri/.ssh/authorized_keys

SSH-configuration for a user harri

On this page

1. Introduction
- 1.1. Should the monitoring intervene?
- 1.2. Alert handlers in Checkmk
2. Setting up alert handlers
3. Rules based configuration
4. How alert handlers are executed
5. Alert handlers in the history
6. Notification via alert handlers
7. Handlers for every check execution
8. Remote execution under Linux
9. Relevant files and directories
- 9.1. Files and directories on the Checkmk-Server
- 9.2. Files and directories on the monitored Linux system