Checkmk
to checkmk.com

1. Introduction

Checkmk includes nearly 2000 ready-made check plug-ins for all imaginable hardware and software. These are maintained by the Checkmk team, and new plug-ins are added every week. On the Checkmk Exchange there are also more plug-ins contributed by our users.

And yet there are always situations where a device, an application, or just a specific metric that is important to you is not covered by any of these plug-ins — maybe because it is something that was developed within your own company and is therefore not available to anyone else.

1.1. Does it always have to be a real plug-in?

What options do you have for implementing an effective monitoring here? Well, you could of course contact our support team and request that they develop a suitable plug-in for you — but naturally it is quicker if you can do it yourself.

You have four options:

MethodHow to do itAdvantagesDisadvantages

Local check

Extend a Checkmk Agent with a simple script

Is very simple, is possible in all programming languages offered by the monitored host’s operating system, and even supports service discovery

Threshold configuration only for the agent itself, SNMP not possible or very cumbersome

Nagios-compatible check plug-in

Run the plug-in via MRPE from the Windows or Linux agent

Access to all existing Nagios plug-ins, also free choice of the programming language

Threshold configuration only for the agent itself, SNMP not possible or very cumbersome, no service discovery possible

Evaluating log messages

Monitor messages with the Event Console

No development necessary, but only need to set up rules in the Event Console

Only works if suitable log messages are available, no verified current status, no recording of metrics, no configurable thresholds

Genuine Checkmk plug-in

Will be explained in this article

Integrates 100% with Checkmk, automatic service detection, central configuration of thresholds via graphical interface, very high performance, supports SNMP, automatic host and service labels are possible, supports HW/SW inventory, supported by standard libraries from Checkmk

Requires more training and knowledge of the Python programming language

This article will show you how to develop real Checkmk check plug-ins — along with everything associated with them. Here we show you how to use the newly-developed API for programming plug-ins in version 2.0.0 of Checkmk.

1.2. What has changed compared to the old API?

Do you already have experience with developing check plug-ins for Checkmk Version 1.6.0 or earlier? If so, here is a concise overview of all of the changes introduced in the new Check API available from Version 2.0.0:

  • Plug-ins are now Python 3 modules, and the file names must have the .py extension.

  • The custom plug-ins are now located in the local/lib/check_mk/base/plugins/agent_based directory.

  • At the beginning of the file you will now need at least one special import statement.

  • The sections and the actual checks are now stored separately. For this purpose there are the new register.agent_section and register.check_plugin functions.

  • Several function and argument names have been renamed. Among other things, Discovery is now always used consistently (previously Inventory).

  • The Discovery function (formerly Inventory function) and also the Check function must now always work as generators (so use yield).

  • The names for the declared functions' arguments are now fixed.

  • Instead of the SNMP scan function, write a declaration of which OIDs are expected with which values.

  • The functions for representing numbers have been restructured (e.g. get_bytes_human_readable becomes render.bytes).

  • There is now a separate method for checks to exclude others (supersedes). This is no longer done in the SNMP scan function.

  • The auxiliary functions for working with counters, rates and averages have changed.

  • Instead of magic return values such as 2 for CRIT, there are now constants (e.g. State.CRIT).

  • Many possible programming errors in your plug-in are now recognised by Checkmk at a very early stage and can be immediately highlighted for you.

1.3. Will the old API still be supported?

Yes — the API for the development of check plug-ins valid up to version 1.6.0 of Checkmk will be supported with some minor restrictions for a few more years, because a significant number of plug-ins have been developed with it. During this time, Checkmk will offer both APIs in parallel. Details can be found in the #10601 work.

Nevertheless, we do recommend the new API for the development of new plug-ins, as it is more consistent and logical, better documented and is the most future-proof solution in the long term.

1.4. The different types of agents

Check plug-ins evaluate the data from the Checkmk agents. And that is why, before we leap into action here, we should first look at an overview of the types of agents that Checkmk actually recognises:

Checkmk Agent

The ‘normal’ plug-ins evaluate data that the Checkmk agent sends for Linux, Windows or other operating systems. This agent monitors operating system parameters and applications, and sometimes also server hardware. Each new check plug-in requires an extension of the agent to provide the necessary data. Therefore you first develop an agent plug-in, and then one or more check plug-ins that evaluate this data.

Special Agent / API-Integration

You need a special agent if you do not receive the data that is relevant for monitoring from either the normal Checkmk agent or SNMP. The most common application for Special Agent is querying HTTP-based APIs. Examples are, e.g. Monitoring AWS, Azure, or VMware. In this case you write a script that runs directly on the Checkmk server, connects to the API, and outputs data in the same format as an agent plug-in would. For this you write suitable check plug-ins in the same way as with the ‘agent-based’ monitoring.

SNMP

When monitoring via SNMP you do not need an extension of an agent, but simply evaluate the data retrieved from your device via SNMP, which provides this by default. Checkmk supports you and takes over all of the details and special features of the SNMP protocol. There is in fact an agent here as well — namely the SNMP agent which is pre-installed on the system being monitored.

Active Check

This check type forms a special role. Here you first write a classic Nagios-compatible plug-in which is intended for execution on the Checkmk server, and which from there uses a network protocol to directly query a service on the target device. The most prominent example is the check_http plug-in which allows you to monitor web servers and web pages. You can then integrate this plug-in into Checkmk so that it can be set up as usual via rules.

1.5. Prerequisites

If you feel like programming check plug-ins, you need to satisfy the following prerequisites:

  • Knowledge of the Python programming language

  • Experience with Checkmk, especially with regard to agents and checks

  • Experience with Linux on the command line

As preparation, the following articles are recommended:

2. A first simple check plug-in

After this long introduction, it’s time we programmed our first simple check plug-in. As an example, let’s take a simple monitoring for Linux. Since Checkmk itself runs on Linux, it is very likely that you also have access to a Linux system.

The check plug-in will create a new service that detects whether someone has inserted a USB stick on a Linux server. In this case, this service should then become critical. You might even find something like this useful, but it is really only a simplified example and possibly also not programmed in a completely watertight way — but for now that’s not what this exercise is really about.

The whole procedure involves two steps:

  1. We find out which Linux command can be used to determine whether a USB stick has been plugged in, and then extend the Linux agent with a small script that calls this command.

  2. We then write a check plug-in in the Checkmk site that evaluates this data.

Here we go…​

2.1. Finding the right command

At the beginning of any check programming activity is the necessary research! This means that we have to find out how we can get the information we need for monitoring. With Linux, this will often involve command line commands. In Windows, PowerShell, VBScript or WMI can help, and with SNMP we have to find the right OIDs (there is an own chapter for this).

Unfortunately, there is no general procedure for determining the correct command, so we do not want to spend too much time on the subject here, we will however briefly explain how it works for a USB stick.

First we log in to the host we want to monitor. Under Linux, the agent runs as the root user by default — which is why we perform all our tests as root. For our task with the USB stick, there are convenient symbolic links in the directory /dev/disk/by-id. These point to all the Linux block devices. And a plugged-in USB stick is one such device. In addition, you can tell by the ID of the prefix usb- when a block device is a USB device.

The following command lists all entries in this directory:

root@linux# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root  9 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191 -> ../../sda
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5 -> ../../sda5
lrwxrwxrwx 1 root root  9 May 14 11:21 wwn-0x5002538655584d30 -> ../../sda
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part5 -> ../../sda5

So — and now the whole thing with the plugged in USB stick:

root@linux# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root  9 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191 -> ../../sda
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5 -> ../../sda5
lrwxrwxrwx 1 root root  9 Mai 14 12:15 usb-SCSI_DISK-0:0 -> ../../sdc
lrwxrwxrwx 1 root root 10 Mai 14 12:15 usb-SCSI_DISK-0:0-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Mai 14 12:15 usb-SCSI_DISK-0:0-part2 -> ../../sdc2
lrwxrwxrwx 1 root root  9 Mai 14 11:21 wwn-0x5002538655584d30 -> ../../sda
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part5 -> ../../sda5

2.2. Purging the data

Actually, we would be finished with that and could transport this whole output via the Checkmk agent to the Checkmk server and have it analysed there as well, because in Checkmk the following recommendation always applies: always let the server do the complex work, so keep the agent plug-in as simple as possible.

But there is still too much hot air in here. Transferring unnecessary data is always undesirable. Avoiding unnecessary transfers saves network traffic, memory, computing time and also makes everything clearer. That is simply a better way of doing things!

First, we can omit the -l. This already makes the output of ls much leaner:

root@linux# ls /dev/disk/by-id/
ata-APPLE_SSD_SM0512F_S1K5NYBF810191        ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5  wwn-0x5002538655584d30-part3
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1  wwn-0x5002538655584d30-part4                ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2
wwn-0x5002538655584d30                      wwn-0x5002538655584d30-part5                ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3
wwn-0x5002538655584d30-part1                ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4  wwn-0x5002538655584d30-part2

Now again, the multi-column structure is disturbing, but this is only because the ls command recognises that it is running in an interactive terminal. Later, as part of the agent, it will output the data in a single column. But we can also easily force this here with the -1 option (for output in one column):

root@linux# ls -1 /dev/disk/by-id/
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5
wwn-0x5002538655584d30
wwn-0x5002538655584d30-part1
wwn-0x5002538655584d30-part2
wwn-0x5002538655584d30-part3
wwn-0x5002538655584d30-part4
wwn-0x5002538655584d30-part5

If you look closely, you will see not only the block devices themselves, but also any partitions that exist there. These are the entries that end in -part1, -part2, etc. We do not need these for our check and can get rid of them quite easily with a grep. There we take the -v option for negative logic:

root@linux# ls /dev/disk/by-id/ | grep -v -- -part
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
usb-SCSI_DISK-0:0
wwn-0x5002538655584d30

Here you can now see much more clearly that in our example there are in fact exactly three devices when the USB stick is plugged in.

Perfect! We now have a clear list of all block devices which has been compiled with a simple command. That’s all we need.

We have again omitted the -1 in the last command, because ls now writes into a pipe and outputs a single column by itself. And grep needs the -- because otherwise it would interpret the word -part as the four options -p, -a, -r and -t.

And by the way: Why don’t we simply 'grep' for usb in addition so that only USB devices are output? Well, of course we could do that. But for one thing, our example then becomes increasingly boring, and besides, it is somehow more reassuring to get some content in the section in a normal situation and not simply nothing. In this way one can see immediately on the Checkmk server that the agent plug-in is working correctly.

2.3. Including the command in the agent

In order for us to be able to retrieve this data from the Checkmk server, we need to make the new command part of the Checkmk agent on the system being monitored. We could of course simply edit the /usr/bin/check_mk_agent file there and include that. However, this would have the disadvantage that our command would disappear again when we update the agent’s software because the file will be replaced at that point.

It is therefore better if we make an agent plug-in. This is even simpler. All we need is an executable file with our command in the /usr/lib/check_mk_agent/plugins directory.

And one more point is important: we can’t just output our data like this. What we still need is a section header. This is a specially-formatted line that contains our new check’s name. By means of these section headers, Checkmk can later recognise where this plug-in’s data begins and the previous plug-in’s data ends.

So now we need a meaningful name for our new check. This name is limited to lower case letters (only a-z, no accents, no umlauts), underscores and numbers and must be unique. Avoid name collisions with existing check plug-ins. If you are curious about which names already exist, in a Checkmk site you can list them on the command line with cmk -L:

OMD[mysite]:~$ cmk -L | head -n 20
3par_capacity                agent      HPE 3PAR: Capacity
3par_cpgs                    agent      HPE 3PAR: CPGs
3par_cpgs_usage              agent      HPE 3PAR: CPGs Usage
3par_hosts                   agent      HPE 3PAR: Hosts
3par_ports                   agent      HPE 3PAR: Ports
3par_remotecopy              agent      HPE 3PAR: Remote Copy
3par_system                  agent      HPE 3PAR: System
3par_volumes                 agent      HPE 3PAR: Volumes
3ware_disks                  agent      3ware ATA RAID Controller: State of Disks
3ware_info                   agent      3ware ATA RAID Controller: General Information
3ware_units                  agent      3ware ATA RAID Controller: State of Units
acme_agent_sessions          snmp       ACME Devices: Agent Sessions
acme_certificates            snmp       ACME Devices: Certificates
acme_fan                     snmp       ACME Devices: Fans
acme_powersupply             snmp       ACME Devices: Power Supplies
acme_realm                   snmp       ACME Devices: Realm
acme_sbc                     agent      ACME SBC: Health
acme_sbc_settings            agent      ACME SBC: Health Settings
acme_sbc_snmp                snmp       ACME SBC: Health (via SNMP)
acme_temp                    snmp       ACME Devices: Temperature

The second column shows how the respective check plug-in obtains its data.

For our example, let’s choose the name linux_usbstick. In this example the section header must look like this:

<<<linux_usbstick>>>

We can simply output this with echo. If we then don’t forget the 'Shebang' (this is not a venomous sting from the desert planet but an abbreviation for sharp and bang — the latter being an abbreviation for the exclamation mark!), by which Linux recognises that it should execute the script with the shell, in which case our plug-in will look like this:

/usr/lib/check_mk_agent/plugins/linux_usbstick
#!/bin/sh
echo '<<<linux_usbstick>>>'
ls /dev/disk/by-id/ | grep -v -- -part

We have now simply used the file name linux_usbstick, even though it doesn’t really matter. But one thing is still very important: Make the file executable!

root@linux# chmod +x /usr/lib/check_mk_agent/plugins/linux_usbstick

Of course, you can easily try out the plug-in by manually by entering the complete path as a command:

root@linux# /usr/lib/check_mk_agent/plugins/linux_usbstick
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30

2.4. Testing the agent

As always, the next most important tasks are testing and troubleshooting. It is best to proceed in three steps:

  1. Try out the plug-in on its own. We have just done that.

  2. From the agent test the whole process locally.

  3. Retrieve the agent via the Checkmk server.

Testing the agent locally is very simple — as root user call the command check_mk_agent. The new section should appear somewhere in the output from this:

root@linux# check_mk_agent

Here is an excerpt from that output which contains the new section:

<<<lnx_thermal:sep(124)>>>
thermal_zone0|-|BAT0|35600
thermal_zone1|-|x86_pkg_temp|81000|0|passive|0|passive
<<<local>>>
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30
<<<lnx_packages:sep(124):persist(1589463274)>>>
accountsservice|0.6.45-1ubuntu1|amd64|deb|-||install ok installed
acl|2.2.52-3build1|amd64|deb|-||install ok installed
acpi|1.7-1.1|amd64|deb|-||install ok installed

By appending less you can scroll through the output (press the space bar to scroll, / to search and Q to exit):

root@linux# check_mk_agent | less

The third test is then performed directly from the Checkmk site. Include the host in the monitoring (e.g. as myserver01) and then retrieve the agent data with cmk -d. You should get the same output here:

OMD[mysite]:~$ cmk -d myserver01 | less

By the way: grep has the -A option to output a few more lines after each hit. This allows you to conveniently search and output the section:

root@linux# cmk -d myserver01 | grep -A5 '^<<< linux_usbstick'
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30
<<<lnx_packages:sep(124):persist(1589463559)>>>
accountsservice|0.6.45-1ubuntu1|amd64|deb|-||install ok installed

If this works, your agent is now ready! And what have we done to achieve this? We simply created a three-line script with the path /usr/lib/check_mk_agent/plugins/linux_usbstick and made it executable!

Everything that follows now only takes place on the Checkmk server: There we will write the actual check plug-in.

2.5. Declaring the section

Preparing the agent is the most complicated part, but it is only half the battle. Now we have to teach Checkmk how to handle the information and the new agent section, which services it should generate, when they should go to OK or CRIT, etc. We do all this by programming a check plug-in in Python.

For your own check plug-ins you will find a directory prepared in the local hierarchy of the site directory. This is local/lib/check_mk/base/plugins/agent_based/. Here in the path, base means the part of Checkmk that is responsible for actually monitoring and alerting. The agent_based is for all plug-ins that relate to the Checkmk agent (so not alerting plug-ins, for example). The easiest way to work with this is to switch to it:

OMD[mysite]:~$ cd local/lib/check_mk/base/plugins/agent_based

This directory belongs to the site user and is therefore editable by you. You can edit your plug-in with any text editor installed on the Linux system.

So let’s create our plug-in here. The convention is that the file name reflects the agent section’s name. Mandatory is that the file ends with .py, since from Checkmk version 2.0.0 onwards the plug-ins will always be real Python modules.

First, we need to import the functions needed for the plug-ins from other Python modules. The simplest method for this is with a *. As you might guess, there is also a version number of the API for plug-in programming here. This will be version 1 until further notice, and is abbreviated here to v1:

local/lib/check_mk/base/plugins/agent_based/linux_usbstick.py
from .agent_based_api.v1 import *

This versioning allows us to eventually provide future new versions of the API parallel to the previous ones, so that existing check plug-ins continue to work without problems.

In the simplest case, you skip explicitly declaring the section. If you want to implement a parse function (which professional developers would always advise you to do), see the section on parse functions for more information.

2.6. Registering the check

In order for Checkmk to know that the new check exists, it must be registered. This is done by calling the function register.check_plugin. In doing so, you must always specify at least four things:

  1. name: The name of the check plug-in. If you don’t want to get into trouble, take the same name here as for your new agent section. This way the check will automatically know which section to evaluate.

  2. service_name: The name of the service as it should then appear in the monitoring.

  3. discovery_function: The function that discovers services of this type (more on this in a moment).

  4. check_function: The function to perform the actual check (more on this in a moment).

So for our check the registration will look like this:

register.check_plugin(
    name="linux_usbstick",
    service_name="USB stick",
    discovery_function=discover_linux_usbstick,
    check_function=check_linux_usbstick,
)

It’s best not to try this out just yet, because of course we still have to write the discover_linux_usbstick and check_linux_usbstick functions beforehand, and these functions must appear in the source code before the above declaration.

2.7. Writing the discovery function

A special feature of Checkmk is the automatic discovery of services to be monitored. In order for this to work, each check plug-in must define a function that detects whether a service of this type or which services of this type are to be created for the host in question on the basis of the agent’s output.

The discovery function is always called when the service discovery is carried out for a host. This function then decides whether or which services are to be created. In the standard procedure, it receives exactly one argument with the name section. This contains the data of the agent section in a parsed format (more on this later).

We implement the following simple logic: If the agent section linux_usbstick exists, then we also create a matching service. This service will then automatically appear on all hosts where our agent plug-in has been rolled out. We recognise the presence of the section simply by the fact that our discovery has actually been invoked!

The discovery function must return an object of the type Service for each service to be created using yield (not with return). For checks that can only occur once per host, no further information is needed:

def discover_linux_usbstick(section):
    yield Service()

2.8. Writing the check function

So now we can come to the actual check function, which finally decides on the basis of current agent outputs which state a service should assume. Since our check has no parameters and there is only ever one per host, our function is also called with the single argument section.

Since we really need the content this time, we have to deal with the format of this argument. Unless you have explicitly defined a parse function Checkmk will parse each line of the section into a list of words using spaces. The whole thing then in turn becomes a list of these word lists. So the end result is that we will always have a list of lists.

In the simple case in which our agent plug-in only finds two devices, it will then look like this (here there is only one word per line):

[['ata-APPLE_SSD_SM0512F_S1K5NYBF810191'], ['wwn-0x5002538655584d30']]

The check function now goes through line by line and looks for a line whose first (and only) word begins with usb-SCSI_DISK. If this is found, the state will become CRIT. Here is the implementation:

def check_linux_usbstick(section):
    for line in section:
        if line[0].startswith("usb-SCSI_DISK"):
            yield Result(state=State.CRIT, summary="Found USB stick")
            return
    yield Result(state=State.OK, summary="No USB stick found")

And here is the explanation:

  1. With for line in section we loop through all of the lines in the agent’s output.

  2. We then check whether the first word in the line — the respective device — begins with usb-SCSI_DISK.

  3. If yes, we generate a check result with the status CRIT and the text Found USB stick. And we then end the function with a return.

  4. If the loop is run without finding anything, it will generate the status OK and the text No USB stick found.

2.9. The complete plug-in at a glance

And here is the complete plug-in one more time:

local/lib/check_mk/base/plugins/agent_based/linux_usbstick.py
from .agent_based_api.v1 import *

def discover_linux_usbstick(section):
    yield Service()

def check_linux_usbstick(section):
    for line in section:
        if line[0].startswith("usb-SCSI_DISK"):
            yield Result(state=State.CRIT, summary="Found USB stick")
            return
    yield Result(state=State.OK, summary="No USB stick found")

register.check_plugin(
    name = "linux_usbstick",
    service_name = "USB stick",
    discovery_function = discover_linux_usbstick,
    check_function = check_linux_usbstick,
)

And this is the plug-in for the Linux agent:

/usr/lib/check_mk_agent/plugins/linux_usbstick
#!/bin/sh
echo '<<<linux_usbstick>>>'
ls /dev/disk/by-id/ | grep -v -- -part

3. Checks with more than one service (items) per host

3.1. Basic principles

In our example, we have built a very simple check that creates a service on a host — or not. A very common situation is, of course, that there can be several services with one check on one host.

The most common example of this is the file systems for a host. The plug-in named df creates one service per file system on the host. To distinguish these services, the mount point of the file system (e.g. /var), or the drive letter (e.g. C:) is built into the service name. This then results in the service name being, e.g. filesystem /var, or filesystem C:. The word /var or C: is referred to here as the item. So we also speak of a check with items.

If you want to build a check with multiple items, you need to implement the following things:

  • The discovery function must generate a service for each of the items that are to be meaningfully monitored on the host.

  • In the service name you must include this item using the %s wildcard (i.e. "Filesystem %s").

  • The check function is invoked once separately for each item and receives this as an argument. It must then fish out the relevant data for this item from the agent data.

3.2. A simple example

To be able to test the whole thing practically, we will simply build another agent section that only outputs game data. A small shell script is sufficient for this. The section should be called foobar in this example:

/usr/lib/check_mk_agent/plugins/foobar
#!/bin/sh
echo "<<<foobar>>>"
echo "West 100 100"
echo "East 197 200"
echo "North 0 50"

From foobar, there are three sectors to be found here: West, East and North (whatever that means). In each sector there are a number of seats, some of which are occupied (e.g. in West 100 of 100 seats are occupied).

Now we will create a matching check plug-in for this. The registration is as usual, but with the important difference that the service name now contains exactly one %s. At this position the item’s name will be inserted later by Checkmk:

register.check_plugin(
    name = "foobar",
    service_name = "Foobar Sector %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
)

The discovery function now has the task of determining the items to be monitored. As usual, it receives the section argument. And again, this is a list of lines, which in turn are lists of words.

In our example the list looks like this:

[['West', '100', '100'], ['East', '197', '200'], ['North', '0', '50']]

You can loop through such a list with Python and give meaningful names to these three words grouped in each line:

for sector, _used, _slots in section:
    ...

In each line, the first word — here the sector — is our item. Whenever we find an item, we return that with yield, creating an object of type Service that gets the sector name as its item. The underscore indicates that for now we don’t care about the other two columns in the output, since in a discovery it ultimately doesn’t matter how many slots are occupied.

Overall it looks like this:

def discover_foobar(section):
    for sector, _used, _slots in section:
        yield Service(item=sector)

Of course, it would be easy to omit some lines here on the basis of arbitrary criteria. Maybe there are sectors which have a size of 0 and which you would never want to monitor? Simply omit such rows so that no item will be generated for them.

Then later, when the host is being monitored, the check function is called separately for each service — and thus for each item. Therefore, in addition to the section, it also receives the item argument with the item it is looking for. Now we go through all of the lines one after the other. When doing so, we will find the line that corresponds to the desired item:

def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            ...

Now all that is missing is the actual logic which determines when the item should in fact be OK, WARN or CRIT. We do that here like this:

  • When all slots have been used, the thing is to become CRIT.

  • If there are fewer than 10 slots free, then it will become WARN.

  • Otherwise OK

The occupied and total slots always appear as the second and third words in each line. However, here we are dealing with strings, not numbers — but we need numbers to be able to compare and calculate. We therefore convert the strings into numbers using int().

We then return the check result by supplying an object of type result via yield. This takes the parameters state and summary:

def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            used = int(used)   # convert string to int
            slots = int(slots)   # convert string to int
            if used == slots:
                s = State.CRIT
            elif slots - used <= 10:
                s = State.WARN
            else:
                s = State.OK
            yield Result(
                state = s,
                summary = f"Used {used} out of {slots} slots")
            return

In this context, please note the following:

  1. The command return ensures that the check function is terminated immediately after processing the found item. There is nothing more to be done, after all.

  2. If the loop is processed without finding the item being searched for, Checkmk automatically generates the result UNKNOWN - Item not found in monitoring data. This is intentional and a good thing. Do not handle this case yourself. If you don’t find an item you are looking for, just let Python run its course through the function and let Checkmk do its work.

  3. With the argument summary you define the text that the service produces from the status output. This is purely informal and will not be evaluated further by Checkmk.

By the way, for the common situation where you want to check a simple metric for thresholds, there is a helper function called check_levels. This helper function is explained in the Check API documentation, which you can access in Checkmk via Help > Plugin API reference > Agent based API ("Check API").

Now let’s first try out the discovery. For the sake of clarity we will restrict this action to our plug-in by using the --detect-plugins=foobar option:

OMD[mysite]:~$ cmk --detect-plugins=foobar -vI myhost123
  3 foobar
SUCCESS - Found 3 services, 1 host labels

And now right away we can test the checking process (here also limited to foobar):

OMD[mysite]:~$ cmk --detect-plugins=foobar -v myhost123
Foobar Sector East   WARN - used 197 out of 200 slots
Foobar Sector North  OK - used 0 out of 50 slots
Foobar Sector West   CRIT - used 100 out of 100 slots

3.3. The example — a recap

And here again our example in full. To avoid errors due to undefined function names, the functions must always be defined before registering.

local/lib/check_mk/base/plugins/agent_based/foobar.py
from .agent_based_api.v1 import *
import pprint


def discover_foobar(section):
    for sector, used, slots in section:
        yield Service(item=sector)


def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            used = int(used)    # convert string to int
            slots = int(slots)  # convert string to int
            if used == slots:
                s = State.CRIT
            elif slots - used <= 10:
                s = State.WARN
            else:
                s = State.OK
            yield Result(
                state = s,
                summary = f"used {used} out of {slots} slots")
            return


register.check_plugin(
    name = "foobar",
    service_name = "Foobar Sector %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
)

4. Performance values

4.1. Determining values in the check function

Not always, but very often checks work with numbers. With its graphing system Checkmk has a component to store, evaluate and display such numbers. This works completely independently from the generation of any resulting OK, WARN or CRIT states.

Such measured values — or metrics — are determined by the check function and simply returned as an additional result. For this purpose the Metric object is used, which requires at least the two arguments name and value. Here is an example:

    yield Metric("fooslots", used)

4.2. Threshold information

Furthermore there are two optional arguments. With the argument levels you can provide information about thresholds for WARN and CRIT, in the form of a pair of two numbers. This is then usually plotted on the graph as a yellow and a red line. The first number (yellow line) represents the warning threshold, the second (red line) the critical one. The convention is that the check already goes to WARN when the warning threshold is reached (analogous for CRIT).

The coding could then look like this (here with hard coded thresholds):

    yield Metric("fooslots", used, levels=(190,200))

Notes:

  • If only one of the two thresholds will be defined, simply enter None for the other threshold, e.g. levels=(None, 200).

  • Floating point numbers are also allowed, but not strings.

  • Attention: the check function itself is responsible for the check of the thresholds . The specification of levels serves only as marginal information for the graphing system!

4.3. The values range

Analogous to the threshold values, you can also provide the graphing system with information about a range of possible values. This denotes the smallest and largest possible value. This is done in the boundaries argument, where None can also be optionally used here for one of the two boundaries.

Example:

    yield Metric(name="fooslots", value=used, boundaries=(0, 200))

And now once again our check function from the above example, but this time with the return of metric information including threshold values and a value range (this time of course not with fixed but with calculated values):

def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            used = int(used)    # convert string to int
            slots = int(slots)  # convert string to int

            yield Metric(
                "fooslots",
                used,
                levels=(slots-10, slots),
                boundaries=(0, slots))

            if used == slots:
                s = State.CRIT
            elif slots - used <= 10:
                s = State.WARN
            else:
                s = State.OK
            yield Result(
                state = s,
                summary = f"used {used} out of {slots} slots")
            return

5. Checks with multiple partial results

In order to prevent the number of services on a host from growing out of all control, several partial results are often combined in a single service. For example, the Memory used service under Linux checks not only RAM and swap usage, but also shared memory, page tables and various other information.

The API provided by Checkmk offers a very convenient interface for this. In this way, a check function may simply generate a result with yield any number of times. The overall status for the service is then based on the 'worst' partial result according to the scheme OKWARNUNKNOWNCRIT.

Here is an abbreviated, fictitious example:

def check_foobar(section):
    yield Result(state=State.OK, summary="Knulf rate optimal")
    # ...
    yield Result(state=State.WARN, summary="Gnarz required")
    # ...
    yield Result(state=State.OK, summary="Last Szork was good")

The summary of the service in the GUI then looks like this: "Knulf rate optimal, Gnarz required WARN, Last Szork was good". And the overall status will be WARN.

You can return multiple metrics in the same way. Simply call yield Metric(…​) once for each metric.

6. Summary and Details

In the Checkmk monitoring, each service also has a line of text in addition to its status OK, WARN, and so on. Up until version 1.6.0 this was called Output of check plugin. As of version 2.0.0 this is now called Summary — so it has the task of providing a concise summary of the status. The idea is that this text does not exceed a length of 60 characters, so that it is always easy to read and ensures a clear table display without annoying line breaks.

Next to this there is the Details field, which used to be called Long output of check plugin (multiline). Here all of the details of the state are displayed, the idea being that all of the summary information is included here as well.

When calling yield Result(…​) you can determine which information is so important that it should be displayed in the summary and for which information it is sufficient that it appears in the details. The default rule is that partial results that lead to a WARN/CRIT will always be visible in the summary.

In our examples so far we have always used the following call:

    yield Result(state=State.OK, summary="some important text")

This will cause some important text to always appear in the Summary — and additionally in the Details — so you should only use this for important information. If a partial result is of secondary importance, replace summary with notice and the text will appear — if the service is OK only in the details.

    yield Result(state=State.OK, notice="some additional text")

If the state is WARN or CRIT, the text will then automatically appear as an addition in the summary:

    yield Result(state=State.CRIT, notice="some additional text")

Thus, in the summary it will be immediately clear why the service is not OK.

Last but not least, you have — for both summary and notice — the possibility to specify an alternative text for the details, which may contain more information about the partial result:

    yield Result(state=State.OK,
                 summary="55% used space",
                 details="55.2% of 160 GB used (82 GB)")

To summarize, this means:

  • The full text of the summary (for services that are OK) should not exceed 60 characters.

  • Always use either summary or notice — not both, and not neither.

  • Add details as necessary if you want the details text to be an alternative one.

7. Error handling

7.1. Exceptions and crash reports

The correct handling of errors (unfortunately) consumes a large chunk of the programming work. The good news is that the Checkmk API already does most of the work for you. Consequently, in most cases, it is important for you to simply not deal with errors.

When Python gets into a situation that is in some way unexpected , it responds with what is known as an exception. Here are a few examples:

  • You convert a string into a number with int(…​), but the string does not actually include a number, e.g. int("foo").

  • You access the fifth element of bar with bar[4], but this in fact has only four elements.

  • You are calling a function that does not exist.

Here the important general rule applies: Don’t capture exceptions yourself! Checkmk will always do this for you in a consistent and efficient way — in most cases accompanied by a crash report. Such a report will look like this, for example:

crash report 1

By clicking on the icon crash icon, the user is navigated to a page where they can

  • view a display of the file in which the crash took place.

  • get all information about the crash, for instance any error messages, call stack, agent output, the current values of local variables and much more.

  • send the report to us (Checkmk GmbH) as feedback.

Submitting the report to Checkmk GmbH of course only makes sense for check plug-ins which are official Checkmk components. But you can also ask your users to simply send you the data. The users can then help you to find the error. It is often the case that a check plug-in works for you, but other users may experience sporadic errors. Working together you can then usually identify these problems very easily.

But if you were to intercept the exception yourself, all of this information would simply be unavailable. You would perhaps set the service to UNKNOWN and issue an error message, but all of the background circumstances that led to an error (e.g. the data from the agent) would simply be invisible.

7.2. Viewing exceptions on the command line

If you run your plug-in on the command line, no crash reports will be generated — you will only see the summarized error message:

OMD[mysite]:~$ cmk -II --detect-plugins=foobar myhost123
  WARNING: Exception in discovery function of check plugin 'foobar': invalid literal for int() with base 10: 'foo'

BUT: if you simply append the --debug option to this, you will then receive the Python stack trace:

OMD[mysite]:~$ cmk --debug -II --detect-plugins=foobar myhost123
Traceback (most recent call last):
  File "/omd/sites/myhost123/bin/cmk", line 82, in <module>
    exit_status = modes.call(mode_name, mode_args, opts, args)
  File "/omd/sites/myhost123/lib/python3/cmk/base/modes/init.py", line 68, in call
    return handler(*handler_args)
  File "/omd/sites/myhost123/lib/python3/cmk/base/modes/check_mk.py", line 1577, in mode_discover
    discovery.do_discovery(set(hostnames), options.get("checks"), options["discover"] == 1)
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 345, in do_discovery
    _do_discovery_for(
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 397, in _do_discovery_for
    discovered_services = _discover_services(
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1265, in _discover_services
    service_table.update({
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1265, in <dictcomp>
    service_table.update({
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1337, in _execute_discovery
    yield from _enriched_discovered_services(hostname, check_plugin.name, plugins_services)
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1351, in _enriched_discovered_services
    for service in plugins_services:
  File "/omd/sites/myhost123/lib/python3/cmk/base/api/agent_based/register/check_plugins.py", line 69, in filtered_generator
    for element in generator(*args, **kwargs):
  File "/omd/sites/myhost123/local/lib/python3/cmk/base/plugins/agent_based/foobar.py", line 5, in discover_foobar
    int("foo")
ValueError: invalid literal for int() with base 10: 'foo'

7.3. Invalid outputs from an agent

The question is how to react when the output from the agent is not in the form you would normally expect — whether it is from the 'real' agent or when the data comes via SNMP. Let’s assume that you always expect three words per line. What should you do if only two words were to arrive?

Now — if this is a permitted and familiar agent behaviour, then of course you need to capture that and employ case discrimination.

If, however, this is not actually allowed …​ then it is best to treat the line as if it always contains three words, e.g. with:

def check_foobar(section):
    for foo, bar, baz in section:
        # ...

If there should ever be a line that does not consist of exactly three words, a nice exception will be generated and you will receive the very helpful crash report that was just mentioned.

7.4. Missing items

What if the agent outputs data correctly, but the item to be checked is missing? So, like this, for example:

def check_foobar(item, section):
    for sector, used, slots in section:
        if item == sector:
            # ... Check state ...
            yield Result(...)
            return

If the item you are looking for is not there, the loop is run and Python just falls out of the back at the end of the function without 'yielding' a result. And that’s exactly the correct procedure! Because Checkmk recognizes that the item to be monitored is missing and with UNKNOWN generates the correct status and a suitable standard text for it.

8. SNMP-based checks

8.1. The fundamentals

Developing checks that work with SNMP is very similar to agent-based checks, except that you still need to specify which SNMP ranges (OIDs) the check requires. If you don’t as yet have any experience with SNMP, we strongly recommend reading the article on Monitoring via SNMP as preparation at this point.

The process of discovery and checking via SNMP is somewhat different from that for a normal agent. Because unlike there — where the agent sends all of the relevant information on its own — with SNMP we have to say exactly which data ranges we require. A complete dump of all data would be theoretically possible (via SNMP walk), but this process can take minutes for fast devices and over an hour for complex switches. Thus, this is not viable for checking or even for discovery. Checkmk therefore proceeds in a more targeted manner.

SNMP detection

Service detection is divided into two phases. First, the SNMP detection is performed. This determines which plug-ins on the respective device are of actual interest. To do this, a few SNMP OIDs are retrieved — individually, without a walk. The most important of these is the sysDescr (OID: 1.3.6.1.2.1.1.0). Under this OID, each SNMP device holds a description of itself, e.g. ‘Cisco NX-OS(tm) n5000, Software (n5000-uk9),…​’.

Based on this text, you can already define for very many plug-ins whether they can be useful for this application. If the text is still not specific enough, further OIDs are fetched and checked. The result of the SNMP detection will then be a list of candidates for check plug-ins.

Discovery

In the second step, the necessary monitoring data is fetched for each of these candidates with SNMP walks. These are then combined into a table and provided to the check’s discovery function with the section argument, which then as usual determines the items to be monitored.

Checking

When running checks, it is already known which plug-ins are to be executed for the device and the SNMP detection is now omitted. Here the monitoring data needed for the plug-ins are fetched immediately by SNMP walks and from it the section argument for the check function is filled.

Summary

So what do you need to do differently with an SNMP check compared to an agent-based one?

  1. You do not need a plug-in for the agent.

  2. You must define the single OIDs and search texts required for an SNMP detection.

  3. You have to define which SNMP areas must be fetched for monitoring.

8.2. A word about the MIBs

Before we continue, we want to say a word about the infamous SNMP MIBs, because there are many prejudices about these. Right at the beginning, some good news: Checkmk doesn’t need them. Really! But they are an important aid in being able to develop an SNMP check.

So what is a MIB? Literally, the abbreviation means Management Information Base — somewhat meaningless really. To be concrete, a MIB is a quite easy to read text file which describes a certain subtree in the SNMP world. Namely, it states which branch in the tree — that is, which OID — has which meaning. This includes a name for the OID, a comment on what values it can take (e.g. for enumerated data types, where things like 1=up, 2=down, etc. are defined) and sometimes a useful comment.

Checkmk provides a set of freely-available MIB files. These describe very general areas in the global OID tree, but do not contain any vendor-specific areas. Therefore they are of not much help for self-developed checks.

So try to find the MIB files relevant for your particular device somewhere on the manufacturer’s web pages or even on the device’s management interface, and install these in the Checkmk site in local/share/check_mk/mibs. You can then have SNMP walks convert OID numbers to names and can thus more quickly find where the data of interest for your monitoring is located. Also, MIBs, if done carefully, contain interesting information in their comments, as we noted above. You can easily read an MIB file with a text editor or with less.

8.3. Locating the correct OIDs

The crucial prerequisite for developing a plug-in is, of course, that you know which OIDs contain the necessary information. The first step in doing this (if the device doesn’t refuse) is to perform a complete SNMP walk. This will retrieve all of the available data via SNMP.

Checkmk can accomplish this very easily for you. To do so, first include the device (or one of the devices) for which you want to develop a plug-in into your monitoring. Let’s say this device is called mydevice01. Check in the device’s basic functions to make sure that it can be monitored. As a minimum, the SNMP Info and Uptime services need to be found, and probably at least one Interface as well. This is how you make sure that SNMP access works cleanly.

Then switch to the command line in the Checkmk site. Here you can perform a complete walk with the following command. We recommend using the -v (verbose) option when doing this for the very first time:

OMD[mysite]:~$ cmk -v --snmpwalk mydevice01
mydevice01:
Walk on ".1.3.6.1.2.1"...3898 variables.
Walk on ".1.3.6.1.4.1"...6025 variables.
Wrote fetched data to /omd/sites/heute/var/check_mk/snmpwalks/mydevice01.

As mentioned earlier, such a complete walk can take minutes or even hours — although the latter is rare. So don’t become nervous if it takes a while to complete this process. The walk will be saved in the file var/check_mk/snmpwalks/mydevice01. This will be a easily-readable text file that starts something like this:

var/check_mk/snmpwalks/mydevice01
.1.3.6.1.2.1.1.1.0 JetStream 24-Port Gigabit L2 Managed Switch with 4 Combo SFP Slots
.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.11863.1.1.3
.1.3.6.1.2.1.1.3.0 546522419
.1.3.6.1.2.1.1.4.0 hh@example.com
.1.3.6.1.2.1.1.5.0 sw-ks-01
.1.3.6.1.2.1.1.6.0 Core Switch Serverraum klein
.1.3.6.1.2.1.1.7.0 3
.1.3.6.1.2.1.2.1.0 27

In each line there is an OID and then its value. And right there in the first line you find the most important one, namely the sysDescr.

Now the OIDs themselves are not very informative. If the correct MIBs are installed, you can have them converted to names in a second step with the cmk --snmptranslate command. It is best to redirect the result — which would otherwise appear in the terminal — to a file:

OMD[heute]:~$ cmk --snmptranslate mydevice01  > translated
Processing 9923 lines.
finished.

The translated file reads like the original walk, but has a translated value for the OID on each line after the -->:

translated
.1.3.6.1.2.1.1.1.0 JetStream 24-Port Gigabit L2 Managed Switch with 4 Combo SFP Slots --> SNMPv2-MIB::sysDescr.0
.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.11863.1.1.3 --> SNMPv2-MIB::sysObjectID.0
.1.3.6.1.2.1.1.3.0 546522419 --> DISMAN-EVENT-MIB::sysUpTimeInstance
.1.3.6.1.2.1.1.4.0 hh@example.com --> SNMPv2-MIB::sysContact.0
.1.3.6.1.2.1.1.5.0 sw-ks-01 --> SNMPv2-MIB::sysName.0
.1.3.6.1.2.1.1.6.0 Core Switch Serverraum klein --> SNMPv2-MIB::sysLocation.0
.1.3.6.1.2.1.1.7.0 3 --> SNMPv2-MIB::sysServices.0
.1.3.6.1.2.1.2.1.0 27 --> IF-MIB::ifNumber.0
.1.3.6.1.2.1.2.2.1.1.1 1 --> IF-MIB::ifIndex.1
.1.3.6.1.2.1.2.2.1.1.2 2 --> IF-MIB::ifIndex.2

Example: the OID .1.3.6.1.2.1.1.4.0 has the translated name SNMPv2-MIB::sysContact.0. This is an important hint — the rest is then practice, experience and of course experimentation.

8.4. Registering the SNMP section

So, once you have determined the necessary OIDs, it’s on to the actual development of the plug-in. This is done in three steps:

  1. For an SNMP detection, specify which OIDs must contain which texts for your plug-in to run.

  2. Declare which OID branches need to be fetched for the monitoring.

  3. Write a check plug-in analogous to those for agent-based checks.

The first two steps are performed by registering an SNMP section. You do this by calling register.snmp_section(). Here you specify at least three arguments: the name of the section (name), the details for the SNMP detect detect, and the OID branches needed for actually monitoring (fetch). Here is an example of a hypothetical check plug-in with the name foo:

local/lib/check_mk/base/plugins/agent_based/foo.py
register.snmp_section(
    name = "foo",
    detect = startswith(".1.3.6.1.2.1.1.1.0", "foobar device"),
    fetch = SNMPTree(
        base = '.1.3.6.1.4.1.35424.1.2',
        oids = [
            '4.0',
            '5.0',
            '8.0',
        ],
    ),
)

The SNMP detection

With the keyword detect you specify under which conditions the discovery function should be executed. In our example, this is the case if the value of the OID .1.3.6.1.2.1.1.0 (i.e. the sysDescr) starts with the text foobar device (case-insensitive in principle). In addition to startswith, there are a number of other possible attributes. There is also a negated form of each of these, which begins with not_:

AttributeNegationFunction

equals(oid, needle)

not_equals(oid, needle)

The value of the OID is equal to the text needle

contains(oid, needle)

not_contains(oid, needle)

The value of the OID at some point contains the text needle

startswith(oid, needle)

not_startswith(oid, needle)

The value of the OID starts with the text needle

endswith(oid, needle)

not_endswith(oid, needle)

The value of the OID ends with the text needle

matches(oid, regex)

not_matches(oid, regex)

The value of OID matches the regular expression regex, anchored at the start and at the end — so with an exact match. If you only need a substring, just add another .*

exists(oid)

not_exists(oid)

Met if the OID is available on the device. The value may be empty.

As well as the above, there is also the possibility of linking multiple tests with all_of or any_of. The option all_of requires multiple successful attributes for a positive discovery of the plug-in. The following example finds the plug-in on a device if sysDescr starts with the text foo (or FOO or Foo) and the OID .1.3.6.1.2.1.2.0 contains the text .4.1.11863.:

detect = all_of(
    startswith(".1.3.6.1.2.1.1.1.0", "foo"),
    contains(".1.3.6.1.2.1.1.2.0", ".4.1.11863.")
)

The any_of option, on the other hand, is satisfied if any single one of the criteria is met. Here is an example where different values are allowed for the sysDescr keyword:

detect = any_of(
    startswith(".1.3.6.1.2.1.1.1.0", "foo version 3 system"),
    startswith(".1.3.6.1.2.1.1.1.0", "foo version 4 system"),
    startswith(".1.3.6.1.2.1.1.1.0", "foo version 4.1 system"),
)

By the way — are you familiar with regular expressions? If so, with these you would probably simplify this whole process and still get by with just a single line:

detect = matches(".1.3.6.1.2.1.1.1.0", "FOO Version (3|4|4.1) .*"),

And one more important note: The OIDs you specify in the detect declaration from a plug-in will, in case of doubt, be fetched from every device that is monitored via SNMP. Therefore, be very sparing in your use of vendor-specific OIDs. Try to make your discovery absolutely exclusive to the sysDescr (.1.3.6.1.2.1.1.1.0) and the sysObjectID (.1.3.6.1.2.1.1.2.0). If you still need another different OID, then reduce the number of devices where it is requested to a minimum by excluding as many devices as possible beforehand using the sysDescr, e.g. like this:

detect = all_of(
    startswith(".1.3.6.1.2.1.1.1.0", "foo"),   # first check sysDescr
    contains(".1.3.6.1.4.1.4455.1.3", "bar"),  # fetch vendor specific OID
)

The all_of() works in such a way that if the first condition fails, the second is not even tried (and thus the OID in question is not fetched). Here in the example, the OID .1.3.6.1.4.1.4455.1.3 is fetched only for those devices that have foo in their sysDescr.

What happens if you have made the declaration incorrectly or at least not quite on target?

  • If the detection erroneously detects devices that do not have the necessary OIDs, your discovery function will not generate any services — so nothing 'bad' will happen. However, this will slow down the discovery on such devices, because now every time it will pointlessly try to retrieve the corresponding OIDs.

  • If the detection does not detect devices that are actually allowed, during the discovery no services will be found in the monitoring.

8.5. The OID ranges for monitoring

The most important part of the SNMP declaration is the specification of which OIDs are to be fetched for the monitoring. In almost all cases, a plug-in only needs selected branches from a single table to do this. Let’s consider the following example:

    fetch = SNMPTree(
        base = '.1.3.6.1.4.1.35424.1.2',
        oids = [
            '4.0',
            '5.0',
            '8.0',
        ],
    ),

The keyword base specifies an OID prefix here. All necessary data is below. At oids you then specify a list of sub-OIDs to be fetched from there. In the above example, a total of three SNMP walks are then made, namely on starting from the OIDs .1.3.6.1.4.1.35424.1.2.4.0 and .1.3.6.1.4.1.35424.1.2.5.0 and .1.3.6.1.4.1.35424.1.2.8.0. It is important that these walks fetch the same number of variables and that they also correspond to each other. This means that, for example, the nth element from each of the walks corresponds to the same monitored object.

Here is an example from the check plug-in snmp_quantum_storage_info:

    tree = SNMPTree(
       base=".1.3.6.1.4.1.2036.2.1.1",  # qSystemInfo
       oids=[
           "4",   # qVendorID
           "5",   # qProdId
           "6",   # qProdRev
           "12",  # qSerialNumber
       ],
    ),
)

Here, the vendor ID, the product ID, the product revision and the serial number are retrieved from each storage device.

The discovery and check function is presented with this data as a table, i.e. as a list of lists. The table is mirrored so that you have all of the data for each item per entry in the outer list. Each entry has as many items as you specified in oids. This allows you to loop through the list in a very practical way, e.g.:

    for vendor_id, prod_id, prod_rev, serial_number in section:
        ...

Please note:

  • All entries are strings, even if the OIDs in question are actually numbers.

  • Missing OIDs are presented as empty strings.

  • Remember the ability to output formatted data during development with pprint.

8.6. Other SNMP special features

We will describe here in the future:

  • How to retrieve multiple independent SNMP areas.

  • What OIDEnd() is all about

  • Other special cases when dealing with SNMP

9. Formatting numbers

9.1. The basics

In the summary or details for a service, numbers are often output. To make it as easy as possible for you to format them nicely and correctly, and also to standardize the output from all check plug-ins, there are helper functions for rendering different kinds of sizes. All of these are sub-functions of the render module and are consequently called with render.. For example, render.bytes(2000) results in the text 1.95 KiB.

What all of these functions have in common is that they get their values in a so-called canonical or natural unit. Thus one must never think, and there are no difficulties or errors with the conversion. For example, times are always given in seconds, and the sizes of hard disks, files, etc. are always given in bytes and not in kilobytes, kibibytes, blocks or any other confusion of units.

Please use these functions even if you don’t like the display so much. After all, this is then consistent for the user. And future versions of Checkmk may be able to change the display or even make it configurable for the user. Your check plug-in will then also benefit from this.

Following the detailed description of all of the display functions (render functions), you will find a summary of these in the form of a clear table.

9.2. Times, time spans, frequencies

Absolute time specifications (timestamps) are formatted with render.date() or render.datetime(). The specifications are always in seconds from January 1, 1970, 00:00:00 UTC — the so-called epoch time. This is also the format used by the Python function time.time(). The advantage of this representation is that it can be used to calculate very easily, for example, the duration of an operation if the start and end times are known. The formula is then simply duration = end - start. These calculations also work independently of the time zone, daylight saving time changes or leap years.

render.date() outputs only the date, render.datetime() adds the time. The output is done according to the current time zone for the Checkmk server that is running the check!

Examples:

CallOutput

render.date(0)

Jan 01 1970

render.datetime(0)

Jan 01 1970 01:00:00

render.date(1600000000)

Sep 13 2020

render.datetime(1600000000)

Sep 13 2020 14:26:40

Now please don’t be surprised that render.date(0) outputs 01:00 as the time instead of 00:00! This is because we are writing this manual in the time zone for Germany, which is one hour ahead of UTC standard time (at least during standard time, because, as you know, January 1 is not in (the European Summer) daylight saving time)

For timespan there is still the function render.timespan(). This produces a duration in seconds and outputs it in a human readable form. For larger time spans, seconds or minutes are omitted.

CallOutput

render.timespan(1)

1 second

render.timespan(123)

2 minutes 3 seconds

render.timespan(12345)

3 hours 25 minutes

render.timespan(1234567)

14 days 6 hours

A frequency is effectively the reciprocal of time. The canonical unit is Hz, which is equivalent to once per second. A field of application is, for example, for the clock rate of a CPU:

CallOutput

render.frequency(111222333444)

111 GHz

9.3. Bytes

Whenever memory, files, hard disks, file systems and the like are concerned, the canonical unit is the byte. Since computers usually organize such things in powers of two, e.g. in units of 512, 1024 or 65536 bytes, from the beginning it has been accepted that a kilobyte is not 1000 but 1024 bytes. This is in itself very practical, because in this way mostly round numbers came out. The legendary Commodore C64 had a 64 kilobyte memory and not 65,536.

Unfortunately, at some point hard disk manufacturers came up with the idea of specifying the sizes of their disks in 1000’s of units. Since the difference between 1000 and 1024 is 2.4% for each size, and these are multiplied, a 1 GB disk (1024 times 1024 * 1024) suddenly becomes 1.07 GB. That sells better.

This annoying confusion persists to this day and continues to cause errors. As a remedy, new prefixes based on the binary system were defined by the International Electrotechnical Commission. Accordingly, nowadays a kilobyte is officially 1000 bytes, and a kibibyte is 1024 bytes (2 to the power of 10). In addition one should say Mebibyte and Gibitbyte and Tebibyte (ever heard of these?). The abbreviations are (attention, here at once always i, instead of e!) KiB, MiB, GiB and TiB.

Checkmk adapts itself to this standard and helps you with multiple adapted render functions so that you can always produce correct outputs. So specifically for hard disks and file systems there is the render.disksize() function, which gives the output in powers of 1000.

CallOutput

render.disksize(1000)

1.00 kB

render.disksize(1024)

1.02 kB

render.disksize(2000000)

2.00 MB

For the sizes of files it is common to specify the exact size in bytes without rounding. This has the advantage that you can see very quickly if a file has changed even minimally or that two files are (probably) the same. The render.filesize() function is responsible for this:

CallOutput

render.filesize(1000)

1,000 B

render.filesize(1024)

1,024 B

render.filesize(2000000)

2,000,000 B

If you want to output a size that is not a disk or a file size, just use the generic render.bytes(). This will give you the output in the classic 1024’s in the new official notation:

CallOutput

render.bytes(1000)

1000 B

render.bytes(1024)

1.00 KiB

render.bytes(2000000)

1.91 MiB

9.4. Bandwidths and data rates

Networkers have their own terms and ways of expressing things. And as always, in each domain Checkmk tries hard to adopt the way of communicating that is customary there. That’s why there are three different rendering functions for data rates and speeds. All of these have in common that the rates are passed in bytes per second, even when the output is in bits!

render.nicspeed() represents the maximum speed of a network card or switch port. Since they are not measured values, there is no need to do any rounding. Although no port can send single bits, the specifications are in bits for historical reasons. Attention: you must however always pass bytes per second here as well!

Examples:

CallOutput

render.nicspeed(12500000)

100 MBit/s

render.nicspeed(100000000)

800 MBit/s

render.networkbandwidth() is for an actual measured transmission speed on the network. The input value is again bytes per second (or 'octets' as a networker would say):

CallOutput

render.networkbandwidth(123)

984 Bit/s

render.networkbandwidth(123456)

988 kBit/s

render.networkbandwidth(123456789)

988 MBit/s

Where the network is not involved and data rates are nevertheless output, bytes are again common. The most prominent examples are the IO rates of hard disks. For this there is the render function render.iobandwidth(), which in Checkmk works with powers of 1000:

CallOutput

render.iobandwidth(123)

123 B/s

render.iobandwidth(123456)

123 kB/s

render.iobandwidth(123456789)

123 MB/s

9.5. Percentages

The render.percent() function represents a percentage value — rounded to two decimal places. It is an exception to the other functions in that the actual natural value — that is, the ratio — is not passed here, but really the percentage. So if something is half full, for example, rather than 0.5. you must pass 50.

Because it can sometimes be interesting to know whether a value is almost zero or exactly zero, values that are greater than zero but less than 0.01 are marked by adding a "<" sign.

CallOutput

render.percent(0.004)

<0.01%

render.percent(18.5)

18.50%

render.percent(123)

123.00%

9.6. Summary

Here to recap is an overview of all of the render functions:

FunctionEntryDescriptionOutput example

date

Epoch

Date

Dec 18 1970

datetime

Epoche

Date and time

Dec 18 1970 10:40:00

timespan

Seconds

Duration / Age

3d 5m

frequency

Hz

Frequency (e.g. Clock rate)

110 MHz

disksize

Bytes

Hard disk size, Basis 1000

1,234 GB

filesize

Bytes

Size of files, exact

1,334,560 B

bytes

Bytes

Size in bytes, base 1024

23,4 KiB

nicspeed

Octets/sec

Network card speed

100 MBit/s

networkbandwidth

Octets/sec

Transmission speed

23.50 GBit/s

iobandwidth

Bytes/sec

IO-Bandwidth

124 MB/s

percent

Prozent

Percentage value, meaningfully rounded

99.997%

10. Thresholds and check parameters

10.1. A rule set for the setup

In one of our previous examples, we generated the state WARN if there were only 10 or fewer slots left. The number 10 was coded directly into the check function — hard-coded, as programmers would say. In Checkmk, however, users are more used to being able to configure such thresholds and parameters by rule. We will therefore take a look at how you can improve your check so that it can be configured via the setup interface.

To do this, we need to distinguish between two scenarios:

  1. There is already a suitable rule set. This can actually only be the case if your new check performs a check for which Checkmk already has check plug-ins of the same type, such as monitoring a temperature. There is already a rule set for this that you can use directly.

  2. There is no matching rule set. In that case you will have to create a new one.

10.2. Using existing rule sets

The rule sets supplied for the parameters of checks can be found in the lib/check_mk/gui/plugins/wato/check_parameters/ directory. Let’s take the file memory_simple.py as an example. This defines a rule set with the following section:

lib/check_mk/gui/plugins/wato/check_parameters/memory_simple.py
rulespec_registry.register(
    CheckParameterRulespecWithItem(
        check_group_name="memory_simple",
        group=RulespecGroupCheckParametersOperatingSystem,
        item_spec=_item_spec_memory_simple,
        match_type="dict",
        parameter_valuespec=_parameter_valuespec_memory_simple,
        title=lambda: _("Main memory usage of simple devices"),
    ))

Here the important point for you is the keyword check_group_name, which is set here to 'memory_simple'. This establishes the connection to the check plug-in. You do this when registering the check with the keyword check_ruleset_name, for example:

register.check_plugin(
    name = "foobar",
    service_name = "Foobar Sector %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
    check_ruleset_name="memory_simple",
    check_default_parameters={},
)

The definition of default parameters via the keyword check_default_parameters is also mandatory. These parameters apply to your check if the user has not yet created a rule. If there are no mandatory parameters, you can simply take the empty dictionary {} as the value.

We will see below how the value configured by the user arrives at the check function.

10.3. Defining your own rule set

If there is no suitable rule set (which is probably the usual situation), we will have to create a new one ourselves. To do this, we create a file in the local/share/check_mk/web/plugins/wato directory. The name of the file should be based on that of the check and, like all plug-in files, it must have the extension '.py'.

Let’s look at the structure of such a file step by step. First come some import commands. If you want the texts in your file to be translatable into other languages, import the _ (underscore) function. This is an identifier for all translatable texts. For example, you can code calling the function _("Threshold for warn") instead of "Threshold for warn". The translation system in Checkmk, which is based on gettext, finds such texts and includes them in the list of texts to be translated. If you only build the check for yourself, you can do without it and do not need the following import command:

local/share/check_mk/web/plugins/wato/foobar_parameters.py
from cmk.gui.i18n import _

Next we import the so-called ValueSpecs. A ValueSpec is a very practical and universal tool that uses Checkmk in many places. It is used to generate customised input masks, to display and validate the entered values and to convert them into Python data structures. In the following example, Dictionary, Integer and TextInput will be imported.

from cmk.gui.valuespec import (
    Dictionary,
    Integer,
    TextInput,
)

You will definitely need the Dictionary. Since version 2.0.0 of Checkmk it is mandatory that check parameters are Python dictionaries. Previously, it could also be a pair (a tuple of two numbers), e.g. Warn/Crit.

Integer is responsible for the input of a number without decimal places and TextInput for a Unicode text.

Next, symbols are imported that are needed for registration:

from cmk.gui.plugins.wato import (
    CheckParameterRulespecWithItem,
    rulespec_registry,
    RulespecGroupCheckParametersOperatingSystem,
)

If your check does not have an item, import instead CheckParameterRulespecWithoutItem. About the RulespecGroup…​. we will explain more on this below.

Now come the actual definitions. First we define an input field with which the user can specify the item for the check. This is necessary for the rule condition and also for the manual creation of checks that are to function without discovery. We do this with TextInput. This is assigned a title via title, which is then usually displayed as a heading for the input field:

def _item_valuespec_foobar():
    return TextInput(title=_("Sector name"))

You can freely choose the name of the function that returns this ValueSpec as it is only required at the point further down. So that it does not become visible beyond the module boundary, it should begin with an underscore.

Next comes the ValueSpec for entering the actual check parameter. For this, as well, we create a function that generates it. The return Dictionary(…​) is mandatory. Within it, you create the list of sub-parameters for this check with elements=[…​]. In our example there is only one — the warning threshold for the free slots. This is required to be an integer, so we use integer here.

def _parameter_valuespec_foobar():
    return Dictionary(
        elements=[
            ("warning_lower", Integer(title=_("Warning below free slots"))),
        ],
    )

Last but not least, we now register a new rule set using the imported and self-defined items. For this purpose there is the rulespec_registry.register() function:

rulespec_registry.register(
    CheckParameterRulespecWithItem(
        check_group_name="foobar",
        group=RulespecGroupCheckParametersOperatingSystem,
        match_type="dict",
        item_spec=_item_valuespec_foobar,
        parameter_valuespec=_parameter_valuespec_foobar,
        title=lambda: _("Free slots for Foobar sectors"),
    ))

A few more notes on this:

  • If your check does not use an item, the inner function is CheckParameterRulespecWithoutItem. The line item_spec is then omitted.

  • As mentioned above, the check_group_name provides the link to the checks that are to use this rule. It may not be identical with an already existing rule, because this would overwrite the existing rule.

  • The group determines in which category in the setup the rule set should appear. Most of these groups are defined in the file lib/check_mk/gui/plugins/wato/utils/init.py. There you will also find examples of how to create your own new group.

  • The match_type is always "dict". In older Checkmk versions there were also parameter rules with other types.

  • title defines the title of the rule set, but is not given directly as text, but as an executable function which returns the text (hence the lambda:).

Testing

When you have created this file, you should first try out whether everything works so far and not immediately continue working with the check function. To do this, you must first restart the Apache for the site so that the new file will be read. This is performed using the command:

OMD[mysite]:~$ omd restart apache

After that, the rule set should be found in the setup. Create a rule in this chain and try out different values. If this functions without errors, you can now use the check parameters in the check function.

10.4. Applying the rule to the check plug-in

In order for the rule to take effect, we must allow the check plug-in to accept check parameters and tell it which rule to use. To do this, the check_default_parameters entry must be present in the registration. In the simplest case, we pass an empty dictionary.

Secondly, we pass the check_ruleset_name to the registration function, i.e. the name we gave to the rule set above using check_group_name. This way Checkmk knows from which rule set the parameters are to be determined.

The whole thing will then look like this, for example:

register.check_plugin(
    name = "foobar",
    service_name = "Foobar Sector %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
    check_default_parameters={},
    check_ruleset_name="foobar",
)

Now Checkmk will try to pass parameters to the check function. For this to work, we need to extend the check function so that it expects the params argument as the second argument. This is inserted between item and section (If you build a check without an item, the item is of course omitted and params will be at the beginning):

def check_foobar(item, params, section):

It is highly recommended to now have the contents of the variable params printed out with a print command as a first test (or pprint if you want to have it a bit more convenient). Create different rules, and see which values arrive at params:

def check_foobar(item, params, section):
    print(params)
    for sector, used, slots in ...

Very important: When everything is ready, be sure to remove the print commands again! These can otherwise disrupt the internal communication in Checkmk.

Now we adapt our check function so that the parameter passed can produce its desired effect. We get the value with the usually chosen key (here "warning_lower") from the parameters:

def check_foobar(item, params, section):
    warn = params["warning_lower"]
    for sector, used, slots in section:
        if sector == item:
            used = int(used)    # convert string to int
            slots = int(slots)  # convert string to int
            if used == slots:
                s = State.CRIT
            elif slots - used <= warn:
                s = State.WARN
            else:
                s = State.OK
            yield Result(
                state = s,
                summary = f"used {used} out of {slots} slots")
            return

If a rule has been configured, we can now monitor the "free slots" in our example. However, if no rule has been defined, this check function will crash: Since the default parameters of the plug-in are not filled, in the absence of a rule the plug-in will generate a KeyError.

We can fix this problem by inserting a suitable parameter during registration:

register.check_plugin(
    name = "foobar",
    service_name = "Foobar Sector %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
    check_default_parameters = {"warning_lower": 10},
    check_ruleset_name = "foobar",
)

You should always pass default values in this way (and not use the check plug-in to catch missing parameters), as these default parameters can also be displayed in the setup interface. For this purpose, there is e.g. the entry Show check parameters in the menu Display on the service configuration page for a host.

By the way, having a single value as a threshold is very uncommon in Checkmk. Since services can be in the states OK, WARN, CRIT, it is only logical to always define the parameters as tuples with two entries, i.e. as a pair of thresholds for WARN and CRIT. To do this, we adapt the rule set as follows:

def _parameter_valuespec_foobar():
    return Dictionary(
        elements=[
            ("warning_lower", Tuple(
                title=_("Levels on free slots"),
                elements=[
                    Integer(title=_("Warning below")),
                    Integer(title=_("Critical below")),
                ],
            )),
        ],
    )

Note that such a change of data type is an incompatible change: Existing rules can now no longer be loaded from the interface. And also the check function may run into problems if instead of an expected pair of numbers there is a single number in params. You can simply edit such rules. When you save them again, the new format will be used.

10.5. Further ValueSpecs

In Checkmk there are numerous ValueSpecs for all kinds of situations. Here are a few more useful ones:

Float

Float is like Integer but allows the input of numbers with decimal places.

Percentage

Often one does not want to indicate thresholds in absolute numbers, but in percentages. For this purpose there is the Percentage ValueSpec:

def _parameter_valuespec_foobar():
    return Dictionary(
        elements=[
            ("levels_percent", Tuple(
                title=_("Relative levels"),
                elements=[
                    Percentage(title=_("Warning at"), default_value=80),
                    Percentage(title=_("Critical at"), default_value=90)
                ],
            )),
        ],
    )

With this ValueSpec, the check plug-in would receive the parameters {"levels_percent": (80.0, 90.0)}.

MonitoringState

The MonitoringState is useful if you want to allow the user to select one of the states OK, WARN, CRIT and UNKNOWN for each of various situations. It provides the user with a drop-down list of just these four options, which are then converted to one of the numbers 0, 1, 2 or 3.

Here you can set, for example, which status the service should have if no backup is configured or available:

def _parameter_valuespec_plesk_backups():
    return Dictionary(
        help=_("This check monitors backups configured for domains in plesk."),
        elements=[
            ("no_backup_configured_state",
             MonitoringState(title=_("State when no backup is configured"), default_value=1)),
            ("no_backup_found_state",
             MonitoringState(title=_("State when no backup can be found"), default_value=1)),
        ...

With this ValueSpec, the check plug-in would be passed the parameters `{"no_backup_configured_state": 1, "no_backup_found_state": 1} if in both cases the default of WARN (=1) was taken over. You can easily convert the number into a State object by passing it to the State() function:

    yield Result(
        state=State(params["no_backup_configured_state"]),
        summary="No backup is configured!",
    )

Age

The field Age allows the entry of an age, which is stored and transferred internally as a count of seconds:

def _parameter_valuespec_antivir_update_age():
    return Tuple(elements=[
        Age(title=_("Warning level for time since last update")),
        Age(title=_("Critical level for time since last update")),
    ],)

Filesize

The ValueSpec Filesize allows the input of file (or hard disk) sizes. Internally, the calculation is done with bytes, but the user may choose from KB, MB, GB or TB:

    Tuple(
        title=_("Maximum size of all files on backup space"),
        help=_("The maximum size of all files on the backup space. "
               "This might be set to the allowed quotas on the configured "
               "FTP server to be notified if the space limit is reached."),
        elements=[
            Filesize(title=_("Warning at")),
            Filesize(title=_("Critical at")),
        ],
    ),

The topic of ValueSpecs is extremely flexible and extensive and would go beyond the scope of this article. Please have a look at the examples of the rule definitions supplied by Checkmk in lib/check_mk/gui/plugins/wato/check_parameters/. There are more than 500 files with examples.

11. Customised presentation of metrics

11.1. The significance of metric definitions

In our example above, we have let the foobar plug-in generate the metric fooslots. Metrics are immediately visible in the Checkmk graphical interface without you having to do anything. A graph for each metric will be automatically generated in the service details.

However, there are a few limitations:

  • A "Perf-O-Meter", i.e. the graphical bar-like preview of the measurement value, does not automatically appear when the service is displayed in the list view (e.g. in the view showing all services of a host).

  • Matching metrics are not automatically combined in a single graph, but each appears separately.

  • The metric does not have a proper title, but the internal variable name of the metric is shown.

  • No unit is used that allows a meaningful representation (e.g. GB instead of individual bytes).

  • A colour is randomly selected. To have a clear representation of your metrics in these aspects, you will need some more definitions in another file.

11.2. Using existing metric definitions

Before you do this, you should — as with the rule set for the parameters — first check whether Checkmk does not already come with a suitable metrics definition. The predefined metrics definitions can be found in the lib/check_mk/gui/plugins/metrics/ directory. For example, in the file cpu.py you will find a metric for the free space in a file system:

metric_info["util"] = {
    "title": _("CPU utilization"),
    "unit": "%",
    "color": "26/a",
}

If this is suitable for your plug-in, you only need to use the name "util" in your call to the Metric() class. Everything else will then be automatically derived from it.

11.3. Own metric definitions

If there is no suitable metric, simply create one yourself. In our example we want to define our own metric for our fooslots. To do this, we create a file in local/share/check_mk/web/plugins/metrics:

local/share/check_mk/web/plugins/metrics/foobar_metric.py
from cmk.gui.i18n import _
from cmk.gui.plugins.metrics import metric_info

metric_info["fooslots"] = {
    "title": _("Used slots"),
    "unit": "count",
    "color": "15/a",
}

Here are a few pointers:

  • The key (here 'fooslots') is the metric name and must match what the check function outputs.

  • Importing and using the underscore for internationalisation is optional, as already discussed in the rules.

  • See the file lib/check_mk/gui/plugins/metrics/unit.py for the unit definitions.

  • The colour definition uses a palette. For each palette colour there are /a and /b. These are two shades of the same colour. In the existing definitions you will also find many direct colour codes like '#ff8800'. These will gradually be phased out and all replaced by palette colours as these provide a more uniform look and are also easier to match to the interface themes.

This definition now ensures that the colour, title and unit of the metric are displayed according to our requirements.

11.4. Graphs with multiple metrics

If you want to combine several metrics in one graph — which is often very useful — you need, simply in the same file, a graph definition. This is done via the global dictionary graph_info.

For example, let’s assume our check has two metrics, fooslots and fooslots_free. The metric definitions would be, for example:

local/share/check_mk/web/plugins/metrics/foobar_metric.py
from cmk.gui.i18n import _
from cmk.gui.plugins.metrics import (
    metric_info,
    graph_info,
)

metric_info["fooslots"] = {
    "title": _("Used slots"),
    "unit": "count",
    "color": "16/a",
}

metric_info["fooslots_free"] = {
    "title": _("Free slots"),
    "unit": "count",
    "color": "24/a",
}

Now we add a graph that draws these two metrics as lines:

graph_info["fooslots_combined"] = {
    "metrics": [
        ("fooslots", "line"),
        ("fooslots_free", "line"),
    ],
}

Notes on this:

  • Unfortunately, there is no description of the possibilities for this definition in the manual yet. But you will find many examples in the files in the directory lib/check_mk/gui/plugins/metrics.

  • Try area or stack instead of line.

11.5. Displaying the metrics in the Perf-O-Meter

If you would like to display a Perf-O-Meter in the service line in addition to our metric, you need another file, this time in the directory local/share/check_mk/web/plugins/perfometer.

Example:

local/share/check_mk/web/plugins/perfometer/foobar_perfometer.py
from cmk.gui.plugins.metrics import perfometer_info

perfometer_info.append({
    "type": "logarithmic",
    "metric": "fooslots",
    "half_value": 5,
    "exponent": 2.0,
})

Perf-O-Meters are a bit trickier than graphs because they have no legend. And that’s why it’s difficult with the range of values. Since the poor Perf-O-Meter cannot know which readings are even possible and the space is very limited, many built-in check plug-ins use a logarithmic representation. This is also the case in our example, half_value is the measured value that is displayed exactly in the middle of the Perf-O-Meter. With a value of 5, the bar would be half filled. And exponent describes the factor which is necessary so that another 10% of the range would be filled. So in this example, a reading of 10 would be displayed at 60% and one of 20 at 70%.

The advantage of this method is that when you have a list of services of the same type, you can quickly compare all Perf-O-Meters visually because they all have the same scale. And despite the very small representations, you can easily see the differences in both very small and large readings. The values are NOT to scale, however.

Alternatively, you can also use a linear Perf-O-Meter. This is always useful if there is a known maximum value. A typical situation would be measured values that represent percentages from 0 to 100. This would then look like this, for example:

perfometer_info.append({
    "type": "linear",
    "segments": ["fooslots_used_percent"],
    "total": 100.0,
})

There is another difference to the logarithmic representation — here segments is a list and allows multiple metrics to be displayed side by side.

As always, you can find examples in the many plug-ins supplied by Checkmk. These are also in the files in the directory lib/check_mk/gui/plugins/metrics.

12. Notes for users of the old API

Are you already experienced in developing check plug-ins with the previous API — the one up to version 1.6.0 of Checkmk? Then you will find some notes about important changes summarized here.

12.1. saveint() and savefloat()

The two functions saveint() and savefloat() have been dropped. As a reminder, saveint(x) returns 0 if x cannot be reasonably converted to a number, e.g. because it is an empty string or does not consist only of digits.

While there have been a few good use cases for this, it has been used incorrectly in the majority of cases, which in the past has resulted in many errors being obscured.

In a situation in which you want to get a 0 on an empty string — which is the most common 'good' use case of saveint(x) — you can simply code the following:

foo = int(x) if x else 0

For savefloat() everything applies analogously.

13. Taming complex agent outputs using the parse function

The next step is the so-called parse function. This has the task of parsing the 'raw' agent data and putting it into a logically-tidy form that is easy to process in all subsequent steps. The convention is that this is named after the agent section and begins with parse_. It gets string_table as its only argument. Please note that you are not free to choose the argument here. It really must be called that.

For now, we write our parse function in such a way that we simply output the data it receives to the console. To do this, we simply use the print function (note: since Python 3, brackets are mandatory here):

def parse_linux_usbstick(string_table):
    print(string_table)

In order for this whole process to have any effect, we have to make our parse function and the new agent section known to Checkmk. To do this, we call up a registration function:

register.agent_section(
    name = "linux_usbstick",
    parse_function = parse_linux_usbstick,
)

Here it is important that the name of the section really does exactly match the section header in the agent output. Altogether, it now looks like this:

local/lib/check_mk/base/plugins/agent_based/linux_usbstick.py
from .agent_based_api.v1 import *

def parse_linux_usbstick(string_table):
    print(string_table)
    return string_table

register.agent_section(
    name = "linux_usbstick",
    parse_function = parse_linux_usbstick,
)

From this point on, every plug-in that uses the section linux_usbstick gets the return value from the parse function. As a rule, this will be the check plug-in of the same name.

In a way, we have now built the simplest possible plug-in, which although it has no real use yet, we can at least test it. To do this, we trigger a service detection (option -I) on the command line from the host whose agent we prepared earlier. If its output really contains a section linux_usbstick, then we should see our debug output:

OMD[mysite]:~$ cmk -I myhost123
[['ata-APPLE_SSD_SM0512F_S1K5NYBF810191'], ['wwn-0x5002538655584d30']]

The output becomes somewhat clearer if we replace the simple print with a Pretty-print from the module pprint. This is highly recommended for all further debugging output:

local/lib/check_mk/base/plugins/agent_based/linux_usbstick.py
from .agent_based_api.v1 import *
*import pprint*

def parse_linux_usbstick(string_table):
    *pprint.pprint(string_table)*
    return string_table

register.agent_section(
    name = "linux_usbstick",
    parse_function = parse_linux_usbstick,
)

It will look like this:

OMD[mysite]:~$ cmk -I myhost123
[['ata-APPLE_SSD_SM0512F_S1K5NYBF810191'],
 ['wwn-0x5002538655584d30']]

13.1. Composing the parse function

If you look closely, you will see that these are nested lists. In the argument string_table you get a list which contains a list of words per line of the agent output. The lines are separated by sequences of spaces. Since our section contains only one word per line, the inner lists consist of only one entry each.

The following example makes the structure a little clearer:

local/lib/check_mk/base/plugins/agent_based/linux_usbstick.py
from .agent_based_api.v1 import *
import pprint

def parse_linux_usbstick(string_table):
    print("Number of lines: %d" % len(string_table))
    print("Number of words in first line: %d" % len(string_table[0]))
    print("Length of first word: %d" % len(string_table[0][0]))
    return string_table

register.agent_section(
    name = "linux_usbstick",
    parse_function = parse_linux_usbstick,
)

The output will look like this:

OMD[mysite]:~$ cmk -I myhost123
Number of lines: 3
Number of words in first line: 1
Length of first word: 36

For our example, we just need a simple list of device names, so we make our parse function unpack the single word from each line and package it into a nice new list:

def parse_linux_usbstick(string_table):
    parsed = []
    for line in string_table:
        parsed.append(line[0])
    pprint.pprint(parsed)
    return string_table

The debug output then looks like this (please look carefully, there are now only a single pair of square brackets):

['ata-APPLE_SSD_SM0512F_S1K5NYBF810191',
 'wwn-0x5002538655584d30']

For the parse function to be complete, we now need to remove the debug message and — very importantly — deliver the new result with return:

def parse_linux_usbstick(string_table):
    parsed = []
    for line in string_table:
        parsed.append(line[0])
    return parsed

Of course, from this point on, all of the relevant plug-ins must be able to work with the new data format.

14. The outlook for the future

There are many more aspects and topics around the development of own plug-ins. Checkmk has many interfaces for custom extensions and is therefore very flexible. We are working on progressively describing these interfaces in the manual.

Should you have any questions or difficulties, our professional support and also the free forum are of course at your disposal.

15. Files and directories

local/lib/check_mk/base/plugins/agent_based

Location for self-written check plug-ins.

local/share/check_mk/web/plugins/wato

Storage location for your check parameter rule sets.

local/share/check_mk/web/plugins/metrics

Storage location for own metric definitions.

local/share/check_mk/web/plugins/perfometer

Storage location for own Perf-O-Meter definitions.

local/share/check_mk/mibs

Place SNMP MIB files here that are to be loaded automatically.

lib/check_mk/gui/plugins/wato/check_parameters

Here you will find the rule set definitions for all check plug-ins included in Checkmk.

lib/check_mk/gui/plugins/wato/utils/init.py

This file defines the groups of the setup interface in which you can store new rule sets.

lib/check_mk/gui/plugins/metrics/

Here you will find the metric definitions for the supplied plug-ins.

lib/check_mk/gui/plugins/metrics/unit.py

The predefined units for the metrics are in this file.

/usr/lib/check_mk_agent/plugins

This directory refers to a monitored Linux host. Here the Checkmk agent for Linux expects to find agent extensions (agent plug-ins).

On this page