Checkmk
to checkmk.com

1. Introduction

Checkmk includes nearly 2000 ready-made check plug-ins for all imaginable hardware and software. These are maintained by the Checkmk team, and new plug-ins are added every week. On the Checkmk Exchange there are also more plug-ins contributed by our users.

And yet there are always situations where a device, an application, or just a specific metric that is important to you is not covered by any of these plug-ins — maybe because it is something that was developed within your own company and is therefore not available to anyone else.

1.1. Does it always have to be a real plug-in?

What options do you have for implementing an effective monitoring here? Well, you could of course contact our support team and request that they develop a suitable plug-in for you — but naturally it is quicker if you can do it yourself.

You have four options:

MethodHow to do itAdvantagesDisadvantages

Localcheck

Extend a Checkmk Agent with a simple script

Is very simple, is possible in all programming languages offered by the monitored host’s operating system, and even supports service discovery

Threshold configuration only for the agent itself, SNMP not possible or very cumbersome

Nagios-compatible check plug-in

Run the plug-in via MRPE from the Windows or Linux agent

Access to all existing Nagios plug-ins, also free choice of the programming language

Threshold configuration only for the agent itself, SNMP not possible or very cumbersome, no service discovery possible

Evaluating log messages

Monitor messages with the Event Console

No development necessary, but only need to set up rules in the Event Console

Only works if suitable log messages are available, no verified current status, no recording of metrics, no configurable thresholds

Genuine Checkmk plug-in

Will be explained in this article

Integrates 100% with Checkmk, automatic service detection, central configuration of thresholds via graphical interface, very high performance, supports SNMP, automatic host and service labels are possible, supports HW/SW inventory, supported by standard libraries from Checkmk

Requires more training and knowledge of the Python programming language

This article will show you how to develop real Checkmk check plug-ins — along with everything associated with them. Here we show you how to use the newly-developed API for programming plug-ins in version 2.0.0 of Checkmk.

1.2. What has changed compared to the old API?

Do you already have experience with developing check plug-ins for Checkmk Version 1.6.0 or earlier? If so, here is a concise overview of all of the changes introduced in the new Check API available from Version 2.0.0:

  • Plug-ins are now Python 3 modules, and the file names must have the .py extension.

  • The custom plug-ins are now located in the local/lib/check_mk/base/plugins/agent_based directory.

  • At the beginning of the file you will now need at least one special import statement.

  • The sections and the actual checks are now stored separately. For this purpose there are the new register.agent_section and register.check_plugin functions.

  • Several function and argument names have been renamed. Among other things, Discovery is now always used consistently (previously Inventory).

  • The Discovery function (formerly Inventory function) and also the Check function must now always work as generators (so use yield).

  • The names for the declared functions' arguments are now fixed.

  • Instead of the SNMP scan function, write a declaration of which OIDs are expected with which values.

  • The functions for representing numbers have been restructured (e.g. get_bytes_human_readable becomes render.bytes).

  • There is now a separate method for checks to exclude others (supersedes). This is no longer done in the SNMP scan function.

  • The auxiliary functions for working with counters, rates and averages have changed.

  • Instead of magic return values such as 2 for CRIT, there are now constants (e.g. State.CRIT).

  • Many possible programming errors in your plug-in are now recognised by Checkmk at a very early stage and can be immediately highlighted for you.

1.3. Will the old API still be supported?

Yes — the API for the development of check plug-ins valid up to version 1.6.0 of Checkmk will be supported with some minor restrictions for a few more years, because a significant number of plug-ins have been developed with it. During this time, Checkmk will offer both APIs in parallel. Details can be found in the #10601 work.

Nevertheless, we do recommend the new API for the development of new plug-ins, as it is more consistent and logical, better documented and is the most future-proof solution in the long term.

1.4. The different types of agents

Check plug-ins evaluate the data from the Checkmk agents. And that is why, before we leap into action here, we should first look at an overview of the types of agents that Checkmk actually recognises:

Checkmk Agent

The ‘normal’ plug-ins evaluate data that the Checkmk agent sends for Linux, Windows or other operating systems. This agent monitors operating system parameters and applications, and sometimes also server hardware. Each new check plug-in requires an extension of the agent to provide the necessary data. Therefore you first develop an agent plug-in, and then one or more check plug-ins that evaluate this data.

Special Agent / API-Integration

You need a special agent if you do not receive the data that is relevant for monitoring from either the normal Checkmk agent or SNMP. The most common application for Special Agent is querying HTTP-based APIs. Examples are, e.g. Monitoring AWS, Azure, or VMware. In this case you write a script that runs directly on the Checkmk server, connects to the API, and outputs data in the same format as an agent plug-in would. For this you write suitable check plug-ins in the same way as with the ‘agent-based’ monitoring.

SNMP

When monitoring via SNMP you do not need an extension of an agent, but simply evaluate the data retrieved from your device via SNMP, which provides this by default. Checkmk supports you and takes over all of the details and special features of the SNMP protocol. There is in fact an agent here as well — namely the SNMP agent which is pre-installed on the system being monitored.

Active Check

This check type forms a special role. Here you first write a classic Nagios-compatible plug-in which is intended for execution on the Checkmk server, and which from there uses a network protocol to directly query a service on the target device. The most prominent example is the check_http plug-in which allows you to monitor web servers and web pages. You can then integrate this plug-in into Checkmk so that it can be set up as usual via rules.

1.5. Prerequisites

If you feel like programming check plug-ins, you need to satisfy the following prerequisites:

  • Knowledge of the Python programming language

  • Experience with Checkmk, especially with regard to agents and checks

  • Experience with Linux on the command line

As preparation, the following articles are recommended:

2. A first simple check plug-in

After this long introduction, it’s time we programmed our first simple check plug-in. As an example, let’s take a simple monitoring for Linux. Since ( CMK ) itself runs on Linux, it is very likely that you also have access to a Linux system.

The check plug-in will create a new service that detects whether someone has inserted a USB stick on a Linux server. In this case, this service should then become critical. You might even find something like this useful, but it is really only a simplified example and possibly also not programmed in a completely watertight way — but for now that’s not what this exercise is really about.

The whole procedure involves two steps:

  1. We find out which Linux command can be used to determine whether a USB stick has been plugged in, and then extend the Linux agent with a small script that calls this command.

  2. We then write a check plug-in in the Checkmk instance that evaluates this data.

Here we go…​

2.1. Find the right command

At the beginning of any check programming activity is the necessary research! This means that we have to find out how we can get the information we need for monitoring. With Linux, this will often involve command line commands. In Windows, PowerShell, VBScript or WMI can help, and with SNMP we have to find the right OIDs (there is an own chapter for this).

Unfortunately, there is no general procedure for determining the correct command, so we do not want to spend too much time on the subject here, we will however briefly explain how it works for a USB stick.

First we log in to the host we want to monitor. Under Linux, the agent runs as the root user by default — which is why we perform all our tests as root. For our task with the USB stick, there are convenient symbolic links in the directory /dev/disk/by-id. These point to all the Linux block devices. And a plugged-in USB stick is one such device. In addition, you can tell by the ID of the prefix usb- when a block device is a USB device.

The following command lists all entries in this directory:

root@linux# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root  9 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191 -> ../../sda
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5 -> ../../sda5
lrwxrwxrwx 1 root root  9 May 14 11:21 wwn-0x5002538655584d30 -> ../../sda
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part5 -> ../../sda5

So — and now the whole thing with the plugged in USB stick:

root@linux# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root  9 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191 -> ../../sda
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5 -> ../../sda5
lrwxrwxrwx 1 root root  9 Mai 14 12:15 usb-SCSI_DISK-0:0 -> ../../sdc
lrwxrwxrwx 1 root root 10 Mai 14 12:15 usb-SCSI_DISK-0:0-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Mai 14 12:15 usb-SCSI_DISK-0:0-part2 -> ../../sdc2
lrwxrwxrwx 1 root root  9 Mai 14 11:21 wwn-0x5002538655584d30 -> ../../sda
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part5 -> ../../sda5

2.2. Purging the data

Actually, we would be finished with that and could transport this whole output via the Checkmk agent to the Checkmk server and have it analysed there as well, because in Checkmk the following recommendation always applies: always let the server do the complex work, so keep the agent plug-in as simple as possible.

But there is still too much hot air in here. It is always good not to transfer unnecessary data. This saves network traffic, memory, computing time and also makes everything clearer. That is a better way of doing things!

First, we can omit the -l. This already makes the output of ls much leaner:

root@linux# ls /dev/disk/by-id/
ata-APPLE_SSD_SM0512F_S1K5NYBF810191        ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5  wwn-0x5002538655584d30-part3
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1  wwn-0x5002538655584d30-part4                ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2
wwn-0x5002538655584d30                      wwn-0x5002538655584d30-part5                ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3
wwn-0x5002538655584d30-part1                ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4  wwn-0x5002538655584d30-part2

Now again, the multi-column structure is disturbing, but this is only because the ls command recognises that it is running in an interactive terminal. Later, as part of the agent, it will output the data in a single column. But we can also easily force this here with the -1 option (for output in one column):

root@linux# ls -1 /dev/disk/by-id/
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5
wwn-0x5002538655584d30
wwn-0x5002538655584d30-part1
wwn-0x5002538655584d30-part2
wwn-0x5002538655584d30-part3
wwn-0x5002538655584d30-part4
wwn-0x5002538655584d30-part5

If you look closely, you will see not only the block devices themselves, but also any partitions that exist there. These are the entries that end in -part1, -part2, etc. We do not need these for our check and can get rid of them quite easily with a grep. There we take the -v option for negative logic:

root@linux# ls /dev/disk/by-id/ | grep -v -- -part
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
usb-SCSI_DISK-0:0
wwn-0x5002538655584d30

Here you can now see much more clearly that in our example there are in fact exactly three devices when the USB stick is plugged in.

Perfect! We now have a clear list of all block devices which has been compiled with a simple command. That’s all we need.

We have again omitted the -1 in the last command, because ls now writes into a pipe and outputs a single column by itself. And grep needs the -- because otherwise it would interpret the word -part as the four options -p, -a, -r and -t.

And by the way: Why don’t we simply "grep" for usb in addition so that only USB devices are output? Well, of course we could do that. But for one thing, our example then becomes increasingly boring, and besides, it is somehow more reassuring to get some content in the section in a normal situation and not simply nothing. In this way one can see immediately on the Checkmk server that the agent plug-in is working correctly.

2.3. Include the command in the agent

In order for us to be able to retrieve this data from the Checkmk server, we need to make the new command part of the Checkmk agent on the system being monitored. We could of course simply edit the /usr/bin/check_mk_agent file there and include that. However, this would have the disadvantage that our command would disappear again when we update the agent’s software because the file will be replaced at that point.

It is therefore better if we make an agent plug-in. This is even simpler. All we need is an executable file with our command in the /usr/lib/check_mk_agent/plugins directory.

And one more point is important: we can’t just output our data like this. What we still need is a section header. This is a specially-formatted line that contains our new check’s name. By means of these section headers, Checkmk can later recognise where this plug-in’s data begins and the previous plug-in’s data ends.

So now we need a meaningful name for our new check. This name must consist of lower case letters, underscores and numbers and be unique. There may not be an existing check plug-in with this name. If you are curious about which names already exist, in a Checkmk instance you can list them on the command line with cmk -L:

OMD[mysite]:~$ cmk -L | head -n 20
3par_capacity                agent      HPE 3PAR: Capacity
3par_cpgs                    agent      HPE 3PAR: CPGs
3par_cpgs_usage              agent      HPE 3PAR: CPGs Usage
3par_hosts                   agent      HPE 3PAR: Hosts
3par_ports                   agent      HPE 3PAR: Ports
3par_remotecopy              agent      HPE 3PAR: Remote Copy
3par_system                  agent      HPE 3PAR: System
3par_volumes                 agent      HPE 3PAR: Volumes
3ware_disks                  agent      3ware ATA RAID Controller: State of Disks
3ware_info                   agent      3ware ATA RAID Controller: General Information
3ware_units                  agent      3ware ATA RAID Controller: State of Units
acme_agent_sessions          snmp       ACME Devices: Agent Sessions
acme_certificates            snmp       ACME Devices: Certificates
acme_fan                     snmp       ACME Devices: Fans
acme_powersupply             snmp       ACME Devices: Power Supplies
acme_realm                   snmp       ACME Devices: Realm
acme_sbc                     agent      ACME SBC: Health
acme_sbc_settings            agent      ACME SBC: Health Settings
acme_sbc_snmp                snmp       ACME SBC: Health (via SNMP)
acme_temp                    snmp       ACME Devices: Temperature

The second column shows how the respective check plug-in obtains its data.

For our example, let’s choose the name linux_usbstick. In this example the section header must look like this:

<<<linux_usbstick>>>

We can simply output this with echo. If we then don’t forget the 'shabang' (this is not a venomous sting from the desert planet but an abbreviation for sharp and bang — the latter being an abbreviation for the exclamation mark!), by which Linux recognises that it should execute the script with the shell, in which case our plug-in will look like this:

/usr/lib/check_mk_agent/plugins/linux_usbstick
#!/bin/sh
echo '<<<linux_usbstick>>>'
ls /dev/disk/by-id/ | grep -v -- -part

We have now simply used the file name linux_usbstick, even though it doesn’t really matter. But one thing is still very important: Make the file executable!

root@linux# chmod +x /usr/lib/check_mk_agent/plugins/linux_usbstick

Of course, you can easily try out the plug-in by manually by entering the complete path as a command:

root@linux# /usr/lib/check_mk_agent/plugins/linux_usbstick
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30

2.4. Test the agent

As always, the next most important tasks are testing and troubleshooting. It is best to proceed in three steps:

  1. Try out the plug-in on its own. We have just done that.

  2. From the agent test the whole process locally.

  3. Retrieve the agent via the Checkmk server.

Testing the agent locally is very simple — as root user call the command check_mk_agent. The new section should appear somewhere in the output from this:

root@linux# check_mk_agent

Here is an excerpt from that output which contains the new section:

<<<lnx_thermal:sep(124)>>>
thermal_zone0|-|BAT0|35600
thermal_zone1|-|x86_pkg_temp|81000|0|passive|0|passive
<<<local>>>
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30
<<<lnx_packages:sep(124):persist(1589463274)>>>
accountsservice|0.6.45-1ubuntu1|amd64|deb|-||install ok installed
acl|2.2.52-3build1|amd64|deb|-||install ok installed
acpi|1.7-1.1|amd64|deb|-||install ok installed

By appending less you can scroll through the output (press the space bar to scroll, / to search and Q to exit):

root@linux# check_mk_agent | less

The third test is then performed directly from the Checkmk instance. Include the host in the monitoring (e.g. as myserver01) and then retrieve the agent data with cmk -d. You should get the same output here:

OMD[mysite]:~$ cmk -d myserver01 | less

By the way: grep has the -A option to output a few more lines after each hit. This allows you to conveniently search and output the section:

root@linux# cmk -d myserver01 | grep -A5 '^<<< linux_usbstick'
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30
<<<lnx_packages:sep(124):persist(1589463559)>>>
accountsservice|0.6.45-1ubuntu1|amd64|deb|-||install ok installed

If this works, your agent is now ready! And what have we done to achieve this? We simply created a three-line script with the path /usr/lib/check_mk_agent/plugins/linux_usbstick and made it executable!

Everything that follows now only takes place on the Checkmk server: There we will write the actual check plug-in.

2.5. Declare the section

Preparing the agent is the most complicated part, but it is only half the battle. Now we have to teach Checkmk how to handle the information and the new agent section, which services it should generate, when they should go to OK or CRIT, etc. We do all this by programming a check plug-in in Python.

For your own check plug-ins you will find a directory prepared in the local hierarchy of the instance directory. This is local/lib/check_mk/base/plugins/agent_based/. Here in the path, base means the part of Checkmk that is responsible for actually monitoring and alerting. The agent_based is for all plug-ins that relate to the Checkmk agent (so not alerting plug-ins, for example). The easiest way to work with this is to switch to it:

OMD[mysite]:~$ cd local/lib/check_mk/base/plugins/agent_based

This directory belongs to the instance user and is therefore editable by you. You can edit your plug-in with any text editor installed on the Linux system.

So let’s create our plug-in here. The convention is that the file name reflects the agent section’s name. Mandatory is that the file ends with .py, since from Checkmk version 2.0.0 onwards the plug-ins will always be real Python modules.

First, we need to import the functions needed for the plug-ins from other Python modules. The simplest method for this is with a *. As you might guess, there is also a version number of the API for plug-in programming here. This will be version 1 until further notice, and is abbreviated here to v1:

local/lib/check_mk/base/plugins/agent_based/linux_usbstick.py
from .agent_based_api.v1 import *

This versioning allows us to eventually provide future new versions of the API parallel to the previous ones, so that existing check plug-ins continue to work without problems.

In the simplest case, you skip explicitly declaring the section. If you want to implement a parse function (which professional developers would always advise you to do), see the section on parse functions for more information.

2.6. Registering the check

In order for Checkmk to know that the new check exists, it must be registered. This is done by calling the function register.check_plugin. In doing so, you must always specify at least four things:

  1. name: The name of the check plug-in. If you don’t want to get into trouble, take the same name here as for your new agent section. This way the check will automatically know which section to evaluate.

  2. service_name: The name of the service as it should then appear in the monitoring.

  3. discovery_function: The function that discovers services of this type (more on this in a moment).

  4. check_function: The function to perform the actual check (more on this in a moment).

So for our check the registration will look like this:

register.check_plugin(
    name="linux_usbstick",
    service_name="USB stick",
    discovery_function=discover_linux_usbstick,
    check_function=check_linux_usbstick,
)

It’s best not to try this out just yet, because of course we still have to write the discovery_linux_usbstick and check_linux_usbstick functions beforehand, and these functions must appear in the source code before the above declaration.

2.7. Writing the Discovery Function

A special feature of Checkmk is the automatic discovery of services to be monitored. In order for this to work, each check plug-in must define a function that detects whether a service of this type or which services of this type are to be created for the host in question on the basis of the agents' output.

The discovery function is always called when the service discovery is carried out for a host. This function then decides whether or which services are to be created. In the standard procedure, it receives exactly one argument with the name section. This contains the data of the agent section in a parsed format (more on this later).

We implement the following simple logic: If the agent section linux_usbstick exists, then we also create a matching service. This service will then automatically appear on all hosts where our agent plug-in has been rolled out. We recognise the presence of the section simply by the fact that our discovery has actually been invoked!

The discovery function must return an object of the type Service for each service to be created using yield (not with return). For checks that can only occur once per host, no further information is needed:

def discovery_linux_usbstick(section):
    yield Service()

2.8. Writing the check function

So now we can come to the actual check function, which finally decides on the basis of current agent outputs which state a service should assume. Since our check has no parameters and there is only ever one per host, our function is also called with the single argument section.

Since we really need the content this time, we have to deal with the format of this argument. Unless you have explicitly defined a parse function Checkmk will parse each line of the section into a list of words using spaces. The whole thing then in turn becomes a list of these word lists. So the end result is that we will always have a list of lists.

In the simple case in which our agent plug-in only finds two devices, it will then look like this (here there is only one word per line):

[['ata-APPLE_SSD_SM0512F_S1K5NYBF810191'], ['wwn-0x5002538655584d30']]

The check function now goes through line by line and looks for a line whose first (and only) word begins with usb-SCSI_DISK. If this is found, the state will become CRIT. Here is the implementation:

def check_linux_usbstick(section):
    for line in section:
        if line[0].startswith("usb-SCSI_DISK"):
            yield Result(state=State.CRIT, summary="Found USB stick")
            return
    yield Result(state=State.OK, summary="No USB stick found")

And here is the explanation:

  1. With for line in section we loop through all of the lines in the agent’s output.

  2. We then check whether the first word in the line — the respective device — begins with usb-SCSI_DISK.

  3. If yes, we generate a check result with the status CRIT and the text Found USB stick. And we then end the function with a return.

  4. If the loop is run without finding anything, it will generate the status OK and the text No USB stick found.

2.9. Testing the discovery

2.10. Testing the check

2.11. The complete plug-in at a glance

And here is the complete plug-in one more time:

local/lib/check_mk/base/plugins/agent_based/linux_usbstick.py
from .agent_based_api.v1 import *

def discover_linux_usbstick(section):
    yield Service()

def check_linux_usbstick(section):
    for line in section:
        if line[0].startswith("usb-SCSI_DISK"):
            yield Result(state=State.CRIT, summary="Found USB stick")
            return
    yield Result(state=State.OK, summary="No USB stick found")

register.check_plugin(
    name = "linux_usbstick",
    service_name = "USB stick",
    discovery_function = discover_linux_usbstick,
    check_function = check_linux_usbstick,
)

And this is the plug-in for the Linux agent:

/usr/lib/check_mk_agent/plugins/linux_usbstick
#!/bin/sh
echo '<<<linux_usbstick>>>'
ls /dev/disk/by-id/ | grep -v -- -part

3. Checks with more than one service (items) per host

3.1. Basic principles

In our example, we have built a very simple check that creates a service on a host — or not. A very common situation is, of course, that there can be several services with one check on one host.

The most common example of this is the file systems for a host. The plug-in named df creates one service per file system on the host. To distinguish these services, the mount point of the file system (e.g. /var), or the drive letter (e.g. C:) is built into the service name. This then results in the service name being, e.g. filesystem /var, or filesystem C:. The word /var or C: is referred to here as the item. So we also speak of a check with items.

If you want to build a check with multiple items, you need to implement the following things:

  • The discovery function must generate a service for each of the items that are to be meaningfully monitored on the host.

  • In the service name you must include this item using the %s wildcard (i.e. "Filesystem %s").

  • The check function is invoked once separately for each item and receives this as an argument. It must then fish out the relevant data for this item from the agent data.

3.2. A simple example

To be able to test the whole thing practically, we will simply build another agent section that only outputs game data. A small shell script is sufficient for this. The section should be called foobar in this example:

/usr/lib/check_mk_agent/plugins/foobar
#!/bin/sh
echo "<<<foobar>>>"
echo "West 100 100"
echo "East 197 200"
echo "North 0 50"

From foobar, there are three sectors to be found here: West, East and North (whatever that means). In each sector there are a number of seats, some of which are occupied (e.g. in West 100 of 100 seats are occupied).

Now we will create a matching check plug-in for this. The registration is as usual, but with the important difference that the service name now contains exactly one %s. At this position the item’s name will be inserted later by Checkmk:

register.check_plugin(
    name = "foobar",
    service_name = "Foobar Sector %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
)

The discovery function now has the task of determining the items to be monitored. As usual, it receives the section argument. And again, this is a list of lines, which in turn are lists of words.

In our example the list looks like this:

[['West', '100', '100'], ['East', '197', '200'], ['North', '0', '50']]

You can loop through such a list with Python and give meaningful names to these three words grouped in each line:

for sector, _used, _slots in section:
    ...

In each line, the first word — here the sector — is our item. Whenever we find an item, we return that with yield, creating an object of type Service that gets the sector name as its item. The underscore indicates that for now we don’t care about the other two columns in the output, since in a discovery it ultimately doesn’t matter how many slots are occupied.

Overall it looks like this:

def discover_foobar(section):
    for sector, _used, _slots in section:
        yield Service(item=sector)

Of course, it would be easy to omit some lines here on the basis of arbitrary criteria. Maybe there are sectors which have a size of 0 and which you would never want to monitor? Simply omit such rows so that no item will be generated for them.

Then later, when the host is being monitored, the check function is called separately for each service — and thus for each item. Therefore, in addition to the section, it also receives the item argument with the item it is looking for. Now we go through all of the lines one after the other. When doing so, we will find the line that corresponds to the desired item:

def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            ...

Now all that is missing is the actual logic which determines when the item should in fact be OK, WARN or CRIT. We do that here like this:

  • When all slots have been used, the thing is to become CRIT.

  • If there are fewer than 10 slots free, then it will become WARN.

  • Otherwise OK

The occupied and total slots always appear as the second and third words in each line. However, here we are dealing with strings, not numbers — but we need numbers to be able to compare and calculate. We therefore convert the strings into numbers using int().

We then return the check result by supplying an object of type result via yield. This takes the parameters state and summary:

def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            used = int(used)   # convert string to int
            slots = int(slots)   # convert string to int
            if used == slots:
                s = State.CRIT
            elif slots - used <= 10:
                s = State.WARN
            else:
                s = State.OK
            yield Result(
                state = s,
                summary = f"Used {used} out of {slots} slots")
            return

In this context, please note the following:

  1. The command return ensures that the check function is terminated immediately after processing the found item. There is nothing more to be done, after all.

  2. If the loop is processed without finding the item being searched for, Checkmk automatically generates the result UNKNOWN - Item not found in monitoring data. This is intentional and a good thing. Do not handle this case yourself. If you don’t find an item you are looking for, just let Python run its course through the function and let Checkmk do its work.

  3. With the argument summary you define the text that the service produces from the status output. This is purely informal and will not be evaluated further by Checkmk.

By the way, for the common situation where you want to check a simple metric for thresholds, there is a helper function called check_levels. This is explained in a separate chapter.

Now let’s first try out the Discovery. For the sake of clarity we will restrict this action to our plug-in by using the --detect-plugins=foobar option:

OMD[mysite]:~$ cmk --detect-plugins=foobar -vI myhost123
  3 foobar
SUCCESS - Found 3 services, 1 host labels

And now right away we can test the checking process (here also limited to foobar):

OMD[mysite]:~$ cmk --detect-plugins=foobar -v myhost123
Foobar Sector East   WARN - used 197 out of 200 slots
Foobar Sector North  OK - used 0 out of 50 slots
Foobar Sector West   CRIT - used 100 out of 100 slots

3.3. The example — a recap

And here again our example in full. To avoid errors due to undefined function names, the functions must always be defined before registering.

local/lib/check_mk/base/plugins/agent_based/foobar.py
from .agent_based_api.v1 import *
import pprint


def discover_foobar(section):
    for sector, used, slots in section:
        yield Service(item=sector)


def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            used = int(used)    # convert string to int
            slots = int(slots)  # convert string to int
            if used == slots:
                s = State.CRIT
            elif slots - used <= 10:
                s = State.WARN
            else:
                s = State.OK
            yield Result(
                state = s,
                summary = f"used {used} out of {slots} slots")
            return


register.check_plugin(
    name = "foobar",
    service_name = "Foobar Sector %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
)

4. Performance values

4.1. Determining values in the check function

Not always, but very often checks work with numbers. With its graphing system Checkmk has a component to store, evaluate and display such numbers. This works completely independently from the generation of any resulting OK, WARN or CRIT states.

Such measured values — or metrics — are determined by the check function and simply returned as an additional result. For this purpose the Metric object is used, which requires at least the two arguments name and value. Here is an example:

    yield Metric("fooslots", used)

4.2. Threshold information

Furthermore there are two optional arguments. With the argument levels you can provide information about thresholds for WARN and CRIT, in the form of a pair of two numbers. This is then usually plotted on the graph as a yellow and a red line. The first number (yellow line) represents the warning threshold, the second (red line) the critical one. The convention is that the check already goes to WARN when the warning threshold is reached (analogous for CRIT).

The coding could then look like this (here with hardcoded thresholds):

    yield Metric("fooslots", used, levels=(190,200))

Notes:

  • If only one of the two thresholds will be defined, simply enter None for the other threshold, e.g. levels=(None, 200).

  • Floating point numbers are also allowed, but not strings.

  • Attention: the check function itself is responsible for the check of the thresholds . The specification of levels serves only as marginal information for the graphing system!

4.3. The values range

Analogous to the threshold values, you can also provide the graphing system with information about a range of possible values. This denotes the smallest and largest possible value. This is done in the boundaries argument, where None can also be optionally used here for one of the two boundaries.

Example:

    yield Metric(name="fooslots", value=used, boundaries=(0, 200))

And now once again our check function from the above example, but this time with the return of metric information including threshold values and a value range (this time of course not with fixed but with calculated values):

def check_foobar(item, section):
    for sector, used, slots in section:
        if sector == item:
            used = int(used)    # convert string to int
            slots = int(slots)  # convert string to int

            yield Metric(
                "fooslots",
                used,
                levels=(slots-10, slots),
                boundaries=(0, slots))

            if used == slots:
                s = State.CRIT
            elif slots - used <= 10:
                s = State.WARN
            else:
                s = State.OK
            yield Result(
                state = s,
                summary = f"used {used} out of {slots} slots")
            return

5. Checks with multiple partial results

In order to prevent the number of services on a host from growing out of all control, several partial results are often combined in a single service. For example, the Memory used service under Linux checks not only RAM and swap consumption, but also shared memory, page tables and various other information.

The API provided by Checkmk offers a very convenient interface for this. In this way, a check function may simply generate a result with yield any number of times. The overall status for the service is then based on the 'worst' partial result according to the scheme OKWARNUNKNOWNCRIT.

Here is an abbreviated, fictitious example:

def check_foobar(section):
    yield Result(state=State.OK, summary="Knulf rate optimal")
    # ...
    yield Result(state=State.WARN, summary="Gnarz required")
    # ...
    yield Result(state=State.OK, summary="Last Szork was good")

The summary of the service in the GUI then looks like this: "Knulf rate optimal, Gnarz required WARN, Last Szork was good". And the overall status will be WARN.

You can return multiple metrics in the same way. Simply call yield Metric(…​) once for each metric.

6. Summary and Details

In the Checkmk monitoring, each service also has a line of text in addition to its status OK, WARN, and so on. Up until version 1.6.0 this was called Output of check plugin. As of version 2.0.0 this is now called Summary — so it has the task of providing a concise summary of the status. The idea is that this text does not exceed a length of 60 characters, so that it is always easy to read and ensures a clear table display without annoying line breaks.

Next to this there is the Details field, which used to be called Long output of check plugin (multiline). Here all of the details of the state are displayed, the idea being that all of the summary information is included here as well.

When calling yield Result(…​) you can determine which information is so important that it should be displayed in the summary and for which information it is sufficient that it appears in the details. The default rule is that partial results that lead to a WARN/CRIT will always be visible in the summary.

In our examples so far we have always used the following call:

    yield Result(state=State.OK, summary="some important text")

This will cause some important text to always appear in the Summary — and additionally in the Details — so you should only use this for important information. If a partial result is of secondary importance, replace summary with notice and the text will appear — if the service is OK only in the details.

    yield Result(state=State.OK, notice="some additional text")

If the state is WARN or CRIT, the text will then automatically appear as an addition in the summary:

    yield Result(state=State.CRIT, notice="some additional text")

Thus, in the summary it will be immediately clear why the service is not OK.

Last but not least, you have — for both summary and notice — the possibility to specify an alternative text for the details, which may contain more information about the partial result:

    yield Result(state=State.OK,
                 summary="55% used space",
                 details="55.2% of 160 GB used (82 GB)")

To summarize, this means:

  • The full text of the summary (for services that are OK) should not exceed 60 characters.

  • Always use either summary or notice — not both, and not neither.

  • Add details as necessary if you want the details text to be an alternative one.

7. Error handling

7.1. Exceptions and Crash Reports

The correct handling of errors (unfortunately) consumes a large chunk of the programming work. The good news is that the Checkmk API already does most of the work for you. Consequently, in most cases, it is important for you to simply not deal with errors.

When Python gets into a situation that is in some way unexpected , it responds with what is known as an exception. Here are a few examples:

  • You convert a string into a number with int(…​), but the string does not actually include a number, e.g. int("foo").

  • You access the fifth element of bar with bar[4], but this in fact has only four elements.

  • You are calling a function that does not exist.

Here the important general rule applies: <b>Don’t capture exceptions yourself!</b> Checkmk will always do this for you in a consistent and efficient way — in most cases accompanied by a Crashreport. Such a report will look like this, for example:

crash report 1

By clicking on the icon crash icon, the user is navigated to a page where they can

  • view a display of the file in which the crash took place.

  • get all information about the crash, for instance any error messages, call stack, agent output, the current values of local variables and much more.

  • send the report to us (tribe29) as feedback.

Submitting the report to tribe29 of course only makes sense for check plug-ins which are official Checkmk components. But you can also ask your users to simply send you the data. The users can then help you to find the error. It is often the case that a check plug-in works for you, but other users may experience sporadic errors. Working together you can then usually identify these problems very easily.

But if you were to intercept the exception yourself, all of this information would simply be unavailable. You would perhaps set the service to UNKNOWN and issue an error message, but all of the background circumstances that led to an error (e.g. the data from the agent) would simply be invisible.

7.2. Viewing exceptions on the command line

If you run your plug-in on the command line, no crash reports will be generated — you will only see the summarized error message:

OMD[mysite]:~$ cmk -II --detect-plugins=foobar myhost123
  WARNING: Exception in discovery function of check plugin 'foobar': invalid literal for int() with base 10: 'foo'

BUT: if you simply append the --debug option to this, you will then receive the Python stack trace:

OMD[mysite]:~$ cmk --debug -II --detect-plugins=foobar myhost123
Traceback (most recent call last):
  File "/omd/sites/myhost123/bin/cmk", line 82, in 
    exit_status = modes.call(mode_name, mode_args, opts, args)
  File "/omd/sites/myhost123/lib/python3/cmk/base/modes/init.py", line 68, in call
    return handler(*handler_args)
  File "/omd/sites/myhost123/lib/python3/cmk/base/modes/check_mk.py", line 1577, in mode_discover
    discovery.do_discovery(set(hostnames), options.get("checks"), options["discover"] == 1)
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 345, in do_discovery
    _do_discovery_for(
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 397, in _do_discovery_for
    discovered_services = _discover_services(
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1265, in _discover_services
    service_table.update({
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1265, in 
    service_table.update({
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1337, in _execute_discovery
    yield from _enriched_discovered_services(hostname, check_plugin.name, plugins_services)
  File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1351, in _enriched_discovered_services
    for service in plugins_services:
  File "/omd/sites/myhost123/lib/python3/cmk/base/api/agent_based/register/check_plugins.py", line 69, in filtered_generator
    for element in generator(*args, **kwargs):
  File "/omd/sites/myhost123/local/lib/python3/cmk/base/plugins/agent_based/foobar.py", line 5, in discover_foobar
    int("foo")
ValueError: invalid literal for int() with base 10: 'foo'

7.3. Invalid outputs from an agent

The question is how to react when the output from the agent is not in the form you would normally expect — whether it is from the 'real' agent or when the data comes via SNMP. Let’s assume that you always expect three words per line. What should you do if only two words were to arrive?

Now — if this is a permitted and familiar agent behavior, then of course you need to capture that and employ case discrimination.

If, however, this is not actually allowed …​ then it is best to treat the line as if it always contains three words, e.g. with:

def check_foobar(section):
    for foo, bar, baz in section:
        # ...

If there should ever be a line that does not consist of exactly three words, a nice exception will be generated and you will receive the very helpful crash report that was just mentioned.

7.4. Missing items

What if the agent outputs data correctly, but the item to be checked is missing? So, like this, for example:

def check_foobar(item, section):
    for sector, used, slots in section:
        if item == sector:
            # ... Check state ...
            yield Result(...)
            return

If the item you are looking for is not there, the loop is run and Python just falls out of the back at the end of the function without 'yielding' a result. And that’s exactly the correct procedure! Because Checkmk recognizes that the item to be monitored is missing and with UNKNOWN generates the correct status and a suitable standard text for it.

8. SNMP-based checks

8.1. The fundamentals

Developing checks that work with SNMP is very similar to agent-based checks, except that you still need to specify which SNMP ranges (OIDs) the check requires. If you don’t as yet have any experience with SNMP, we strongly recommend reading the article on Monitoring via SNMP as preparation at this point.

The process of discovery and checking via SNMP is somewhat different from that for a normal agent. B ecause unlike there — where the agent sends all of the relevant information on its own — with SNMP we have to say exactly which data ranges we require. A complete dump of all data would be theoretically possible (via SNMP walk), but this process can take minutes for fast devices and over an hour for complex switches. Thus, this is not viable for checking or even for discovery. Checkmk therefore proceeds in a more targeted manner.

SNMP detection

Service detection is divided into two phases. First, the SNMP detection is performed. This determines which plug-ins on the respective device are of actual interest. To do this, a few SNMP OIDs are retrieved — individually, without a walk. The most important of these is the sysDescr (OID: 1.3.6.1.2.1.1.0). Under this OID, each SNMP device holds a description of itself, e.g. ‘Cisco NX-OS(tm) n5000, Software (n5000-uk9),…​’.

Based on this text, you can already define for very many plug-ins whether they can be useful for this application. If the text is still not specific enough, further OIDs are fetched and checked. The result of the SNMP detection will then be a list of candidates for check plug-ins.

Discovery

In the second step, the necessary monitoring data is fetched for each of these candidates with SNMP walks. These are then combined into a table and provided to the check’s discovery function with the section argument, which then as usual determines the items to be monitored.

Checking

When running checks, it is already known which plug-ins are to be executed for the device and the SNMP detection is now omitted. Here the monitoring data needed for the plug-ins are fetched immediately by SNMP walks and from it the section argument for the check function is filled.

Summary

So what do you need to do differently with an SNMP check compared to an agent-based one?

  1. You do not need a plug-in for the agent.

  2. You must define the single OIDs and search texts required for an SNMP detection.

  3. You have to define which SNMP areas must be fetched for monitoring.

8.2. A word about the MIBs

Before we continue, we want to say a word about the infamous SNMP MIBs, because there are many prejudices about these. Right at the beginning, some good news: Checkmk doesn’t need them. Really! But they are an important aid in being able to develop an SNMP check.

So what is a MIB? Literally, the abbreviation means Management Information Base — somewhat meaningless really. To be concrete, a MIB is a quite easy to read text file which describes a certain subtree in the SNMP world. Namely, it states which branch in the tree — that is, which OID — has which meaning. This includes a name for the OID, a comment on what values it can take (e.g. for enumerated data types, where things like 1=up, 2=down, etc. are defined) and sometimes a useful comment.

Checkmk provides a set of freely-available MIB files. These describe very general areas in the global OID tree, but do not contain any vendor-specific areas. Therefore they are of not much help for self-developed checks.

So try to find the MIB files relevant for your particular device somewhere on the manufacturer’s web pages or even on the device’s management interface, and install these in the Checkmk instance in local/share/check_mk/mibs. You can then have SNMP walks convert OID numbers to names and can thus more quickly find where the data of interest for your monitoring is located. Also, MIBs, if done carefully, contain interesting information in their comments, as we noted above. You can easily read an MIB file with a text editor or with less.

8.3. Locating the correct OIDs

The crucial prerequisite for developing a plug-in is, of course, that you know which OIDs contain the necessary information. The first step in doing this (if the device doesn’t refuse) is to perform a complete SNMP walk. This will retrieve all of the available data via SNMP.

Checkmk can accomplish this very easily for you. To do so, first include the device (or one of the devices) for which you want to develop a plug-in into your monitoring. Let’s say this device is called mydevice01. Check in the device’s basic functions to make sure that it can be monitored. As a minimum, the SNMP Info and Uptime services need to be found, and probably at least one Interface as well. This is how you make sure that SNMP access works cleanly.

Then switch to the command line in the (Checkmk) instance. Here you can perform a complete walk with the following command. We recommend using the -v (verbose) option when doing this for the very first time:

OMD[mysite]:~$ cmk -v --snmpwalk mydevice01
mydevice01:
Walk on ".1.3.6.1.2.1"...3898 variables.
Walk on ".1.3.6.1.4.1"...6025 variables.
Wrote fetched data to /omd/sites/heute/var/check_mk/snmpwalks/mydevice01.

As mentioned earlier, such a complete walk can take minutes or even hours — although the latter is rare. So don’t become nervous if it takes a while to complete this process. The walk will be saved in the file var/check_mk/snmpwalks/mydevice01. This will be a easily-readable text file that starts something like this:

var/check_mk/snmpwalks/mydevice01
.1.3.6.1.2.1.1.1.0 JetStream 24-Port Gigabit L2 Managed Switch with 4 Combo SFP Slots
.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.11863.1.1.3
.1.3.6.1.2.1.1.3.0 546522419
.1.3.6.1.2.1.1.4.0 hh@example.com
.1.3.6.1.2.1.1.5.0 sw-ks-01
.1.3.6.1.2.1.1.6.0 Core Switch Serverraum klein
.1.3.6.1.2.1.1.7.0 3
.1.3.6.1.2.1.2.1.0 27

In each line there is an OID and then its value. And right there in the first line you find the most important one, namely the sysDescr.

Now the OIDs themselves are not very informative. If the correct MIBs are installed, you can have them converted to names in a second step with the cmk --snmptranslate command. It is best to redirect the result — which would otherwise appear in the terminal — to a file:

OMD[heute]:~$ cmk --snmptranslate mydevice01  > translated
Processing 9923 lines.
finished.

The translated file reads like the original walk, but has a translated value for the OID on each line after the -->:

translated
.1.3.6.1.2.1.1.1.0 JetStream 24-Port Gigabit L2 Managed Switch with 4 Combo SFP Slots --> SNMPv2-MIB::sysDescr.0
.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.11863.1.1.3 --> SNMPv2-MIB::sysObjectID.0
.1.3.6.1.2.1.1.3.0 546522419 --> DISMAN-EVENT-MIB::sysUpTimeInstance
.1.3.6.1.2.1.1.4.0 hh@example.com --> SNMPv2-MIB::sysContact.0
.1.3.6.1.2.1.1.5.0 sw-ks-01 --> SNMPv2-MIB::sysName.0
.1.3.6.1.2.1.1.6.0 Core Switch Serverraum klein --> SNMPv2-MIB::sysLocation.0
.1.3.6.1.2.1.1.7.0 3 --> SNMPv2-MIB::sysServices.0
.1.3.6.1.2.1.2.1.0 27 --> IF-MIB::ifNumber.0
.1.3.6.1.2.1.2.2.1.1.1 1 --> IF-MIB::ifIndex.1
.1.3.6.1.2.1.2.2.1.1.2 2 --> IF-MIB::ifIndex.2

Example: the OID .1.3.6.1.2.1.1.4.0 has the translated name SNMPv2-MIB::sysContact.0. This is an important hint — the rest is then practice, experience and of course experimentation.

8.4. Registering the SNMP section

So, once you have determined the necessary OIDs, it’s on to the actual development of the plug-in. This is done in three steps:

  1. For an SNMP detection, specify which OIDs must contain which texts for your plug-in to run.

  2. Declare which OID branches need to be fetched for the monitoring.

  3. Write a check plug-in analogous to those for agent-based checks.

The first two steps are performed by registering an SNMP section. You do this by calling register.snmp_section(). Here you specify at least three arguments: the name of the section (name), the details for the SNMP detect detect, and the OID branches needed for actually monitoring (fetch). Here is an example of a hypthetical check plug-in with the name foo:

local/lib/check_mk/base/plugins/agent_based/foo.py
register.snmp_section(
    name = "foo",
    detect = startswith(".1.3.6.1.2.1.1.1.0", "foobar device"),
    fetch = SNMPTree(
        base = '.1.3.6.1.4.1.35424.1.2',
        oids = [
            '4.0',
            '5.0',
            '8.0',
        ],
    ),
)

The SNMP Detection

With the keyword detect you specify under which conditions the discovery function should be executed. In our example, this is the case if the value of the OID .1.3.6.1.2.1.1.0 (i.e. the sysDescr) starts with the text foobar device (case-insensitive in principle). In addition to startswith, there are a number of other possible attributes. There is also a negated form of each of these, which begins with not_:

AttributeNegationFunction

equals(oid, needle)

not_equals(oid, needle)

The value of the OID is equal to the text needle

contains(oid, needle)

not_contains(oid, needle)

The value of the OID at some point contains the text needle

startswith(oid, needle)

not_startswith(oid, needle)

The value of the OID starts with the text needle

endswith(oid, needle)

not_endswith(oid, needle)

The value of the OID ends with the text needle

matches(oid, regex)

not_matches(oid, regex)

The value of OID matches the regular expression regex, anchored at the start and at the end — so with an exact match. If you only need a substring, just add another .*

exists(oid)

not_exists(oid)

Met if the OID is available on the device. The value may be empty.

As well as the above, there is also the possibility of linking multiple tests with all_of or any_of. The option all_of requires multiple successful attributes for a positive discovery of the plug-in. The following example finds the plug-in on a device if sysDescr starts with the text foo (or FOO or Foo) and the OID .1.3.6.1.2.1.2.0 contains the text .4.1.11863.:

detect = all_of(
    startswith(".1.3.6.1.2.1.1.1.0", "foo"),
    contains(".1.3.6.1.2.1.1.2.0", ".4.1.11863.")
)

The any_of option, on the other hand, is satisfied if any single one of the criteria is met. Here is an example where different values are allowed for the sysDescr keyword:

detect = any_of(
    startswith(".1.3.6.1.2.1.1.1.0", "foo version 3 system"),
    startswith(".1.3.6.1.2.1.1.1.0", "foo version 4 system"),
    startswith(".1.3.6.1.2.1.1.1.0", "foo version 4.1 system"),
)

By the way — are you familiar with regular expressions? If so, with these you would probably simplify this whole process and still get by with just a single line:

detect = matches(".1.3.6.1.2.1.1.1.0", "FOO Version (3|4|4.1) .*"),

And one more important note: The OIDs you specify in the detect declaration from a plug-in will, in case of doubt, be fetched from every device that is monitored via SNMP. Therefore, be very sparing in your use of vendor-specific OIDs. Try to make your discovery absolutely exclusive to the sysDescr (.1.3.6.1.2.1.1.1.0) and the sysObjectID (.1.3.6.1.2.1.1.2.0). If you still need another different OID, then reduce the number of devices where it is requested to a minimum by excluding as many devices as possible beforehand using the sysDescr, e.g. like this:

detect = all_of(
    startswith(".1.3.6.1.2.1.1.1.0", "foo"),   # first check sysDescr
    contains(".1.3.6.1.4.1.4455.1.3", "bar"),  # fetch vendor specific OID
)

The all_of() works in such a way that if the first condition fails, the second is not even tried (and thus the OID in question is not fetched). Here in the example, the OID .1.3.6.1.4.1.4455.1.3 is fetched only for those devices that have foo in their sysDescr.

What happens if you have made the declaration incorrectly or at least not quite on target?

  • If the detection erroneously detects devices that do not have the necessary OIDs, your discovery function will not generate any services — so nothing 'bad' will happen. However, this will slow down the discovery on such devices, because now every time it will pointlessly try to retrieve the corresponding OIDs.

  • If the detection does not detect devices that are actually allowed, during the discovery no services will be found in the monitoring.

8.5. The OID ranges for monitoring

The most important part of the SNMP declaration is the specification of which OIDs are to be fetched for the monitoring. In almost all cases, a plug-in only needs selected branches from a single table to do this. Let’s consider the following example:

    fetch = SNMPTree(
        base = '.1.3.6.1.4.1.35424.1.2',
        oids = [
            '4.0',
            '5.0',
            '8.0',
        ],
    ),

The keyword base specifies an OID prefix here. All necessary data is below. At oids you then specify a list of sub-OIDs to be fetched from there. In the above example, a total of three SNMP walks are then made, namely on starting from the OIDs .1.3.6.1.4.1.35424.1.2.4.0 and .1.3.6.1.4.1.35424.1.2.5.0 and .1.3.6.1.4.1.35424.1.2.8.0. It is important that these walks fetch the same number of variables and that they also correspond to each other. This means that, for example, the nth element from each of the walks corresponds to the same monitored object.

Here is an example from the check plug-in snmp_quantum_storage_info:

    tree = SNMPTree(
       base=".1.3.6.1.4.1.2036.2.1.1",  # qSystemInfo
       oids=[
           "4",   # qVendorID
           "5",   # qProdId
           "6",   # qProdRev
           "12",  # qSerialNumber
       ],
    ),
)

Here, the vendor ID, the product ID, the product revision and the serial number are retrieved from each storage device.

The discovery and check function is presented with this data as a table, i.e. as a list of lists. The table is mirrored so that you have all of the data for each item per entry in the outer list. Each entry has as many items as you specified in oids. This allows you to loop through the list in a very practical way, e.g.:

    for vendor_id, prod_id, prod_rev, serial_number in section:
        ...

Please note:

  • All entries are strings, even if the OIDs in question are actually numbers.

  • Missing OIDs are presented as empty strings.

  • Remember the ability to output formatted data during development with pprint.

8.6. Other SNMP special features

We will describe here in the future:

  • How to retrieve multiple independent SNMP areas.

  • What OIDEnd() is all about

  • Other special cases when dealing with SNMP

9. Formatting numbers

9.1. The basics

In the summary or details for a service, numbers are often output. To make it as easy as possible for you to format them nicely and correctly, and also to standardize the output from all check plug-ins, there are helper functions for rendering different kinds of sizes. All of these are sub-functions of the render module and are consequently called with render.. For example, render.bytes(2000) results in the text 1.95 KiB.

What all of these functions have in common is that they get their values in a so-called canonical or natural unit. Thus one must never think, and there are no difficulties or errors with the conversion. For example, times are always given in seconds, and the sizes of hard disks, files, etc. are always given in bytes and not in kilobytes, kibibytes, blocks or any other confusion of units.

Please use these functions even if you don’t like the display so much. After all, this is then consistent for the user. And future versions of Checkmk may be able to change the display or even make it configurable for the user. Your check plug-in will then also benefit from this.

Following the detailed description of all of the display functions (render functions), you will find a summary of these in the form of a clear table.

9.2. Times, time spans, frequencies

Absolute time specifications (timestamps) are formatted with render.date() or render.datetime(). The specifications are always in seconds from January 1, 1970, 00:00:00 UTC — the so-called epoch time. This is also the format used by the Python function time.time(). The advantage of this representation is that it can be used to calculate very easily, for example, the duration of an operation if the start and end times are known. The formula is then simply duration = end - start. These calculations also work independently of the time zone, daylight saving time changes or leap years.

render.date() outputs only the date, render.datetime() adds the time. The output is done according to the current time zone for the (.CMK) server that is running the check!

Examples:

CallOutput

render.date(0)

Jan 01 1970

render.datetime(0)

Jan 01 1970 01:00:00

render.date(1600000000)

Sep 13 2020

render.datetime(1600000000)

Sep 13 2020 14:26:40

Now please don’t be surprised that render.date(0) outputs 01:00 as the time instead of 00:00! This is because we are writing this manual in the time zone for Germany, which is one hour ahead of UTC standard time (at least during standard time, because, as you know, January 1 is not in (the European Summer) daylight saving time)

For timespan there is still the function render.timespan(). This produces a duration in seconds and outputs it in a human readable form. For larger time spans, seconds or minutes are omitted.

CallOutput

render.timespan(1)

1 second

render.timespan(123)

2 minutes 3 seconds

render.timespan(12345)

3 hours 25 minutes

render.timespan(1234567)

14 days 6 hours

A frequency is effectively the reciprocal of time. The canonical unit is Hz, which is equivalent to once per second. A field of application is, for example, for the clock rate of a CPU:

CallOutput

render.frequency(111222333444)

111 GHz

9.3. Bytes

Whenever memory, files, hard disks, file systems and the like are concerned, the canonical unit is the byte. Since computers usually organize such things in powers of two, e.g. in units of 512, 1024 or 65536 bytes, from the beginning it has been accepted that a kilobyte is not 1000 but 1024 bytes. This is in itself very practical, because in this way mostly round numbers came out. The legendary Commodore C64 had a 64 kilobyte memory and not 65,536.

Unfortunately, at some point hard disk manufacturers came up with the idea of specifying the sizes of their disks in 1000’s of units. Since the difference between 1000 and 1024 is 2.4% for each size, and these are multiplied, a 1 GB disk (1024 times 1024 * 1024) suddenly becomes 1.07 GB. That sells better.

This annoying confusion persists to this day and continues to cause errors. As a remedy, new prefixes based on the binary system were defined by the International Electrotechnical Commission. Accordingly, nowadays a kilobyte is officially 1000 bytes, and a kibibyte is 1024 bytes (2 to the power of 10). In addition one should say Mebibyte and Gibitbyte and Tebibyte (ever heard of these?). The abbreviations are (attention, here at once always i, instead of e!) KiB, MiB, GiB and TiB.

Checkmk adapts itself to this standard and helps you with multiple adapted render functions so that you can always produce correct outputs. So specifically for hard disks and file systems there is the render.disksize() function, which gives the output in powers of 1000.

CallOutput

render.disksize(1000)

1.00 kB

render.disksize(1024)

1.02 kB

render.disksize(2000000)

2.00 MB

For the sizes of files it is common to specify the exact size in bytes without rounding. This has the advantage that you can see very quickly if a file has changed even minimally or that two files are (probably) the same. The render.filesize() function is responsible for this:

CallOutput

render.filesize(1000)

1,000 B

render.filesize(1024)

1,024 B

render.filesize(2000000)

2,000,000 B

If you want to output a size that is not a disk or a file size, just use the generic render.bytes(). This will give you the output in the classic 1024’s in the new official notation:

CallOutput

render.bytes(1000)

1000 B

render.bytes(1024)

1.00 KiB

render.bytes(2000000)

1.91 MiB

9.4. Bandwidths and data rates

Networkers have their own terms and ways of expressing things. And as always, in each domain Checkmk-tries hard to adopt the way of communicating that is customary there. That’s why there are three different rendering functions for data rates and speeds. All of these have in common that the rates are passed in bytes per second, even when the output is in bits!

render.nicspeed() represents the maximum speed of a network card or switch port. Since they are not measured values, there is no need to do any rounding. Although no port can send single bits, the specifications are in bits for historical reasons. Attention: you must however always pass bytes per second here as well!

Examples:

CallOutput

render.nicspeed(12500000)

100 MBit/s

render.nicspeed(100000000)

800 MBit/s

render.networkbandwidth() is for an actual measured transmission speed on the network. The input value is again bytes per second (or 'octets' as a networker would say):

CallOutput

render.networkbandwidth(123)

984 Bit/s

render.networkbandwidth(123456)

988 kBit/s

render.networkbandwidth(123456789)

988 MBit/s

Where the network is not involved and data rates are nevertheless output, bytes are again common. The most prominent examples are the IO rates of hard disks. For this there is the render function render.iobandwidth(), which in Checkmk works with powers of 1000:

CallOutput

render.iobandwidth(123)

123 B/s

render.iobandwidth(123456)

123 kB/s

render.iobandwidth(123456789)

123 MB/s

9.5. Percentages

The render.percent() function represents a percentage value — rounded to two decimal places. It is an exception to the other functions in that the actual natural value — that is, the ratio — is not passed here, but really the percentage. So if something is half full, for example, rather than 0.5. you must pass 50.

Because it can sometimes be interesting to know whether a value is almost zero or exactly zero, values that are greater than zero but less than 0.01 are marked by adding a "<" sign.

CallOutput

render.percent(0.004)

<0.01%

render.percent(18.5)

18.50%

render.percent(123)

123.00%

9.6. Summary

Here to recap is an overview of all of the render functions:

FunctionEntryDescriptionOutput example

date

Epoch

Date

Dec 18 1970

datetime

Epoche

Date and time

Dec 18 1970 10:40:00

timespan

Seconds

Duration / Age

3d 5m

frequency

Hz

Frequency (e.g. Clock rate)

110 MHz

disksize

Bytes

Hard disk size, Basis 1000

1,234 GB

filesize

Bytes

Size of files, exact

1,334,560 B

bytes

Bytes

Size in bytes, base 1024

23,4 KiB

nicspeed

Octets/sec

Network card speed

100 MBit/s

networkbandwidth

Octets/sec

Transmission speed

23.50 GBit/s

iobandwidth

Bytes/sec

IO-Bandwidth

124 MB/s

percent

Prozent

Percentage value, meaningfully rounded

99.997%

10. Notes for users of the old API

Are you already experienced in developing check plug-ins with the previous API — the one up to version 1.6.0 of Checkmk? Then you will find some notes about important changes summarized here.

10.1. saveint() and savefloat()

The two functions saveint() and savefloat() have been dropped. As a reminder, saveint(x) returns 0 if x cannot be reasonably converted to a number, e.g. because it is an empty string or does not consist only of digits.

While there have been a few good use cases for this, it has been used incorrectly in the majority of cases, which in the past has resulted in many errors being obscured.

In a situation in which you want to get a 0 on an empty string — which is the most common 'good' use case of saveint(x) — you can simply code the following:

foo = int(x) if x else 0

For savefloat() everything applies analogously.

On this page