1. Introduction
Checkmk includes nearly 2000 ready-made check plug-ins for all imaginable hardware and software. These are maintained by the Checkmk team, and new plug-ins are added every week. On the Checkmk Exchange there are also more plug-ins contributed by our users.
And yet there are always situations where a device, an application, or just a specific metric that is important to you is not covered by any of these plug-ins — maybe because it is something that was developed within your own company and is therefore not available to anyone else.
1.1. Does it always have to be a real plug-in?
What options do you have for implementing an effective monitoring here? Well, you could of course contact our support team and request that they develop a suitable plug-in for you — but naturally it is quicker if you can do it yourself.
You have four options:
Method | How to do it | Advantages | Disadvantages |
---|---|---|---|
Extend a Checkmk Agent with a simple script | Is very simple, is possible in all programming languages offered by the monitored host’s operating system, and even supports service discovery | Threshold configuration only for the agent itself, SNMP not possible or very cumbersome | |
Nagios-compatible check plug-in | Access to all existing Nagios plug-ins, also free choice of the programming language | Threshold configuration only for the agent itself, SNMP not possible or very cumbersome, no service discovery possible | |
Evaluating log messages | Monitor messages with the Event Console | No development necessary, but only need to set up rules in the Event Console | Only works if suitable log messages are available, no verified current status, no recording of metrics, no configurable thresholds |
Genuine Checkmk plug-in | Will be explained in this article | Integrates 100% with Checkmk, automatic service detection, central configuration of thresholds via graphical interface, very high performance, supports SNMP, automatic host and service labels are possible, supports HW/SW inventory, supported by standard libraries from Checkmk | Requires more training and knowledge of the Python programming language |
This article will show you how to develop real Checkmk check plug-ins — along with everything associated with them. Here we show you how to use the newly-developed API for programming plug-ins in version 2.0.0 of Checkmk.
1.2. What has changed compared to the old API?
Do you already have experience with developing check plug-ins for Checkmk Version 1.6.0 or earlier? If so, here is a concise overview of all of the changes introduced in the new Check API available from Version 2.0.0:
Plug-ins are now Python 3 modules, and the file names must have the
.py
extension.The custom plug-ins are now located in the
local/lib/check_mk/base/plugins/agent_based
directory.At the beginning of the file you will now need at least one special
import
statement.The sections and the actual checks are now stored separately. For this purpose there are the new
register.agent_section
andregister.check_plugin
functions.Several function and argument names have been renamed. Among other things, Discovery is now always used consistently (previously Inventory).
The Discovery function (formerly Inventory function) and also the Check function must now always work as generators (so use
yield
).The names for the declared functions' arguments are now fixed.
Instead of the SNMP scan function, write a declaration of which OIDs are expected with which values.
The functions for representing numbers have been restructured (e.g.
get_bytes_human_readable
becomesrender.bytes
).There is now a separate method for checks to exclude others (
supersedes
). This is no longer done in the SNMP scan function.The auxiliary functions for working with counters, rates and averages have changed.
Instead of magic return values such as
2
for CRIT, there are now constants (e.g.State.CRIT
).Many possible programming errors in your plug-in are now recognised by Checkmk at a very early stage and can be immediately highlighted for you.
1.3. Will the old API still be supported?
Yes — the API for the development of check plug-ins valid up to version 1.6.0 of Checkmk will be supported with some minor restrictions for a few more years, because a significant number of plug-ins have been developed with it. During this time, Checkmk will offer both APIs in parallel. Details can be found in the #10601 work.
Nevertheless, we do recommend the new API for the development of new plug-ins, as it is more consistent and logical, better documented and is the most future-proof solution in the long term.
1.4. The different types of agents
Check plug-ins evaluate the data from the Checkmk agents. And that is why, before we leap into action here, we should first look at an overview of the types of agents that Checkmk actually recognises:
Checkmk Agent | The ‘normal’ plug-ins evaluate data that the Checkmk agent sends for Linux, Windows or other operating systems. This agent monitors operating system parameters and applications, and sometimes also server hardware. Each new check plug-in requires an extension of the agent to provide the necessary data. Therefore you first develop an agent plug-in, and then one or more check plug-ins that evaluate this data. |
Special Agent / API-Integration | You need a special agent if you do not receive the data that is relevant for monitoring from either the normal Checkmk agent or SNMP. The most common application for Special Agent is querying HTTP-based APIs. Examples are, e.g. Monitoring AWS, Azure, or VMware. In this case you write a script that runs directly on the Checkmk server, connects to the API, and outputs data in the same format as an agent plug-in would. For this you write suitable check plug-ins in the same way as with the ‘agent-based’ monitoring. |
SNMP | When monitoring via SNMP you do not need an extension of an agent, but simply evaluate the data retrieved from your device via SNMP, which provides this by default. Checkmk supports you and takes over all of the details and special features of the SNMP protocol. There is in fact an agent here as well — namely the SNMP agent which is pre-installed on the system being monitored. |
Active Check | This check type forms a special role. Here you first write a classic Nagios-compatible plug-in which is intended for execution on the Checkmk server, and which from there uses a network protocol to directly query a service on the target device. The most prominent example is the |
1.5. Prerequisites
If you feel like programming check plug-ins, you need to satisfy the following prerequisites:
Knowledge of the Python programming language
Experience with Checkmk, especially with regard to agents and checks
Experience with Linux on the command line
As preparation, the following articles are recommended:
2. A first simple check plug-in
After this long introduction, it’s time we programmed our first simple check plug-in. As an example, let’s take a simple monitoring for Linux. Since Checkmk itself runs on Linux, it is very likely that you also have access to a Linux system.
The check plug-in will create a new service that detects whether someone has inserted a USB stick on a Linux server. In this case, this service should then become critical. You might even find something like this useful, but it is really only a simplified example and possibly also not programmed in a completely watertight way — but for now that’s not what this exercise is really about.
The whole procedure involves two steps:
We find out which Linux command can be used to determine whether a USB stick has been plugged in, and then extend the Linux agent with a small script that calls this command.
We then write a check plug-in in the Checkmk site that evaluates this data.
Here we go…
2.1. Finding the right command
At the beginning of any check programming activity is the necessary research! This means that we have to find out how we can get the information we need for monitoring. With Linux, this will often involve command line commands. In Windows, PowerShell, VBScript or WMI can help, and with SNMP we have to find the right OIDs (there is an own chapter for this).
Unfortunately, there is no general procedure for determining the correct command, so we do not want to spend too much time on the subject here, we will however briefly explain how it works for a USB stick.
First we log in to the host we want to monitor. Under Linux, the agent runs
as the root
user by default — which is why we perform all our tests as root
. For our task with the USB stick, there are convenient
symbolic links in the directory /dev/disk/by-id
. These point to all the
Linux block devices. And a plugged-in USB stick is one such device. In addition,
you can tell by the ID of the prefix usb-
when a block device is a USB
device.
The following command lists all entries in this directory:
root@linux# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 9 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191 -> ../../sda
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 May 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5 -> ../../sda5
lrwxrwxrwx 1 root root 9 May 14 11:21 wwn-0x5002538655584d30 -> ../../sda
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 May 14 11:21 wwn-0x5002538655584d30-part5 -> ../../sda5
So — and now the whole thing with the plugged in USB stick:
root@linux# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 9 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191 -> ../../sda
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Mai 14 11:21 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5 -> ../../sda5
lrwxrwxrwx 1 root root 9 Mai 14 12:15 usb-SCSI_DISK-0:0 -> ../../sdc
lrwxrwxrwx 1 root root 10 Mai 14 12:15 usb-SCSI_DISK-0:0-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Mai 14 12:15 usb-SCSI_DISK-0:0-part2 -> ../../sdc2
lrwxrwxrwx 1 root root 9 Mai 14 11:21 wwn-0x5002538655584d30 -> ../../sda
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Mai 14 11:21 wwn-0x5002538655584d30-part5 -> ../../sda5
2.2. Purging the data
Actually, we would be finished with that and could transport this whole output via the Checkmk agent to the Checkmk server and have it analysed there as well, because in Checkmk the following recommendation always applies: always let the server do the complex work, so keep the agent plug-in as simple as possible.
But there is still too much hot air in here. Transferring unnecessary data is always undesirable. Avoiding unnecessary transfers saves network traffic, memory, computing time and also makes everything clearer. That is simply a better way of doing things!
First, we can omit the -l
.
This already makes the output of ls
much leaner:
root@linux# ls /dev/disk/by-id/
ata-APPLE_SSD_SM0512F_S1K5NYBF810191 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5 wwn-0x5002538655584d30-part3
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1 wwn-0x5002538655584d30-part4 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2
wwn-0x5002538655584d30 wwn-0x5002538655584d30-part5 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3
wwn-0x5002538655584d30-part1 ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4 wwn-0x5002538655584d30-part2
Now again, the multi-column structure is disturbing, but this is only because
the ls
command recognises that it is running in an interactive terminal.
Later, as part of the agent, it will output the data in a single column.
But we can also easily force this here with the -1
option (for output in one column):
root@linux# ls -1 /dev/disk/by-id/
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part1
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part2
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part3
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part4
ata-APPLE_SSD_SM0512F_S1K5NYBF810191-part5
wwn-0x5002538655584d30
wwn-0x5002538655584d30-part1
wwn-0x5002538655584d30-part2
wwn-0x5002538655584d30-part3
wwn-0x5002538655584d30-part4
wwn-0x5002538655584d30-part5
If you look closely, you will see not only the block devices themselves,
but also any partitions that exist there. These are the entries that end in
-part1
, -part2
, etc. We do not need these for our check and
can get rid of them quite easily with a grep
. There we take the -v
option for negative logic:
root@linux# ls /dev/disk/by-id/ | grep -v -- -part
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
usb-SCSI_DISK-0:0
wwn-0x5002538655584d30
Here you can now see much more clearly that in our example there are in fact exactly three devices when the USB stick is plugged in.
Perfect! We now have a clear list of all block devices which has been compiled with a simple command. That’s all we need.
We have again omitted the -1
in the last command, because ls
now writes into a pipe and outputs a single column by itself. And grep
needs the
--
because otherwise it would interpret the word -part
as the
four options -p
, -a
, -r
and -t
.
And by the way: Why don’t we simply 'grep' for usb
in addition so that only USB devices are output? Well, of course we could do that.
But for one thing, our example then becomes increasingly boring, and besides, it is somehow more
reassuring to get some content in the section in a normal situation and not
simply nothing. In this way one can see immediately on the Checkmk server that the
agent plug-in is working correctly.
2.3. Including the command in the agent
In order for us to be able to retrieve this data from the Checkmk server,
we need to make the new command part of the Checkmk agent on the system being
monitored. We could of course simply edit the /usr/bin/check_mk_agent
file
there and include that. However, this would have the disadvantage that our
command would disappear again when we update the agent’s software because the
file will be replaced at that point.
It is therefore better if we make an agent plug-in. This is even simpler.
All we need is an executable file with our command in the
/usr/lib/check_mk_agent/plugins
directory.
And one more point is important: we can’t just output our data like this. What we still need is a section header. This is a specially-formatted line that contains our new check’s name. By means of these section headers, Checkmk can later recognise where this plug-in’s data begins and the previous plug-in’s data ends.
So now we need a meaningful name for our new check.
This name is limited to lower case letters (only a-z, no accents, no umlauts), underscores and numbers and must be unique.
Avoid name collisions with existing check plug-ins.
If you are curious about which names already exist, in a Checkmk site you can list them on the command line with cmk -L
:
OMD[mysite]:~$ cmk -L | head -n 20
3par_capacity agent HPE 3PAR: Capacity
3par_cpgs agent HPE 3PAR: CPGs
3par_cpgs_usage agent HPE 3PAR: CPGs Usage
3par_hosts agent HPE 3PAR: Hosts
3par_ports agent HPE 3PAR: Ports
3par_remotecopy agent HPE 3PAR: Remote Copy
3par_system agent HPE 3PAR: System
3par_volumes agent HPE 3PAR: Volumes
3ware_disks agent 3ware ATA RAID Controller: State of Disks
3ware_info agent 3ware ATA RAID Controller: General Information
3ware_units agent 3ware ATA RAID Controller: State of Units
acme_agent_sessions snmp ACME Devices: Agent Sessions
acme_certificates snmp ACME Devices: Certificates
acme_fan snmp ACME Devices: Fans
acme_powersupply snmp ACME Devices: Power Supplies
acme_realm snmp ACME Devices: Realm
acme_sbc agent ACME SBC: Health
acme_sbc_settings agent ACME SBC: Health Settings
acme_sbc_snmp snmp ACME SBC: Health (via SNMP)
acme_temp snmp ACME Devices: Temperature
The second column shows how the respective check plug-in obtains its data.
For our example, let’s choose the name linux_usbstick
.
In this example the section header must look like this:
<<<linux_usbstick>>>
We can simply output this with echo
. If we then don’t forget the
'Shebang' (this is not a venomous sting from the desert planet but an
abbreviation for sharp and bang — the latter being an abbreviation
for the exclamation mark!), by which Linux recognises that it should execute the
script with the shell, in which case our plug-in will look like this:
#!/bin/sh
echo '<<<linux_usbstick>>>'
ls /dev/disk/by-id/ | grep -v -- -part
We have now simply used the file name linux_usbstick
, even though it
doesn’t really matter.
But one thing is still very important: Make the file executable!
root@linux# chmod +x /usr/lib/check_mk_agent/plugins/linux_usbstick
Of course, you can easily try out the plug-in by manually by entering the complete path as a command:
root@linux# /usr/lib/check_mk_agent/plugins/linux_usbstick
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30
2.4. Testing the agent
As always, the next most important tasks are testing and troubleshooting. It is best to proceed in three steps:
Try out the plug-in on its own. We have just done that.
From the agent test the whole process locally.
Retrieve the agent via the Checkmk server.
Testing the agent locally is very simple — as root
user call
the command check_mk_agent
.
The new section should appear somewhere in the output from this:
root@linux# check_mk_agent
Here is an excerpt from that output which contains the new section:
<<<lnx_thermal:sep(124)>>>
thermal_zone0|-|BAT0|35600
thermal_zone1|-|x86_pkg_temp|81000|0|passive|0|passive
<<<local>>>
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30
<<<lnx_packages:sep(124):persist(1589463274)>>>
accountsservice|0.6.45-1ubuntu1|amd64|deb|-||install ok installed
acl|2.2.52-3build1|amd64|deb|-||install ok installed
acpi|1.7-1.1|amd64|deb|-||install ok installed
By appending less
you can scroll through the output (press the space
bar to scroll, /
to search and Q
to exit):
root@linux# check_mk_agent | less
The third test is then performed directly from the Checkmk site. Include the
host in the monitoring (e.g. as myserver01
) and then retrieve the agent
data with cmk -d
. You should get the same output here:
OMD[mysite]:~$ cmk -d myserver01 | less
By the way: grep
has the -A
option to output a few more lines
after each hit. This allows you to conveniently search and output the section:
root@linux# cmk -d myserver01 | grep -A5 '^<<< linux_usbstick'
<<<linux_usbstick>>>
ata-APPLE_SSD_SM0512F_S1K5NYBF810191
wwn-0x5002538655584d30
<<<lnx_packages:sep(124):persist(1589463559)>>>
accountsservice|0.6.45-1ubuntu1|amd64|deb|-||install ok installed
If this works, your agent is now ready! And what have we done to achieve this?
We simply created a three-line script with the path
/usr/lib/check_mk_agent/plugins/linux_usbstick
and made it executable!
Everything that follows now only takes place on the Checkmk server: There we will write the actual check plug-in.
2.5. Declaring the section
Preparing the agent is the most complicated part, but it is only half the battle. Now we have to teach Checkmk how to handle the information and the new agent section, which services it should generate, when they should go to OK or CRIT, etc. We do all this by programming a check plug-in in Python.
For your own check plug-ins you will find a directory prepared in the
local
hierarchy of the site directory.
This is local/lib/check_mk/base/plugins/agent_based/
.
Here in the path, base
means the part of Checkmk that is responsible for
actually monitoring and alerting. The agent_based
is for all plug-ins
that relate to the Checkmk agent (so not alerting plug-ins, for example).
The easiest way to work with this is to switch to it:
OMD[mysite]:~$ cd local/lib/check_mk/base/plugins/agent_based
This directory belongs to the site user and is therefore editable by you. You can edit your plug-in with any text editor installed on the Linux system.
So let’s create our plug-in here. The convention is that the file name reflects
the agent section’s name. Mandatory is that the file ends with
.py
, since from Checkmk version 2.0.0 onwards the plug-ins
will always be real Python modules.
First, we need to import the functions needed for the plug-ins from other
Python modules. The simplest method for this is with a *
. As you might
guess, there is also a version number of the API for plug-in programming here.
This will be version 1 until further notice,
and is abbreviated here to v1
:
from .agent_based_api.v1 import *
This versioning allows us to eventually provide future new versions of the API parallel to the previous ones, so that existing check plug-ins continue to work without problems.
In the simplest case, you skip explicitly declaring the section. If you want to implement a parse function (which professional developers would always advise you to do), see the section on parse functions for more information.
2.6. Registering the check
In order for Checkmk to know that the new check exists, it must be registered.
This is done by calling the function register.check_plugin
.
In doing so, you must always specify at least four things:
name
: The name of the check plug-in. If you don’t want to get into trouble, take the same name here as for your new agent section. This way the check will automatically know which section to evaluate.service_name
: The name of the service as it should then appear in the monitoring.discovery_function
: The function that discovers services of this type (more on this in a moment).check_function
: The function to perform the actual check (more on this in a moment).
So for our check the registration will look like this:
register.check_plugin(
name="linux_usbstick",
service_name="USB stick",
discovery_function=discover_linux_usbstick,
check_function=check_linux_usbstick,
)
It’s best not to try this out just yet, because of course we still have to write
the discover_linux_usbstick
and check_linux_usbstick
functions
beforehand, and these functions must appear in the source code before
the above declaration.
2.7. Writing the discovery function
A special feature of Checkmk is the automatic discovery of services to be monitored. In order for this to work, each check plug-in must define a function that detects whether a service of this type or which services of this type are to be created for the host in question on the basis of the agent’s output.
The discovery function is always called when the service discovery is carried
out for a host. This function then decides whether or which services are to be
created. In the standard procedure, it receives exactly one argument with the
name section
. This contains the data of the agent section in a parsed
format (more on this later).
We implement the following simple logic: If the agent section
linux_usbstick
exists, then we also create a matching service.
This service will then automatically appear on all hosts where our agent plug-in
has been rolled out. We recognise the presence of the section simply by the fact
that our discovery has actually been invoked!
The discovery function must return an object of the type Service
for
each service to be created using yield
(not with return
).
For checks that can only occur once per host, no further information is needed:
def discover_linux_usbstick(section):
yield Service()
2.8. Writing the check function
So now we can come to the actual check function, which finally decides on the
basis of current agent outputs which state a service should assume. Since our
check has no parameters and there is only ever one per host, our function is
also called with the single argument section
.
Since we really need the content this time, we have to deal with the format of this argument. Unless you have explicitly defined a parse function Checkmk will parse each line of the section into a list of words using spaces. The whole thing then in turn becomes a list of these word lists. So the end result is that we will always have a list of lists.
In the simple case in which our agent plug-in only finds two devices, it will then look like this (here there is only one word per line):
[['ata-APPLE_SSD_SM0512F_S1K5NYBF810191'], ['wwn-0x5002538655584d30']]
The check function now goes through line by line and looks for a line whose
first (and only) word begins with usb-SCSI_DISK
.
If this is found, the state will become CRIT. Here is the implementation:
def check_linux_usbstick(section):
for line in section:
if line[0].startswith("usb-SCSI_DISK"):
yield Result(state=State.CRIT, summary="Found USB stick")
return
yield Result(state=State.OK, summary="No USB stick found")
And here is the explanation:
With
for line in section
we loop through all of the lines in the agent’s output.We then check whether the first word in the line — the respective device — begins with
usb-SCSI_DISK
.If yes, we generate a check result with the status CRIT and the text
Found USB stick
. And we then end the function with areturn
.If the loop is run without finding anything, it will generate the status OK and the text
No USB stick found
.
2.9. The complete plug-in at a glance
And here is the complete plug-in one more time:
from .agent_based_api.v1 import *
def discover_linux_usbstick(section):
yield Service()
def check_linux_usbstick(section):
for line in section:
if line[0].startswith("usb-SCSI_DISK"):
yield Result(state=State.CRIT, summary="Found USB stick")
return
yield Result(state=State.OK, summary="No USB stick found")
register.check_plugin(
name = "linux_usbstick",
service_name = "USB stick",
discovery_function = discover_linux_usbstick,
check_function = check_linux_usbstick,
)
And this is the plug-in for the Linux agent:
#!/bin/sh
echo '<<<linux_usbstick>>>'
ls /dev/disk/by-id/ | grep -v -- -part
3. Checks with more than one service (items) per host
3.1. Basic principles
In our example, we have built a very simple check that creates a service on a host — or not. A very common situation is, of course, that there can be several services with one check on one host.
The most common example of this is the file systems for a host.
The plug-in named df
creates one service per file system on the host.
To distinguish these services, the mount point of the file system
(e.g. /var
), or the drive letter (e.g. C:
) is built into the
service name. This then results in the service name being,
e.g. filesystem /var
, or filesystem C:
.
The word /var
or C:
is referred to here as the item.
So we also speak of a check with items.
If you want to build a check with multiple items, you need to implement the following things:
The discovery function must generate a service for each of the items that are to be meaningfully monitored on the host.
In the service name you must include this item using the
%s
wildcard (i.e."Filesystem %s"
).The check function is invoked once separately for each item and receives this as an argument. It must then fish out the relevant data for this item from the agent data.
3.2. A simple example
To be able to test the whole thing practically, we will simply build another
agent section that only outputs game data. A small shell script is sufficient
for this. The section should be called foobar
in this example:
#!/bin/sh
echo "<<<foobar>>>"
echo "West 100 100"
echo "East 197 200"
echo "North 0 50"
From foobar, there are three sectors to be found here: West
,
East
and North
(whatever that means).
In each sector there are a number of seats, some of which are occupied
(e.g. in West
100 of 100 seats are occupied).
Now we will create a matching check plug-in for this.
The registration is as usual, but with the important difference that the service
name now contains exactly one %s
.
At this position the item’s name will be inserted later by Checkmk:
register.check_plugin(
name = "foobar",
service_name = "Foobar Sector %s",
discovery_function = discover_foobar,
check_function = check_foobar,
)
The discovery function now has the task of determining the items to be monitored.
As usual, it receives the section
argument. And again, this is a list
of lines, which in turn are lists of words.
In our example the list looks like this:
[['West', '100', '100'], ['East', '197', '200'], ['North', '0', '50']]
You can loop through such a list with Python and give meaningful names to these three words grouped in each line:
for sector, _used, _slots in section:
...
In each line, the first word — here the sector — is our item.
Whenever we find an item, we return that with yield
, creating an object
of type Service
that gets the sector name as its item. The underscore
indicates that for now we don’t care about the other two columns in the output,
since in a discovery it ultimately doesn’t matter how many slots are occupied.
Overall it looks like this:
def discover_foobar(section):
for sector, _used, _slots in section:
yield Service(item=sector)
Of course, it would be easy to omit some lines here on the basis of arbitrary criteria. Maybe there are sectors which have a size of 0 and which you would never want to monitor? Simply omit such rows so that no item will be generated for them.
Then later, when the host is being monitored, the check function is called
separately for each service — and thus for each item. Therefore, in addition to
the section, it also receives the item
argument with the item it is
looking for. Now we go through all of the lines one after the other.
When doing so, we will find the line that corresponds to the desired item:
def check_foobar(item, section):
for sector, used, slots in section:
if sector == item:
...
Now all that is missing is the actual logic which determines when the item should in fact be OK, WARN or CRIT. We do that here like this:
When all slots have been used, the thing is to become CRIT.
If there are fewer than 10 slots free, then it will become WARN.
Otherwise OK
The occupied and total slots always appear as the second and third words in each
line. However, here we are dealing with strings, not numbers — but we need
numbers to be able to compare and calculate. We therefore convert the strings
into numbers using int()
.
We then return the check result by supplying an object of type result
via yield
. This takes the parameters state
and summary
:
def check_foobar(item, section):
for sector, used, slots in section:
if sector == item:
used = int(used) # convert string to int
slots = int(slots) # convert string to int
if used == slots:
s = State.CRIT
elif slots - used <= 10:
s = State.WARN
else:
s = State.OK
yield Result(
state = s,
summary = f"Used {used} out of {slots} slots")
return
In this context, please note the following:
The command
return
ensures that the check function is terminated immediately after processing the found item. There is nothing more to be done, after all.If the loop is processed without finding the item being searched for, Checkmk automatically generates the result
UNKNOWN - Item not found in monitoring data
. This is intentional and a good thing. Do not handle this case yourself. If you don’t find an item you are looking for, just let Python run its course through the function and let Checkmk do its work.With the argument
summary
you define the text that the service produces from the status output. This is purely informal and will not be evaluated further by Checkmk.
By the way, for the common situation where you want to check a simple metric for thresholds, there is a helper function called check_levels
.
This helper function is explained in the Check API documentation, which you can access in Checkmk via Help > Plugin API reference > Agent based API ("Check API").
Now let’s first try out the discovery. For the sake of clarity we will restrict
this action to our plug-in by using the --detect-plugins=foobar
option:
OMD[mysite]:~$ cmk --detect-plugins=foobar -vI myhost123
3 foobar
SUCCESS - Found 3 services, 1 host labels
And now right away we can test the checking process
(here also limited to foobar
):
OMD[mysite]:~$ cmk --detect-plugins=foobar -v myhost123
Foobar Sector East WARN - used 197 out of 200 slots
Foobar Sector North OK - used 0 out of 50 slots
Foobar Sector West CRIT - used 100 out of 100 slots
3.3. The example — a recap
And here again our example in full. To avoid errors due to undefined function names, the functions must always be defined before registering.
from .agent_based_api.v1 import *
import pprint
def discover_foobar(section):
for sector, used, slots in section:
yield Service(item=sector)
def check_foobar(item, section):
for sector, used, slots in section:
if sector == item:
used = int(used) # convert string to int
slots = int(slots) # convert string to int
if used == slots:
s = State.CRIT
elif slots - used <= 10:
s = State.WARN
else:
s = State.OK
yield Result(
state = s,
summary = f"used {used} out of {slots} slots")
return
register.check_plugin(
name = "foobar",
service_name = "Foobar Sector %s",
discovery_function = discover_foobar,
check_function = check_foobar,
)
4. Performance values
4.1. Determining values in the check function
Not always, but very often checks work with numbers. With its graphing system Checkmk has a component to store, evaluate and display such numbers. This works completely independently from the generation of any resulting OK, WARN or CRIT states.
Such measured values — or metrics — are determined by the check function and
simply returned as an additional result. For this purpose the Metric
object is used, which requires at least the two arguments name
and
value
. Here is an example:
yield Metric("fooslots", used)
4.2. Threshold information
Furthermore there are two optional arguments. With the argument levels
you can provide information about thresholds for WARN and CRIT,
in the form of a pair of two numbers. This is then usually plotted on the graph
as a yellow and a red line. The first number (yellow line) represents the warning
threshold, the second (red line) the critical one. The convention is that the
check already goes to WARN when the warning threshold is reached (analogous
for CRIT).
The coding could then look like this (here with hard coded thresholds):
yield Metric("fooslots", used, levels=(190,200))
Notes:
If only one of the two thresholds will be defined, simply enter
None
for the other threshold, e.g.levels=(None, 200)
.Floating point numbers are also allowed, but not strings.
Attention: the check function itself is responsible for the check of the thresholds . The specification of
levels
serves only as marginal information for the graphing system!
4.3. The values range
Analogous to the threshold values, you can also provide the graphing system with
information about a range of possible values. This denotes the smallest and
largest possible value. This is done in the boundaries
argument, where
None
can also be optionally used here for one of the two boundaries.
Example:
yield Metric(name="fooslots", value=used, boundaries=(0, 200))
And now once again our check function from the above example, but this time with the return of metric information including threshold values and a value range (this time of course not with fixed but with calculated values):
def check_foobar(item, section):
for sector, used, slots in section:
if sector == item:
used = int(used) # convert string to int
slots = int(slots) # convert string to int
yield Metric(
"fooslots",
used,
levels=(slots-10, slots),
boundaries=(0, slots))
if used == slots:
s = State.CRIT
elif slots - used <= 10:
s = State.WARN
else:
s = State.OK
yield Result(
state = s,
summary = f"used {used} out of {slots} slots")
return
5. Checks with multiple partial results
In order to prevent the number of services on a host from growing out of all control, several partial results are often combined in a single service. For example, the Memory used service under Linux checks not only RAM and swap usage, but also shared memory, page tables and various other information.
The API provided by Checkmk offers a very convenient interface for this. In this
way, a check function may simply generate a result with yield
any
number of times. The overall status for the service is then based on the 'worst'
partial result according to the scheme OK → WARN → UNKNOWN → CRIT.
Here is an abbreviated, fictitious example:
def check_foobar(section):
yield Result(state=State.OK, summary="Knulf rate optimal")
# ...
yield Result(state=State.WARN, summary="Gnarz required")
# ...
yield Result(state=State.OK, summary="Last Szork was good")
The summary of the service in the GUI then looks like this: "Knulf rate optimal, Gnarz required WARN, Last Szork was good". And the overall status will be WARN.
You can return multiple metrics in the same way.
Simply call yield Metric(…)
once for each metric.
6. Summary and Details
In the Checkmk monitoring, each service also has a line of text in addition to its status OK, WARN, and so on. Up until version 1.6.0 this was called Output of check plugin. As of version 2.0.0 this is now called Summary — so it has the task of providing a concise summary of the status. The idea is that this text does not exceed a length of 60 characters, so that it is always easy to read and ensures a clear table display without annoying line breaks.
Next to this there is the Details field, which used to be called Long output of check plugin (multiline). Here all of the details of the state are displayed, the idea being that all of the summary information is included here as well.
When calling yield Result(…)
you can determine which information is
so important that it should be displayed in the summary and for which information
it is sufficient that it appears in the details. The default rule is that partial
results that lead to a WARN/CRIT will always be visible in the summary.
In our examples so far we have always used the following call:
yield Result(state=State.OK, summary="some important text")
This will cause some important text
to always appear in the
Summary — and additionally in the Details — so you should only use
this for important information. If a partial result is of secondary importance,
replace summary
with notice
and the text will appear — if the service is OK only in the details.
yield Result(state=State.OK, notice="some additional text")
If the state is WARN or CRIT, the text will then automatically appear as an addition in the summary:
yield Result(state=State.CRIT, notice="some additional text")
Thus, in the summary it will be immediately clear why the service is not OK.
Last but not least, you have — for both summary
and notice
— the possibility to specify an alternative text for the details,
which may contain more information about the partial result:
yield Result(state=State.OK,
summary="55% used space",
details="55.2% of 160 GB used (82 GB)")
To summarize, this means:
The full text of the summary (for services that are OK) should not exceed 60 characters.
Always use either
summary
ornotice
— not both, and not neither.Add
details
as necessary if you want the details text to be an alternative one.
7. Error handling
7.1. Exceptions and crash reports
The correct handling of errors (unfortunately) consumes a large chunk of the programming work. The good news is that the Checkmk API already does most of the work for you. Consequently, in most cases, it is important for you to simply not deal with errors.
When Python gets into a situation that is in some way unexpected , it responds with what is known as an exception. Here are a few examples:
You convert a string into a number with
int(…)
, but the string does not actually include a number, e.g.int("foo")
.You access the fifth element of
bar
withbar[4]
, but this in fact has only four elements.You are calling a function that does not exist.
Here the important general rule applies: Don’t capture exceptions yourself! Checkmk will always do this for you in a consistent and efficient way — in most cases accompanied by a crash report. Such a report will look like this, for example:
By clicking on the icon, the user is navigated to a page where they can
view a display of the file in which the crash took place.
get all information about the crash, for instance any error messages, call stack, agent output, the current values of local variables and much more.
send the report to us (Checkmk GmbH) as feedback.
Submitting the report to Checkmk GmbH of course only makes sense for check plug-ins which are official Checkmk components. But you can also ask your users to simply send you the data. The users can then help you to find the error. It is often the case that a check plug-in works for you, but other users may experience sporadic errors. Working together you can then usually identify these problems very easily.
But if you were to intercept the exception yourself, all of this information would simply be unavailable. You would perhaps set the service to UNKNOWN and issue an error message, but all of the background circumstances that led to an error (e.g. the data from the agent) would simply be invisible.
7.2. Viewing exceptions on the command line
If you run your plug-in on the command line, no crash reports will be generated — you will only see the summarized error message:
OMD[mysite]:~$ cmk -II --detect-plugins=foobar myhost123
WARNING: Exception in discovery function of check plugin 'foobar': invalid literal for int() with base 10: 'foo'
BUT: if you simply append the --debug
option to this, you will then
receive the Python stack trace:
OMD[mysite]:~$ cmk --debug -II --detect-plugins=foobar myhost123
Traceback (most recent call last):
File "/omd/sites/myhost123/bin/cmk", line 82, in <module>
exit_status = modes.call(mode_name, mode_args, opts, args)
File "/omd/sites/myhost123/lib/python3/cmk/base/modes/init.py", line 68, in call
return handler(*handler_args)
File "/omd/sites/myhost123/lib/python3/cmk/base/modes/check_mk.py", line 1577, in mode_discover
discovery.do_discovery(set(hostnames), options.get("checks"), options["discover"] == 1)
File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 345, in do_discovery
_do_discovery_for(
File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 397, in _do_discovery_for
discovered_services = _discover_services(
File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1265, in _discover_services
service_table.update({
File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1265, in <dictcomp>
service_table.update({
File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1337, in _execute_discovery
yield from _enriched_discovered_services(hostname, check_plugin.name, plugins_services)
File "/omd/sites/myhost123/lib/python3/cmk/base/discovery.py", line 1351, in _enriched_discovered_services
for service in plugins_services:
File "/omd/sites/myhost123/lib/python3/cmk/base/api/agent_based/register/check_plugins.py", line 69, in filtered_generator
for element in generator(*args, **kwargs):
File "/omd/sites/myhost123/local/lib/python3/cmk/base/plugins/agent_based/foobar.py", line 5, in discover_foobar
int("foo")
ValueError: invalid literal for int() with base 10: 'foo'
7.3. Invalid outputs from an agent
The question is how to react when the output from the agent is not in the form you would normally expect — whether it is from the 'real' agent or when the data comes via SNMP. Let’s assume that you always expect three words per line. What should you do if only two words were to arrive?
Now — if this is a permitted and familiar agent behaviour, then of course you need to capture that and employ case discrimination.
If, however, this is not actually allowed … then it is best to treat the line as if it always contains three words, e.g. with:
def check_foobar(section):
for foo, bar, baz in section:
# ...
If there should ever be a line that does not consist of exactly three words, a nice exception will be generated and you will receive the very helpful crash report that was just mentioned.
7.4. Missing items
What if the agent outputs data correctly, but the item to be checked is missing? So, like this, for example:
def check_foobar(item, section):
for sector, used, slots in section:
if item == sector:
# ... Check state ...
yield Result(...)
return
If the item you are looking for is not there, the loop is run and Python just falls out of the back at the end of the function without 'yielding' a result. And that’s exactly the correct procedure! Because Checkmk recognizes that the item to be monitored is missing and with UNKNOWN generates the correct status and a suitable standard text for it.
8. SNMP-based checks
8.1. The fundamentals
Developing checks that work with SNMP is very similar to agent-based checks, except that you still need to specify which SNMP ranges (OIDs) the check requires. If you don’t as yet have any experience with SNMP, we strongly recommend reading the article on Monitoring via SNMP as preparation at this point.
The process of discovery and checking via SNMP is somewhat different from that for a normal agent. Because unlike there — where the agent sends all of the relevant information on its own — with SNMP we have to say exactly which data ranges we require. A complete dump of all data would be theoretically possible (via SNMP walk), but this process can take minutes for fast devices and over an hour for complex switches. Thus, this is not viable for checking or even for discovery. Checkmk therefore proceeds in a more targeted manner.
SNMP detection
Service detection is divided into two phases. First, the SNMP detection
is performed. This determines which plug-ins on the respective device are of
actual interest. To do this, a few SNMP OIDs are retrieved — individually,
without a walk. The most important of these is the
sysDescr
(OID: 1.3.6.1.2.1.1.0
). Under this OID, each SNMP
device holds a description of itself,
e.g. ‘Cisco NX-OS(tm) n5000, Software (n5000-uk9),…’.
Based on this text, you can already define for very many plug-ins whether they can be useful for this application. If the text is still not specific enough, further OIDs are fetched and checked. The result of the SNMP detection will then be a list of candidates for check plug-ins.
Discovery
In the second step, the necessary monitoring data is fetched for each of these
candidates with SNMP walks. These are then combined into a table and provided to
the check’s discovery function with the section
argument, which then as
usual determines the items to be monitored.
Checking
When running checks, it is already known which plug-ins are to be executed for
the device and the SNMP detection is now omitted. Here the monitoring data
needed for the plug-ins are fetched immediately by SNMP walks and from it the
section
argument for the check function is filled.
Summary
So what do you need to do differently with an SNMP check compared to an agent-based one?
You do not need a plug-in for the agent.
You must define the single OIDs and search texts required for an SNMP detection.
You have to define which SNMP areas must be fetched for monitoring.
8.2. A word about the MIBs
Before we continue, we want to say a word about the infamous SNMP MIBs, because there are many prejudices about these. Right at the beginning, some good news: Checkmk doesn’t need them. Really! But they are an important aid in being able to develop an SNMP check.
So what is a MIB? Literally, the abbreviation means Management Information Base — somewhat meaningless really. To be concrete, a MIB is a quite easy to read text file which describes a certain subtree in the SNMP world. Namely, it states which branch in the tree — that is, which OID — has which meaning. This includes a name for the OID, a comment on what values it can take (e.g. for enumerated data types, where things like 1=up, 2=down, etc. are defined) and sometimes a useful comment.
Checkmk provides a set of freely-available MIB files. These describe very general areas in the global OID tree, but do not contain any vendor-specific areas. Therefore they are of not much help for self-developed checks.
So try to find the MIB files relevant for your particular device somewhere on
the manufacturer’s web pages or even on the device’s management interface,
and install these in the Checkmk site in local/share/check_mk/mibs
.
You can then have SNMP walks convert OID numbers to names and can thus more
quickly find where the data of interest for your monitoring is located.
Also, MIBs, if done carefully, contain interesting information in their
comments, as we noted above. You can easily read an MIB file with a text editor
or with less
.
8.3. Locating the correct OIDs
The crucial prerequisite for developing a plug-in is, of course, that you know which OIDs contain the necessary information. The first step in doing this (if the device doesn’t refuse) is to perform a complete SNMP walk. This will retrieve all of the available data via SNMP.
Checkmk can accomplish this very easily for you. To do so, first include the
device (or one of the devices) for which you want to develop a plug-in into your
monitoring. Let’s say this device is called mydevice01
. Check in the
device’s basic functions to make sure that it can be monitored.
As a minimum, the SNMP Info and Uptime services need to be found,
and probably at least one Interface as well.
This is how you make sure that SNMP access works cleanly.
Then switch to the command line in the Checkmk site. Here you can perform a
complete walk with the following command. We recommend using the -v
(verbose) option when doing this for the very first time:
OMD[mysite]:~$ cmk -v --snmpwalk mydevice01
mydevice01:
Walk on ".1.3.6.1.2.1"...3898 variables.
Walk on ".1.3.6.1.4.1"...6025 variables.
Wrote fetched data to /omd/sites/heute/var/check_mk/snmpwalks/mydevice01.
As mentioned earlier, such a complete walk can take minutes or even
hours — although the latter is rare. So don’t become nervous if it takes a
while to complete this process. The walk will be saved in the file
var/check_mk/snmpwalks/mydevice01
.
This will be a easily-readable text file that starts something like this:
.1.3.6.1.2.1.1.1.0 JetStream 24-Port Gigabit L2 Managed Switch with 4 Combo SFP Slots
.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.11863.1.1.3
.1.3.6.1.2.1.1.3.0 546522419
.1.3.6.1.2.1.1.4.0 hh@example.com
.1.3.6.1.2.1.1.5.0 sw-ks-01
.1.3.6.1.2.1.1.6.0 Core Switch Serverraum klein
.1.3.6.1.2.1.1.7.0 3
.1.3.6.1.2.1.2.1.0 27
In each line there is an OID and then its value. And right there in the first
line you find the most important one, namely the sysDescr
.
Now the OIDs themselves are not very informative. If the correct MIBs are
installed, you can have them converted to names in a second step with
the cmk --snmptranslate
command. It is best to redirect the result — which would otherwise appear in the terminal — to a file:
OMD[heute]:~$ cmk --snmptranslate mydevice01 > translated
Processing 9923 lines.
finished.
The translated
file reads like the original walk, but has a translated
value for the OID on each line after the -->
:
.1.3.6.1.2.1.1.1.0 JetStream 24-Port Gigabit L2 Managed Switch with 4 Combo SFP Slots --> SNMPv2-MIB::sysDescr.0
.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.11863.1.1.3 --> SNMPv2-MIB::sysObjectID.0
.1.3.6.1.2.1.1.3.0 546522419 --> DISMAN-EVENT-MIB::sysUpTimeInstance
.1.3.6.1.2.1.1.4.0 hh@example.com --> SNMPv2-MIB::sysContact.0
.1.3.6.1.2.1.1.5.0 sw-ks-01 --> SNMPv2-MIB::sysName.0
.1.3.6.1.2.1.1.6.0 Core Switch Serverraum klein --> SNMPv2-MIB::sysLocation.0
.1.3.6.1.2.1.1.7.0 3 --> SNMPv2-MIB::sysServices.0
.1.3.6.1.2.1.2.1.0 27 --> IF-MIB::ifNumber.0
.1.3.6.1.2.1.2.2.1.1.1 1 --> IF-MIB::ifIndex.1
.1.3.6.1.2.1.2.2.1.1.2 2 --> IF-MIB::ifIndex.2
Example: the OID .1.3.6.1.2.1.1.4.0
has the translated name
SNMPv2-MIB::sysContact.0
. This is an important hint — the rest is then
practice, experience and of course experimentation.
8.4. Registering the SNMP section
So, once you have determined the necessary OIDs, it’s on to the actual development of the plug-in. This is done in three steps:
For an SNMP detection, specify which OIDs must contain which texts for your plug-in to run.
Declare which OID branches need to be fetched for the monitoring.
Write a check plug-in analogous to those for agent-based checks.
The first two steps are performed by registering an SNMP section.
You do this by calling register.snmp_section()
. Here you specify at
least three arguments: the name of the section (name
), the details for
the SNMP detect detect
, and the OID branches needed for actually
monitoring (fetch
). Here is an example of a hypothetical check plug-in
with the name foo
:
register.snmp_section(
name = "foo",
detect = startswith(".1.3.6.1.2.1.1.1.0", "foobar device"),
fetch = SNMPTree(
base = '.1.3.6.1.4.1.35424.1.2',
oids = [
'4.0',
'5.0',
'8.0',
],
),
)
The SNMP detection
With the keyword detect
you specify under which conditions the discovery
function should be executed. In our example, this is the case if the value of the
OID .1.3.6.1.2.1.1.0
(i.e. the sysDescr
) starts with the text
foobar device
(case-insensitive in principle). In addition to
startswith
, there are a number of other possible attributes. There is
also a negated form of each of these, which begins with not_
:
Attribute | Negation | Function |
---|---|---|
equals(oid, needle) | not_equals(oid, needle) | The value of the OID is equal to the text |
contains(oid, needle) | not_contains(oid, needle) | The value of the OID at some point contains the text |
startswith(oid, needle) | not_startswith(oid, needle) | The value of the OID starts with the text |
endswith(oid, needle) | not_endswith(oid, needle) | The value of the OID ends with the text |
matches(oid, regex) | not_matches(oid, regex) | The value of OID matches the regular expression |
exists(oid) | not_exists(oid) | Met if the OID is available on the device. The value may be empty. |
As well as the above, there is also the possibility of linking multiple tests with
all_of
or any_of
. The option all_of
requires multiple
successful attributes for a positive discovery of the plug-in. The following
example finds the plug-in on a device if sysDescr
starts with the text
foo
(or FOO
or Foo
) and the
OID .1.3.6.1.2.1.2.0
contains the text .4.1.11863.
:
detect = all_of(
startswith(".1.3.6.1.2.1.1.1.0", "foo"),
contains(".1.3.6.1.2.1.1.2.0", ".4.1.11863.")
)
The any_of
option, on the other hand, is satisfied if any single one of
the criteria is met. Here is an example where different values are allowed for
the sysDescr
keyword:
detect = any_of(
startswith(".1.3.6.1.2.1.1.1.0", "foo version 3 system"),
startswith(".1.3.6.1.2.1.1.1.0", "foo version 4 system"),
startswith(".1.3.6.1.2.1.1.1.0", "foo version 4.1 system"),
)
By the way — are you familiar with regular expressions? If so, with these you would probably simplify this whole process and still get by with just a single line:
detect = matches(".1.3.6.1.2.1.1.1.0", "FOO Version (3|4|4.1) .*"),
And one more important note: The OIDs you specify in the detect
declaration from a plug-in will, in case of doubt, be fetched from every
device that is monitored via SNMP. Therefore, be very sparing in your use of
vendor-specific OIDs. Try to make your discovery absolutely exclusive to the
sysDescr
(.1.3.6.1.2.1.1.1.0
) and the sysObjectID
(.1.3.6.1.2.1.1.2.0
). If you still need another different OID, then
reduce the number of devices where it is requested to a minimum by excluding as
many devices as possible beforehand using the sysDescr
, e.g. like this:
detect = all_of(
startswith(".1.3.6.1.2.1.1.1.0", "foo"), # first check sysDescr
contains(".1.3.6.1.4.1.4455.1.3", "bar"), # fetch vendor specific OID
)
The all_of()
works in such a way that if the first condition fails,
the second is not even tried (and thus the OID in question is not fetched).
Here in the example, the OID .1.3.6.1.4.1.4455.1.3
is fetched only for
those devices that have foo
in their sysDescr
.
What happens if you have made the declaration incorrectly or at least not quite on target?
If the detection erroneously detects devices that do not have the necessary OIDs, your discovery function will not generate any services — so nothing 'bad' will happen. However, this will slow down the discovery on such devices, because now every time it will pointlessly try to retrieve the corresponding OIDs.
If the detection does not detect devices that are actually allowed, during the discovery no services will be found in the monitoring.
8.5. The OID ranges for monitoring
The most important part of the SNMP declaration is the specification of which OIDs are to be fetched for the monitoring. In almost all cases, a plug-in only needs selected branches from a single table to do this. Let’s consider the following example:
fetch = SNMPTree(
base = '.1.3.6.1.4.1.35424.1.2',
oids = [
'4.0',
'5.0',
'8.0',
],
),
The keyword base
specifies an OID prefix here.
All necessary data is below. At oids
you then specify a list of
sub-OIDs to be fetched from there. In the above example, a total of three SNMP
walks are then made, namely on starting from the OIDs
.1.3.6.1.4.1.35424.1.2.4.0
and .1.3.6.1.4.1.35424.1.2.5.0
and .1.3.6.1.4.1.35424.1.2.8.0
. It is important that these walks fetch
the same number of variables and that they also correspond to each other.
This means that, for example, the nth element from each of the walks corresponds
to the same monitored object.
Here is an example from the check plug-in snmp_quantum_storage_info
:
tree = SNMPTree(
base=".1.3.6.1.4.1.2036.2.1.1", # qSystemInfo
oids=[
"4", # qVendorID
"5", # qProdId
"6", # qProdRev
"12", # qSerialNumber
],
),
)
Here, the vendor ID, the product ID, the product revision and the serial number are retrieved from each storage device.
The discovery and check function is presented with this data as a table,
i.e. as a list of lists. The table is mirrored so that you have all of the data
for each item per entry in the outer list. Each entry has as many items as you
specified in oids
.
This allows you to loop through the list in a very practical way, e.g.:
for vendor_id, prod_id, prod_rev, serial_number in section:
...
Please note:
All entries are strings, even if the OIDs in question are actually numbers.
Missing OIDs are presented as empty strings.
Remember the ability to output formatted data during development with
pprint
.
8.6. Other SNMP special features
We will describe here in the future:
How to retrieve multiple independent SNMP areas.
What OIDEnd() is all about
Other special cases when dealing with SNMP
9. Formatting numbers
9.1. The basics
In the summary or details for a service, numbers are often output. To make it as
easy as possible for you to format them nicely and correctly, and also to
standardize the output from all check plug-ins, there are helper functions for
rendering different kinds of sizes. All of these are sub-functions of the
render
module and are consequently called with render.
.
For example, render.bytes(2000)
results in the text 1.95 KiB
.
What all of these functions have in common is that they get their values in a so-called canonical or natural unit. Thus one must never think, and there are no difficulties or errors with the conversion. For example, times are always given in seconds, and the sizes of hard disks, files, etc. are always given in bytes and not in kilobytes, kibibytes, blocks or any other confusion of units.
Please use these functions even if you don’t like the display so much. After all, this is then consistent for the user. And future versions of Checkmk may be able to change the display or even make it configurable for the user. Your check plug-in will then also benefit from this.
Following the detailed description of all of the display functions (render functions), you will find a summary of these in the form of a clear table.
9.2. Times, time spans, frequencies
Absolute time specifications (timestamps) are formatted with
render.date()
or render.datetime()
. The specifications are
always in seconds from January 1, 1970, 00:00:00 UTC — the so-called
epoch time. This is also the format used by the Python function
time.time()
. The advantage of this representation is that it can be
used to calculate very easily, for example, the duration of an operation if the
start and end times are known. The formula is then simply
duration = end - start
. These calculations also work independently of
the time zone, daylight saving time changes or leap years.
render.date()
outputs only the date, render.datetime()
adds
the time. The output is done according to the current time zone for the Checkmk server that is running the check!
Examples:
Call | Output |
---|---|
render.date(0) | Jan 01 1970 |
render.datetime(0) | Jan 01 1970 01:00:00 |
render.date(1600000000) | Sep 13 2020 |
render.datetime(1600000000) | Sep 13 2020 14:26:40 |
Now please don’t be surprised that render.date(0)
outputs 01:00 as the
time instead of 00:00! This is because we are writing this manual in the time zone
for Germany, which is one hour ahead of UTC standard time (at least during
standard time, because, as you know, January 1 is not in (the European Summer)
daylight saving time)
For timespan there is still the function render.timespan()
.
This produces a duration in seconds and outputs it in a human readable form.
For larger time spans, seconds or minutes are omitted.
Call | Output |
---|---|
render.timespan(1) | 1 second |
render.timespan(123) | 2 minutes 3 seconds |
render.timespan(12345) | 3 hours 25 minutes |
render.timespan(1234567) | 14 days 6 hours |
A frequency is effectively the reciprocal of time. The canonical unit is Hz, which is equivalent to once per second. A field of application is, for example, for the clock rate of a CPU:
Call | Output |
---|---|
render.frequency(111222333444) | 111 GHz |
9.3. Bytes
Whenever memory, files, hard disks, file systems and the like are concerned, the canonical unit is the byte. Since computers usually organize such things in powers of two, e.g. in units of 512, 1024 or 65536 bytes, from the beginning it has been accepted that a kilobyte is not 1000 but 1024 bytes. This is in itself very practical, because in this way mostly round numbers came out. The legendary Commodore C64 had a 64 kilobyte memory and not 65,536.
Unfortunately, at some point hard disk manufacturers came up with the idea of specifying the sizes of their disks in 1000’s of units. Since the difference between 1000 and 1024 is 2.4% for each size, and these are multiplied, a 1 GB disk (1024 times 1024 * 1024) suddenly becomes 1.07 GB. That sells better.
This annoying confusion persists to this day and continues to cause errors. As a remedy, new prefixes based on the binary system were defined by the International Electrotechnical Commission. Accordingly, nowadays a kilobyte is officially 1000 bytes, and a kibibyte is 1024 bytes (2 to the power of 10). In addition one should say Mebibyte and Gibitbyte and Tebibyte (ever heard of these?). The abbreviations are (attention, here at once always i, instead of e!) KiB, MiB, GiB and TiB.
Checkmk adapts itself to this standard and helps you with multiple adapted render
functions so that you can always produce correct outputs. So specifically for
hard disks and file systems there is the render.disksize()
function,
which gives the output in powers of 1000.
Call | Output |
---|---|
render.disksize(1000) | 1.00 kB |
render.disksize(1024) | 1.02 kB |
render.disksize(2000000) | 2.00 MB |
For the sizes of files it is common to specify the exact size in
bytes without rounding. This has the advantage that you can see very
quickly if a file has changed even minimally or that two files are (probably)
the same. The render.filesize()
function is responsible for this:
Call | Output |
---|---|
render.filesize(1000) | 1,000 B |
render.filesize(1024) | 1,024 B |
render.filesize(2000000) | 2,000,000 B |
If you want to output a size that is not a disk or a file size, just use the
generic render.bytes()
.
This will give you the output in the classic 1024’s in the new official notation:
Call | Output |
---|---|
render.bytes(1000) | 1000 B |
render.bytes(1024) | 1.00 KiB |
render.bytes(2000000) | 1.91 MiB |
9.4. Bandwidths and data rates
Networkers have their own terms and ways of expressing things. And as always, in each domain Checkmk tries hard to adopt the way of communicating that is customary there. That’s why there are three different rendering functions for data rates and speeds. All of these have in common that the rates are passed in bytes per second, even when the output is in bits!
render.nicspeed()
represents the maximum speed of a network card or
switch port. Since they are not measured values, there is no need to do any
rounding. Although no port can send single bits, the specifications are in bits
for historical reasons. Attention: you must however always pass bytes per second
here as well!
Examples:
Call | Output |
---|---|
render.nicspeed(12500000) | 100 MBit/s |
render.nicspeed(100000000) | 800 MBit/s |
render.networkbandwidth()
is for an actual measured transmission speed
on the network. The input value is again bytes per second (or 'octets' as a
networker would say):
Call | Output |
---|---|
render.networkbandwidth(123) | 984 Bit/s |
render.networkbandwidth(123456) | 988 kBit/s |
render.networkbandwidth(123456789) | 988 MBit/s |
Where the network is not involved and data rates are nevertheless output,
bytes are again common. The most prominent examples are the IO rates of hard
disks. For this there is the render function render.iobandwidth()
,
which in Checkmk works with powers of 1000:
Call | Output |
---|---|
render.iobandwidth(123) | 123 B/s |
render.iobandwidth(123456) | 123 kB/s |
render.iobandwidth(123456789) | 123 MB/s |
9.5. Percentages
The render.percent()
function represents a percentage value — rounded to two decimal places. It is an exception to the other functions in that the
actual natural value — that is, the ratio — is not passed here, but really the
percentage.
So if something is half full, for example, rather than 0.5. you must pass 50.
Because it can sometimes be interesting to know whether a value is almost zero or exactly zero, values that are greater than zero but less than 0.01 are marked by adding a "<" sign.
Call | Output |
---|---|
render.percent(0.004) | <0.01% |
render.percent(18.5) | 18.50% |
render.percent(123) | 123.00% |
9.6. Summary
Here to recap is an overview of all of the render functions:
Function | Entry | Description | Output example |
---|---|---|---|
date | Epoch | Date | Dec 18 1970 |
datetime | Epoche | Date and time | Dec 18 1970 10:40:00 |
timespan | Seconds | Duration / Age | 3d 5m |
frequency | Hz | Frequency (e.g. Clock rate) | 110 MHz |
disksize | Bytes | Hard disk size, Basis 1000 | 1,234 GB |
filesize | Bytes | Size of files, exact | 1,334,560 B |
bytes | Bytes | Size in bytes, base 1024 | 23,4 KiB |
nicspeed | Octets/sec | Network card speed | 100 MBit/s |
networkbandwidth | Octets/sec | Transmission speed | 23.50 GBit/s |
iobandwidth | Bytes/sec | IO-Bandwidth | 124 MB/s |
percent | Prozent | Percentage value, meaningfully rounded | 99.997% |
10. Thresholds and check parameters
10.1. A rule set for the setup
In one of our previous examples, we generated the state WARN if there were only 10 or fewer slots left. The number 10
was coded directly into the check function — hard-coded, as programmers would say. In Checkmk, however, users are more used to being able to configure such thresholds and parameters by rule. We will therefore take a look at how you can improve your check so that it can be configured via the setup interface.
To do this, we need to distinguish between two scenarios:
There is already a suitable rule set. This can actually only be the case if your new check performs a check for which Checkmk already has check plug-ins of the same type, such as monitoring a temperature. There is already a rule set for this that you can use directly.
There is no matching rule set. In that case you will have to create a new one.
10.2. Using existing rule sets
The rule sets supplied for the parameters of checks can be found in the lib/check_mk/gui/plugins/wato/check_parameters/
directory. Let’s take the file memory_simple.py
as an example. This defines a rule set with the following section:
rulespec_registry.register(
CheckParameterRulespecWithItem(
check_group_name="memory_simple",
group=RulespecGroupCheckParametersOperatingSystem,
item_spec=_item_spec_memory_simple,
match_type="dict",
parameter_valuespec=_parameter_valuespec_memory_simple,
title=lambda: _("Main memory usage of simple devices"),
))
Here the important point for you is the keyword check_group_name
, which is set here to 'memory_simple'
. This establishes the connection to the check plug-in. You do this when registering the check with the keyword check_ruleset_name
, for example:
register.check_plugin(
name = "foobar",
service_name = "Foobar Sector %s",
discovery_function = discover_foobar,
check_function = check_foobar,
check_ruleset_name="memory_simple",
check_default_parameters={},
)
The definition of default parameters via the keyword check_default_parameters
is also mandatory. These parameters apply to your check if the user has not yet created a rule. If there are no mandatory parameters, you can simply take the empty dictionary {}
as the value.
We will see below how the value configured by the user arrives at the check function.
10.3. Defining your own rule set
If there is no suitable rule set (which is probably the usual situation), we will have to create a new one ourselves. To do this, we create a file in the local/share/check_mk/web/plugins/wato
directory. The name of the file should be based on that of the check and, like all plug-in files, it must have the extension '.py'.
Let’s look at the structure of such a file step by step. First come some import commands. If you want the texts in your file to be translatable into other languages, import the _
(underscore) function. This is an identifier for all translatable texts. For example, you can code calling the function _("Threshold for warn")
instead of "Threshold for warn"
. The translation system in Checkmk, which is based on gettext, finds such texts and includes them in the list of texts to be translated. If you only build the check for yourself, you can do without it and do not need the following import command:
from cmk.gui.i18n import _
Next we import the so-called ValueSpecs. A ValueSpec is a very practical and universal tool that uses Checkmk in many places. It is used to generate customised input masks, to display and validate the entered values and to convert them into Python data structures. In the following example, Dictionary
, Integer
and TextInput
will be imported.
from cmk.gui.valuespec import (
Dictionary,
Integer,
TextInput,
)
You will definitely need the Dictionary
.
Since version 2.0.0 of Checkmk it is mandatory that check parameters are Python dictionaries. Previously, it could also be a pair (a tuple of two numbers), e.g. Warn/Crit.
Integer
is responsible for the input of a number without decimal places and TextInput
for a Unicode text.
Next, symbols are imported that are needed for registration:
from cmk.gui.plugins.wato import (
CheckParameterRulespecWithItem,
rulespec_registry,
RulespecGroupCheckParametersOperatingSystem,
)
If your check does not have an item, import instead CheckParameterRulespecWithoutItem. About the RulespecGroup
…. we will explain more on this below.
Now come the actual definitions. First we define an input field with which the user can specify the item for the check. This is necessary for the rule condition and also for the manual creation of checks that are to function without discovery. We do this with TextInput
. This is assigned a title via title
, which is then usually displayed as a heading for the input field:
def _item_valuespec_foobar():
return TextInput(title=_("Sector name"))
You can freely choose the name of the function that returns this ValueSpec as it is only required at the point further down. So that it does not become visible beyond the module boundary, it should begin with an underscore.
Next comes the ValueSpec for entering the actual check parameter. For this, as well, we create a function that generates it. The return Dictionary(…)
is mandatory. Within it, you create the list of sub-parameters for this check with elements=[…]
. In our example there is only one — the warning threshold for the free slots. This is required to be an integer, so we use integer
here.
def _parameter_valuespec_foobar():
return Dictionary(
elements=[
("warning_lower", Integer(title=_("Warning below free slots"))),
],
)
Last but not least, we now register a new rule set using the imported and self-defined items. For this purpose there is the rulespec_registry.register()
function:
rulespec_registry.register(
CheckParameterRulespecWithItem(
check_group_name="foobar",
group=RulespecGroupCheckParametersOperatingSystem,
match_type="dict",
item_spec=_item_valuespec_foobar,
parameter_valuespec=_parameter_valuespec_foobar,
title=lambda: _("Free slots for Foobar sectors"),
))
A few more notes on this:
If your check does not use an item, the inner function is
CheckParameterRulespecWithoutItem
. The lineitem_spec
is then omitted.As mentioned above, the
check_group_name
provides the link to the checks that are to use this rule. It may not be identical with an already existing rule, because this would overwrite the existing rule.The
group
determines in which category in the setup the rule set should appear. Most of these groups are defined in the filelib/check_mk/gui/plugins/wato/utils/init.py
. There you will also find examples of how to create your own new group.The
match_type
is always"dict"
. In older Checkmk versions there were also parameter rules with other types.title
defines the title of the rule set, but is not given directly as text, but as an executable function which returns the text (hence thelambda:
).
Testing
When you have created this file, you should first try out whether everything works so far and not immediately continue working with the check function. To do this, you must first restart the Apache for the site so that the new file will be read. This is performed using the command:
OMD[mysite]:~$ omd restart apache
After that, the rule set should be found in the setup. Create a rule in this chain and try out different values. If this functions without errors, you can now use the check parameters in the check function.
10.4. Applying the rule to the check plug-in
In order for the rule to take effect, we must allow the check plug-in to accept check parameters and tell it which rule to use. To do this, the check_default_parameters
entry must be present in the registration. In the simplest case, we pass an empty dictionary.
Secondly, we pass the check_ruleset_name
to the registration function, i.e. the name we gave to the rule set above using check_group_name
. This way Checkmk knows from which rule set the parameters are to be determined.
The whole thing will then look like this, for example:
register.check_plugin(
name = "foobar",
service_name = "Foobar Sector %s",
discovery_function = discover_foobar,
check_function = check_foobar,
check_default_parameters={},
check_ruleset_name="foobar",
)
Now Checkmk will try to pass parameters to the check function. For this to work, we need to extend the check function so that it expects the params
argument as the second argument. This is inserted between item
and section
(If you build a check without an item, the item
is of course omitted and params
will be at the beginning):
def check_foobar(item, params, section):
It is highly recommended to now have the contents of the variable params
printed out with a print
command as a first test (or pprint
if you want to have it a bit more convenient). Create different rules, and see which values arrive at params
:
def check_foobar(item, params, section):
print(params)
for sector, used, slots in ...
Very important: When everything is ready, be sure to remove the print
commands again! These can otherwise disrupt the internal communication in Checkmk.
Now we adapt our check function so that the parameter passed can produce its desired effect. We get the value with the usually chosen key (here "warning_lower"
) from the parameters:
def check_foobar(item, params, section):
warn = params["warning_lower"]
for sector, used, slots in section:
if sector == item:
used = int(used) # convert string to int
slots = int(slots) # convert string to int
if used == slots:
s = State.CRIT
elif slots - used <= warn:
s = State.WARN
else:
s = State.OK
yield Result(
state = s,
summary = f"used {used} out of {slots} slots")
return
If a rule has been configured, we can now monitor the "free slots" in our example. However, if no rule has been defined, this check function will crash: Since the default parameters of the plug-in are not filled, in the absence of a rule the plug-in will generate a KeyError
.
We can fix this problem by inserting a suitable parameter during registration:
register.check_plugin(
name = "foobar",
service_name = "Foobar Sector %s",
discovery_function = discover_foobar,
check_function = check_foobar,
check_default_parameters = {"warning_lower": 10},
check_ruleset_name = "foobar",
)
You should always pass default values in this way (and not use the check plug-in to catch missing parameters), as these default parameters can also be displayed in the setup interface. For this purpose, there is e.g. the entry Show check parameters in the menu Display on the service configuration page for a host.
By the way, having a single value as a threshold is very uncommon in Checkmk. Since services can be in the states OK, WARN, CRIT, it is only logical to always define the parameters as tuples
with two entries, i.e. as a pair of thresholds for WARN and CRIT. To do this, we adapt the rule set as follows:
def _parameter_valuespec_foobar():
return Dictionary(
elements=[
("warning_lower", Tuple(
title=_("Levels on free slots"),
elements=[
Integer(title=_("Warning below")),
Integer(title=_("Critical below")),
],
)),
],
)
Note that such a change of data type is an incompatible change: Existing rules can now no longer be loaded from the interface. And also the check function may run into problems if instead of an expected pair of numbers there is a single number in params
. You can simply edit such rules. When you save them again, the new format will be used.
10.5. Further ValueSpecs
In Checkmk there are numerous ValueSpecs for all kinds of situations. Here are a few more useful ones:
Float
Float
is like Integer
but allows the input of numbers with decimal places.
Percentage
Often one does not want to indicate thresholds in absolute numbers, but in percentages. For this purpose there is the Percentage
ValueSpec:
def _parameter_valuespec_foobar():
return Dictionary(
elements=[
("levels_percent", Tuple(
title=_("Relative levels"),
elements=[
Percentage(title=_("Warning at"), default_value=80),
Percentage(title=_("Critical at"), default_value=90)
],
)),
],
)
With this ValueSpec, the check plug-in would receive the parameters {"levels_percent": (80.0, 90.0)}
.
MonitoringState
The MonitoringState
is useful if you want to allow the user to select one of the states OK, WARN, CRIT and UNKNOWN for each of various situations. It provides the user with a drop-down list of just these four options, which are then converted to one of the numbers 0
, 1
, 2
or 3
.
Here you can set, for example, which status the service should have if no backup is configured or available:
def _parameter_valuespec_plesk_backups():
return Dictionary(
help=_("This check monitors backups configured for domains in plesk."),
elements=[
("no_backup_configured_state",
MonitoringState(title=_("State when no backup is configured"), default_value=1)),
("no_backup_found_state",
MonitoringState(title=_("State when no backup can be found"), default_value=1)),
...
With this ValueSpec, the check plug-in would be passed the parameters `{"no_backup_configured_state": 1, "no_backup_found_state": 1}
if in both cases the default of WARN (=1) was taken over. You can easily convert the number into a State
object by passing it to the State()
function:
yield Result(
state=State(params["no_backup_configured_state"]),
summary="No backup is configured!",
)
Age
The field Age
allows the entry of an age, which is stored and transferred internally as a count of seconds:
def _parameter_valuespec_antivir_update_age():
return Tuple(elements=[
Age(title=_("Warning level for time since last update")),
Age(title=_("Critical level for time since last update")),
],)
Filesize
The ValueSpec Filesize
allows the input of file (or hard disk) sizes. Internally, the calculation is done with bytes, but the user may choose from KB, MB, GB or TB:
Tuple(
title=_("Maximum size of all files on backup space"),
help=_("The maximum size of all files on the backup space. "
"This might be set to the allowed quotas on the configured "
"FTP server to be notified if the space limit is reached."),
elements=[
Filesize(title=_("Warning at")),
Filesize(title=_("Critical at")),
],
),
The topic of ValueSpecs is extremely flexible and extensive and would go beyond the scope of this article. Please have a look at the examples of the rule definitions supplied by Checkmk in lib/check_mk/gui/plugins/wato/check_parameters/
. There are more than 500 files with examples.
11. Customised presentation of metrics
11.1. The significance of metric definitions
In our example above, we have let the foobar
plug-in generate the metric fooslots
. Metrics are immediately visible in the Checkmk graphical interface without you having to do anything. A graph for each metric will be automatically generated in the service details.
However, there are a few limitations:
A "Perf-O-Meter", i.e. the graphical bar-like preview of the measurement value, does not automatically appear when the service is displayed in the list view (e.g. in the view showing all services of a host).
Matching metrics are not automatically combined in a single graph, but each appears separately.
The metric does not have a proper title, but the internal variable name of the metric is shown.
No unit is used that allows a meaningful representation (e.g. GB instead of individual bytes).
A colour is randomly selected. To have a clear representation of your metrics in these aspects, you will need some more definitions in another file.
11.2. Using existing metric definitions
Before you do this, you should — as with the rule set for the parameters — first check whether Checkmk does not already come with a suitable metrics definition. The predefined metrics definitions can be found in the lib/check_mk/gui/plugins/metrics/
directory. For example, in the file cpu.py
you will find a metric for the free space in a file system:
metric_info["util"] = {
"title": _("CPU utilization"),
"unit": "%",
"color": "26/a",
}
If this is suitable for your plug-in, you only need to use the name "util"
in your call to the Metric()
class. Everything else will then be automatically derived from it.
11.3. Own metric definitions
If there is no suitable metric, simply create one yourself. In our example we want to define our own metric for our fooslots
. To do this, we create a file in local/share/check_mk/web/plugins/metrics
:
from cmk.gui.i18n import _
from cmk.gui.plugins.metrics import metric_info
metric_info["fooslots"] = {
"title": _("Used slots"),
"unit": "count",
"color": "15/a",
}
Here are a few pointers:
The key (here
'fooslots'
) is the metric name and must match what the check function outputs.Importing and using the underscore for internationalisation is optional, as already discussed in the rules.
See the file
lib/check_mk/gui/plugins/metrics/unit.py
for the unit definitions.The colour definition uses a palette. For each palette colour there are
/a
and/b
. These are two shades of the same colour. In the existing definitions you will also find many direct colour codes like'#ff8800'
. These will gradually be phased out and all replaced by palette colours as these provide a more uniform look and are also easier to match to the interface themes.
This definition now ensures that the colour, title and unit of the metric are displayed according to our requirements.
11.4. Graphs with multiple metrics
If you want to combine several metrics in one graph — which is often very useful — you need, simply in the same file, a graph definition. This is done via the global dictionary graph_info
.
For example, let’s assume our check has two metrics, fooslots
and fooslots_free
. The metric definitions would be, for example:
from cmk.gui.i18n import _
from cmk.gui.plugins.metrics import (
metric_info,
graph_info,
)
metric_info["fooslots"] = {
"title": _("Used slots"),
"unit": "count",
"color": "16/a",
}
metric_info["fooslots_free"] = {
"title": _("Free slots"),
"unit": "count",
"color": "24/a",
}
Now we add a graph that draws these two metrics as lines:
graph_info["fooslots_combined"] = {
"metrics": [
("fooslots", "line"),
("fooslots_free", "line"),
],
}
Notes on this:
Unfortunately, there is no description of the possibilities for this definition in the manual yet. But you will find many examples in the files in the directory
lib/check_mk/gui/plugins/metrics
.Try
area
orstack
instead ofline
.
11.5. Displaying the metrics in the Perf-O-Meter
If you would like to display a Perf-O-Meter in the service line in addition to our metric, you need another file, this time in the directory local/share/check_mk/web/plugins/perfometer
.
Example:
from cmk.gui.plugins.metrics import perfometer_info
perfometer_info.append({
"type": "logarithmic",
"metric": "fooslots",
"half_value": 5,
"exponent": 2.0,
})
Perf-O-Meters are a bit trickier than graphs because they have no legend. And that’s why it’s difficult with the range of values. Since the poor Perf-O-Meter cannot know which readings are even possible and the space is very limited, many built-in check plug-ins use a logarithmic representation. This is also the case in our example, half_value
is the measured value that is displayed exactly in the middle of the Perf-O-Meter. With a value of 5
, the bar would be half filled. And exponent
describes the factor which is necessary so that another 10% of the range would be filled. So in this example, a reading of 10
would be displayed at 60% and one of 20
at 70%.
The advantage of this method is that when you have a list of services of the same type, you can quickly compare all Perf-O-Meters visually because they all have the same scale. And despite the very small representations, you can easily see the differences in both very small and large readings. The values are NOT to scale, however.
Alternatively, you can also use a linear Perf-O-Meter. This is always useful if there is a known maximum value. A typical situation would be measured values that represent percentages from 0 to 100. This would then look like this, for example:
perfometer_info.append({
"type": "linear",
"segments": ["fooslots_used_percent"],
"total": 100.0,
})
There is another difference to the logarithmic representation — here segments
is a list and allows multiple metrics to be displayed side by side.
As always, you can find examples in the many plug-ins supplied by Checkmk. These are also in the files in the directory lib/check_mk/gui/plugins/metrics
.
12. Notes for users of the old API
Are you already experienced in developing check plug-ins with the previous API — the one up to version 1.6.0 of Checkmk? Then you will find some notes about important changes summarized here.
12.1. saveint() and savefloat()
The two functions saveint()
and savefloat()
have been dropped.
As a reminder, saveint(x)
returns 0
if x
cannot be reasonably converted to a number, e.g. because it is an empty string or does not consist only of digits.
While there have been a few good use cases for this, it has been used incorrectly in the majority of cases, which in the past has resulted in many errors being obscured.
In a situation in which you want to get a 0
on an empty string — which is the most common 'good' use case of saveint(x)
— you can simply code the following:
foo = int(x) if x else 0
For savefloat()
everything applies analogously.
13. Taming complex agent outputs using the parse function
The next step is the so-called parse function. This has the task of parsing the 'raw' agent data and putting it into a logically-tidy form that is easy to process in all subsequent steps. The convention is that this is named after the agent section and begins with parse_
. It gets string_table
as its only argument. Please note that you are not free to choose the argument here. It really must be called that.
For now, we write our parse function in such a way that we simply output the data it receives to the console. To do this, we simply use the print
function (note: since Python 3, brackets are mandatory here):
def parse_linux_usbstick(string_table):
print(string_table)
In order for this whole process to have any effect, we have to make our parse function and the new agent section known to Checkmk. To do this, we call up a registration function:
register.agent_section(
name = "linux_usbstick",
parse_function = parse_linux_usbstick,
)
Here it is important that the name of the section really does exactly match the section header in the agent output. Altogether, it now looks like this:
from .agent_based_api.v1 import *
def parse_linux_usbstick(string_table):
print(string_table)
return string_table
register.agent_section(
name = "linux_usbstick",
parse_function = parse_linux_usbstick,
)
From this point on, every plug-in that uses the section linux_usbstick
gets the return value from the parse function. As a rule, this will be the check plug-in of the same name.
In a way, we have now built the simplest possible plug-in, which although it has no real use yet, we can at least test it. To do this, we trigger a service detection (option -I
) on the command line from the host whose agent we prepared earlier. If its output really contains a section linux_usbstick
, then we should see our debug output:
OMD[mysite]:~$ cmk -I myhost123
[['ata-APPLE_SSD_SM0512F_S1K5NYBF810191'], ['wwn-0x5002538655584d30']]
The output becomes somewhat clearer if we replace the simple print
with a Pretty-print from the module pprint
. This is highly recommended for all further debugging output:
from .agent_based_api.v1 import *
*import pprint*
def parse_linux_usbstick(string_table):
*pprint.pprint(string_table)*
return string_table
register.agent_section(
name = "linux_usbstick",
parse_function = parse_linux_usbstick,
)
It will look like this:
OMD[mysite]:~$ cmk -I myhost123
[['ata-APPLE_SSD_SM0512F_S1K5NYBF810191'],
['wwn-0x5002538655584d30']]
13.1. Composing the parse function
If you look closely, you will see that these are nested lists. In the argument string_table
you get a list which contains a list of words per line of the agent output. The lines are separated by sequences of spaces. Since our section contains only one word per line, the inner lists consist of only one entry each.
The following example makes the structure a little clearer:
from .agent_based_api.v1 import *
import pprint
def parse_linux_usbstick(string_table):
print("Number of lines: %d" % len(string_table))
print("Number of words in first line: %d" % len(string_table[0]))
print("Length of first word: %d" % len(string_table[0][0]))
return string_table
register.agent_section(
name = "linux_usbstick",
parse_function = parse_linux_usbstick,
)
The output will look like this:
OMD[mysite]:~$ cmk -I myhost123
Number of lines: 3
Number of words in first line: 1
Length of first word: 36
For our example, we just need a simple list of device names, so we make our parse function unpack the single word from each line and package it into a nice new list:
def parse_linux_usbstick(string_table):
parsed = []
for line in string_table:
parsed.append(line[0])
pprint.pprint(parsed)
return string_table
The debug output then looks like this (please look carefully, there are now only a single pair of square brackets):
['ata-APPLE_SSD_SM0512F_S1K5NYBF810191',
'wwn-0x5002538655584d30']
For the parse function to be complete, we now need to remove the debug message and — very importantly — deliver the new result with return
:
def parse_linux_usbstick(string_table):
parsed = []
for line in string_table:
parsed.append(line[0])
return parsed
Of course, from this point on, all of the relevant plug-ins must be able to work with the new data format.
14. The outlook for the future
There are many more aspects and topics around the development of own plug-ins. Checkmk has many interfaces for custom extensions and is therefore very flexible. We are working on progressively describing these interfaces in the manual.
Should you have any questions or difficulties, our professional support and also the free forum are of course at your disposal.
15. Files and directories
local/lib/check_mk/base/plugins/agent_based | Location for self-written check plug-ins. |
local/share/check_mk/web/plugins/wato | Storage location for your check parameter rule sets. |
local/share/check_mk/web/plugins/metrics | Storage location for own metric definitions. |
local/share/check_mk/web/plugins/perfometer | Storage location for own Perf-O-Meter definitions. |
local/share/check_mk/mibs | Place SNMP MIB files here that are to be loaded automatically. |
lib/check_mk/gui/plugins/wato/check_parameters | Here you will find the rule set definitions for all check plug-ins included in Checkmk. |
lib/check_mk/gui/plugins/wato/utils/init.py | This file defines the groups of the setup interface in which you can store new rule sets. |
lib/check_mk/gui/plugins/metrics/ | Here you will find the metric definitions for the supplied plug-ins. |
lib/check_mk/gui/plugins/metrics/unit.py | The predefined units for the metrics are in this file. |
/usr/lib/check_mk_agent/plugins | This directory refers to a monitored Linux host. Here the Checkmk agent for Linux expects to find agent extensions (agent plug-ins). |