1. Introduction
An important advantage with Checkmk compared to other monitoring systems is the large number of well-maintained check plug-ins supplied as standard. In order for these plug-ins to have a uniformly high quality there are standardised criteria that each plug-in must meet. In this article we will introduce these criteria.
An important note regarding the criteria: please don’t simply assume that all plug-ins supplied with Checkmk conform to all current standards. Please avoid copy & paste. It is more advisable to orient your work according to the information in this article.
Should you be developing plug-ins solely for your own use, you are of course quite free and are not bound by our standards.
2. Preface
2.1. Quality
For check plug-ins that are official components in Checkmk, or are planned to be such, higher quality is demanded in comparison to those written ‘for your own use’. This expectation applies to their ‘outer’ quality (as seen by the user), as well as to their internal quality (the readability of code, etc.).
Please code the plug-in well, and to as high a standard as you are able to.
2.2. The basics of a check plug-in
Every check plug-in must as a minimum include the following components:
The check plug-in itself
A Manpage
Plug-ins with check parameters require a WATO-Definition for the applicable Rule Set
Metrics definitions for graphs and the Perf-O-Meter if the check produces metrics data
A definition for the Agent Bakery if an agent plug-in is present
A number of complete, and diverse examples of the agents outputs, or respectively SNMP-Walks
3. Naming conventions
3.1. Plug-in names
Choosing a name for a plug-in is especially critical since this name cannot be altered at a later date.
A plug-in’s name must be short, sufficiently explicit, and understandable. Example:
firewall_status
is only a good name if the plug-in functions for all, or at least many firewalls.A name is composed of lower case letters and numerals. An underscore is permitted as a separator.
The words
status
orstate
are unnecessary in a name, since of course every plug-in monitors a status. The same applies for the superfluous wordcurrent
. So, rather thanfoobar_current_temp_status
simply use justfoobar_temp
.Checks that apply to a physical thing (e.g. fan, power supply), should have a name in the singular – for example,
casa_fan
,oracle_tablespace
. Checks in which each item refers to a number or multiples should be named using a plural – for example,user_logins
,printer_pages
.Product-specific checks should be prefixed with the product name — e.g.,
oracle_tablespace
.Manufacturer-specific check plug-ins that do NOT apply to a SPECIFIC product should be prefixed with the manufacturer’s abreviation — e.g.,
fsc_
for Fujitsu Siemens Computers.SNMP-based check plug-ins that use a common component of the MIB which may well be supported by more than one manufacturer should be named after the MIB, rather than after a manufacturer — e.g., the
hr_*
-check plug-ins.
3.2. Service names
Use common and well-defined abbreviations (e.g. CPU, DB, VPN, IO,…).
Write abbreviations in upper case.
Use sentence case (e.g. CPU utilization, not Cpu Utilization) — proper names are an exception.
Write product names as defined by the vendor (e.g. vSphere).
Use American English (e.g. utilization, not utilisation).
Stick to existing naming schemes, if they exist. For example, all interface services use the template "Interface %s".
Try to be brief as those names show up in dashboards, views and reports and might be cut, if too long.
3.3. Names for metrics
Metrics for which a meaningful definition already exists should always be used.
Otherwise similar rules as used for the naming of check plug-ins apply (product-specific, manufacturer-specific, etc.)
3.4. WATO-rule group names
The same convention applies as with metrics.
4. Constructing check plug-ins
4.1. General structure
The actual Python data under share/check_mk/checks/
should have the
following structure (complying with the coding sequence):
A file header with a GPL-Notice
The name and email address of the original author if the plug-in has not been developed by the Checkmk-Projekt.
A short sample of the agent’s output
Default values for the check parameter (
factory_settings
)Help functions, if available
The Parse function, if available
The Discovery function
The Check function
The
check_info
-declaration
4.2. Coding guidelines
Author
If the plug-in has not been developed by the Checkmk-Team, the author’s name and email address should be coded directly after the GPL-header.
Readability
Avoid long lines of code — the maximum permitted length is 100 characters.
In each case the indentation is four blank characters — do not use tabs.
Orientate yourself to Python-Standard PEP 8
Sample agent outputs
Including a sample of an agent’s output greatly simplifies the reading of the code. When doing so it is important to include various possible outputs in the sample. Make the sample no longer than necessary. With SNMP-based checks provide an SNMP-Walk:
Example excerpt from SNMP data:
.1.3.6.1.4.1.2.3.51.2.2.7.1.0 255
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.1.1 1
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.2.1 "Good"
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.3.1 "No critical or warning events"
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.4.1 "No timestamp"
If, for example, differing output formats are produced due to differing firmware
standards in the target devices, then an example noting the version should be provided for each.
A good example of this case can be found in the multipath
check plug-in.
SNMP-MIBs
When defining the snmp_info
the readable path to the OID should be given in the
comments. Example:
'snmp_info' : [(".1.3.6.1.2.1.47.1.1.1.1", [
OID_END,
"2", # ENTITY-MIB::entPhysicalDescription
"5", # ENTITY-MIB::entPhysicalClass
"7", # ENTITY-MIB::entPhysicalName
]),
Using lambda
Avoid complex expressions with lambda
. Permitted is lambda
in
the lambda oid: …
scan function, and when you wish to invoke existing functions
with only an altered argument — for example:
"inventory_function" : lambda info: inventory_foobar_generic(info, "temperature")
Iterating through SNMP-agent data
With checks that parse SNMP-data, an index like this should not be used…
for line in info:
if line[1] != '' and line[0] ...
It is better to unpack each line as meaningful variables:
for *sensor_id, state_state, foo, bar* in info:
if sensor_state != '1' and sensor_id ...
Parse functions
Always use parse functions whenever parsing an agent’s output is not trivial.
The parse function’s argument should always be named info
,
and in the discovery and check functions the argument should be named parsed
instead of info
.
In this way it will be clear to the reader that this result is from a parse function.
Checks with multiple partial results
A check that produces multiple partial results — for example, current allocations
and growth — must return these with yield
. Checks that produce only a
single result must use return
.
if "abs_levels" in params:
warn, crit = params["abs_levels"]
if value >= crit:
yield 2, "...."
elif value >= warn:
yield 1, "...."
else:
yield 0, "..."
if "perc_levels" in params:
warn, crit = params["perc_levels"]
if percentage >= crit:
yield 2, "...."
elif percentage >= warn:
yield 1, "...."
else:
yield 0, "..."
The (!)
and (!!)
markers are obsolete and may
no longer be used. These should be replaced by yield
.
Keys in check_info[…]
Only store keys which will be used In your entry in check_info
.
The only required entries are ‘service_description’
and ‘check_function’
.
Only insert ‘has_perfdata’
and other keys with boolean values if
their value is True
.
4.3. Agent plug-ins
If your check plug-in requires an agent plug-in, then be aware of the following rules:
Store the plug-in in
share/check_mk/agents/plugins
for Unix-type systems, and set the execution rights to755
.In Windows the directory is called
share/check_mk/agents/windows/plugins
.Shell and Python scripts should have no file name extension (omit
.sh
and.py
).Use
!/bin/sh
in the first lines of shell scripts. Only use!/bin/bash
if BASH features are required.Use the standard Checkmk-file heading with the GPL-notice.
Your plug-in must not damage the target system, especially if the plug-in is not actually supported by the system.
Do not forget the reference to the plug-in on the check’s manpage.
If the component that the plug-in is to monitor doesn’t actually exist on a system, the plug-in must not output a section head.
If the plug-in requires a configurations file this should (in Linux) be searched for in the
$MK_CONFDIR
directory, and the file must have the same name as the plug-in — apart from the.cfg
extension, and without a possiblemk_
prefix. The procedure is similar for Windows — the directory in Windows is%MK_CONFDIR%
.Do not code plug-ins for Windows in Powershell. This is not portable, and is in any case very resource-greedy. Use VBS.
Do not code Plug-ins in Java.
4.4. Don’ts
Do not use
import
in your check files. All permitted Python modules have already been imported.Do not use
datetime
for parsing and calculating time specifications – usetime
. This can perform all needed tasks. Really!Arguments that receive your functions must in no way modify the functions. This especially applies for
params
andinfo
.Should you really want to work with regular expressions (they are slow!), invoke these with the
regex()
function — do not usere
directly.Naturally it is not permitted to use
print
, or otherwise route outputs tostdout
, or communicate with the outside world in any way!The SNMP-scan function is not allowed to retrieve OIDs other than
.1.3.6.1.2.1.1.1.0
and.1.3.6.1.2.1.1.2.0
. Exception: the SNMP-scan function has ensured via a Check of one of these OIDs that further OIDs will retrieve only a strictly-limited number of devices.
5. Behaviour of check plug-ins
5.1. Exceptions
Your check plug-in should not, rather it must always assume that an agent’s output is syntactically valid. The plug-in is in no case permitted to attempt to handle unknown error situations in the output itself!
Why is this so? Checkmk has a very refined function for automatically handling such errors.
For the user it can generate comprehensive crash reports, and it also sets the status of the plug-in to UNKNOWN. This is much more helpful than if the check, for example, simply produces an unknown SNMP code 17
.
The discovery, parse and/or check function should generally enter an exception if the agent’s output is not in the defined, known format for which the plug-in was developed.
5.2. saveint() and savefloat()
The saveint()
and savefloat()
functions convert a string into
an int
or float
and produce a 0
if the string cannot
be converted (e.g. it is an empty string).
Only use these functions if the empty or invalid value is a known condition — otherwise important error messages will be supressed (see above).
5.3. Item not found
A check that doesn’t find an item being monitored should simply produce a None
,
and not generate its own error message. In such a case Checkmk will produce
a standardised, consistent error message, and set the service to UNKNOWN.
5.4. Threshholds
Many check plug-ins have parameters which define thresholds for specific metrics, and thus determine when the check assumes a WARN or CRIT status. Please be aware of the following rules that ensure Checkmk reacts consistently.
The thresholds for WARN and CRIT should always be verified with
>=
and<=
. Example: a plug-in monitors the length of a mail queue. The critical upper limit is 100. This means that if the actual value is ‘100’ it is already critical!If there are ONLY upper, or ONLY lower thresholds (the commonest cases), then the entry fields in WATO should be coded with Warning at and Critical at .
If there are upper AND lower thresholds, the coding should be as follows: Warning at or above _, Critical at or above , _Warning at or below and _Critical at or below _.
5.5. Check plug-in outputs
Every check produces one line of text — the plug-in output. To achieve a consistent behavier for all plug-ins, the following rules apply:
For showing measured values, exactly one blank character should separate the value and the unit (e.g.
17.4 V
). The only exception to this rule is with%
, where there is no blank:89.5%
.When listing measured values, the value’s name with an initial capital is followed by a colon. Example:
Voltage: 24.5 V, Phase: negative, Flux-Compensator: operational
Do not show internal keys, codewords, SNMP-internals or other rubbish in plug-in outputs which is of no use to the user. Use meaningful human-readable terms. Use terms that the user normally expects! Example: Use
route monitor has failed
rather thanrouteMonitorFail
.If the check item has an additional specification, code this in square brackets at the beginning of the output (e.g.
Interface 2 - [eth0] …
)In listings, items are separated by commas, and following items have initial capitals:
Swap used: …, Total virtual memory used: …
5.6. Default thresholds
Every plug-in that works with thresholds should have meaningful default threshold values defined for it. The following rules apply:
The default thresholds used in the check should also be defined 1:1 as default parameters in the applicable WATO-rule.
The default thresholds should be defined in
factory_settings
(if the check has a dictionary as a parameter).The default thresholds should be selected on a technically-sound basis. Is there a manufacturer’s specification? Are there best Practices?
It is essential that the source of the thresholds be documented in the check.
5.7. Nagios vs. CMC
Ensure that your check also functions with a Nagios core. That is usually the case automatically, but not always.
6. Metrics
6.1. Formats for metrics
The check plug-in always returns metric data as
int
orfloat
. Strings are not allowed.If you wish to output the sixtuple from a metric value field, use
None
in its position. Example:[("taple_util", utilization, None, None, 0, size)]
If you don’t require the entry at the end, simply shorten the tuple. Do not use a
None
at the end.
6.2. Naming the metrics
Metric names consist of lower case letters and underscores. Numerals are permitted, but not leading.
Metric names should be, as with check plug-ins, short and specific. Metrics that will be used by multiple plug-ins should have generic names.
Avoid using the pointless filler word
current
. The measured value is always the current one.The metric should be named after the ‘thing’, not after the unit of measurement. Thus, for example,
current
rather thanampere
, orsize
rather thanbytes
.Important always use the canonical size. Really! Checkmk scales the data itself as appropriate. Example:
Measurement type | Canonical unit |
---|---|
Duration |
Seconds |
File size |
Bytes |
Temperature |
Celsius |
Network throughput |
Octets per second (not bits/sec!) |
Percentage value |
A value from 0 to 100 (not 0.0 to 1.0) |
Events per time period |
1 per second |
Electrical performance |
Watts (not mW) |
6.3. Flags for metric data
Only set
‘has_perfdata’
incheck_info
toTrue
if the check actually outputs metric data (or can output it).
6.4. Definitions for graphs and the Perf-O-Meter
The definitions for graphs should be like the definitions in
web/plugins/metrics/check_mk.py
. Do not create definitions for PNP-graphs.
In the Raw Edition as well these will be generated on the basis of the metric
definitions in Checkmk itself.
7. WATO-Definition
7.1. Check group names
Check plug-ins with parameters require a compulsory WATO-rule definition.
The connection between a plug-in and a rule is made through the check group
(the entry ‘group’
in check_info
). All checks
that are configured with the same rule set are consolidated via the group.
If your plug-in should sensibly be configured with an existing rule set, then also use an existing group.
If your plug-in is so specific that it in any case requires its own group, then create an own group for it where the group’s name should reference the plug-in.
Should it be foreseeable that in the future further plug-ins could use the same rule set, then use an appropriately generic name.
7.2. Default values for ValueSpecs
When defining your parameter definitions (ValueSpecs) use the exact same
default values as the defaults actually used in the checks (if possible).
Example: if without a rule the check assumes the threshold (5, 10)
for WARN
and CRIT, then the ValueSpec should be so defined that 5
and 10
will be automatically offered as thresholds.
7.3. Choosing ValueSpecs
For some types of data there are specialised ValueSpecs. An example is
Age
for a certain number of seconds. This must be used wherever it is
appropriate. Do not, for example, use Integer
in such a case.
8. Include-files
For a number of types of checks there are already-prepared implementations in include-files, that not only can be used, but SHOULD be used. Important include-files are:
temperature.include |
Monitoring of temperatures |
elphase.include |
Electrical AC phases (e.g. in USV) |
fan.include |
Fans |
if.include |
Network interfaces |
df.include |
File system levels |
mem.include |
Monitoring of RAM (Main storage) |
ps.include |
Operating system processes |
Important: use existing Include files only if these have been designed for the purpose at hand, and not simply because they are an approximate fit!
9. Manpages
Each check plug-in must have a Manpage. If you have programmed several plug-ins in one check file (subchecks), each of these must of course have its own Manpage.
The Manpage is intended for the user! Write information that will help them. Here it is not about documenting what you have programmed, but about giving the user the useful information that they need.
A Manpage must be:
complete
precise
short
helpful!
A Manpage consists of several sections — some of which are optional:
9.1. Title
With the title:
macro you determine the heading. This consists of:
The exact device name or device group for which the check is written
What the check monitors, e.g., System Health
These two parts are separated by a colon — only in this way can existing checks be easily searched for and, above all, found.
9.2. Agent categories
The agents:
macro can have different categories. There are basically
three categories:
Agents: In this case the operating systems for which the check was built and is available for are specified. For example
linux
, orlinux, windows, solaris
SNMP: In this case there is only the entry
snmp
Active checks: If an active check has been integrated into the Checkmk interface, use the category
active
9.3. Catalogue entries
With the header catalog:
you define where in the Checkmanpages catalogue
the plug-in is stored. If a category is missing — for example, for a new
manufacturer — the category must be defined in the catalog_titles
variable in the cmk/man_pages.py
file — or from Version 1.6.0
in the cmk/utils/man_pages.py
file.
Currently this file cannot be extended in local/
by plug-ins,
so only the developers of Checkmk can make changes here.
Please note the exact capitalization of product and company names! This applies not only to the catalogue entry, but also to all other texts where these occur. Example: NetApp is always written NetApp, and not netapp, NETAPP, Netapp, or similar. Google can help to find the correct spelling!
9.4. Description of the plug-in
The following information must be included in the description:
in the Manpage:
Exactly what hardware or software does the check monitor? Are there special features of certain firmware or product versions of the devices? Do not refer to a MIB, but to product designations. Example: It is not helpful if you write “This check works for all devices that support the Wrdpfrmpft-17.11-MIB”. Write precisely which product lines or similar are supported.
Which aspect of this is monitored? What does the check do?
Under what conditions is the check OK, WARN or CRIT?
Is an agent plug-in required for the check? If yes — how is it installed? This must work without the Agent Bakery.
Are there any other requirements for the check to work (preparation of the target system, installation of drivers, etc.). These should only be listed if they are not normally fulfilled anyway (e.g. mounting of
/proc
under Linux).
Do not write anything that affects ALL checks together. For example, do not repeat general things like how to set up SNMP-based checks.
9.5. Item
For checks that have an item (i.e., a %s
in the service name),
the Manpage under item:
must describe how it is formed.
If the check plug-in does not use an item, you can omit this line completely.
9.6. Service Discovery
Under inventory:
, write under which conditions this check’s service(s)
will be found automatically, i.e. how the ‘Service Discovery’ behaves.
An example from nfsmounts
:
inventory:
All NFS mounts are found automatically. This is done
by scanning {/proc/mounts}. The file {/etc/fstab} is irrelevant.
Make sure that the text is understandable without deeper knowledge of an MIB or the code — so do not write:
One service is created for each temperature sensor if the state is 1.