1. Introduction
An important advantage with Checkmk compared to other monitoring systems is the large number of well-maintained check plug-ins supplied as standard. In order for these plug-ins to have a uniformly high quality there are standardized criteria that each plug-in must meet.
An important note regarding the criteria: do not simply assume that all plug-ins supplied with Checkmk conform to all current standards. Avoid copy & paste. It is more advisable to orient your work according to the information in this article.
Should you be developing plug-ins solely for your own use, you are of course quite free and are not bound by our standards.
1.1. Quality
For check plug-ins that are official components in Checkmk, or are planned to be such, higher quality is demanded in comparison to those written ‘for your own use’. This expectation applies to their ‘outer’ quality (as seen by the user), as well as to their internal quality (the readability of code, etc.).
Please code the plug-in well, and to as high a standard as you are able to.
1.2. Scope of a check plug-in
Every check plug-in must as a minimum include the following components:
The check plug-in itself.
A man page.
Plug-ins with check parameters require a definition for the applicable rule set.
Metric definitions for graphs and the Perf-O-Meter if the check produces metric data.
A definition for the Agent Bakery if an agent plug-in is present.
A number of complete, and diverse examples of the agents outputs, or respectively SNMP walks.
2. Naming conventions
2.1. Check plug-in name
Choosing a name for a plug-in is especially critical since this name cannot be altered at a later date.
A plug-in’s name must be short, sufficiently explicit, and understandable. Example:
firewall_status
is only a good name if the plug-in functions for all, or at least many firewalls.A name is composed of lower case letters and numerals. An underscore is permitted as a separator.
The words
status
orstate
are unnecessary in a name, since of course every plug-in monitors a status. The same applies for the superfluous wordcurrent
. So, rather thanfoobar_current_temp_status
simply use justfoobar_temp
.Check plug-ins where the item represents a physical thing (e.g. fan, power supply), should have a name in the singular — for example,
casa_fan
,oracle_tablespace
. Check plug-ins in which each item refers to a number or multiples should be named using a plural — for example,user_logins
,printer_pages
.Product-specific check plug-ins should be prefixed with the product name — e.g.,
oracle_tablespace
.Manufacturer-specific check plug-ins that do not apply to a specific product should be prefixed with the manufacturer’s abbreviation — e.g.,
fsc_
for Fujitsu Siemens Computers.SNMP-based check plug-ins that use a common component of the MIB which may well be supported by more than one manufacturer should be named after the MIB, rather than after a manufacturer — e.g., the
hr_*
check plug-ins.
2.2. Service name
Use common and well-defined abbreviations (e.g. CPU, DB, VPN, IO,…).
Write abbreviations in upper case.
Use sentence case (e.g.
CPU utilization
, notCpu Utilization
) — proper names are an exception.Write product names as defined by the vendor (e.g.
vSphere
).Use American English (e.g.
utilization
, notutilisation
).Stick to existing naming schemes, if they exist. For example, all interface services use the template
Interface %s
.Try to be brief and keep descriptions short as service names are truncated in dashboards, views and reports if they are too long.
2.3. Metric names
Metrics for which a meaningful definition already exists should be reused.
Otherwise similar rules as used for the naming of check plug-ins apply (product-specific, manufacturer-specific, etc.)
2.4. Check group name for the rule set
The same convention applies as with metrics.
3. Constructing a check plug-in
3.1. General structure
The actual Python file under ~/share/check_mk/checks/
should have the following structure (complying with the coding sequence):
A file header with a GPL notice.
The name and email address of the original author if the plug-in has not been developed by the Checkmk project.
A short sample of the agent’s output.
Default values for the check parameters (
factory_settings
).Auxiliary functions, if available.
The parse function, if available.
The discovery function.
The check function.
The
check_info
-declaration.
3.2. Coding guidelines
Author
If the plug-in has not been developed by the Checkmk team, the author’s name and email address should be coded directly after the file header.
Readability
Avoid long lines of code — the maximum permitted length is 100 characters.
In each case the indentation is four blank characters — do not use tabs.
Orientate yourself to Python standard PEP 8.
Sample agent output
Including a sample of an agent’s output greatly simplifies the reading of the code. When doing so it is important to include various possible outputs in the sample. Make the sample no longer than necessary. With SNMP-based checks provide an SNMP walk:
Example excerpt from SNMP data:
.1.3.6.1.4.1.2.3.51.2.2.7.1.0 255
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.1.1 1
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.2.1 "Good"
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.3.1 "No critical or warning events"
.1.3.6.1.4.1.2.3.51.2.2.7.2.1.4.1 "No timestamp"
If, for example, differing output formats are produced due to differing firmware versions in the target devices, then an example noting the version should be provided for each.
A good example of this case can be found in the multipath
check plug-in.
SNMP MIBs
When defining the snmp_info
the readable path to the OID should be given in the comments.
Example:
'snmp_info' : (".1.3.6.1.2.1.47.1.1.1.1", [
OID_END,
"2", # ENTITY-MIB::entPhysicalDescription
"5", # ENTITY-MIB::entPhysicalClass
"7", # ENTITY-MIB::entPhysicalName
]),
Using lambda
Avoid complex expressions with lambda
.
Permitted is lambda
in the lambda oid: …
scan function, and when you wish to invoke existing functions with only an altered argument — for example:
"inventory_function" : lambda info: inventory_foobar_generic(info, "temperature")
Iterating through SNMP agent data
With checks that parse SNMP data, an index like this should not be used:
for line in info:
if line[1] != '' and line[0] ...
It is better to unpack each line as meaningful variables:
for *sensor_id, state_state, foo, bar* in info:
if sensor_state != '1' and sensor_id ...
Parse functions
Always use parse functions whenever parsing an agent’s output is not trivial.
The parse function’s argument should always be named info
, and in the discovery and check functions the argument should be named parsed
instead of info
.
In this way it will be clear to the reader that this result is from a parse function.
Checks with multiple partial results
A check that produces multiple partial results — for example, current allocations and growth — must return these with yield
.
Checks that produce only a single result must use return
.
if "abs_levels" in params:
warn, crit = params["abs_levels"]
if value >= crit:
yield 2, "...."
elif value >= warn:
yield 1, "...."
else:
yield 0, "..."
if "perc_levels" in params:
warn, crit = params["perc_levels"]
if percentage >= crit:
yield 2, "...."
elif percentage >= warn:
yield 1, "...."
else:
yield 0, "..."
The (!)
and (!!)
markers are obsolete and may no longer be used.
These should be replaced by yield
.
Keys in check_info[…]
Only store keys which will be used in your entry in check_info
.
The only required entries are "service_description"
and "check_function"
.
Only insert "has_perfdata"
and other keys with boolean values if their value is True
.
3.3. Agent plug-ins
If your check plug-in requires an agent plug-in, then be aware of the following rules:
Store the plug-in in
~/share/check_mk/agents/plugins
for Unix-like systems, and set the execution rights to755
.In Windows the directory is called
~/share/check_mk/agents/windows/plugins
.Shell and Python scripts should have no file name extension (omit
.sh
and.py
).Use
#!/bin/sh
in the first line of shell scripts. Only use#!/bin/bash
if Bash features are required.Use the standard Checkmk file header with the GPL notice.
Your plug-in must not damage the target system, especially if the plug-in is not actually supported by the system.
Do not forget the reference to the plug-in on the check plug-in’s man page.
If the component that the plug-in is to monitor doesn’t actually exist on a system, the plug-in must not output a section header.
If the plug-in requires a configuration file this should (in Linux) be searched for in the
$MK_CONFDIR
directory, and the file must have the same name as the plug-in — apart from the.cfg
extension, and without a possiblemk_
prefix. The procedure is similar for Windows — the directory in Windows is%MK_CONFDIR%
.Do not code plug-ins for Windows in PowerShell. This is not portable, and is in any case very resource-greedy. Use VBScript.
Do not code plug-ins in Java.
3.4. Don’ts
Do not use
import
in your check plug-in file. All permitted Python modules have already been imported.Do not use
datetime
for parsing and calculating time specifications — usetime
. This can perform all needed tasks. Really!Arguments that receive your functions must in no way modify the functions. This especially applies for
params
andinfo
.Should you really want to work with regular expressions (they are slow!), invoke these with the
regex()
function — do not usere
directly.Naturally it is not permitted to use
print
, or otherwise route outputs tostdout
, or communicate with the outside world in any way!The SNMP scan function is not allowed to retrieve OIDs other than
.1.3.6.1.2.1.1.1.0
and.1.3.6.1.2.1.1.2.0
. Exception: the SNMP scan function has previously ensured by checking one of these two OIDs that further OIDs are only fetched from a strictly-limited number of devices.
4. Check plug-in behavior
4.1. Exceptions
Your check plug-in should not, rather it must always assume that an agent’s output is syntactically valid. The plug-in is in no case permitted to attempt to handle unknown error situations in the output itself!
Why is this so?
Checkmk has a very refined function for automatically handling such errors.
For the user it can generate comprehensive crash reports, and it also sets the status of the plug-in to UNKNOWN.
This is much more helpful than if the check, for example, simply produces an unknown SNMP code 17
.
The discovery, parse and/or check function should generally enter an exception if the agent’s output is not in the defined, known format for which the plug-in was developed.
4.2. saveint() and savefloat()
The saveint()
and savefloat()
functions convert a string into int
or float
and produce a 0
if the string cannot be converted (e.g. it is an empty string).
Only use these functions if the empty or invalid value is a known condition — otherwise important error messages will be suppressed (see above).
4.3. Item not found
A check that doesn’t find an item being monitored should simply produce a None
, and not generate its own error message.
In such a case Checkmk will produce a standardized, consistent error message, and set the service to UNKNOWN.
4.4. Thresholds
Many check plug-ins have parameters which define thresholds for specific metrics, and thus determine when the check assumes a WARN or CRIT status. Be aware of the following rules that ensure Checkmk reacts consistently:
The thresholds for WARN and CRIT should always be verified with
>=
and<=
. Example: a plug-in monitors the length of a mail queue. The critical upper limit is 100. This means that if the actual value is ‘100’ it is already critical!If there are only upper, or only lower thresholds (the commonest cases), then the entry fields in the rule set should be coded with Warning at and Critical at.
If there are upper and lower thresholds, the coding should be as follows: Warning at or above, Critical at or above, Warning at or below and Critical at or below.
4.5. Check plug-in output
Every check produces one line of text — the plug-in output. To achieve a consistent behavior for all plug-ins, the following rules apply:
For showing measured values, exactly one blank character should separate the value and the unit (e.g.
17.4 V
). The only exception to this rule is with%
, where there is no blank:89.5%
.When listing measured values, the value’s name with an initial capital is followed by a colon. Example:
Voltage: 24.5 V, Phase: negative, Flux-Compensator: operational
Do not show internal keys, codewords, SNMP-internals or other rubbish in plug-in outputs which is of no use to the user. Use meaningful human-readable terms. Use terms that the user normally expects. Example: Use
route monitor has failed
rather thanrouteMonitorFail
.If the check item has an additional specification, code this in square brackets at the beginning of the output (e.g.
Interface 2 - [eth0] …
).In listings, items are separated by commas, and following items have initial capitals:
Swap used: …, Total virtual memory used: …
4.6. Default thresholds
Every plug-in that works with thresholds should have meaningful default threshold values defined for it. The following rules apply:
The default thresholds used in the check should also be defined 1:1 as default parameters in the applicable rule set.
The default thresholds should be defined in
factory_settings
(if the check has a dictionary as a parameter).The default thresholds should be selected on a technically-sound basis. Is there a manufacturer’s specification? Are there best practices?
It is essential that the source of the thresholds be documented in the check.
4.7. Nagios vs. CMC
Ensure that your check also functions with a Nagios monitoring core. That is usually the case automatically, but not always.
5. Metrics
5.1. Formats for metrics
The check plug-in always returns metric data as
int
orfloat
. Strings are not allowed.If you wish to output the six-tuple from a metric value field, use
None
in its position. Example:[("taple_util", utilization, None, None, 0, size)]
If you do not require the entry at the end, simply shorten the tuple. Do not use a
None
at the end.
5.2. Naming the metrics
Metric names consist of lower case letters and underscores. Numerals are permitted, but not leading.
Metric names should be, as with check plug-ins, short and specific. Metrics that will be used by multiple plug-ins should have generic names.
Avoid using the pointless filler word
current
. The measured value is always the current one.The metric should be named after the ‘thing’, not after the unit of measurement. Thus, for example,
current
rather thanampere
, orsize
rather thanbytes
.
Important: Always use the canonical size. Really! Checkmk scales the data itself as appropriate. Examples:
Measurement type | Canonical unit |
---|---|
Duration |
Seconds |
File size |
Bytes |
Temperature |
Celsius |
Network throughput |
Octets per second (not bits per second!) |
Percentage value |
A value from 0 to 100 (not 0.0 to 1.0) |
Events per time period |
1 per second |
Electrical performance |
Watt (not mW) |
5.3. Flag for metric data
Only set "has_perfdata"
in check_info
to True
if the check actually outputs metric data (or can output it).
5.4. Definition for graph and Perf-O-Meter
The definitions for graphs should be like the definitions in ~/web/plugins/metrics/check_mk.py
.
Do not create definitions for PNP graphs.
In Checkmk Raw as well these will be generated on the basis of the metric definitions in Checkmk itself.
6. Definition of the rule set
6.1. Check group name
Check plug-ins with parameters require a compulsory rule set definition.
The connection between a plug-in and a rule set is made through the check group (the entry "group"
in check_info
).
All checks that are configured with the same rule set are consolidated via the group.
If your plug-in should sensibly be configured with an existing rule set, then also use an existing group.
If your plug-in is so specific that it in any case requires its own group, then create an own group for it where the group’s name should reference the plug-in.
Should it be foreseeable that in the future further plug-ins could use the same rule set, then use an appropriately generic name.
6.2. Default values for ValueSpecs
When defining your parameter definitions (ValueSpecs) use the exact same default values as the defaults actually used in the checks (if possible).
Example: if without a rule the check assumes the threshold (5, 10)
for WARN and CRIT, then the ValueSpec should be so defined that 5
and 10
will be automatically offered as thresholds.
6.3. Choosing ValueSpecs
For some types of data there are specialized ValueSpecs.
An example is Age
for a certain number of seconds.
This must be used wherever it is appropriate.
Do not, for example, use Integer
in such a case.
7. Include files
For a number of types of checks there are already-prepared implementations in include files, that not only can be used, but should be used. Important include files are:
|
Monitoring of temperatures |
|
Electrical AC phases (e.g. in USV) |
|
Fans |
|
Network interfaces |
|
File system levels |
|
Monitoring of RAM (Main storage) |
|
Operating system processes |
Important: use existing include files only if these have been designed for the purpose at hand, and not simply because they are an approximate fit!
8. Man pages
Each check plug-in must have a man page. If you have programmed several plug-ins in one check plug-in file, each of these must of course have its own man page.
The man page is intended for the user! Write information that will help them. Here it is not about documenting what you have programmed, but about giving the user the useful information that they need.
A man page must be
complete,
precise,
short,
helpful.
A man page consists of several sections — some of which are optional:
8.1. Title
With the title:
macro you determine the heading.
This consists of:
the exact device name or device group for which the check is written,
information on what the check monitors (e.g. system health).
These two parts are separated by a colon — only in this way can existing checks be easily searched for and, above all, found.
8.2. Agent categories
The agents:
macro can have different categories.
There are basically three categories:
Agents: In this case the operating systems for which the check was built and is available for are specified, for example
linux
, orlinux, windows, solaris
.SNMP: In this case there is only the entry
snmp
.Active checks: If an active check has been integrated into the Checkmk interface, use the category
active
.
8.3. Catalog entry
Use the catalog:
header to specify where the man page is to be stored in the check plug-in catalog.
If a category is missing — for example, for a new manufacturer — the category must be defined in the catalog_titles
variable in the cmk/utils/man_pages.py
file.
Currently this file cannot be extended in local/
by plug-ins, so only the developers of Checkmk can make changes here.
Note the exact capitalization of product and company names!
This applies not only to the catalog entry, but also to all other texts where these occur.
Example: NetApp
is always written NetApp
, and not netapp
, NETAPP
, Netapp
, or similar.
Google can help to find the correct spelling.
8.4. Plug-in description
The following information must be included in the description:
in the man page:
Exactly what hardware or software does the check monitor? Are there special features of certain firmware or product versions of the devices? Do not refer to a MIB, but to product designations. Example: It is not helpful if you write ‘This check works for all devices that support the Foobar-17.11-MIB’. Write precisely which product lines or similar are supported.
Which aspect of this is monitored? What does the check do?
Under what conditions is the check OK, WARN or CRIT?
Is an agent plug-in required for the check? If yes — how is it installed? This must work without the Agent Bakery.
Are there any other requirements for the check to work (preparation of the target system, installation of drivers, etc.). These should only be listed if they are not normally fulfilled anyway (e.g. mounting of
/proc
under Linux).
Do not write anything that affects all checks together. For example, do not repeat general things like how to set up SNMP-based checks.
8.5. Item
For checks that have an item (i.e., a %s
in the service name), the man page under item:
must describe how it is formed.
If the check plug-in does not use an item, you can omit this line completely.
8.6. Service discovery
Under inventory:
, write under which conditions this check’s service(s) will be found automatically, i.e. how the service discovery behaves.
An example from nfsmounts
:
inventory:
All NFS mounts are found automatically. This is done
by scanning {/proc/mounts}. The file {/etc/fstab} is irrelevant.
Make sure that the text is understandable without deeper knowledge of an MIB or the code — so do not write:
One service is created for each temperature sensor if the state is 1.
Instead, it is better to translate as much as possible:
One service is created for each temperature sensor, if the state is "active".