Checkmk
to checkmk.com

1. Introduction

An important advantage with Checkmk compared to other monitoring systems is the large number of well-maintained check plug-ins supplied as standard. In order for these plug-ins to have a uniformly high quality there are standardized criteria that each plug-in must meet.

An important note regarding the criteria: do not simply assume that all plug-ins supplied with Checkmk conform to all current standards. Avoid copy & paste. It is more advisable to orient your work according to the information in this article.

Should you be developing plug-ins solely for your own use, you are of course quite free and are not bound by our standards.

1.1. Quality

For check plug-ins that are official components in Checkmk, or are planned to be such, higher quality is demanded in comparison to those written ‘for your own use’. This expectation applies to their ‘outer’ quality (as seen by the user), as well as to their internal quality (the readability of code, etc.).

Please code the plug-in well, and to as high a standard as you are able to.

1.2. Scope of a check plug-in

Every check plug-in must as a minimum include the following components:

  • The check plug-in itself.

  • A man page.

  • Plug-ins with check parameters require a definition for the applicable rule set.

  • Metric definitions for graphs and the Perf-O-Meter if the check produces metric data.

  • A definition for the Agent Bakery if an agent plug-in is present.

  • A number of complete, and diverse examples of the agents outputs, or respectively SNMP walks.

2. Naming conventions

2.1. Check plug-in name

Choosing a name for a plug-in is especially critical since this name cannot be altered at a later date.

  • A plug-in’s name must be short, sufficiently explicit, and understandable. Example: firewall_status is only a good name if the plug-in functions for all, or at least many firewalls.

  • A name is composed of lower case letters and numerals. An underscore is permitted as a separator.

  • The words status or state are unnecessary in a name, since of course every plug-in monitors a status. The same applies for the superfluous word current. So, rather than foobar_current_temp_status simply use just foobar_temp.

  • Check plug-ins where the item represents a physical thing (e.g. fan, power supply), should have a name in the singular — for example, casa_fan, oracle_tablespace. Check plug-ins in which each item refers to a number or multiples should be named using a plural — for example, user_logins, printer_pages.

  • Product-specific check plug-ins should be prefixed with the product name — e.g., oracle_tablespace.

  • Manufacturer-specific check plug-ins that do not apply to a specific product should be prefixed with the manufacturer’s abbreviation — e.g., fsc_ for Fujitsu Siemens Computers.

  • SNMP-based check plug-ins that use a common component of the MIB which may well be supported by more than one manufacturer should be named after the MIB, rather than after a manufacturer — e.g., the hr_* check plug-ins.

2.2. Service name

  • Use common and well-defined abbreviations (e.g. CPU, DB, VPN, IO,…).

  • Write abbreviations in upper case.

  • Use sentence case (e.g. CPU utilization, not Cpu Utilization) — proper names are an exception.

  • Write product names as defined by the vendor (e.g. vSphere).

  • Use American English (e.g. utilization, not utilisation).

  • Stick to existing naming schemes, if they exist. For example, all interface services use the template Interface %s.

  • Try to be brief and keep descriptions short as service names are truncated in dashboards, views and reports if they are too long.

2.3. Metric names

  • Metrics for which a meaningful definition already exists should be reused.

  • Otherwise similar rules as used for the naming of check plug-ins apply (product-specific, manufacturer-specific, etc.)

2.4. Check group name for the rule set

The same convention applies as with metrics.

3. Constructing a check plug-in

3.1. General structure

The actual Python file under ~/share/check_mk/checks/ should have the following structure (complying with the coding sequence):

  1. A file header with a GPL notice.

  2. The name and email address of the original author if the plug-in has not been developed by the Checkmk project.

  3. A short sample of the agent’s output.

  4. Default values for the check parameters (factory_settings).

  5. Auxiliary functions, if available.

  6. The parse function, if available.

  7. The discovery function.

  8. The check function.

  9. The check_info-declaration.

3.2. Coding guidelines

Author

If the plug-in has not been developed by the Checkmk team, the author’s name and email address should be coded directly after the file header.

Readability

  • Avoid long lines of code — the maximum permitted length is 100 characters.

  • In each case the indentation is four blank characters — do not use tabs.

  • Orientate yourself to Python standard PEP 8.

Sample agent output

Including a sample of an agent’s output greatly simplifies the reading of the code. When doing so it is important to include various possible outputs in the sample. Make the sample no longer than necessary. With SNMP-based checks provide an SNMP walk:

 Example excerpt from SNMP data:
 .1.3.6.1.4.1.2.3.51.2.2.7.1.0  255
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.1.1  1
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.2.1  "Good"
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.3.1  "No critical or warning events"
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.4.1  "No timestamp"

If, for example, differing output formats are produced due to differing firmware versions in the target devices, then an example noting the version should be provided for each. A good example of this case can be found in the multipath check plug-in.

SNMP MIBs

When defining the snmp_info the readable path to the OID should be given in the comments. Example:

    'snmp_info' : (".1.3.6.1.2.1.47.1.1.1.1", [
        OID_END,
        "2",    # ENTITY-MIB::entPhysicalDescription
        "5",    # ENTITY-MIB::entPhysicalClass
        "7",    # ENTITY-MIB::entPhysicalName
    ]),

Using lambda

Avoid complex expressions with lambda. Permitted is lambda in the lambda oid: …​ scan function, and when you wish to invoke existing functions with only an altered argument — for example:

     "inventory_function" : lambda info: inventory_foobar_generic(info, "temperature")

Iterating through SNMP agent data

With checks that parse SNMP data, an index like this should not be used:

    for line in info:
        if line[1] != '' and line[0] ...

It is better to unpack each line as meaningful variables:

    for *sensor_id, state_state, foo, bar* in info:
        if sensor_state != '1' and sensor_id ...

Parse functions

Always use parse functions whenever parsing an agent’s output is not trivial. The parse function’s argument should always be named info, and in the discovery and check functions the argument should be named parsed instead of info. In this way it will be clear to the reader that this result is from a parse function.

Checks with multiple partial results

A check that produces multiple partial results — for example, current allocations and growth — must return these with yield. Checks that produce only a single result must use return.

    if "abs_levels" in params:
        warn, crit = params["abs_levels"]
        if value >= crit:
            yield 2, "...."
        elif value >= warn:
            yield 1, "...."
        else:
            yield 0, "..."

    if "perc_levels" in params:
        warn, crit = params["perc_levels"]
        if percentage >= crit:
            yield 2, "...."
        elif percentage >= warn:
            yield 1, "...."
        else:
            yield 0, "..."

The (!) and (!!) markers are obsolete and may no longer be used. These should be replaced by yield.

Keys in check_info[…​]

Only store keys which will be used in your entry in check_info. The only required entries are "service_description" and "check_function". Only insert "has_perfdata" and other keys with boolean values if their value is True.

3.3. Agent plug-ins

If your check plug-in requires an agent plug-in, then be aware of the following rules:

  • Store the plug-in in ~/share/check_mk/agents/plugins for Unix-like systems, and set the execution rights to 755.

  • In Windows the directory is called ~/share/check_mk/agents/windows/plugins.

  • Shell and Python scripts should have no file name extension (omit .sh and .py).

  • Use #!/bin/sh in the first line of shell scripts. Only use #!/bin/bash if Bash features are required.

  • Use the standard Checkmk file header with the GPL notice.

  • Your plug-in must not damage the target system, especially if the plug-in is not actually supported by the system.

  • Do not forget the reference to the plug-in on the check plug-in’s man page.

  • If the component that the plug-in is to monitor doesn’t actually exist on a system, the plug-in must not output a section header.

  • If the plug-in requires a configuration file this should (in Linux) be searched for in the $MK_CONFDIR directory, and the file must have the same name as the plug-in — apart from the .cfg extension, and without a possible mk_ prefix. The procedure is similar for Windows — the directory in Windows is %MK_CONFDIR%.

  • Do not code plug-ins for Windows in PowerShell. This is not portable, and is in any case very resource-greedy. Use VBScript.

  • Do not code plug-ins in Java.

3.4. Don’ts

  • Do not use import in your check plug-in file. All permitted Python modules have already been imported.

  • Do not use datetime for parsing and calculating time specifications — use time. This can perform all needed tasks. Really!

  • Arguments that receive your functions must in no way modify the functions. This especially applies for params and info.

  • Should you really want to work with regular expressions (they are slow!), invoke these with the regex() function — do not use re directly.

  • Naturally it is not permitted to use print, or otherwise route outputs to stdout, or communicate with the outside world in any way!

  • The SNMP scan function is not allowed to retrieve OIDs other than .1.3.6.1.2.1.1.1.0 and .1.3.6.1.2.1.1.2.0. Exception: the SNMP scan function has previously ensured by checking one of these two OIDs that further OIDs are only fetched from a strictly-limited number of devices.

4. Check plug-in behavior

4.1. Exceptions

Your check plug-in should not, rather it must always assume that an agent’s output is syntactically valid. The plug-in is in no case permitted to attempt to handle unknown error situations in the output itself!

Why is this so? Checkmk has a very refined function for automatically handling such errors. For the user it can generate comprehensive crash reports, and it also sets the status of the plug-in to UNKNOWN. This is much more helpful than if the check, for example, simply produces an unknown SNMP code 17.

The discovery, parse and/or check function should generally enter an exception if the agent’s output is not in the defined, known format for which the plug-in was developed.

4.2. saveint() and savefloat()

The saveint() and savefloat() functions convert a string into int or float and produce a 0 if the string cannot be converted (e.g. it is an empty string).

Only use these functions if the empty or invalid value is a known condition — otherwise important error messages will be suppressed (see above).

4.3. Item not found

A check that doesn’t find an item being monitored should simply produce a None, and not generate its own error message. In such a case Checkmk will produce a standardized, consistent error message, and set the service to UNKNOWN.

4.4. Thresholds

Many check plug-ins have parameters which define thresholds for specific metrics, and thus determine when the check assumes a WARN or CRIT status. Be aware of the following rules that ensure Checkmk reacts consistently:

  • The thresholds for WARN and CRIT should always be verified with >= and <=. Example: a plug-in monitors the length of a mail queue. The critical upper limit is 100. This means that if the actual value is ‘100’ it is already critical!

  • If there are only upper, or only lower thresholds (the commonest cases), then the entry fields in the rule set should be coded with Warning at and Critical at.

  • If there are upper and lower thresholds, the coding should be as follows: Warning at or above, Critical at or above, Warning at or below and Critical at or below.

4.5. Check plug-in output

Every check produces one line of text — the plug-in output. To achieve a consistent behavior for all plug-ins, the following rules apply:

  • For showing measured values, exactly one blank character should separate the value and the unit (e.g. 17.4 V). The only exception to this rule is with %, where there is no blank: 89.5%.

  • When listing measured values, the value’s name with an initial capital is followed by a colon. Example: Voltage: 24.5 V, Phase: negative, Flux-Compensator: operational

  • Do not show internal keys, codewords, SNMP-internals or other rubbish in plug-in outputs which is of no use to the user. Use meaningful human-readable terms. Use terms that the user normally expects. Example: Use route monitor has failed rather than routeMonitorFail.

  • If the check item has an additional specification, code this in square brackets at the beginning of the output (e.g. Interface 2 - [eth0] …​).

  • In listings, items are separated by commas, and following items have initial capitals: Swap used: …​, Total virtual memory used: …​

4.6. Default thresholds

Every plug-in that works with thresholds should have meaningful default threshold values defined for it. The following rules apply:

  • The default thresholds used in the check should also be defined 1:1 as default parameters in the applicable rule set.

  • The default thresholds should be defined in factory_settings (if the check has a dictionary as a parameter).

  • The default thresholds should be selected on a technically-sound basis. Is there a manufacturer’s specification? Are there best practices?

  • It is essential that the source of the thresholds be documented in the check.

4.7. Nagios vs. CMC

Ensure that your check also functions with a Nagios monitoring core. That is usually the case automatically, but not always.

5. Metrics

5.1. Formats for metrics

  • The check plug-in always returns metric data as int or float. Strings are not allowed.

  • If you wish to output the six-tuple from a metric value field, use None in its position. Example: [("taple_util", utilization, None, None, 0, size)]

  • If you do not require the entry at the end, simply shorten the tuple. Do not use a None at the end.

5.2. Naming the metrics

  • Metric names consist of lower case letters and underscores. Numerals are permitted, but not leading.

  • Metric names should be, as with check plug-ins, short and specific. Metrics that will be used by multiple plug-ins should have generic names.

  • Avoid using the pointless filler word current. The measured value is always the current one.

  • The metric should be named after the ‘thing’, not after the unit of measurement. Thus, for example, current rather than ampere, or size rather than bytes.

Important: Always use the canonical size. Really! Checkmk scales the data itself as appropriate. Examples:

Measurement type Canonical unit

Duration

Seconds

File size

Bytes

Temperature

Celsius

Network throughput

Octets per second (not bits per second!)

Percentage value

A value from 0 to 100 (not 0.0 to 1.0)

Events per time period

1 per second

Electrical performance

Watt (not mW)

5.3. Flag for metric data

Only set "has_perfdata" in check_info to True if the check actually outputs metric data (or can output it).

5.4. Definition for graph and Perf-O-Meter

The definitions for graphs should be like the definitions in ~/web/plugins/metrics/check_mk.py. Do not create definitions for PNP graphs. In Checkmk Raw as well these will be generated on the basis of the metric definitions in Checkmk itself.

6. Definition of the rule set

6.1. Check group name

Check plug-ins with parameters require a compulsory rule set definition. The connection between a plug-in and a rule set is made through the check group (the entry "group" in check_info). All checks that are configured with the same rule set are consolidated via the group.

If your plug-in should sensibly be configured with an existing rule set, then also use an existing group.

If your plug-in is so specific that it in any case requires its own group, then create an own group for it where the group’s name should reference the plug-in.

Should it be foreseeable that in the future further plug-ins could use the same rule set, then use an appropriately generic name.

6.2. Default values for ValueSpecs

When defining your parameter definitions (ValueSpecs) use the exact same default values as the defaults actually used in the checks (if possible).

Example: if without a rule the check assumes the threshold (5, 10) for WARN and CRIT, then the ValueSpec should be so defined that 5 and 10 will be automatically offered as thresholds.

6.3. Choosing ValueSpecs

For some types of data there are specialized ValueSpecs. An example is Age for a certain number of seconds. This must be used wherever it is appropriate. Do not, for example, use Integer in such a case.

7. Include files

For a number of types of checks there are already-prepared implementations in include files, that not only can be used, but should be used. Important include files are:

temperature.include

Monitoring of temperatures

elphase.include

Electrical AC phases (e.g. in USV)

fan.include

Fans

if.include

Network interfaces

df.include

File system levels

mem.include

Monitoring of RAM (Main storage)

ps.include

Operating system processes

Important: use existing include files only if these have been designed for the purpose at hand, and not simply because they are an approximate fit!

8. Man pages

Each check plug-in must have a man page. If you have programmed several plug-ins in one check plug-in file, each of these must of course have its own man page.

The man page is intended for the user! Write information that will help them. Here it is not about documenting what you have programmed, but about giving the user the useful information that they need.

A man page must be

  • complete,

  • precise,

  • short,

  • helpful.

A man page consists of several sections — some of which are optional:

8.1. Title

With the title: macro you determine the heading. This consists of:

  • the exact device name or device group for which the check is written,

  • information on what the check monitors (e.g. system health).

These two parts are separated by a colon — only in this way can existing checks be easily searched for and, above all, found.

8.2. Agent categories

The agents: macro can have different categories. There are basically three categories:

  • Agents: In this case the operating systems for which the check was built and is available for are specified, for example linux, or linux, windows, solaris.

  • SNMP: In this case there is only the entry snmp.

  • Active checks: If an active check has been integrated into the Checkmk interface, use the category active.

8.3. Catalog entry

Use the catalog: header to specify where the man page is to be stored in the check plug-in catalog.

If a category is missing — for example, for a new manufacturer — the category must be defined in the catalog_titles variable in the cmk/utils/man_pages.py file. Currently this file cannot be extended in local/ by plug-ins, so only the developers of Checkmk can make changes here.

Note the exact capitalization of product and company names! This applies not only to the catalog entry, but also to all other texts where these occur. Example: NetApp is always written NetApp, and not netapp, NETAPP, Netapp, or similar. Google can help to find the correct spelling.

8.4. Plug-in description

The following information must be included in the description: in the man page:

  • Exactly what hardware or software does the check monitor? Are there special features of certain firmware or product versions of the devices? Do not refer to a MIB, but to product designations. Example: It is not helpful if you write ‘This check works for all devices that support the Foobar-17.11-MIB’. Write precisely which product lines or similar are supported.

  • Which aspect of this is monitored? What does the check do?

  • Under what conditions is the check OK, WARN or CRIT?

  • Is an agent plug-in required for the check? If yes — how is it installed? This must work without the Agent Bakery.

  • Are there any other requirements for the check to work (preparation of the target system, installation of drivers, etc.). These should only be listed if they are not normally fulfilled anyway (e.g. mounting of /proc under Linux).

Do not write anything that affects all checks together. For example, do not repeat general things like how to set up SNMP-based checks.

8.5. Item

For checks that have an item (i.e., a %s in the service name), the man page under item: must describe how it is formed. If the check plug-in does not use an item, you can omit this line completely.

8.6. Service discovery

Under inventory:, write under which conditions this check’s service(s) will be found automatically, i.e. how the service discovery behaves. An example from nfsmounts:

nfsmounts
inventory:
  All NFS mounts are found automatically. This is done
  by scanning {/proc/mounts}. The file {/etc/fstab} is irrelevant.

Make sure that the text is understandable without deeper knowledge of an MIB or the code — so do not write:

One service is created for each temperature sensor if the state is 1.

Instead, it is better to translate as much as possible:

One service is created for each temperature sensor, if the state is "active".
On this page