Checkmk
to checkmk.com

1. Introduction

An important advantage with Checkmk compared to other monitoring systems is the large number of well-maintained check plug-ins supplied as standard. In order for these plug-ins to have a uniformly high quality there are standardised criteria that each plug-in must meet. In this article we will introduce these criteria.

An important note regarding the criteria: please don’t simply assume that all plug-ins supplied with Checkmk conform to all current standards. Please avoid copy & paste. It is more advisable to orient your work according to the information in this article.

Should you be developing plug-ins solely for your own use, you are of course quite free and are not bound by our standards.

2. Preface

2.1. Quality

For check plug-ins that are official components in Checkmk, or are planned to be such, higher quality is demanded in comparison to those written ‘for your own use’. This expectation applies to their ‘outer’ quality (as seen by the user), as well as to their internal quality (the readability of code, etc.).

Please code the plug-in well, and to as high a standard as you are able to.

2.2. The basics of a check plug-in

Every check plug-in must as a minimum include the following components:

  • The check plug-in itself

  • A Manpage

  • Plug-ins with check parameters require a WATO-Definition for the applicable Rule Set

  • Metrics definitions for graphs and the Perf-O-Meter if the check produces metrics data

  • A definition for the Agent Bakery if an agent plug-in is present

  • A number of complete, and diverse examples of the agents outputs, or respectively SNMP-Walks

3. Naming conventions

3.1. Plug-in names

Choosing a name for a plug-in is especially critical since this name cannot be altered at a later date.

  • A plug-in’s name must be short, sufficiently explicit, and understandable. Example: firewall_status is only a good name if the plug-in functions for all, or at least many firewalls.

  • A name is composed of lower case letters and numerals. An underscore is permitted as a separator.

  • The words status or state are unnecessary in a name, since of course every plug-in monitors a status. The same applies for the superfluous word current. So, rather than foobar_current_temp_status simply use just foobar_temp.

  • Checks that apply to a physical thing (e.g. fan, power supply), should have a name in the singular – for example, casa_fan, oracle_tablespace. Checks in which each item refers to a number or multiples should be named using a plural – for example, user_logins, printer_pages.

  • Product-specific checks should be prefixed with the product name — e.g., oracle_tablespace.

  • Manufacturer-specific check plug-ins that do NOT apply to a SPECIFIC product should be prefixed with the manufacturer’s abreviation — e.g., fsc_ for Fujitsu Siemens Computers.

  • SNMP-based check plug-ins that use a common component of the MIB which may well be supported by more than one manufacturer should be named after the MIB, rather than after a manufacturer — e.g., the hr_*-check plug-ins.

3.2. Service names

  • Use common and well-defined abbreviations (e.g. CPU, DB, VPN, IO,…).

  • Write abbreviations in upper case.

  • Use sentence case (e.g. CPU utilization, not Cpu Utilization) — proper names are an exception.

  • Write product names as defined by the vendor (e.g. vSphere).

  • Use American English (e.g. utilization, not utilisation).

  • Stick to existing naming schemes, if they exist. For example, all interface services use the template "Interface %s".

  • Try to be brief as those names show up in dashboards, views and reports and might be cut, if too long.

3.3. Names for metrics

  • Metrics for which a meaningful definition already exists should always be used.

  • Otherwise similar rules as used for the naming of check plug-ins apply (product-specific, manufacturer-specific, etc.)

3.4. WATO-rule group names

  • The same convention applies as with metrics.

4. Constructing check plug-ins

4.1. General structure

The actual Python data under share/check_mk/checks/ should have the following structure (complying with the coding sequence):

  1. A file header with a GPL-Notice

  2. The name and email address of the original author if the plug-in has not been developed by the Checkmk-Projekt.

  3. A short sample of the agent’s output

  4. Default values for the check parameter (factory_settings)

  5. Help functions, if available

  6. The Parse function, if available

  7. The Discovery function

  8. The Check function

  9. The check_info-declaration

4.2. Coding guidelines

Author

  • If the plug-in has not been developed by the Checkmk-Team, the author’s name and email address should be coded directly after the GPL-header.

Readability

  • Avoid long lines of code — the maximum permitted length is 100 characters.

  • In each case the indentation is four blank characters — do not use tabs.

  • Orientate yourself to Python-Standard PEP 8

Sample agent outputs

Including a sample of an agent’s output greatly simplifies the reading of the code. When doing so it is important to include various possible outputs in the sample. Make the sample no longer than necessary. With SNMP-based checks provide an SNMP-Walk:

 Example excerpt from SNMP data:
 .1.3.6.1.4.1.2.3.51.2.2.7.1.0  255
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.1.1  1
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.2.1  "Good"
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.3.1  "No critical or warning events"
 .1.3.6.1.4.1.2.3.51.2.2.7.2.1.4.1  "No timestamp"

If, for example, differing output formats are produced due to differing firmware standards in the target devices, then an example noting the version should be provided for each. A good example of this case can be found in the multipath check plug-in.

SNMP-MIBs

When defining the snmp_info the readable path to the OID should be given in the comments. Example:

    'snmp_info' : [(".1.3.6.1.2.1.47.1.1.1.1", [
        OID_END,
        "2",    # ENTITY-MIB::entPhysicalDescription
        "5",    # ENTITY-MIB::entPhysicalClass
        "7",    # ENTITY-MIB::entPhysicalName
    ]),

Using lambda

Avoid complex expressions with lambda. Permitted is lambda in the lambda oid: …​ scan function, and when you wish to invoke existing functions with only an altered argument — for example:

     "inventory_function" : lambda info: inventory_foobar_generic(info, "temperature")

Iterating through SNMP-agent data

With checks that parse SNMP-data, an index like this should not be used…​

    for line in info:
        if line[1] != '' and line[0] ...

It is better to unpack each line as meaningful variables:

    for *sensor_id, state_state, foo, bar* in info:
        if sensor_state != '1' and sensor_id ...

Parse functions

Always use parse functions whenever parsing an agent’s output is not trivial. The parse function’s argument should always be named info, and in the discovery and check functions the argument should be named parsed instead of info. In this way it will be clear to the reader that this result is from a parse function.

Checks with multiple partial results

A check that produces multiple partial results — for example, current allocations and growth — must return these with yield. Checks that produce only a single result must use return.

    if "abs_levels" in params:
        warn, crit = params["abs_levels"]
        if value >= crit:
            yield 2, "...."
        elif value >= warn:
            yield 1, "...."
        else:
            yield 0, "..."

    if "perc_levels" in params:
        warn, crit = params["perc_levels"]
        if percentage >= crit:
            yield 2, "...."
        elif percentage >= warn:
            yield 1, "...."
        else:
            yield 0, "..."

The (!) and (!!) markers are obsolete and may no longer be used. These should be replaced by yield.

Keys in check_info[…​]

Only store keys which will be used In your entry in check_info. The only required entries are ‘service_description’ and ‘check_function’. Only insert ‘has_perfdata’ and other keys with boolean values if their value is True.

4.3. Agent plug-ins

If your check plug-in requires an agent plug-in, then be aware of the following rules:

  • Store the plug-in in share/check_mk/agents/plugins for Unix-type systems, and set the execution rights to 755.

  • In Windows the directory is called share/check_mk/agents/windows/plugins.

  • Shell and Python scripts should have no file name extension (omit .sh and .py).

  • Use !/bin/sh in the first lines of shell scripts. Only use !/bin/bash if BASH features are required.

  • Use the standard Checkmk-file heading with the GPL-notice.

  • Your plug-in must not damage the target system, especially if the plug-in is not actually supported by the system.

  • Do not forget the reference to the plug-in on the check’s manpage.

  • If the component that the plug-in is to monitor doesn’t actually exist on a system, the plug-in must not output a section head.

  • If the plug-in requires a configurations file this should (in Linux) be searched for in the $MK_CONFDIR directory, and the file must have the same name as the plug-in — apart from the .cfg extension, and without a possible mk_ prefix. The procedure is similar for Windows — the directory in Windows is %MK_CONFDIR%.

  • Do not code plug-ins for Windows in Powershell. This is not portable, and is in any case very resource-greedy. Use VBS.

  • Do not code Plug-ins in Java.

4.4. Don’ts

  • Do not use import in your check files. All permitted Python modules have already been imported.

  • Do not use datetime for parsing and calculating time specifications – use time. This can perform all needed tasks. Really!

  • Arguments that receive your functions must in no way modify the functions. This especially applies for params and info.

  • Should you really want to work with regular expressions (they are slow!), invoke these with the regex() function — do not use re directly.

  • Naturally it is not permitted to use print, or otherwise route outputs to stdout, or communicate with the outside world in any way!

  • The SNMP-scan function is not allowed to retrieve OIDs other than .1.3.6.1.2.1.1.1.0 and .1.3.6.1.2.1.1.2.0. Exception: the SNMP-scan function has ensured via a Check of one of these OIDs that further OIDs will retrieve only a strictly-limited number of devices.

5. Behaviour of check plug-ins

5.1. Exceptions

Your check plug-in should not, rather it must always assume that an agent’s output is syntactically valid. The plug-in is in no case permitted to attempt to handle unknown error situations in the output itself!

Why is this so? Checkmk has a very refined function for automatically handling such errors. For the user it can generate comprehensive crash reports, and it also sets the status of the plug-in to UNKNOWN. This is much more helpful than if the check, for example, simply produces an unknown SNMP code 17.

The discovery, parse and/or check function should generally enter an exception if the agent’s output is not in the defined, known format for which the plug-in was developed.

5.2. saveint() and savefloat()

The saveint() and savefloat() functions convert a string into an int or float and produce a 0 if the string cannot be converted (e.g. it is an empty string).

Only use these functions if the empty or invalid value is a known condition — otherwise important error messages will be supressed (see above).

5.3. Item not found

A check that doesn’t find an item being monitored should simply produce a None, and not generate its own error message. In such a case Checkmk will produce a standardised, consistent error message, and set the service to UNKNOWN.

5.4. Threshholds

Many check plug-ins have parameters which define thresholds for specific metrics, and thus determine when the check assumes a WARN or CRIT status. Please be aware of the following rules that ensure Checkmk reacts consistently.

  • The thresholds for WARN and CRIT should always be verified with >= and <=. Example: a plug-in monitors the length of a mail queue. The critical upper limit is 100. This means that if the actual value is ‘100’ it is already critical!

  • If there are ONLY upper, or ONLY lower thresholds (the commonest cases), then the entry fields in WATO should be coded with Warning at and Critical at .

  • If there are upper AND lower thresholds, the coding should be as follows: Warning at or above _, Critical at or above , _Warning at or below and _Critical at or below _.

5.5. Check plug-in outputs

Every check produces one line of text — the plug-in output. To achieve a consistent behavier for all plug-ins, the following rules apply:

  • For showing measured values, exactly one blank character should separate the value and the unit (e.g. 17.4 V). The only exception to this rule is with %, where there is no blank: 89.5%.

  • When listing measured values, the value’s name with an initial capital is followed by a colon. Example: Voltage: 24.5 V, Phase: negative, Flux-Compensator: operational

  • Do not show internal keys, codewords, SNMP-internals or other rubbish in plug-in outputs which is of no use to the user. Use meaningful human-readable terms. Use terms that the user normally expects! Example: Use route monitor has failed rather than routeMonitorFail.

  • If the check item has an additional specification, code this in square brackets at the beginning of the output (e.g. Interface 2 - [eth0] …​)

  • In listings, items are separated by commas, and following items have initial capitals: Swap used: …​, Total virtual memory used: …​

5.6. Default thresholds

Every plug-in that works with thresholds should have meaningful default threshold values defined for it. The following rules apply:

  • The default thresholds used in the check should also be defined 1:1 as default parameters in the applicable WATO-rule.

  • The default thresholds should be defined in factory_settings (if the check has a dictionary as a parameter).

  • The default thresholds should be selected on a technically-sound basis. Is there a manufacturer’s specification? Are there best Practices?

  • It is essential that the source of the thresholds be documented in the check.

5.7. Nagios vs. CMC

Ensure that your check also functions with a Nagios core. That is usually the case automatically, but not always.

6. Metrics

6.1. Formats for metrics

  • The check plug-in always returns metric data as int or float. Strings are not allowed.

  • If you wish to output the sixtuple from a metric value field, use None in its position. Example: [("taple_util", utilization, None, None, 0, size)]

  • If you don’t require the entry at the end, simply shorten the tuple. Do not use a None at the end.

6.2. Naming the metrics

  • Metric names consist of lower case letters and underscores. Numerals are permitted, but not leading.

  • Metric names should be, as with check plug-ins, short and specific. Metrics that will be used by multiple plug-ins should have generic names.

  • Avoid using the pointless filler word current. The measured value is always the current one.

  • The metric should be named after the ‘thing’, not after the unit of measurement. Thus, for example, current rather than ampere, or size rather than bytes.

  • Important always use the canonical size. Really! Checkmk scales the data itself as appropriate. Example:

Measurement type Canonical unit

Duration

Seconds

File size

Bytes

Temperature

Celsius

Network throughput

Octets per second (not bits/sec!)

Percentage value

A value from 0 to 100 (not 0.0 to 1.0)

Events per time period

1 per second

Electrical performance

Watts (not mW)

6.3. Flags for metric data

  • Only set ‘has_perfdata’ in check_info to True if the check actually outputs metric data (or can output it).

6.4. Definitions for graphs and the Perf-O-Meter

The definitions for graphs should be like the definitions in web/plugins/metrics/check_mk.py. Do not create definitions for PNP-graphs. In the Raw Edition as well these will be generated on the basis of the metric definitions in Checkmk itself.

7. WATO-Definition

7.1. Check group names

Check plug-ins with parameters require a compulsory WATO-rule definition. The connection between a plug-in and a rule is made through the check group (the entry ‘group’ in check_info). All checks that are configured with the same rule set are consolidated via the group.

If your plug-in should sensibly be configured with an existing rule set, then also use an existing group.

If your plug-in is so specific that it in any case requires its own group, then create an own group for it where the group’s name should reference the plug-in.

Should it be foreseeable that in the future further plug-ins could use the same rule set, then use an appropriately generic name.

7.2. Default values for ValueSpecs

When defining your parameter definitions (ValueSpecs) use the exact same default values as the defaults actually used in the checks (if possible). Example: if without a rule the check assumes the threshold (5, 10) for WARN and CRIT, then the ValueSpec should be so defined that 5 and 10 will be automatically offered as thresholds.

7.3. Choosing ValueSpecs

For some types of data there are specialised ValueSpecs. An example is Age for a certain number of seconds. This must be used wherever it is appropriate. Do not, for example, use Integer in such a case.

8. Include-files

For a number of types of checks there are already-prepared implementations in include-files, that not only can be used, but SHOULD be used. Important include-files are:

temperature.include

Monitoring of temperatures

elphase.include

Electrical AC phases (e.g. in USV)

fan.include

Fans

if.include

Network interfaces

df.include

File system levels

mem.include

Monitoring of RAM (Main storage)

ps.include

Operating system processes

Important: use existing Include files only if these have been designed for the purpose at hand, and not simply because they are an approximate fit!

9. Manpages

Each check plug-in must have a Manpage. If you have programmed several plug-ins in one check file (subchecks), each of these must of course have its own Manpage.

The Manpage is intended for the user! Write information that will help them. Here it is not about documenting what you have programmed, but about giving the user the useful information that they need.

A Manpage must be:

  • complete

  • precise

  • short

  • helpful!

A Manpage consists of several sections — some of which are optional:

9.1. Title

With the title: macro you determine the heading. This consists of:

  • The exact device name or device group for which the check is written

  • What the check monitors, e.g., System Health

These two parts are separated by a colon — only in this way can existing checks be easily searched for and, above all, found.

9.2. Agent categories

The agents: macro can have different categories. There are basically three categories:

  • Agents: In this case the operating systems for which the check was built and is available for are specified. For example linux, or linux, windows, solaris

  • SNMP: In this case there is only the entry snmp

  • Active checks: If an active check has been integrated into the Checkmk interface, use the category active

9.3. Catalogue entries

With the header catalog: you define where in the Checkmanpages catalogue the plug-in is stored. If a category is missing — for example, for a new manufacturer — the category must be defined in the catalog_titles variable in the cmk/man_pages.py file — or from Version 1.6.0 in the cmk/utils/man_pages.py file.

Currently this file cannot be extended in local/ by plug-ins, so only the developers of Checkmk can make changes here.

Please note the exact capitalization of product and company names! This applies not only to the catalogue entry, but also to all other texts where these occur. Example: NetApp is always written NetApp, and not netapp, NETAPP, Netapp, or similar. Google can help to find the correct spelling!

9.4. Description of the plug-in

The following information must be included in the description: in the Manpage:

  • Exactly what hardware or software does the check monitor? Are there special features of certain firmware or product versions of the devices? Do not refer to a MIB, but to product designations. Example: It is not helpful if you write “This check works for all devices that support the Wrdpfrmpft-17.11-MIB”. Write precisely which product lines or similar are supported.

  • Which aspect of this is monitored? What does the check do?

  • Under what conditions is the check OK, WARN or CRIT?

  • Is an agent plug-in required for the check? If yes — how is it installed? This must work without the Agent Bakery.

  • Are there any other requirements for the check to work (preparation of the target system, installation of drivers, etc.). These should only be listed if they are not normally fulfilled anyway (e.g. mounting of /proc under Linux).

Do not write anything that affects ALL checks together. For example, do not repeat general things like how to set up SNMP-based checks.

9.5. Item

For checks that have an item (i.e., a %s in the service name), the Manpage under item: must describe how it is formed. If the check plug-in does not use an item, you can omit this line completely.

9.6. Service Discovery

Under inventory:, write under which conditions this check’s service(s) will be found automatically, i.e. how the ‘Service Discovery’ behaves. An example from nfsmounts:

nfsmounts
inventory:
  All NFS mounts are found automatically. This is done
  by scanning {/proc/mounts}. The file {/etc/fstab} is irrelevant.

Make sure that the text is understandable without deeper knowledge of an MIB or the code — so do not write:

One service is created for each temperature sensor if the state is 1.
On this page