Checkmk Synthetic Monitoring

For now, this is a mostly machine translated version of the same article in German. It has been proofread by some non-native English speaker to provide a minimal viable article to start your endeavors with Robotmk. We believe it is better for the planet to do LLM (Large Language Model) translation only once (and do an initial proofreading for better quality) than pointing our readers to their favorite online translation tool and translate based on the German version over and over (with varying quality).

However, we know about the limitations of LLM and thus queued this article for our real, human translator to proofread and partially re-translate.

1. Synthetic Monitoring with Robot Framework

Checkmk Synthetic Monitoring is available in the commercial Checkmk editions, but requires an additional subscription. However, you can test the function with up to three tests free of charge and without a time limit.

With Checkmk you can monitor your own infrastructure very closely — right down to the question of whether a particular service, such as a web server, is running properly. If your website is operated via a third-party cloud service, you will not have access to the service itself, but you can use an HTTP check to check whether the website is accessible. But what does that say about the user experience? The fact that an online store is accessible does not mean that navigation, ordering processes and the like work smoothly.

This is where Checkmk Synthetic Monitoring comes in. With the Robotmk plugin, Checkmk offers genuine end-to-end monitoring, i.e. the monitoring of running applications from the user’s perspective. The actual testing is carried out by the open source software Robot Framework — of which Checkmk GmbH is also a member.

The automation software can be used to completely automate user behavior, for example to simulate order processes in online stores click by click. The special thing about Robot Framework is that tests are not written in a fully-fledged programming language, but are defined using easy-to-use keywords such as Open Browser or Click Button. An Open Browser checkmk.com is sufficient to call up the Checkmk website. Several test cases are then combined in so-called test suites (in the form of a .robot file).

Robotmk can now trigger these robot framework test suites on the host and monitor their execution and results as services in Checkmk. In the Checkmk web interface you will then find the status, associated performance graphs and the original evaluations of Robot Framework itself.

1.1. Components

Different components play together for this end-to-end monitoring, so here is a brief overview.

Checkmk server

Checkmk Synthetic monitoring is realized via Robotmk, which uses an agent plugin as a data collector and the Robotmk scheduler (on the monitored host) for triggering Robot Framework projects. Synthetic monitoring is activated and configured via the Robotmk Scheduler rule. Here you specify which test suites should be executed and how exactly Robot Framework should start them — summarized in a plan. Once rolled out, the Robotmk scheduler on the target host ensures the scheduled execution of your Robot Framework suites.

In monitoring, you will ultimately receive several new services: RMK Scheduler Status shows the status of the scheduler itself, i.e. whether test suites could be started successfully. There are also services for all configured test plans (such as RMK MyApp1 Plan) and individual tests from test suites (such as RMK MyApp1 Test). The services of the individual tests also include the original Robot Framework reports.

Last but not least, there are two optional service rules: Robotmk plan and Robotmk test provide for fine-tuning the plan and test services — for example, to effect status changes at certain runtimes.

The Robotmk rules in Checkmk

Testing host

You must provide the Robot Framework test suites on a Windows host. For execution, Robot Framework requires access to their dependencies (Python, libraries, drivers for browser automation and so on). This configuration is independent of Checkmk and can be stored declaratively in a portable package. This is done by the open source command line tool RCC: This tool uses your configuration files in YAML format to build virtual Python environments including dependencies and the Robot Framework itself. The Robotmk scheduler running as a background process triggers this build and then executes the tests itself.

Such an RCC automation package with the package configuration (robot.yaml), the definition of the execution environment (conda.yaml) and the test suites (tests.robot) is also called Robot. RCC and the scheduler are rolled out with the Checkmk agent, the automation package must be available on the host.

The great advantage of RCC is that the executing Windows host itself does not require a configured Python environment.

The agent itself is only required for the transfer of results, logs and screenshots. This also enables the monitoring of very long-running or locally very resource-intensive suites — provided that your Windows host has the corresponding capacities.

2. Monitoring test suites with Robotmk

In the following, we will show you how to include a test suite in the monitoring and monitor it. A simple Hello World suite is used as an example, which only outputs two strings and waits briefly in between. An introduction to Robot Framework is of course not the topic here, but a brief look at the automation package and the demo test suite is necessary so that you can see which data ends up where in the monitoring.

The example runs on the basis of RCC, so that the Windows host does not have to be configured separately. The rcc.exe is rolled out with the agent and can be found under C:\ProgramData\checkmk\agent\bin\. You can download the sample suite as a ZIP file via GitHub. The directory of the suite:

C:\robots\mybot\

conda.yaml
robot.yaml
tests.robot

Important: RCC can also process test suites based on a number of other programming languages, but for use in Checkmk it must be the Robot Framework declaration.

The suite folder now contains two important files: The declaration of the environment required for execution in the file conda.yaml and the actual tests in the file tests.robot (the suite). The robot.yaml is not relevant for use in Checkmk, but is required by RCC.

In this case, only the Python, Pip and Robot Framework dependencies are installed for the environment. The environment build later appears in the monitoring as RCC environment build status. The tests can only be processed and consequently monitored if the environment is built successfully.

c:\robots\mybot\conda.yaml

channels:
  - conda-forge

dependencies:
  - python=3.10.12
  - pip=23.2.1
  - pip:
     - robotframework==7.0

The actual test suite now looks like this:

C:\robots\mybot\tests.robot

*** Settings ***
Documentation Template robot main suite.

*** Variables ***
${MYVAR}    Hello Checkmk!

*** Test Cases ***
My Test
    Log ${MYVAR}
    Sleep 3
    Log Done.

Here, only the value of the variable MYVAR is output, then 3 seconds are waited and finally Done is output. You can set the value of the variable later in the Checkmk web interface — otherwise the default Hello Checkmk! set here will be used.

You can run this test suite manually. To do this, the agent and RCC must already be installed (or you can download the RCC binary yourself). First navigate to your test suite folder, where the tests.robot is also located. Then start the RCC shell with C:\ProgramData\checkmk\agent\bin\rcc.exe task shell. The virtual environment defined in conda.yaml is then created. Then start the suite with robot tests.robot.

And this is exactly what the Robotmk scheduler does as soon as the agent plugin is active.

2.1. Configure rule for the agent plugin

You can find the Robotmk scheduler under Setup > Agent rules > Robotmk scheduler (Windows). As the rule is quite extensive, here is a look at the empty configuration:

Configuration of the agent plugin

First, the scheduler requires the base directory in which all your test suites are located. Enter this arbitrary, absolute path under Base directory of suites, for example C:\robots.

Base directory for all robot framework projects

The Parallel plan groups that are shown now are a Checkmk-own concept.

To explain this, we must first go down one hierarchy level: Here you can see the item Sequential plans. Such a sequential plan defines which suites are to be executed with which parameters. And Robot Framework then processes these suites one after the other. The reason is simple: in practice, tests are sometimes run on the desktop and several test suites could get in each other’s way at the same time (think of stealing each other the mouse pointer).

The plan groups are now an encapsulation for sequentially executed plans — and are themselves executed in parallel. Again, the reason is simple: this allows test suites that do not rely on the desktop to be executed in their own plans without delay — such as the test suite used in this article.

Back to the dialog: The only explicit setting is the execution interval, which you set under Group execution interval.

Execution interval for execution groups.

Interval for the (parallel) execution of plan groups

Attention: The plans in the plan group naturally have a certain runtime themselves, determined by the timeout of a single execution and the maximum number of repeated executions in the event of failed tests. The execution interval of the plan group must therefore be greater than the sum of the maximum runtimes of all plans in the group. The maximum runtime of a plan is calculated as follows: Limit per attempt × (1 + Maximum number of re-executions).

Now it’s time to configure the first plan. You can enter any name under Application name. This name does not have to be unique! The name of the application to be monitored makes sense here, for example OnlineShop or here in the example simply MyApplication. Of course, it can happen that this online store is tested several times, either by other test suites or by the same test suite with different parameters. In such cases, the Variant field is used to achieve unambiguous results despite identical names. For example, if the application OnlineShop is tested once in German and once in English (via corresponding parameters), you could use corresponding abbreviations here. The monitoring will then return results for My OnlineShop_en and My OnlineShop_de.

However, the specification under Relative path to test suite file or folder. is necessary. The path is relative to the base directory specified above, e.g. mybot\test.robot for C:\robots\. Alternatively, a directory (with several robot files) can also be specified here, in which case it would simply be mybot.

Plan for the execution of suites

Continue with the Execution configuration. Under Timeout per attempt you define the maximum time a test suite may run — per attempt. With Robot Framework re-executions you can now instruct Robot Framework to repeat test suites completely or incrementally if tests fail. If the individual tests in a test suite are independent of each other, the incremental strategy is the best way to save time. If, on the other hand, the test suite tests a logical sequence, such as "Login → Call up product page → Product in shopping cart → Checkout", the test suite must of course be completely reprocessed. In the end, there is always only one result.

In the case of complete retries, only self-contained suite results are taken into account for the final result: If a test fails on the last retry, the test suite is counted as a failure. In the case of incremental retries, the final result is made up of the best partial results: If some tests only run successfully on the third attempt, the final result is also counted as a success. Reminder: The combination of attempts and maximum run times of all plans in a plan group determines their minimum execution interval.

Configuration of execution runtimes and repetitions.

Failed tests/suites can be repeated

By default, execution via RCC is activated under Automated environment setup (via RCC), for which you must enter two values. Firstly, RCC requires the specification of where the robot.yaml file is located. Its primary purpose is to reference the conda.yaml file, which is responsible for setting up the Python environment, i.e. installing Python and dependencies. This specification is relative to the base directory that you have set above under Relative path to test suite file or folder. The YAML files can be saved in subfolders, but best practice is the top suite directory! For the above base directory C:\robot\ and the suite directory C:\robot\mybot it is accordingly mybot\robot.yaml.

With the following time limit for building the Python environment, you should bear in mind that sometimes large amounts of data have to be downloaded and set up. Especially for the required browsers, several hundred megabytes are quickly accumulated — but only for the first run. RCC only rebuilds environments if the content of conda.yaml has changed.

Time limit for building virtual environments

Under Robot Framework parameters you have the possibility to use some of the command line parameters of Robot Framework (which are also displayed by the command robot --help). If you want to use additional parameters, the option Argument files will help. A file specified here can contain any robot parameters. Further information about the individual parameters can be found in the inline help.

For our example project, only the option Variables is activated and a variable MYVAR with the value My Value is set — remember the command Log ${MYVAR} at the top of the file tests.robot? This is the corresponding reference.

Some options of the robot command

At the end of the suite configuration, there are three largely self-explanatory options. Execute plan as a specific user allows Robotmk to be executed in the context of a specific user account. Background: By default, Robotmk is executed in the context of the Checkmk agent (LocalSystem account), which has no authorization to access the desktop. A user can be specified here who must be permanently logged in to a desktop session and has access to graphical desktop applications accordingly.

With Assign plan/test result to piggyback host the results of the plan/test can be assigned to another host. For example, if Robot Framework is testing the ordering process of an online store, the results can be assigned to the corresponding web server.

Each test run produces data that is stored under C:\ProgramData\checkmk\agent\robotmk_output\working\suites\. By default, the results of the last 14 days are retained, but you should bear in mind that large mountains of data can quickly pile up here. At least 500 kilobytes of data are generated per run - with more complex test suites and embedded screenshots, for example, this can quickly add up to several megabytes. Depending on the execution interval, the size of the report and your documentation requirements, you should intervene here.

Options for user context, host assignment and automatic cleanup

Automatic cleanup of the large amount of data generated

Once here, you can now create further schedules in this schedule group or further schedule groups.

At the end there are two more options, which in turn relate to the complete Robotmk scheduler configuration.

RCC profile configuration allows you to specify proxy servers and hosts to be excluded.

Grace period before scheduler starts can also be very useful: The scheduler starts together with the Checkmk agent before the desktop logon — which, of course, means that any tests on the desktop must fail. The start can be manually delayed using a lead time.

Options for proxy server and a lead time for the scheduler start.

A grace period prevents failures

This completes the configuration and you can bake a new agent with the plugin and then roll it out, manually or via the automatic agent updates.

Data in the agent output

The output in the agent is quite extensive: error messages, status, configuration and test data are transmitted in several sections. The latter can be found in the robotmk_suite_execution_report section, here is an abbreviated excerpt:

mysite-robot-host-agent.txt

<<<robotmk_suite_execution_report:sep(0)>>>
{
    "attempts": [
        {
            "index": 1,
            "outcome": "AllTestsPassed",
            "runtime": 20
        }
    ],
    "rebot": {
        "Ok": {
            "xml": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n
			<robot generator=\"Rebot 6.1.1 (Python 3.10.12 on win32)\"
			generated=\"20240319 16:23:19.944\"
			rpa=\"true\"
			schemaversion=\"4\">\r\n<suite id=\"s1\"
			name=\"Mybot\"
			source=\"C:\\robots\\mybot\">\r\n<suite id=\"s1-s1\"
			name=\"Tests\"
			source=\"C:\\robots\\mybot\\tests.robot\">\r\n<test id=\"s1-s1-t1\"
			name=\"Mytest\"
			line=\"6\">\r\n<kw
			name=\"Sleep\"
			library=\"BuiltIn\">\r\n<arg>3 Seconds</arg>\r\n<doc>Pauses the test executed for the given time.</doc>\r\n<msg
			timestamp=\"20240319 16:23:02.936\"
			level=\"INFO\">Slept 3 seconds</msg>\r\n<status
			status=\"PASS\"
			starttime=\"20240319 16:23:00.934\"
			endtime=\"20240319 16:23:02.936\"/>"
        }
    },
    "suite_id": "mybot",
    "timestamp": 1710861778
}
...
"html_base64":"PCFET0NUWVBFIGh0bWw+DQo8aHRtbCBsYW ...

Two areas are of particular interest here. Firstly, rebot: The rebot tool has produced the actual status report for Robot Framework from several partial results (hence re-bot). Secondly, the last line html_base64: The HTML reports from Robot Framework are then base64-encoded. Screenshots taken via tests are also transferred in this way — the output/data volume in the agent can be correspondingly extensive.

Data in monitoring

As soon as the Robotmk scheduler and the test suite have been run, the service discovery will produce three new services:

The newly discovered Robotmk services

The service RMK Scheduler Status exists once and immediately after deployment. The services for plans and tests, here RMK MyApplication_mybot Plan and RMK MyApplication_mybot Test: /Test: My Test, are added to the monitoring as soon as the associated suites have been run for the first time.

2.2. Configure service rules

Create rule for plan status

Reminder: Maximum runtimes for plans were defined in the agent rule above. These runtimes can be evaluated with the Robotmk plan rule. For example, you can set the service to CRIT when 90 percent of all calculated timeouts have been reached.

Configuration dialog for threshold values for runtimes of test suites.

Threshold values for status changes based on runtimes

In the Conditions area, there is the option of restricting the rule to certain plans.

Dialog with restriction to the test suite 'mybot'.

Optional restriction to certain plans

Create rule for test status

Additional data can also be retrieved for individual tests in the test suites via the Robotmk test. rule. Here you will again find the option to monitor runtimes, both for tests and keywords. The monitoring of keywords is a Checkmk-specific function. Therefore, the suite-internal status in the Robot Framework report could also be 'OK' because the test suite was processed within the maximum permitted runtime - in Checkmk, however, WARN or CRIT, because a status change takes place at, for example, 80 percent of this maximum permitted runtime.

In addition, the Enable metrics for high-level keywords option can be used to generate metrics for higher-level keywords. This is particularly useful if your tests are organized in such a way that the higher-level keywords describe the "what" and the lower-level keywords describe the "how" — this gives you more abstract evaluations.

In this example, the threshold values for the maximum runtime of a test are 2 and 4 seconds. You will see the effects below in the chapter Robotmk in monitoring.

Rule for monitoring keywords with example values.

Monitoring can be expanded using keyword metrics

Once again, there is an explicit filter option in the Conditions area, here for individual tests.

Dialog with option to restrict to tests.

Optional restriction to certain tests

2.3. Robotmk in monitoring

In monitoring, you will find services for the status of the Robotmk scheduler as well as the individual plans and tests — even if you have not created any separate service rules.

Scheduler status

The service RMK Scheduler Status is OK if the scheduler is running and has successfully built the execution environments.

RCC was able to build the environments — in just one second

Here in the image you can see the note Environment build took 1 second. One second to build a virtual Python environment with Pip and Robot Framework? This is possible because RCC is clever: files that have already been downloaded are reused and a new build is only carried out after changes have been made in conda.yaml. The first build would have taken 30 seconds or more.

Plan status

The status of a plan is reflected in a service named by application name and suite, for example RMK mybot MyApplication_mybot Plan.

The execution of a plan — especially relevant for administrators

Test status

The evaluation of the tests is where it gets really interesting. In the image you can now see the effect of the threshold values set above for the runtime of tests — here the 2 seconds for the WARN status. As the Sleep 3 Seconds instruction in the test itself already ensures a longer runtime, this service must go to WARN here, although the test was of course successful. The fact that the test was successful is shown by the Robot Framework report, which you can access via the icon log log icon.

Results of a specific suite — especially relevant for developers

The report now clearly shows that the test and test suite have run successfully.

Robot framework report for 'Mybot' test suite.

The Robot Framework log, here in optional dark mode

At the bottom of the data you can also see the individual keywords, here for example Log ${MYVAR} together with the value My value set in Checkmk for MYVAR.

Robot framework report at keyword level.

The log file can be expanded down to the smallest details

Dashboards

Of course, you can build your own dashboards as usual — but you can also find two built-in overviews under Monitor > Synthetic Monitoring.

The complete Checkmk Synthetic Monitoring at a glance (shortened)

3. Troubleshooting

3.1. Scheduler reports `No Data`

If the scheduler does not receive any data, building the environment probably did not work. A common reason for this are network problems, for example, due to which certain dependencies cannot be loaded. In this case, take a look at the corresponding log file under C:\ProgramData\checkmk\agent\robotmk_output\working\environment_building.

3.2. Environment building fails: `post-install script execution`

This is a particularly interesting error that you might encounter on fresh Windows systems. The conda.yaml can also contain instructions that are to be executed after the installation of the dependencies — for example, the initialization of the Robot Framework browser. Python commands should therefore be executed here. By default, Windows 11 has aliases for python.exe and python3.exe that refer to the Microsoft Store. You must deactivate these aliases under 'Settings/Aliases for app execution'.

4. Files and directories

Path Content

Path	Content
`C:\ProgramData\checkmk\agent\robotmk_output\working\suites\`	Logs and results of the suites
`C:\ProgramData\checkmk\agent\robotmk_output\working\environment_building`	Logs for building virtual environments
`C:\ProgramData\checkmk\agent\robotmk_output\working\rcc_setup`	Messages of the RCC execution
`C:\ProgramData\checkmk\agent\logs\robotmk_scheduler_rCURRENT.log`	Log of the agent plugin
`C:\ProgramData\checkmk\agent\bin\`	`rcc.exe` and `robotmk_scheduler.exe`
`C:\ProgramData\checkmk\agent\plugins\`	agent plugin `robotmk_agent_plugin.exe`

C:\ProgramData\checkmk\agent\robotmk_output\working\suites\

Logs and results of the suites

C:\ProgramData\checkmk\agent\robotmk_output\working\environment_building

Logs for building virtual environments

C:\ProgramData\checkmk\agent\robotmk_output\working\rcc_setup

Messages of the RCC execution

C:\ProgramData\checkmk\agent\logs\robotmk_scheduler_rCURRENT.log

Log of the agent plugin

C:\ProgramData\checkmk\agent\bin\

rcc.exe and robotmk_scheduler.exe

C:\ProgramData\checkmk\agent\plugins\

agent plugin robotmk_agent_plugin.exe

Checkmk Synthetic Monitoring with Robotmk

1. Synthetic Monitoring with Robot Framework

1.1. Components

Checkmk server

Testing host

2. Monitoring test suites with Robotmk

2.1. Configure rule for the agent plugin

Data in the agent output

Data in monitoring

2.2. Configure service rules

Create rule for plan status

Create rule for test status

2.3. Robotmk in monitoring

Scheduler status

Plan status

Test status

Dashboards

3. Troubleshooting

3.1. Scheduler reports No Data

3.2. Environment building fails: post-install script execution

4. Files and directories

3.1. Scheduler reports `No Data`

3.2. Environment building fails: `post-install script execution`