Fine-Tuning The Monitoring
Fine-Tuning The Monitoring
docs.checkmk.com/latest/en/intro_finetune.html
With some Checkmk beginners, we have seen how they have added many systems to the
monitoring in a short space of time — perhaps because it is so easy to do so in Checkmk.
When they then shortly afterwards activated the notifications for all users, their
colleagues were flooded with hundreds of emails per day, and after only a few days their
enthusiasm for monitoring was effectively destroyed.
Even if Checkmk makes a real effort to define appropriate and sound default values for all
possible settings, it simply cannot know precisely enough how things should be in your IT
environment under normal conditions. Therefore, a bit of manual work is required on
your part to fine-tune the monitoring until even the last false alarm will not be sent. Apart
from that, Checkmk will also find quite a few real problems that you and your colleagues
have not yet suspected. These, too, must first be properly remedied — in reality, not in the
monitoring!
The following principle has proven itself - first quality, then quantity — or in other words:
Make sure that all services that do not really have a problem are reliably on OK.
Activate the notifications by email or SMS only after Checkmk has been running
reliably for a while with no or very few false alarms.
Note: False alarms can of course only occur when the notification function is switched
on. So basically, what we need to do here is to turn off the preliminary stage of
notifications and to avoid the critical states DOWN, WARN or CRIT for non-critical
problems.
In the following chapters on configuration, we will show you what fine-tuning options you
have — so that everything that does not cause problems will be green — and how to get
any occasional drop-outs under control.
2. Rules-based configurations
Before we start configuring, we must first briefly look at the settings of hosts and services
in Checkmk. Since Checkmk was developed for large and complex environments, this is
done using rules. This concept is very powerful and brings many advantages to smaller
1/15
environments as well.
The basic idea is that you don’t explicitly specify every single parameter for every service,
but implement something like: 'On all production Oracle servers, file systems with the
prefix /var/ora at 90% filled will be WARN and at 95% will be CRIT'.
Such a rule can set thresholds for thousands of file systems with a single action. At the
same time, it also documents very clearly which monitoring policies apply in your
company.
Based on a basic rule, you can then define exceptions for individual cases separately. A
suitable rule might look like this: 'On the Oracle server srvora123 , the file system
/var/ora/db01 at 96% filled will be WARN and at 98% will be CRIT'. This exception
rule is set in Checkmk in the same way as the basic rule.
Each rule has the same structure. It always consists of a condition and a value. You can
also add a description and a comment to document the purpose of the rule.
The rules are organised in rule sets. For each type of parameter, Checkmk has a suitable
rule set ready, so you can choose from several hundred rule sets. For example, there is one
called Filesystems (used space and growth) that sets the thresholds for all services that
monitor filesystems. To implement the above example, you would set the basic rule and
the exception rule from this rule set. To determine which thresholds are valid for a
particular file system, Checkmk goes in sequence through all of the rules valid for the
check. The first rule for which the condition applies sets the value — in this case, the
percentage value at which the file system check becomes WARN or CRIT.
3. Finding rules
You have various options for accessing the rule sets in Checkmk.
On the one hand, you can find the rule sets in the Setup menu under the topics of the
objects for which there are rule sets (Hosts, Services and Agents) in different categories.
For services, there are the following rule set entries: Service monitoring rules, Discovery
rules, Enforced services, HTTP, TCP, Email, … and Other services. If you select one of
these entries, the associated rule sets will be listed on the main page. This can be only a
handful, or also very, very many as with the Service monitoring rules. Therefore you have
the possibility to filter on the results page — in the Filter field of the menu bar.
If you are unsure in which category the rule set can be found, you can also search through
all rules in one go, either by using the search field in the setup menu or by opening the
rule search page via Setup > General > Rule search. We will take the latter route in the
following chapter, in which we will introduce the process of rule creation.
With the large number of rule sets available, it is not always easy to find the right one,
with or without a search. However, there is another way that you can access the
appropriate rules for an existing service. In a view that includes the service, click on the
menu option and select the Parameters for this service entry:
2/15
You will receive a page from which you can access all of the rule sets for this service:
In the first box entitled Check origin and parameters, the Filesystems (used space and
growth) entry takes you directly to the set of rules for the file system monitoring
thresholds. However, you can see in the overview that Checkmk has already set default
values, so you only need to create a rule if you want to modify those defaults.
4. Creating rules
3/15
What does a rule look like in practice? The best way to start is to formulate the rule you
want to implement in a sentence, like this: 'On all production Oracle servers, tablespaces
DW20 and DW30 will at 90%fill level be WARN and at 95% be CRIT'.
You can then search for an appropriate rule set — in this example via the rule search:
Setup > General > Rule search. This opens a page in which you can search for 'Oracle' or
for 'tablespace' (case-insensitive) and find all of the rule sets that contain this text in their
name or in their description (not shown here):
The rule set Oracle Tablespaces is found in two categories. The number following the title
(here everywhere 0 ) shows the number of rules that have already been created from this
rule set. If you click on the name in the Service monitoring rules category, you will land in
the overview page for this rule set:
This rule set does not yet contain any rules. You can create the first one with the Create
rule in folder button. In doing so, you will already be defining the first part of the rule’s
condition, namely in which folder it is to apply. If you change the Main directory default
setting to Windows, for example, the new rule only applies to hosts that are directly in or
are below the Windows folder.
4/15
Creating — and later editing — this rule opens an input form with three boxes: Rule
Properties, Value and Conditions. We will look at each of these three in turn.
In the Rule Properties box, all entries are optional. In addition to the informative texts,
here you also have the option of temporarily deactivating a rule. This is practical because
you can sometimes avoid deleting and recreating a rule if you temporarily do not need it.
What you find in the Value box depends in each specific case on the content of what is
being regulated:
As you can see, this can be quite a number of parameters. The example shows a typical
case — each individual parameter can be activated by a checkbox, and the rule will then
only apply to this parameter. You can, for example, let another parameter be determined
by a different rule if that simplifies your configuration. In this example, only the threshold
values for the percentage of free space in the tablespace will be defined.
The Conditions menu for setting the conditions looks a little more confusing at first
glance:
5/15
In this example will only go into the parameters that we absolutely need for defining this
rule:
You have already selected the folder when creating the rule, but here you can change it
again if required.
The Host tags are a very important feature in Checkmk, so we will be devoting a separate
chapter to them right after this chapter. At this point, you use one of the predefined host
tags to specify that the rule should only apply to production systems. First select the host
tag group Criticality from the list and then click Add tag condition and select the value
Productive system.
Very important in this example are the Explicit Tablespaces, which restrict the rule to very
specific services. Two points are important here:
The name of this condition adapts to the rule type. If it says Explicit Services, specify
the names of the services concerned. For example, one such could be Tablespace
DW20 — that is, including the word Tablespace . In the example shown, however,
Checkmk only wants to know the name of the tablespace itself, thus DW20 .
The entered texts are always matched against the beginning. The input of DW20
therefore also accesses a fictitious tablespace DW20A . If you want to prevent this,
append the $ character to the end, i.e. DW20$ , because these are so-called regular
expressions.
Note: A detailed description of all of the other parameters and a detailed explanation of
the important concept of rules can be found in the article on rules. By the way, you can
find out more about the service labels, the last parameter in the image above, in the article
on labels.
After all entries for the definition are complete, save the rule with Save. After saving, there
will be exactly one new rule in the rule set:
6/15
Tip: If rather than with one rule, you later work with hundreds, there is a danger of losing
an overall view. So in order to help you maintain an overview, Checkmk provides very
helpful entries in the Related menu on every page that lists rules. With this you can
display the rules used in the current site (Used rulesets) and, similarly, those that are not
used at all (Ineffective rules).
5. Host tags
In the previous chapter we saw an example of a rule that should only apply to production
systems. More specifically, in that rule we defined a condition using the Productive
system host tag. Why did we define the condition as a tag and not simply set it for the
folder? Well, you can only define a single folder structure, and each host can only be in
one folder. But a host can have many different tags, and the folder structure is simply too
limited and not flexible enough for that.
In contrast, you can assign host tags to the hosts as freely and arbitrarily as you like —
regardless of the folder in which the hosts are located. You can then refer to these tags in
your rules. This makes the configuration not only simpler, but also easier to understand
and less error-prone than if you were to define everything explicitly for each host.
But how and where do you define which hosts should have which tag? And how can you
define your own customised tags?
For the definition of host tag groups, see the Setup menu: Setup > Hosts > Tags:
7/15
As can be seen, some tag groups have already been predefined. You cannot change most
of these. We also recommend that you do not touch the two predefined example groups
Criticality and Networking Segment. It is better to define your own groups:
Click Add tag group. This opens the page for creating a new tag group. In the first box
Basic settings you assign — as so often in Checkmk — an internal ID that serves as a key
and which cannot be changed later. In addition to the ID, you define a descriptive title,
which you can change at any later time. With Topic you can determine where the tag will
be offered later in the properties for the host. If you create a new topic here, the tag will be
displayed in a separate box in the host properties.
The second box Tag choices is about the actual tags, i.e. the selection options in the group.
Click Add tag choice to create a tag and assign an internal ID and a title for each tag:
8/15
Notes:
Groups with only one selection are also allowed and can even be useful. These then
appear as checkboxes. Each host will then have the tag — or not.
At this point, you can ignore the auxiliary tags for now. You can get all the
information on auxiliary tags in particular and on host tags in general in the article
on rules.
Once you have saved this new host tags group with Save, you can start using it.
Now that you have learned the important principles of configuration with rules and host
tags, in the remaining chapters we would like to give you some practical guidelines on
how to reduce false alarms in a new Checkmk system.
9/15
By default, Checkmk takes the thresholds 80% for WARN and 90% for CRIT for the fill
level of file systems. Now 80% for a 2 TByte hard disk is 400 GByte after all — perhaps a
bit much buffer for a warning. So here are a few tips on the subject of file systems:
Create your own rules in the Filesystem (used space and growth) rule set.
The parameters allow thresholds that depend on the size of the file system. To do
this, select Levels for filesystems > Levels for filesystem used space > Dynamic
levels. With the Add new element button you can now define your own threshold
values per disk size.
It is even easier with the magic factor, which we will introduce in the final chapter.
In the Checkmk Raw Edition you first define a time period that covers the
times of the reboot. You can find out how to do this in the article on time periods. Then
create a rule in each of the rule sets Notification period for hosts and Notification period
for services for the affected hosts and select the previously-defined time period there. The
second rule for the services is necessary so that any services that go to CRIT during this
time do not trigger a notification. If problems occur within this time frame — and are also
resolved within the same time frame — no notification will be triggered.
downtimes for this purpose that you can set for any affected hosts.
Tip: An alternative to creating downtimes for hosts, which we have already described in
the chapter on scheduled downtimes, is the Recurring downtimes for hosts rule set in the
Enterprise Editions. This has the great advantage that hosts that are added to the
monitoring later automatically receive these scheduled downtimes.
You can tell Checkmk that it is perfectly OK for a host to be powered off. To do this, find
the Host check command rule set and set its value to Always assume host to be up:
10/15
In the Conditions box, make sure that this rule is really only applied to the appropriate
hosts — depending on the structure you have chosen. For instance, you can define a host
tag and use it here, or you can set the rule for a folder in which all of the printers are
located.
Now, all printers will always be displayed as UP — no matter what their actual status is.
However, the services of the printer will continue to be checked and any timeout would
result in a CRIT state. To avoid this as well, configure a rule for the affected hosts in the
Status of the Checkmk services rule set, in which you set timeouts and connection
problems to OK respectively:
9. Configuring switchports
If you monitor a switch with Checkmk, you will notice that during the service
configuration, a service is automatically created for each port that is UP at the time. This
is a sensible default setting for core and distribution switches — i.e. those to which only
infrastructure devices or servers are connected. However, for switches to which end
devices such as workstations or printers are connected, this leads on the one hand to
continuous notifications if a port goes DOWN, and on the other hand to new services
being continuously found because a previously unmonitored port goes UP.
Two approaches have proven successful for such situations. Firstly, you can restrict the
monitoring to the uplink ports. To do this, create a rule for the disabled services that
excludes the other ports from monitoring.
11/15
However, the second method is much more interesting. With this method you monitor all
ports, but allow DOWN to be a valid state. The advantage is that you will have
transmission-error monitoring even for ports to which end devices are connected and can
thus very quickly detect bad patch cables or errors in auto-negotiation. To implement this
function, you need two rules:
The first set of rules Network interface and switch port discovery defines the conditions
under which switch ports are to be monitored. Create a rule for the desired switches and
select whether individual interfaces (Configure discovery of single interfaces), or groups
(Configure grouping of interfaces) are to be discovered. Then, under Conditions for this
rule to apply > Match port states, activate 2 - down in addition to 1 - up:
In the service configuration of the switches, the ports with the DOWN state will now also
be presented, and you can add these to the list of monitored services. Before you activate
the change, you will still need the second rule that ensures that this state is evaluated as
OK.
This rule set is called Network interfaces and switch ports. Create a new rule and activate
the Operational state option, deactivate Ignore the operational state below it and then
activate the 1 - up and 2 - down states for the Allowed operational states. (and any other
12/15
states as may be required).
It is much better to define rules according to which specific services will systematically
not be monitored. For this purpose there is the Disabled services rule set. Here you can,
for example, create a rule according to which file systems with the /var/test mount
point are by definition not to be monitored.
Tip: If you disable an individual service in the service configuration of a host by clicking
on , a rule is automatically created for the host in this very rule set. You can edit this
rule manually and, for example, remove the explicit host name. The affected service will
then be disabled on all hosts.
You can read more information about this in the article on configuring services.
For this reason, quite a number of check plug-ins have the option in their configuration
that their metrics are averaged over a longer period of time before the thresholds are
applied. An example of this is the rule set for CPU utilisation for non-Unix systems called
CPU utilization for simple devices. For this there is the Averaging for total CPU utilization
parameter:
13/15
If you activate this and enter 15 , the CPU load will first be averaged over a period of 15
minutes and only afterwards will the threshold values be applied to this average value.
If you create a rule in that rule set and set its value to, say, 3 , a service that goes from OK
to WARN, for example, will not yet trigger a notification and will not yet be displayed as a
problem in the Overview. The intermediate state in which the service will now be in is
called the 'soft state'. Only when the state remains not OK for three consecutive checks —
which is a total duration of just over two minutes — will a persistent problem be reported.
Only a hard state will trigger a notification.
This is admittedly not an attractive solution. You should always try to get to the root of
any problem, but sometimes things are just the way they are, and with the 'check
attempts' you at least have a viable way around such situations.
By default, every two hours this service checks whether new — not yet monitored —
services have been found or existing services have been dropped. If this is the case, the
identified service go to WARN. You can then call up the service configuration (on the
Services of host page) and bring the services list back up to the current status.
Detailed information on this 'Discovery Check' can be found in the article on configuring
services. There you can also learn how you can have unmonitored services added
automatically, which makes the work in a large configuration much easier.
Tip: With Monitor > Analyze > Unmonitored services you can call up a view that shows
you any new or dropped services.
14/15
15/15