100% found this document useful (1 vote)
573 views29 pages

Mastering Zabbix - Second Edition - Sample Chapter

Chapter No. 6 Managing Alerts Learn how to monitor your large IT environments using Zabbix with this one-stop, comprehensive guide to the Zabbix world For more information: http://bit.ly/1MoBmDH

Uploaded by

Packt Publishing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
573 views29 pages

Mastering Zabbix - Second Edition - Sample Chapter

Chapter No. 6 Managing Alerts Learn how to monitor your large IT environments using Zabbix with this one-stop, comprehensive guide to the Zabbix world For more information: http://bit.ly/1MoBmDH

Uploaded by

Packt Publishing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Fr

Second Edition
Nowadays, monitoring systems play a crucial role in any IT
environment. They are extensively used to not only measure
your system's performance, but also to forecast capacity
issues. This is where Zabbix, one of the most popular
monitoring solutions for networks and applications, comes
into the picture.
This new edition will provide you with all the knowledge you
need to make strategic and practical decisions about the
Zabbix monitoring system. The setup you'll do with this book
will fit your environment and monitoring needs like a glove.
You will be guided through the initial steps of choosing the
correct size and configuration for your system, to what to
monitor and how to implement your own custom monitoring
component. Exporting and integrating your data with other
systems is also covered.
By the end of this book, you will have a tailor-made and
well-configured monitoring system and will understand with
absolute clarity how crucial it is to your IT environment.

Who this book is written for

Efficiently collect data from a large variety of


monitoring objects
Organize your data in graphs, charts, maps,
and slide shows

P U B L I S H I N G

pl

Write your own custom probes and monitoring


scripts to extend Zabbix

C o m m u n i t y

Configure Zabbix and its database to be highly


available and fault-tolerant
Automate repetitive procedures using
Zabbix's API
Integrate Zabbix with external systems
Understand the protocol and how to interact
with it by writing your own custom agent

$ 49.99 US
31.99 UK

community experience distilled

Sa
m

Build intelligent triggers and alarms to monitor


your network proactively

Andrea Dalle Vacche

This book is intended for system administrators and


IT architects who need to better integrate their Zabbix
installation with their surrounding environment. A basic
working knowledge of Zabbix and Linux is assumed so that
the book can focus on how to use every component to its
full potential.

What you will learn from this book

Mastering Zabbix

Second Edition

Mastering Zabbix

ee

D i s t i l l e d

Mastering Zabbix
Second Edition
Learn how to monitor your large IT environments using Zabbix
with this one-stop, comprehensive guide to the Zabbix world

Prices do not include


local sales tax or VAT
where applicable

Visit www.PacktPub.com for books, eBooks,


code, downloads, and PacktLib.

E x p e r i e n c e

Andrea Dalle Vacche

In this package, you will find:

The author biography


A preview chapter from the book, Chapter 6 'Managing Alerts'
A synopsis of the books content
More information on Mastering Zabbix Second Edition

About the Author


Andrea Dalle Vacche is a highly skilled IT professional with over 15 years of
industry experience.

He graduated from Univerist degli Studi di Ferrara with an information technology


certification. This laid the technology foundation that Andrea has built on ever since.
He has acquired various other industry-respected accreditations from big players
in the IT industry, which include Cisco, Oracle, ITIL, and of course, Zabbix. He also
has a Red Hat Certified Engineer certification. Throughout his career, he has worked
on many large-scale environments, often in roles that have been very complex, on a
consultant basis. This has further enhanced his growing skillset, adding to his practical
knowledge base and concreting his appetite for theoretical technical studying.
Andrea's love for Zabbix came from the time he spent in the Oracle world as a
database administrator/developer. His time was mainly spent on reducing "ownership
costs" with specialization in monitoring and automation. This is where he came
across Zabbix and the technical and administrative flexibility that it offered. With
this as a launch pad, Andrea was inspired to develop Orabbix, the first piece of open
source software to monitor Oracle that is completely integrated with Zabbix. He has
published a number of articles on Zabbix-related software, such as DBforBIX. His
projects are publicly available on his website at http://www.smartmarmot.com.
Currently, Andrea is working as a senior architect for a leading global investment
bank in a very diverse and challenging environment. His involvement is very wide
ranging, and he deals with many critical aspects of the Unix/Linux platforms and
pays due diligence to the many different types of third-party software that are
strategically aligned to the bank's technical roadmap.
Andrea also plays a critical role within the extended management team for the
security awareness of the bank, dealing with disciplines such as security, secrecy,
standardization, auditing, regulator requirements, and security-oriented solutions.
In addition to this book, he has also authored the following books:

Mastering Zabbix, Packt Publishing

Zabbix Network Monitoring Essentials, Packt Publishing

Preface
Ever since its first public release in 2001, Zabbix has distinguished itself as a very
powerful and effective monitoring solution. As an open source product, it's easy to
obtain and deploy, and its unique approach to metrics and alarms has helped to set
it apart from its competitors, both open and commercial. It's a powerful, compact
package with very low requirements in terms of hardware and supporting software
for a basic yet effective installation. If you add a relative ease of use, it's clear that it
can be a very good contender for small environments with a tight budget. But it's
when it comes to managing a huge number of monitored objects, with a complex
configuration and dependencies, that Zabbix's scalability and inherently distributed
architecture really shines. More than anything, Zabbix can be an ideal solution
in large and complex distributed environments, where being able to manage
efficiently and extract meaningful information from monitored objects and events
is just as important if not more important than the usual considerations about costs,
accessibility, and the ease of use.
This is a second edition book, the first having been coauthored by Andrea Dalle
Vacche and Stefano Kewan Lee.The purpose of this book is to help you make the
most of your Zabbix installation to leverage all of its power to monitor any large and
complex environment effectively.
The purpose of this book is to help you make the most of your Zabbix installation to
leverage all of its power to monitor any large and complex environment effectively.

Preface

What this book covers


Chapter 1, Deploying Zabbix, focuses on choosing the optimal hardware and software
configuration for the Zabbix server and database in relation to the current IT
infrastructure, monitoring goals, and possible evolution. This chapter also includes
a section that covers an interesting database-sizing digression, which is useful in
calculating the final database size using a standard environment as the baseline.
Correct environment sizing and a brief discussion about metrics and measurements
that can also be used for capacity planning will be covered here. The chapter contains
practical examples and calculations framed in a theoretical approach to give the
reader the skills required to adapt the information to real-world deployments.
Chapter 2, Distributed Monitoring, explores various Zabbix components both on the server
side and the agent side. Different distributed solutions will be given to the same example
networks to highlight the advantages and possible drawbacks of each. In addition to
the deployment and configuration of agents, the chapter takes proxies, maintenance,
and change management into account too. This section will cover all the possible
architectural implementations of Zabbix and add the pros and cons considerations.
Chapter 3, High Availability and Failover, covers the subjects of high availability and
failover. For each of the three main Zabbix tiers, you will learn to choose among different
HA options. The discussion will build on the information provided in the previous
two chapters in order to end the first part of the book with a few complete deployment
scenarios that will include high-availability servers and databases hierarchically
organized in tiered, distributed architectures geared toward monitoring thousands of
objects scattered in different geographical locations. This chapter will include a realworld, practical example and certain possible scenarios that have been implemented.
Chapter 4, Collecting Data, moves beyond simple agent items and SNMP queries
to tackle a few complex data sources. The chapter will explore powerful Zabbix
built-in functionalities, how to use them, and how to choose the best metrics to
ensure thorough monitoring without overloading the system. There will also be
special considerations about aggregated values and their use in monitoring complex
environments with clusters or the more complex grid architectures.
Chapter 5, Visualizing Data, focuses on getting the most out of the data visualization
features of Zabbix. This is quite a useful chapter, especially if you need to explain
or chase a hardware expansion/improvement to the business unit. You will learn
how to leverage live monitoring data to make dynamic maps and how to organize
a collection of graphs for big-screen visualization in control centers and implement
a general qualitative view. This chapter will cover the data center quality view slide
show completely, which is really useful in highlighting problems and warning the
first-level support in a proactive approach. The chapter will also explore some best
practices concerning the IT services and SLA-reporting features of Zabbix.

Preface

Chapter 6, Managing Alerts, gives examples of complex triggers and trigger conditions
as well as advice on choosing the right amount of trigger and alerting actions. The
purpose is to help you walk the fine line between being blind to possible problems
and being overwhelmed by false positives. You will also learn how to use actions
to automatically fix simple problems, raise actions without the need for human
intervention to correlate different triggers and events, and tie escalations to your
operations management workflow. This section will make you aware of what
can be automated, reducing your administrative workload and optimizing the
administration process in a proactive way.
Chapter 7, Managing Templates, offers guidelines for effective template management:
building complex template schemes out of simple components, understanding and
managing the effects of template modification, maintaining existing monitored
objects, and assigning templates to discovered hosts. This will conclude the second
part of the book that is dedicated to the different Zabbix monitoring and data
management options. The third and final part will discuss Zabbix's interaction with
external products and all its powerful extensibility features.
Chapter 8, Handling External Scripts, helps you learn how to write scripts to monitor
objects that are not covered by the core Zabbix features. The relative advantages and
disadvantages of keeping the scripts on the server side or agent side, how to launch
or schedule them, and a detailed analysis of the Zabbix agent protocol will also be
covered. This chapter will make you aware of all the possible side effects, delays,
and load caused by scripts; you will be able to implement all the needed external
checks, as you will be well aware of all that is connected with them and the relative
observer effect. The chapter will include different implementations of working with
Bash, Java, and Python so that you can easily write your own scripts to extend and
enhance Zabbix's monitoring possibilities.
Chapter 9, Extending Zabbix, delves into the Zabbix API and how to use it to build
specialized frontends and complex extensions. It also covers how to harvest
monitoring data for further elaboration and reporting. It will include simple example
implementations written in Python that will illustrate how to export and further
manipulate data, how to perform massive and complex operations on monitored
objects, and finally, how to automate different management aspects such as user
creation and configuration, trigger activation, and the like.
Chapter 10, Integrating Zabbix, wraps things up by discussing how to make other
systems know about Zabbix and the other way around. This is key to the successful
management of any large and complex environment. You will learn how to use
built-in Zabbix features, API calls, or direct database queries to communicate with
different upstream and downstream systems and applications. There will be concrete
examples of possible interaction with inventory applications, trouble ticket systems,
and data warehouse systems.

Managing Alerts
Checking conditions and alarms is the most characteristic function of any monitoring
system, and Zabbix is no exception. What really sets Zabbix apart is that every alarm
condition or trigger (as it is known in this system) can be tied not only to a single
measurement, but also to an arbitrary complex calculation based on all of the data
available to the Zabbix server. Furthermore, just as triggers are independent from
items, the actions that the server can take based on the trigger status are independent
from the individual trigger, as you will see in the subsequent sections.
In this chapter, you will learn the following things about triggers and actions:

Creating complex, intelligent triggers

Minimizing the possibility of false positives

Setting up Zabbix to take automatic actions based on the trigger status

Relying on escalating actions

An efficient, correct, and comprehensive alerting configuration is a key to the success


of a monitoring system. It's based on extensive data collection, as discussed in
Chapter 4, Collecting Data, and eventually leads to managing messages, recipients,
and delivery media, as we'll see later in the chapter. But all this revolves around the
conditions defined for the checks, and this is the main business of triggers.

[ 223 ]

Managing Alerts

Understanding trigger expressions


Triggers are quite simple to create and configurechoose a name and a severity,
define a simple expression using the expression form, and you are done. The
expression form, accessible through the Add button, lets you choose an item, a
function to perform on the item's data, and some additional parameters and gives an
output as shown in the following screenshot:

You can see how there's a complete item key specification, not just the name, to
which a function is applied. The result is then compared to a constant using a
greater than operator. The syntax for referencing item keys is very similar to that for
a calculated item. In addition to this basic way of referring to item values, triggers
also add a comparison operator that wraps all the calculations up to a Boolean
expression. This is the one great unifier of all triggers; no matter how complex the
expression, it must always return either a True value or a False value. This value
is, of course, directly related to the state of a trigger, which can only be OK if the
expression evaluates to False, or PROBLEM if the expression evaluates to True. There
are no intermediate or soft states for triggers.
A trigger can also be in an UNKNOWN state if it's impossible to
evaluate the trigger expression (because one of the items has
no data, for example).

A trigger expression has two main components:

Functions applied to the item data

Arithmetical and logical operations performed on the functions' results

[ 224 ]

Chapter 6

From a syntactical point of view, the item and function component has to be enclosed
in curly brackets, as illustrated in the preceding screenshot, while the arithmetical
and logical operators stay outside the brackets:

Selecting items and functions


You can reference as many items as you want in a trigger expression as long as
you apply a single function to every single item. This means that, if you want to
use the same item twice, you'll need to specify it twice completely, as shown in the
following code:
{Alpha:log[/tmp/operations.log,,,10,skip].nodata(600)}=1 or
{Alpha:log[/tmp/operations.log,,,10,skip].str(error)}=1

The previously discussed trigger will evaluate to PROBLEM if there are no new lines in
the operations.log file for more than 10 minutes or if an error string is found in the
lines appended to that same file.
Zabbix doesn't apply short-circuit evaluation of the and and or
(previously, until Zabbix 2.4, they were expressed with & and |)
operators; every comparison will be evaluated regardless of the
outcome of the preceding ones.

Of course, you don't have to reference items from the same host; you can reference
different items from different hosts and on different proxies too (if you can access
them), as shown in the following code:
{Proxy1:Alpha:agent.ping.last(0)}=0 and
{Proxy2:Beta:agent.ping.last(0)}=0

Here, the trigger will evaluate to PROBLEM if both the hosts Alpha and Beta are
unreachable. It doesn't matter that the two hosts are monitored by two different
proxies. Everything will work as expected as long as the proxy where the trigger is
defined has access to the two monitored hosts' historical data. You can apply all the
same functions available for calculated items to your items' data. The complete list
and specification are available on the official Zabbix wiki (https://www.zabbix.
com/documentation/2.4/manual/appendix/triggers/functions), so it would
be redundant to repeat them here, but a few common aspects among them deserve a
closer look.

[ 225 ]

Managing Alerts

Choosing between seconds and a number of


measurements
Many trigger functions take a sec or #num argument. This means that you can
either specify a time period in seconds or a number of measurements, and the
trigger will take all of the item's data in the said period and apply the function to it.
So, the following code will take the minimum value of Alpha's CPU idle time in the
last 10 minutes:
{Alpha:system.cpu.util[,idle].min(600)}

The following code, unlike the previous one, will perform the same operation on the
last ten measurements:
{Alpha:system.cpu.util[,idle].min(#10)}

Instead of a value in seconds, you can also specify shortcuts


such as 10m for 10 minutes, 2d for 2 days, and 6h for 6 hours.

Which one should you use in your triggers? While it obviously depends on your
specific needs and objectives, each one has its strengths that make it useful in the
right context. For all kinds of passive checks initiated by the server, you'll often want
to stick to a time period expressed as an absolute value. A #5 parameter will vary
quite dramatically as a time period if you vary the check interval of the relative item.
It's not usually obvious that such a change will also affect related triggers. Moreover,
a time period expressed in seconds may be closer to what you really mean to check
and thus may be easier to understand when you'll visit the trigger definition at a
later date. On the other hand, you'll often want to opt for the #num version of the
parameter for many active checks, where there's no guarantee that you will have a
constant, reliable interval between measurements. This is especially true for trapper
items of any kind and for log files. With these kinds of items, referencing the number
of measurements is often the best option.

[ 226 ]

Chapter 6

The date and time functions


All the functions that return a time value, whether it's the current date, the current
time, the day of the month, or the day of the week, still need a valid item as part of
the expression. These can be useful to create triggers that may change their status
only during certain times of the day or during certain specific days or, better yet,
to define well-known exceptions to common triggers when we know that some
otherwise unusual behavior is to be expected, for example, a case where there's a bug
in one of your company's applications that causes a rogue process to quickly fill up
a filesystem with huge log files. While the development team is working on it, they
ask you to keep an eye on the said filesystem and kill the process if it's filling the
disk up too quickly. As with many things in Zabbix, there's more than one way to
approach this problem, but you decide to keep it simple and find that, after watching
the trending data on the host's disk usage, a good indicator that the process is going
rogue is that the filesystem has grown by more than 3 percent in 10 minutes:
{Alpha:vfs.fs.size[/var,pused].delta(600)}>3

The only problem with this expression is that there's a completely unrelated process
that makes a couple of big file transfers to this same filesystem every night at 2 a.m.
While this is a perfectly normal operation, it could still make the trigger switch to a
PROBLEM state and send an alert. Adding a couple of time functions will take care of
that, as shown in the following code:
{Alpha:vfs.fs.size[/var,pused].delta(600)}>3 and
({Alpha:vfs.fs.size[/var,pused].time(0)}<020000 or
{Alpha:vfs.fs.size[/var,pused].time(0)}>030000 )

Just keep in mind that all the trigger functions return a numerical value, including
the date and time ones, so it's not really practical to express fancy dates, such as the
first Tuesday of the month or last month (instead of the last 30 days).

[ 227 ]

Managing Alerts

Trigger severity
Severity is little more than a simple label that you attach to a trigger. The web
frontend will display different severity values with different colors, and you will
be able to create different actions based on them, but they have no further meaning
or function in the system. This means that the severity of a trigger will not change
over time based on how long that trigger has been in a PROBLEM state, nor can you
assign a different severity to different thresholds in the same trigger. If you really
need a warning alert when a disk is over 90 percent full and a critical alert when it's
100 percent full, you will need to create two different triggers with two different
thresholds and severities. This may not be the best course of action though, as it
could lead to warnings that are ignored and not acted upon, critical warnings that
will fire up when it's already too late and you have already lost service availability,
just a redundant configuration with redundant messages and more possibilities of
mistakes, or an increased signal-to-noise ratio.
A better approach would be to clearly assess the actual severity of the potential for
the disk to fill up and create just one trigger with a sensible threshold and, possibly,
an escalating action if you fear that the warning could get lost among the others.

Choosing between absolute values and


percentages
If you look at many native agent items, you'll see that a lot of them can express
measurements either as absolute values or as percentages. It often makes sense to
do this while creating one's own custom items as both representations can be quite
useful in and of themselves. When it comes to creating triggers on them, though, the
two can differ quite a lot, especially if you have the task of keeping track of available
disk space.
Filesystem sizes and disk usage patterns vary quite a lot between different servers,
installations, application implementations, and user engagements. While a free space
of 5 percent of a hypothetical disk A could be small enough that it would make sense
to trigger a warning and act upon it, the same 5 percent could mean a lot more space
for a large disk array, enough for you to not really need to act immediately but plan a
possible expansion without any urgency. This may lead you to think that percentages
are not really useful in these cases and even that you can't really put disk-spacerelated triggers in templates as it would be better to evaluate every single case and
build triggers that are tailor-made for every particular disk with its particular usage
pattern. While this can certainly be a sensible course of action for particularly sensible
and critical filesystems, it can quickly become too much work in a large environment
where you may need to monitor hundreds of different filesystems.

[ 228 ]

Chapter 6

This is where the delta function can help you create triggers that are general enough
that you can apply them to a wide variety of filesystems so that you can still get
a sensible warning about each one of them. You will still need to create more
specialized triggers for those special, critical disks, but you'd have to anyway.
While it's true that the same percentages may mean quite a different thing for disks
with a great difference in size, a similar percentage variation of available space on a
different disk could mean quite the same thing: the disk is filling up at a rate that can
soon become a problem:
{Template_fs:vfs.fs.size[/,pfree].last(0)}<5 and
({Template_fs:vfs.fs.size[/,pfree].delta(1d)} or
{Template_fs:vfs.fs.size[/,pfree].last(0,1d) } > 0.5)

The previously discussed trigger would report a PROBLEM state not just if the
available space is less than 5 percent on a particular disk, but also if the available
space has been reduced by more than half in the last 24 hours (don't miss the
time-shift parameter in the last function). This means that no matter how big the disk
is, based on its usage pattern it could quickly fill up. Note also how the trigger would
need progressively smaller and smaller percentages for it to assume a PROBLEM state,
so you'd automatically get more frequent and urgent notifications as the disk is
filling up.
For these kinds of checks, percentage values should prove more flexible and easy to
understand than absolute ones, so that's what you probably want to use as a baseline
for templates. On the other hand, absolute values may be your best option if you
want to create a very specific trigger for a very specific filesystem.

Understanding operations as correlations


As you may have already realized, practically every interesting trigger expression is
built as a logical operation between two or more simpler expressions. Naturally, it
is not that this is the only way to create useful triggers. Many simple checks on the
status of an agent.ping item can literally save the day when quickly acted upon,
but Zabbix also makes it possible, and relatively easy, to define powerful checks that
would require a lot of custom coding to implement in other systems. Let's see a few
more examples of relatively complex triggers.

[ 229 ]

Managing Alerts

Going back to the date and time functions, let's say that you have a trigger that
monitors the number of active sessions in an application and fires up an alert if that
number drops too low during certain hours because you know that there should
always be a few automated processes creating and using sessions in that window of
time (from 10:30 to 12:30 in this example). During the rest of the day, the number of
sessions is neither predictable, nor that significant, so you keep sampling it but don't
want to receive any alert. A first, simple version of your trigger could look like the
following code:
{Appserver:sessions.active[myapp].min(300)}<5 and
{Appserver:sessions.active[myapp].time(0)} > 103000 and
{Appserver:sessions.active[myapp].time(0) } < 123000

The session.active item could be a custom script,


calculated item, or anything else. It's used here as a label to
make the example easier to read and not as an instance of
an actual ready-to-use native item.

The only problem with this trigger is that if the number of sessions drops below
five in that window of time but it doesn't come up again until after 12:30, the trigger
will stay in the PROBLEM state until the next day. This may be a great nuisance if you
have set up multiple actions and escalations on that trigger as they would go on for
a whole day no matter what you do to address the actual session's problems. But
even if you don't have escalating actions, you may have to give accurate reports on
these event durations, and an event that looks as if it's going on for almost 24 hours
would be both incorrect in itself and for any SLA reporting. Even if you don't have
reporting concerns, displaying a PROBLEM state when it's not there anymore is a kind
of false positive that will not let your monitoring team focus on the real problems
and, over time, may reduce their attention on that particular trigger.
A possible solution is to make the trigger return to the OK state outside the target
hours if it was in a PROBLEM state, as shown in the following code:
({Appserver:sessions.active[myapp].min(300)}<5 and
{Appserver:sessions.active[myapp].time(0)} > 103000 and
{Appserver:sessions.active[myapp].time(0) } < 123000)) or
({TRIGGER.VALUE}=1 and
{Appserver:sessions.active[myapp].min(300)}<0 and
({Appserver:sessions.active[myapp].time(0)} < 103000 or
{Appserver:sessions.active[myapp].time(0) } > 123000))

[ 230 ]

Chapter 6

The first three lines are identical to the trigger defined before. This time, there is one
more complex condition, as follows:

The trigger is in a PROBLEM state (see the note about the TRIGGER.VALUE
macro)

The number of sessions is less than zero (this can never be true)

We are outside the target hours (the last two lines are the opposite of those
defining the time frame preceding it)
The TRIGGER.VALUE macro represents the current value of
the trigger expressed as a number. A value of 0 means OK, 1
means PROBLEM, and 2 means UNKNOWN. The macro can be used
anywhere you can use an item.function pair, so you'll typically
enclose it in curly brackets. As you've seen in the preceding
example, it can be quite useful when you need to define different
thresholds and conditions depending on the trigger's status itself.

The condition about the number of sessions being less than zero makes sure
that outside the target hours, if the trigger was in a PROBLEM state, the whole
expression will evaluate to false anyway. False means that the trigger is switching
to the OK state.
Here, you have not only made a correlation between an item value and a window
of time to generate an event, but you have also made sure that the event will always
spin down gracefully instead of potentially going out of control.
Another interesting way to build a trigger is to combine different items from the
same hosts or even different items from different hosts. This is often used to spot
incongruities in your system state that would otherwise be very difficult to identify.
An obvious case could be that of a server that serves content over the network.
Its overall performance parameters may vary a lot depending on a great number
of factors, so it would be very difficult to identify sensible trigger thresholds that
wouldn't generate a lot of false positives or, even worse, missed events. What may
be certain though is that if you see a high CPU load while network traffic is low, then
you may have a problem, as shown in the following code:
{Alpha:system.cpu.load[all,avg5].last(0)} > 5 and
{Alpha:net.if.total[eth0].avg(300)} < 1000000

[ 231 ]

Managing Alerts

An even better example would be about the necessity to check for hanging or
frozen sessions in an application. The actual way to do this would depend a lot on
the specific implementation of the said application, but for illustrative purposes,
let's say that a frontend component keeps a number of temporary session files in a
specific directory, while the database component populates a table with the session
data. Even if you have created items on two different hosts to keep track of these
two sources of data, each number taken alone will certainly be useful for trending
analysis and capacity planning, but they need to be compared to check whether
something's wrong in the application's workflow. Assuming that we have previously
defined a local command on the frontend's Zabbix agent that will return the number
of files in a specific directory, and that we have defined an odbc item on the database
host that will query the DB for the number of active sessions, we could then build a
trigger that compares the two values and reports a PROBLEM state if they don't match:
{Frontend:dir.count[/var/sessions].last(0)} <>
{Database:sessions.count.last(0)}

The <> term in the expression is the not equal operator that
was previously expressed as # is now expressed with <>
starting with Zabbix 2.4.

Aggregated and calculated items can also be very useful in building effective
triggers. The following one will make sure that the ratio between active workers and
the available servers doesn't drop too low in a grid or cluster:
{ZbxMain:grpsum["grid", "proc.num[listener]", last, 0].last(0)} /
{ZbxMain:grpsum["grid", "agent.ping", last, 0].last(0)} < 0.5

All these examples should help drive home the fact that once you move beyond
checking for simple thresholds with single-item values and start correlating different
data sources together in order to have more sophisticated and meaningful triggers,
there is virtually no end to all the possible variations of trigger expressions that you
can come up with.
By identifying the right metrics, as explained in Chapter 4, Collecting Data, and
combining them in various ways, you can pinpoint very specific aspects of your
system behavior; you can check log files together with the login events and
network activity to track down possible security breaches, compare a single server's
performance with the average server performance in the same group to identify
possible problems in service delivery, and do much more.

[ 232 ]

Chapter 6

This is, in fact, one of Zabbix's best-kept secrets that really deserve more publicity;
its triggering system is actually a sophisticated correlation engine that draws its
power from a clear and concise method to construct expressions as well as from the
availability of a vast collection of both current and historical data. Spending a bit of
your time studying it in detail and coming up with interesting and useful triggers
that are tailor-made for your needs will certainly pay you back tenfold as you will
end up not only with a perfectly efficient and intelligent monitoring system, but also
with a much deeper understanding of your environment.

Managing trigger dependencies


It's quite common that the availability of a service or a host doesn't depend only on
the said host by itself, but also on the availability of any other machine that may
provide connectivity to it. For example, if a router goes down, whereby an entire
subnet is isolated, you would still get alerts about all the hosts in the said network
that will suddenly be seen as unavailable from Zabbix's point of view even if it's
really the router's fault. A dependency relationship between the router and the hosts
behind it would help alleviate the problem because it would make the server skip
any trigger check for the hosts in the subnet in case the router becomes unavailable.
While Zabbix doesn't support the kind of host-to-host dependencies that other
systems do, it does have a trigger-to-trigger dependency feature that can largely
perform the same function. For every trigger definition, you can specify a different
trigger upon which your new trigger is dependent. If the parent trigger is in a
PROBLEM state, the trigger you are defining won't be checked until the parent returns
to the OK state. This approach is certainly quite flexible and powerful, but it also has
a couple of downsides. The first one is that one single host can have a significant
number of triggers, so if you want to define a host-to-host dependency, you'll need
to update every single trigger, which may prove to be quite a cumbersome task.
In this kind of situation, probably you can simplify the problem by adding your
triggers within a custom template. Anyway, if you have only specific cases, this will
not help as it would end up creating a template for each host, which is not ideal
and will move the problem to the template. You can, of course, rely on the mass
update feature of the web frontend as a partial workaround. A second problem is
that you won't be able to look at a host definition and see that there is a dependency
relationship with another host. Short of looking at a host's trigger configuration,
there's simply no easy way to display or visualize this kind of relationship in Zabbix.

[ 233 ]

Managing Alerts

A distinct advantage of having a trigger-level dependency feature is that you can


define dependencies between single services on different hosts. As an example, you
could have a database that serves a bunch of web applications on different web
servers. If the database is unavailable, none of the related websites will work, so
you may want to set up a dependency between the web monitoring triggers and
the availability of the database. On the same servers, you may also have some other
service that relies on a separate license server or an identity and authentication server.
You could then set up the appropriate dependencies so that you could end up having
some triggers depend on the availability of one server and other triggers depend on
the availability of another one, all in the same host. While this kind of configuration
can easily become quite complex and difficult to maintain efficiently, a select few,
well-placed dependencies can help cut down the amount of redundant alerts in a large
environment. This, in turn, would help you to focus immediately on the real problems
where they arise instead of having to hunt them down in a long list of trigger alerts.

Taking an action
Just as items only provide raw data and triggers are independent from them as they
can access virtually any item's historical data, triggers, in turn, only provide a status
change. This change is recorded as an event just as measurements are recorded as
item data. This means that triggers don't provide any reporting functionality; they
just check their conditions and change the status accordingly. Once again, what may
seem to be a limitation and lack of power turns out to be the exact opposite as the
Zabbix component in charge of actually sending out alerts or trying to automatically
resolve some problems is completely independent from triggers. This means that just
as triggers can access any item's data, actions can access any trigger's name, severity,
or status so that, once again, you can create the perfect mix of very general and very
specific actions without being stuck in a one-action-per-trigger scheme.
Unlike triggers, actions are also completely independent from hosts and templates.
Every action is always globally defined and its conditions checked against every
single Zabbix event. As you'll see in the following paragraphs, this may force you to
create certain explicit conditions instead of implicit conditions, but that's balanced
out by the fact that you won't have to create similar but different actions for similar
events just because they are related to different hosts.
An action is composed of the following three different parts that work together to
provide all the functionality needed:

Action definition

Action conditions

Action operations
[ 234 ]

Chapter 6

The fact that every action has a global scope is reflected in every one of its
components, but it assumes critical importance when it comes to action conditions
as it's the place where you decide which action should be executed based on which
events. But let's not get ahead of ourselves, and let's see a couple of interesting things
about each component.

Defining an action
This is where you decide a name for the action and can define a default message that
can be sent as a part of the action itself. In the message, you can reference specific
data about the event, such as the host, item, and trigger names, item and trigger
values, and URLs. Here, you can leverage the fact that actions are global by using
macros so that a single action definition could be used for every single event in
Zabbix and yet provide useful information in its message.
You can see a few interesting macros already present in the default message when
you create a new action, as shown in the following screenshot:

[ 235 ]

Managing Alerts

Most of them are pretty self-explanatory, but it's interesting to see how you can, of
course, reference a single triggerthe one that generated the event. On the other
hand, as a trigger can check multiple items from multiple hosts, you can reference
all the hosts and items involved (up to nine different hosts and/or items) so that you
can get a picture of what's happening by just reading the message.
Other interesting macros can make the message even more useful and expressive.
Just remember that the default message can be sent not only via e-mail, but also
via chat or SMS; you'll probably want to create different default actions with
different messages for different media types so that you can calibrate the amount of
information provided based on the media available.
You can see the complete list of supported macros in the official documentation wiki
at https://www.zabbix.com/documentation/2.4/manual/appendix/macros/
supported_by_location, so we'll look at just a couple of the most interesting ones.

The {EVENT.DATE} and {EVENT.TIME} macros


These two macros can help you to differentiate between the time a message is sent and
the time of the event itself. It's particularly useful not only for repeated or escalated
actions, but also for all media where a timestamp is not immediately apparent.

The {INVENTORY.SERIALNO.A} and friends macros


When it comes to hardware failure, information about a machine's location, admin
contact, serial number, and so on, can prove quite useful to track it down quickly or
to pass it on to external support groups.

Defining the action conditions


This part lets you define conditions based on the event's hosts, trigger, and trigger
values. Just as with trigger expressions, you can combine different simple conditions
with a series of AND/OR logical operators, as shown in the next screenshot. You
can either have all AND, all OR, or a combination of the two, where conditions
of different types are combined with AND, while conditions of the same type are
combined with OR:

[ 236 ]

Chapter 6

Observe how one of the conditions is Trigger value = PROBLEM. Since actions are
evaluated for every event and since a trigger switching from PROBLEM to OK is an
event in itself, if you don't specify this condition the action will be executed both
when the trigger switches to PROBLEM and when the trigger switches back to OK.
Depending on how you have constructed your default message and what operations
you intend to do with your actions, this may very well be what you intended, and
Zabbix will behave exactly as expected.
Anyway, if you created a different recovery message in the Action definition form
and you forget the condition, you'll get two messages when a trigger switches back
to OKone will be the standard message, and one will be the recovery message. This
can certainly be a nuisance as any recovery message would be effectively duplicated,
but things can get ugly if you rely on external commands as part of the action's
operations. If you forget to specify the condition Trigger value = PROBLEM, the
external, remote command would also be executed twiceonce when the trigger
switches to PROBLEM (this is what you intended) and once when it switches back to
OK (this is quite probably not what you intended). Just to be on the safe side, and if
you don't have very specific needs for the action you are configuring, it's probably
better if you get into the habit of putting Trigger value = PROBLEM for every new
action you create or at least checking whether it's present in the actions you modify.
The most typical application to create different actions with different conditions is to
send alert and recovery messages to different recipients. This is the part where you
should remember that actions are global.
[ 237 ]

Managing Alerts

Let's say that you want all the database problems sent over to the database
administrators group and not the default Zabbix administrators group. If you just
create a new action with the condition that the host group must be DB Instances and,
as message recipients, choose your DB admins, they will certainly receive a message
for any DB-related event, but so will your Zabbix admins if the default action has no
conditions configured. The reason is that since actions are global, they are always
executed whenever their conditions evaluate to True. In this case, both the specific
action and the default one would evaluate to True, so both groups would receive a
message. What you could do is add an opposite condition in the default action so
that it would be valid for every event, except for those related to the DB Instances
host group. The problem is that this approach can quickly get out of control, and
you may find yourself with a default action full of the not in group conditions.
Truth is, once you start creating actions specific to message recipients, you either
disable the default action or take advantage of it to populate a message archive for
administration and reporting purposes.
Starting with Zabbix 2.4, there is another supported way of calculating action
conditions. As you can easily imagine, the And/Or type of calculation clearly
suffers from many limitations. Taking a practical example with two groups of the
same condition type, you can't use the AND condition within a group and the
OR condition within the other group. Starting with Zabbix 2.4, this limitation has
been bypassed. If you take a look at the possible options to calculating the action
condition, you can see that now we can choose even the Custom expression option,
as shown in the following screenshot:

[ 238 ]

Chapter 6

This new way allows us to use calculated formulas, such as:

(A and B) and (C or D)

(A and B) or (C and D)

But you can even mix the logical operators, as with this example:

((A or B) and C) or D

This opens quite a few interesting scenarios of usage, bypassing the previous
limitations.

Choosing the action operations


If the first two parts were just preparation, this is where you tell the action what it
should actually do. The following are the two main aspects to this:

Operation steps

The actual operations available for each step

As with almost everything in Zabbix, the simplest cases that are very straightforward
are most often self-explanatory; you just have a single step, and this step consists
of sending the default message to a group of defined recipients. Also, this simple
scenario can become increasingly complex and sophisticated but still manageable,
depending on your specific needs. Let's see a few interesting details about each part.

Steps and escalations


Even if an action is tied to a single event, it does not mean that it can perform a single
operation. In fact, it can perform an arbitrary number of operations called steps,
which can even go on for an indefinite amount of time or until the conditions for
performing the action are not valid anymore.

[ 239 ]

Managing Alerts

You can use multiple steps to both send messages as well as perform automated
operations. Alternatively, you can use the steps to send alert messages to different
groups or even multiple times to the same group with the time intervals that you
want as long as the event is unacknowledged or even not yet resolved. The following
screenshot shows a combination of different steps:

[ 240 ]

Chapter 6

As you can see, step 1 starts immediately, is set to send a message to a user group,
and then delays the subsequent step by just 1 minute. After 1 minute, step 2 starts
and is configured to perform a remote command on the host. As step 2 has a default
duration (which is defined in the main Action definition tab), step 3 will start after
about an hour. Steps 3, 4, and 5 are all identical and have been configured together
they will send a message to a different user group every 10 minutes. You can't see
it in the preceding screenshot, but step 6 will only be executed if the event is not yet
acknowledged, just as step 7, which is still being configured. The other interesting bit
of step 7 is that it's actually set to configure steps 7 to 0. It may seem counterintuitive,
but in this case, step 0 simply means forever. You can't really have further steps if
you create a step N to 0, because the latter will repeat itself with the time interval
set in the step's Duration(sec) field. Be very careful in using step 0 because it will
really go on until the trigger's status changes. Even then, if you didn't add a Trigger
status="PROBLEM" condition to your action, step 0 can be executed even if the
trigger switched back to OK. In fact, it's probably best never to use step 0 at all unless
you really know what you are doing.

Messages and media


For every message step, you can choose to send the default message that you
configured in the first tab of the Action creation form or send a custom message that
you can craft in exactly the same way as the default one. You might want to add
more details about the event if you are sending the message via e-mail to a technical
group. On the other hand, you might want to reduce the amount of details or the
words in the message if you are sending it to a manager or supervisor or if you are
limiting the message to an SMS.

[ 241 ]

Managing Alerts

Remember that in the Action operation form, you can only choose recipients as
Zabbix users and groups, while you still have to specify any media address for every
user they are reachable to. This is done in the Administration tab of the Zabbix
frontend by adding media instances for every single user. You also need to keep in
mind that every media channel can be enabled or disabled for a user; it may be active
only during certain hours of the day or just for one or more specific trigger severity,
as shown in the following screenshot:

This means that even if you configure an action to send a message, some recipients
may still not receive it based on their own media configuration.
While Email, Jabber, and SMS are the default options to send messages, you still
need to specify how Zabbix is supposed to send them. Again, this is done in the
Media types section of the Administration tab of the frontend. You can also create
new media types there that will be made available both in the media section of user
configuration and as targets to send messages to in the Action operations form.
If you have more than one server and you need to use them for different purposes
or with different sender identifications, a new media type can be a different e-mail,
jabber, or SMS server. It can also be a script, and this is where things can become
interesting if not potentially misleading.

[ 242 ]

Chapter 6

A custom media script has to reside on the Zabbix server in the directory that is
indicated by the AlertScriptPath variable of zabbix_server.conf. When called
upon, it will be executed with the following three parameters passed by the server:

$1: The recipient of the message

$2: The subject of the message

$3: The body of the main message

The recipient will be taken from the appropriate user-media property that you
defined for your users while creating the new media type. The subject and the
message body will be the default ones configured for the action or some step-specific
ones, as explained before. Then, from Zabbix's point of view, whether it's an old
UUCP link, a modern mail server that requires strong authentication, or a post to an
internal microblogging server, the script should send the message to the recipient
by whatever custom methods you intend to use. The fact is that you can actually
do what you want with the message; you can simply log it to a directory, send it to
a remote file server, morph it to a syslog entry and send it over to a log server, run
a speech synthesis program on it and read it aloud on some speakers, or record a
message on an answering machine (as with every custom solution); the sky's the
limit with custom media types. This is why you should not confuse custom media
with the execution of a remote commandwhile you could potentially obtain
roughly the same results with one or the other, custom media scripts and remote
commands are really two different things.

Remote commands
These are normally used to try to perform corrective actions in order to resolve
a problem without human intervention. After you've chosen the target host that
should execute the command, the Zabbix server will connect to it and ask it to
perform it. If you are using the Zabbix agent as a communication channel, you'll
need to set EnableRemoteCommands to 1, or the agent will refuse to execute any
command. Other possibilities include SSH, Telnet, and IPMI (if you have compiled
the relative options during server installation).

[ 243 ]

Managing Alerts

Remote commands can be used to do almost anythingkill or restart a process,


make space on a filesystem by zipping or deleting old files, reboot a machine,
and so on. They tend to seem powerful and exciting to new implementers, but in
the authors' experience, they tend to be fragile solutions that tend to break things
almost as often as they fix them. It's harder than it looks to make them run safely
without accidentally deleting files or rebooting servers when there's no need to. The
real problem with remote commands is that they tend to hide problems instead of
revealing them, which should really be the job of a monitoring system. Yes, they
can prove useful as a quick patch to ensure the smooth operation of your services,
but use them too liberally and you'll quickly forget that there actually are recurring
problems that need to be addressed because some fragile command somewhere
is trying to fix things in the background for you. It's usually better to really try to
solve a problem than to just hide it behind an automated temporary fix. This is not
just from a philosophical point of view as, when these patches fail, they tend to fail
spectacularly and with disastrous consequences.
So, our advice is that you use remote commands very sparingly and only if you
know what you are doing.

Summary
This chapter focused on what is usually considered the core business of a monitoring
systemits triggering and alerting features. By concentrating separately and
alternately on the two parts that contribute to this functiontriggers and actionsit
should be clear to you how, once again, Zabbix's philosophy of separating all the
different functions can give great rewards to the astute user. You learned how to
create complex and sophisticated trigger conditions that will help you have a better
understanding of your environment and have more control over what alerts you
should receive. The various triggering functions and options as well as some of the
finer aspects of item selection, along with the many aspects of action creation, are not
a secret to you now.
In the next chapter, you will explore the final part of Zabbix's core monitoring
components: templates and discovery functions.

[ 244 ]

Get more information Mastering Zabbix Second Edition

Where to buy this book


You can buy Mastering Zabbix Second Edition from the Packt Publishing website.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.

www.PacktPub.com

Stay Connected:

You might also like