Mastering Zabbix - Second Edition - Sample Chapter
Mastering Zabbix - Second Edition - Sample Chapter
Second Edition
Nowadays, monitoring systems play a crucial role in any IT
environment. They are extensively used to not only measure
your system's performance, but also to forecast capacity
issues. This is where Zabbix, one of the most popular
monitoring solutions for networks and applications, comes
into the picture.
This new edition will provide you with all the knowledge you
need to make strategic and practical decisions about the
Zabbix monitoring system. The setup you'll do with this book
will fit your environment and monitoring needs like a glove.
You will be guided through the initial steps of choosing the
correct size and configuration for your system, to what to
monitor and how to implement your own custom monitoring
component. Exporting and integrating your data with other
systems is also covered.
By the end of this book, you will have a tailor-made and
well-configured monitoring system and will understand with
absolute clarity how crucial it is to your IT environment.
P U B L I S H I N G
pl
C o m m u n i t y
$ 49.99 US
31.99 UK
Sa
m
Mastering Zabbix
Second Edition
Mastering Zabbix
ee
D i s t i l l e d
Mastering Zabbix
Second Edition
Learn how to monitor your large IT environments using Zabbix
with this one-stop, comprehensive guide to the Zabbix world
E x p e r i e n c e
Preface
Ever since its first public release in 2001, Zabbix has distinguished itself as a very
powerful and effective monitoring solution. As an open source product, it's easy to
obtain and deploy, and its unique approach to metrics and alarms has helped to set
it apart from its competitors, both open and commercial. It's a powerful, compact
package with very low requirements in terms of hardware and supporting software
for a basic yet effective installation. If you add a relative ease of use, it's clear that it
can be a very good contender for small environments with a tight budget. But it's
when it comes to managing a huge number of monitored objects, with a complex
configuration and dependencies, that Zabbix's scalability and inherently distributed
architecture really shines. More than anything, Zabbix can be an ideal solution
in large and complex distributed environments, where being able to manage
efficiently and extract meaningful information from monitored objects and events
is just as important if not more important than the usual considerations about costs,
accessibility, and the ease of use.
This is a second edition book, the first having been coauthored by Andrea Dalle
Vacche and Stefano Kewan Lee.The purpose of this book is to help you make the
most of your Zabbix installation to leverage all of its power to monitor any large and
complex environment effectively.
The purpose of this book is to help you make the most of your Zabbix installation to
leverage all of its power to monitor any large and complex environment effectively.
Preface
Preface
Chapter 6, Managing Alerts, gives examples of complex triggers and trigger conditions
as well as advice on choosing the right amount of trigger and alerting actions. The
purpose is to help you walk the fine line between being blind to possible problems
and being overwhelmed by false positives. You will also learn how to use actions
to automatically fix simple problems, raise actions without the need for human
intervention to correlate different triggers and events, and tie escalations to your
operations management workflow. This section will make you aware of what
can be automated, reducing your administrative workload and optimizing the
administration process in a proactive way.
Chapter 7, Managing Templates, offers guidelines for effective template management:
building complex template schemes out of simple components, understanding and
managing the effects of template modification, maintaining existing monitored
objects, and assigning templates to discovered hosts. This will conclude the second
part of the book that is dedicated to the different Zabbix monitoring and data
management options. The third and final part will discuss Zabbix's interaction with
external products and all its powerful extensibility features.
Chapter 8, Handling External Scripts, helps you learn how to write scripts to monitor
objects that are not covered by the core Zabbix features. The relative advantages and
disadvantages of keeping the scripts on the server side or agent side, how to launch
or schedule them, and a detailed analysis of the Zabbix agent protocol will also be
covered. This chapter will make you aware of all the possible side effects, delays,
and load caused by scripts; you will be able to implement all the needed external
checks, as you will be well aware of all that is connected with them and the relative
observer effect. The chapter will include different implementations of working with
Bash, Java, and Python so that you can easily write your own scripts to extend and
enhance Zabbix's monitoring possibilities.
Chapter 9, Extending Zabbix, delves into the Zabbix API and how to use it to build
specialized frontends and complex extensions. It also covers how to harvest
monitoring data for further elaboration and reporting. It will include simple example
implementations written in Python that will illustrate how to export and further
manipulate data, how to perform massive and complex operations on monitored
objects, and finally, how to automate different management aspects such as user
creation and configuration, trigger activation, and the like.
Chapter 10, Integrating Zabbix, wraps things up by discussing how to make other
systems know about Zabbix and the other way around. This is key to the successful
management of any large and complex environment. You will learn how to use
built-in Zabbix features, API calls, or direct database queries to communicate with
different upstream and downstream systems and applications. There will be concrete
examples of possible interaction with inventory applications, trouble ticket systems,
and data warehouse systems.
Managing Alerts
Checking conditions and alarms is the most characteristic function of any monitoring
system, and Zabbix is no exception. What really sets Zabbix apart is that every alarm
condition or trigger (as it is known in this system) can be tied not only to a single
measurement, but also to an arbitrary complex calculation based on all of the data
available to the Zabbix server. Furthermore, just as triggers are independent from
items, the actions that the server can take based on the trigger status are independent
from the individual trigger, as you will see in the subsequent sections.
In this chapter, you will learn the following things about triggers and actions:
[ 223 ]
Managing Alerts
You can see how there's a complete item key specification, not just the name, to
which a function is applied. The result is then compared to a constant using a
greater than operator. The syntax for referencing item keys is very similar to that for
a calculated item. In addition to this basic way of referring to item values, triggers
also add a comparison operator that wraps all the calculations up to a Boolean
expression. This is the one great unifier of all triggers; no matter how complex the
expression, it must always return either a True value or a False value. This value
is, of course, directly related to the state of a trigger, which can only be OK if the
expression evaluates to False, or PROBLEM if the expression evaluates to True. There
are no intermediate or soft states for triggers.
A trigger can also be in an UNKNOWN state if it's impossible to
evaluate the trigger expression (because one of the items has
no data, for example).
[ 224 ]
Chapter 6
From a syntactical point of view, the item and function component has to be enclosed
in curly brackets, as illustrated in the preceding screenshot, while the arithmetical
and logical operators stay outside the brackets:
The previously discussed trigger will evaluate to PROBLEM if there are no new lines in
the operations.log file for more than 10 minutes or if an error string is found in the
lines appended to that same file.
Zabbix doesn't apply short-circuit evaluation of the and and or
(previously, until Zabbix 2.4, they were expressed with & and |)
operators; every comparison will be evaluated regardless of the
outcome of the preceding ones.
Of course, you don't have to reference items from the same host; you can reference
different items from different hosts and on different proxies too (if you can access
them), as shown in the following code:
{Proxy1:Alpha:agent.ping.last(0)}=0 and
{Proxy2:Beta:agent.ping.last(0)}=0
Here, the trigger will evaluate to PROBLEM if both the hosts Alpha and Beta are
unreachable. It doesn't matter that the two hosts are monitored by two different
proxies. Everything will work as expected as long as the proxy where the trigger is
defined has access to the two monitored hosts' historical data. You can apply all the
same functions available for calculated items to your items' data. The complete list
and specification are available on the official Zabbix wiki (https://www.zabbix.
com/documentation/2.4/manual/appendix/triggers/functions), so it would
be redundant to repeat them here, but a few common aspects among them deserve a
closer look.
[ 225 ]
Managing Alerts
The following code, unlike the previous one, will perform the same operation on the
last ten measurements:
{Alpha:system.cpu.util[,idle].min(#10)}
Which one should you use in your triggers? While it obviously depends on your
specific needs and objectives, each one has its strengths that make it useful in the
right context. For all kinds of passive checks initiated by the server, you'll often want
to stick to a time period expressed as an absolute value. A #5 parameter will vary
quite dramatically as a time period if you vary the check interval of the relative item.
It's not usually obvious that such a change will also affect related triggers. Moreover,
a time period expressed in seconds may be closer to what you really mean to check
and thus may be easier to understand when you'll visit the trigger definition at a
later date. On the other hand, you'll often want to opt for the #num version of the
parameter for many active checks, where there's no guarantee that you will have a
constant, reliable interval between measurements. This is especially true for trapper
items of any kind and for log files. With these kinds of items, referencing the number
of measurements is often the best option.
[ 226 ]
Chapter 6
The only problem with this expression is that there's a completely unrelated process
that makes a couple of big file transfers to this same filesystem every night at 2 a.m.
While this is a perfectly normal operation, it could still make the trigger switch to a
PROBLEM state and send an alert. Adding a couple of time functions will take care of
that, as shown in the following code:
{Alpha:vfs.fs.size[/var,pused].delta(600)}>3 and
({Alpha:vfs.fs.size[/var,pused].time(0)}<020000 or
{Alpha:vfs.fs.size[/var,pused].time(0)}>030000 )
Just keep in mind that all the trigger functions return a numerical value, including
the date and time ones, so it's not really practical to express fancy dates, such as the
first Tuesday of the month or last month (instead of the last 30 days).
[ 227 ]
Managing Alerts
Trigger severity
Severity is little more than a simple label that you attach to a trigger. The web
frontend will display different severity values with different colors, and you will
be able to create different actions based on them, but they have no further meaning
or function in the system. This means that the severity of a trigger will not change
over time based on how long that trigger has been in a PROBLEM state, nor can you
assign a different severity to different thresholds in the same trigger. If you really
need a warning alert when a disk is over 90 percent full and a critical alert when it's
100 percent full, you will need to create two different triggers with two different
thresholds and severities. This may not be the best course of action though, as it
could lead to warnings that are ignored and not acted upon, critical warnings that
will fire up when it's already too late and you have already lost service availability,
just a redundant configuration with redundant messages and more possibilities of
mistakes, or an increased signal-to-noise ratio.
A better approach would be to clearly assess the actual severity of the potential for
the disk to fill up and create just one trigger with a sensible threshold and, possibly,
an escalating action if you fear that the warning could get lost among the others.
[ 228 ]
Chapter 6
This is where the delta function can help you create triggers that are general enough
that you can apply them to a wide variety of filesystems so that you can still get
a sensible warning about each one of them. You will still need to create more
specialized triggers for those special, critical disks, but you'd have to anyway.
While it's true that the same percentages may mean quite a different thing for disks
with a great difference in size, a similar percentage variation of available space on a
different disk could mean quite the same thing: the disk is filling up at a rate that can
soon become a problem:
{Template_fs:vfs.fs.size[/,pfree].last(0)}<5 and
({Template_fs:vfs.fs.size[/,pfree].delta(1d)} or
{Template_fs:vfs.fs.size[/,pfree].last(0,1d) } > 0.5)
The previously discussed trigger would report a PROBLEM state not just if the
available space is less than 5 percent on a particular disk, but also if the available
space has been reduced by more than half in the last 24 hours (don't miss the
time-shift parameter in the last function). This means that no matter how big the disk
is, based on its usage pattern it could quickly fill up. Note also how the trigger would
need progressively smaller and smaller percentages for it to assume a PROBLEM state,
so you'd automatically get more frequent and urgent notifications as the disk is
filling up.
For these kinds of checks, percentage values should prove more flexible and easy to
understand than absolute ones, so that's what you probably want to use as a baseline
for templates. On the other hand, absolute values may be your best option if you
want to create a very specific trigger for a very specific filesystem.
[ 229 ]
Managing Alerts
Going back to the date and time functions, let's say that you have a trigger that
monitors the number of active sessions in an application and fires up an alert if that
number drops too low during certain hours because you know that there should
always be a few automated processes creating and using sessions in that window of
time (from 10:30 to 12:30 in this example). During the rest of the day, the number of
sessions is neither predictable, nor that significant, so you keep sampling it but don't
want to receive any alert. A first, simple version of your trigger could look like the
following code:
{Appserver:sessions.active[myapp].min(300)}<5 and
{Appserver:sessions.active[myapp].time(0)} > 103000 and
{Appserver:sessions.active[myapp].time(0) } < 123000
The only problem with this trigger is that if the number of sessions drops below
five in that window of time but it doesn't come up again until after 12:30, the trigger
will stay in the PROBLEM state until the next day. This may be a great nuisance if you
have set up multiple actions and escalations on that trigger as they would go on for
a whole day no matter what you do to address the actual session's problems. But
even if you don't have escalating actions, you may have to give accurate reports on
these event durations, and an event that looks as if it's going on for almost 24 hours
would be both incorrect in itself and for any SLA reporting. Even if you don't have
reporting concerns, displaying a PROBLEM state when it's not there anymore is a kind
of false positive that will not let your monitoring team focus on the real problems
and, over time, may reduce their attention on that particular trigger.
A possible solution is to make the trigger return to the OK state outside the target
hours if it was in a PROBLEM state, as shown in the following code:
({Appserver:sessions.active[myapp].min(300)}<5 and
{Appserver:sessions.active[myapp].time(0)} > 103000 and
{Appserver:sessions.active[myapp].time(0) } < 123000)) or
({TRIGGER.VALUE}=1 and
{Appserver:sessions.active[myapp].min(300)}<0 and
({Appserver:sessions.active[myapp].time(0)} < 103000 or
{Appserver:sessions.active[myapp].time(0) } > 123000))
[ 230 ]
Chapter 6
The first three lines are identical to the trigger defined before. This time, there is one
more complex condition, as follows:
The trigger is in a PROBLEM state (see the note about the TRIGGER.VALUE
macro)
The number of sessions is less than zero (this can never be true)
We are outside the target hours (the last two lines are the opposite of those
defining the time frame preceding it)
The TRIGGER.VALUE macro represents the current value of
the trigger expressed as a number. A value of 0 means OK, 1
means PROBLEM, and 2 means UNKNOWN. The macro can be used
anywhere you can use an item.function pair, so you'll typically
enclose it in curly brackets. As you've seen in the preceding
example, it can be quite useful when you need to define different
thresholds and conditions depending on the trigger's status itself.
The condition about the number of sessions being less than zero makes sure
that outside the target hours, if the trigger was in a PROBLEM state, the whole
expression will evaluate to false anyway. False means that the trigger is switching
to the OK state.
Here, you have not only made a correlation between an item value and a window
of time to generate an event, but you have also made sure that the event will always
spin down gracefully instead of potentially going out of control.
Another interesting way to build a trigger is to combine different items from the
same hosts or even different items from different hosts. This is often used to spot
incongruities in your system state that would otherwise be very difficult to identify.
An obvious case could be that of a server that serves content over the network.
Its overall performance parameters may vary a lot depending on a great number
of factors, so it would be very difficult to identify sensible trigger thresholds that
wouldn't generate a lot of false positives or, even worse, missed events. What may
be certain though is that if you see a high CPU load while network traffic is low, then
you may have a problem, as shown in the following code:
{Alpha:system.cpu.load[all,avg5].last(0)} > 5 and
{Alpha:net.if.total[eth0].avg(300)} < 1000000
[ 231 ]
Managing Alerts
An even better example would be about the necessity to check for hanging or
frozen sessions in an application. The actual way to do this would depend a lot on
the specific implementation of the said application, but for illustrative purposes,
let's say that a frontend component keeps a number of temporary session files in a
specific directory, while the database component populates a table with the session
data. Even if you have created items on two different hosts to keep track of these
two sources of data, each number taken alone will certainly be useful for trending
analysis and capacity planning, but they need to be compared to check whether
something's wrong in the application's workflow. Assuming that we have previously
defined a local command on the frontend's Zabbix agent that will return the number
of files in a specific directory, and that we have defined an odbc item on the database
host that will query the DB for the number of active sessions, we could then build a
trigger that compares the two values and reports a PROBLEM state if they don't match:
{Frontend:dir.count[/var/sessions].last(0)} <>
{Database:sessions.count.last(0)}
The <> term in the expression is the not equal operator that
was previously expressed as # is now expressed with <>
starting with Zabbix 2.4.
Aggregated and calculated items can also be very useful in building effective
triggers. The following one will make sure that the ratio between active workers and
the available servers doesn't drop too low in a grid or cluster:
{ZbxMain:grpsum["grid", "proc.num[listener]", last, 0].last(0)} /
{ZbxMain:grpsum["grid", "agent.ping", last, 0].last(0)} < 0.5
All these examples should help drive home the fact that once you move beyond
checking for simple thresholds with single-item values and start correlating different
data sources together in order to have more sophisticated and meaningful triggers,
there is virtually no end to all the possible variations of trigger expressions that you
can come up with.
By identifying the right metrics, as explained in Chapter 4, Collecting Data, and
combining them in various ways, you can pinpoint very specific aspects of your
system behavior; you can check log files together with the login events and
network activity to track down possible security breaches, compare a single server's
performance with the average server performance in the same group to identify
possible problems in service delivery, and do much more.
[ 232 ]
Chapter 6
This is, in fact, one of Zabbix's best-kept secrets that really deserve more publicity;
its triggering system is actually a sophisticated correlation engine that draws its
power from a clear and concise method to construct expressions as well as from the
availability of a vast collection of both current and historical data. Spending a bit of
your time studying it in detail and coming up with interesting and useful triggers
that are tailor-made for your needs will certainly pay you back tenfold as you will
end up not only with a perfectly efficient and intelligent monitoring system, but also
with a much deeper understanding of your environment.
[ 233 ]
Managing Alerts
Taking an action
Just as items only provide raw data and triggers are independent from them as they
can access virtually any item's historical data, triggers, in turn, only provide a status
change. This change is recorded as an event just as measurements are recorded as
item data. This means that triggers don't provide any reporting functionality; they
just check their conditions and change the status accordingly. Once again, what may
seem to be a limitation and lack of power turns out to be the exact opposite as the
Zabbix component in charge of actually sending out alerts or trying to automatically
resolve some problems is completely independent from triggers. This means that just
as triggers can access any item's data, actions can access any trigger's name, severity,
or status so that, once again, you can create the perfect mix of very general and very
specific actions without being stuck in a one-action-per-trigger scheme.
Unlike triggers, actions are also completely independent from hosts and templates.
Every action is always globally defined and its conditions checked against every
single Zabbix event. As you'll see in the following paragraphs, this may force you to
create certain explicit conditions instead of implicit conditions, but that's balanced
out by the fact that you won't have to create similar but different actions for similar
events just because they are related to different hosts.
An action is composed of the following three different parts that work together to
provide all the functionality needed:
Action definition
Action conditions
Action operations
[ 234 ]
Chapter 6
The fact that every action has a global scope is reflected in every one of its
components, but it assumes critical importance when it comes to action conditions
as it's the place where you decide which action should be executed based on which
events. But let's not get ahead of ourselves, and let's see a couple of interesting things
about each component.
Defining an action
This is where you decide a name for the action and can define a default message that
can be sent as a part of the action itself. In the message, you can reference specific
data about the event, such as the host, item, and trigger names, item and trigger
values, and URLs. Here, you can leverage the fact that actions are global by using
macros so that a single action definition could be used for every single event in
Zabbix and yet provide useful information in its message.
You can see a few interesting macros already present in the default message when
you create a new action, as shown in the following screenshot:
[ 235 ]
Managing Alerts
Most of them are pretty self-explanatory, but it's interesting to see how you can, of
course, reference a single triggerthe one that generated the event. On the other
hand, as a trigger can check multiple items from multiple hosts, you can reference
all the hosts and items involved (up to nine different hosts and/or items) so that you
can get a picture of what's happening by just reading the message.
Other interesting macros can make the message even more useful and expressive.
Just remember that the default message can be sent not only via e-mail, but also
via chat or SMS; you'll probably want to create different default actions with
different messages for different media types so that you can calibrate the amount of
information provided based on the media available.
You can see the complete list of supported macros in the official documentation wiki
at https://www.zabbix.com/documentation/2.4/manual/appendix/macros/
supported_by_location, so we'll look at just a couple of the most interesting ones.
[ 236 ]
Chapter 6
Observe how one of the conditions is Trigger value = PROBLEM. Since actions are
evaluated for every event and since a trigger switching from PROBLEM to OK is an
event in itself, if you don't specify this condition the action will be executed both
when the trigger switches to PROBLEM and when the trigger switches back to OK.
Depending on how you have constructed your default message and what operations
you intend to do with your actions, this may very well be what you intended, and
Zabbix will behave exactly as expected.
Anyway, if you created a different recovery message in the Action definition form
and you forget the condition, you'll get two messages when a trigger switches back
to OKone will be the standard message, and one will be the recovery message. This
can certainly be a nuisance as any recovery message would be effectively duplicated,
but things can get ugly if you rely on external commands as part of the action's
operations. If you forget to specify the condition Trigger value = PROBLEM, the
external, remote command would also be executed twiceonce when the trigger
switches to PROBLEM (this is what you intended) and once when it switches back to
OK (this is quite probably not what you intended). Just to be on the safe side, and if
you don't have very specific needs for the action you are configuring, it's probably
better if you get into the habit of putting Trigger value = PROBLEM for every new
action you create or at least checking whether it's present in the actions you modify.
The most typical application to create different actions with different conditions is to
send alert and recovery messages to different recipients. This is the part where you
should remember that actions are global.
[ 237 ]
Managing Alerts
Let's say that you want all the database problems sent over to the database
administrators group and not the default Zabbix administrators group. If you just
create a new action with the condition that the host group must be DB Instances and,
as message recipients, choose your DB admins, they will certainly receive a message
for any DB-related event, but so will your Zabbix admins if the default action has no
conditions configured. The reason is that since actions are global, they are always
executed whenever their conditions evaluate to True. In this case, both the specific
action and the default one would evaluate to True, so both groups would receive a
message. What you could do is add an opposite condition in the default action so
that it would be valid for every event, except for those related to the DB Instances
host group. The problem is that this approach can quickly get out of control, and
you may find yourself with a default action full of the not in group conditions.
Truth is, once you start creating actions specific to message recipients, you either
disable the default action or take advantage of it to populate a message archive for
administration and reporting purposes.
Starting with Zabbix 2.4, there is another supported way of calculating action
conditions. As you can easily imagine, the And/Or type of calculation clearly
suffers from many limitations. Taking a practical example with two groups of the
same condition type, you can't use the AND condition within a group and the
OR condition within the other group. Starting with Zabbix 2.4, this limitation has
been bypassed. If you take a look at the possible options to calculating the action
condition, you can see that now we can choose even the Custom expression option,
as shown in the following screenshot:
[ 238 ]
Chapter 6
(A and B) and (C or D)
(A and B) or (C and D)
But you can even mix the logical operators, as with this example:
((A or B) and C) or D
This opens quite a few interesting scenarios of usage, bypassing the previous
limitations.
Operation steps
As with almost everything in Zabbix, the simplest cases that are very straightforward
are most often self-explanatory; you just have a single step, and this step consists
of sending the default message to a group of defined recipients. Also, this simple
scenario can become increasingly complex and sophisticated but still manageable,
depending on your specific needs. Let's see a few interesting details about each part.
[ 239 ]
Managing Alerts
You can use multiple steps to both send messages as well as perform automated
operations. Alternatively, you can use the steps to send alert messages to different
groups or even multiple times to the same group with the time intervals that you
want as long as the event is unacknowledged or even not yet resolved. The following
screenshot shows a combination of different steps:
[ 240 ]
Chapter 6
As you can see, step 1 starts immediately, is set to send a message to a user group,
and then delays the subsequent step by just 1 minute. After 1 minute, step 2 starts
and is configured to perform a remote command on the host. As step 2 has a default
duration (which is defined in the main Action definition tab), step 3 will start after
about an hour. Steps 3, 4, and 5 are all identical and have been configured together
they will send a message to a different user group every 10 minutes. You can't see
it in the preceding screenshot, but step 6 will only be executed if the event is not yet
acknowledged, just as step 7, which is still being configured. The other interesting bit
of step 7 is that it's actually set to configure steps 7 to 0. It may seem counterintuitive,
but in this case, step 0 simply means forever. You can't really have further steps if
you create a step N to 0, because the latter will repeat itself with the time interval
set in the step's Duration(sec) field. Be very careful in using step 0 because it will
really go on until the trigger's status changes. Even then, if you didn't add a Trigger
status="PROBLEM" condition to your action, step 0 can be executed even if the
trigger switched back to OK. In fact, it's probably best never to use step 0 at all unless
you really know what you are doing.
[ 241 ]
Managing Alerts
Remember that in the Action operation form, you can only choose recipients as
Zabbix users and groups, while you still have to specify any media address for every
user they are reachable to. This is done in the Administration tab of the Zabbix
frontend by adding media instances for every single user. You also need to keep in
mind that every media channel can be enabled or disabled for a user; it may be active
only during certain hours of the day or just for one or more specific trigger severity,
as shown in the following screenshot:
This means that even if you configure an action to send a message, some recipients
may still not receive it based on their own media configuration.
While Email, Jabber, and SMS are the default options to send messages, you still
need to specify how Zabbix is supposed to send them. Again, this is done in the
Media types section of the Administration tab of the frontend. You can also create
new media types there that will be made available both in the media section of user
configuration and as targets to send messages to in the Action operations form.
If you have more than one server and you need to use them for different purposes
or with different sender identifications, a new media type can be a different e-mail,
jabber, or SMS server. It can also be a script, and this is where things can become
interesting if not potentially misleading.
[ 242 ]
Chapter 6
A custom media script has to reside on the Zabbix server in the directory that is
indicated by the AlertScriptPath variable of zabbix_server.conf. When called
upon, it will be executed with the following three parameters passed by the server:
The recipient will be taken from the appropriate user-media property that you
defined for your users while creating the new media type. The subject and the
message body will be the default ones configured for the action or some step-specific
ones, as explained before. Then, from Zabbix's point of view, whether it's an old
UUCP link, a modern mail server that requires strong authentication, or a post to an
internal microblogging server, the script should send the message to the recipient
by whatever custom methods you intend to use. The fact is that you can actually
do what you want with the message; you can simply log it to a directory, send it to
a remote file server, morph it to a syslog entry and send it over to a log server, run
a speech synthesis program on it and read it aloud on some speakers, or record a
message on an answering machine (as with every custom solution); the sky's the
limit with custom media types. This is why you should not confuse custom media
with the execution of a remote commandwhile you could potentially obtain
roughly the same results with one or the other, custom media scripts and remote
commands are really two different things.
Remote commands
These are normally used to try to perform corrective actions in order to resolve
a problem without human intervention. After you've chosen the target host that
should execute the command, the Zabbix server will connect to it and ask it to
perform it. If you are using the Zabbix agent as a communication channel, you'll
need to set EnableRemoteCommands to 1, or the agent will refuse to execute any
command. Other possibilities include SSH, Telnet, and IPMI (if you have compiled
the relative options during server installation).
[ 243 ]
Managing Alerts
Summary
This chapter focused on what is usually considered the core business of a monitoring
systemits triggering and alerting features. By concentrating separately and
alternately on the two parts that contribute to this functiontriggers and actionsit
should be clear to you how, once again, Zabbix's philosophy of separating all the
different functions can give great rewards to the astute user. You learned how to
create complex and sophisticated trigger conditions that will help you have a better
understanding of your environment and have more control over what alerts you
should receive. The various triggering functions and options as well as some of the
finer aspects of item selection, along with the many aspects of action creation, are not
a secret to you now.
In the next chapter, you will explore the final part of Zabbix's core monitoring
components: templates and discovery functions.
[ 244 ]
www.PacktPub.com
Stay Connected: