Building A Large Scale Machine Learning-Based Anomaly Detection System
Building A Large Scale Machine Learning-Based Anomaly Detection System
INTRODUCTION
It has become a business imperative for high velocity online businesses to
analyze patterns of data streams and look for anomalies that can reveal
something unexpected. Most online companies already use data metrics to tell
them how the business is doing, and detecting anomalies in the data can lead to
saving money or creating new business opportunities. This is where the online
world is going; everything is data- and metric-driven to gauge the state of the
business right now.
The types of metrics that companies capture and compare differ from industry to
industry, and each company has its own key performance indicators (KPIs).
What's more, regardless of industry, all online businesses measure the
operational performance of their infrastructure and applications. Monitoring and
analyzing these data patterns in real-time can help detect subtle and
sometimes not-so-subtle and unexpected changes whose root causes warrant
investigation.
www.anodot.com | [email protected]
2
models or algorithms to apply. Companies that use the right models can detect
even the most subtle anomalies. Those that do not apply the right models can
suffer through storms of false-positives or, worse, fail to detect a significant
number of anomalies, leading to lost revenue, dissatisfied customers, broken
machinery or missed business opportunities.
Anodot was founded in 2014 with the purpose of creating a commercial system
for real-time analytics and automated anomaly detection. Our technology has
been built by a core team of highly skilled and experienced data scientists and
technologists who have developed and patented numerous machine learning
algorithms to isolate and correlate issues across multiple parameters in real-time.
The techniques described within this white paper series are grounded in data
science principles and have been adapted or utilized extensively by the
mathematicians and data scientists at Anodot. The veracity of these techniques
has been proven in practice across hundreds of millions of metrics from Anodot's
large customer base. A company that wants to create its own automated
anomaly detection system would encounter challenges similar to those described
within this document series.
www.anodot.com | [email protected]
3
Anomalies in one area can affect other areas, but the association might never be
made if the metrics are not analyzed on a holistic level. This is what a large-scale
anomaly detection system should do.
Still, the question arises, why care about anomalies, especially if they simply seem
to be just blips in the business that appear from time to time? Those blips might
represent significant opportunities to save money (or prevent losing it) and to
potentially create new business opportunities. Consider these real-life incidents:
www.anodot.com | [email protected]
4
www.anodot.com | [email protected]
5
Supervised learning problems are ones in which the computer can be given a set
of data and a human tells the computer how to classify that data. Future datasets
can then be evaluated against what the machine has already learned. Supervised
learning involves giving the machine many examples and classifications that it
can compare against. This is not feasible for anomaly detection, since it involves a
sequence of data where no one has marked where there are anomalies. The
machine has to detect for itself where the anomalies areif they even exist.
Thus, a mathematical algorithm that processes the data must be able to detect
an anomaly without being given any examples of what an anomaly is. This is
known as unsupervised machine learning.
www.anodot.com | [email protected]
6
WHAT IS AN ANOMALY?
The challenge for both machines and humans is identifying an anomaly. Very
often the problem is ill-posed, making it hard to tell what an anomaly is. Consider
the set of images in Figure 2.
In this collection of dog pictures, which one is the anomaly? Is it the sleeping dog,
because all the others are awake? Is it the dog with the very long fur? Is it the dog
with balls in his mouth, or the one with a tongue hanging out? Depending on
what criteria are used to define an anomaly, there could be many answers to the
question. Thus, there must be some constraints on identifying an anomaly;
otherwise almost any answer can be considered correct.
www.anodot.com | [email protected]
7
Viewing these examples, humans would probably categorize the signal changes
as anomalies because humans have some concept in mind as to what
"unexpected" means. We look at a long history of the signal and create a pattern
in our minds of what is normal. Based on this history, we expect certain things to
happen in the future.
Looking at the example in Figure 6 below, we could say that the two days
highlighted in the red box are anomalous because every other day is much
higher. We would be correct in some sense, but it is necessary to see more
history to know if those days are real anomalies or not. There are a lot of
definitions that go into an algorithm to determine with a high degree of
confidence what an anomaly is.
www.anodot.com | [email protected]
8
But anomaly detection is not impossible. There are obvious things that most
people agree on of what it means to be the same, and that's true for time series
signals as wellespecially in the online world where companies measure number
of users, number of clicks, revenue numbers and other similar metrics. People do
have a notion of what "normal" is, and they understand it because, in the end, it
affects revenue. By understanding normal, they also understand when
something is abnormal. Thus, knowing what an anomaly is isn't completely
philosophical or abstract.
www.anodot.com | [email protected]
9
Some characteristics of online machine learning algorithms are that they scale
more easily to more metrics and to large datasets, and it is not possible to iterate
over the dataonce data is read it is not considered again. This method is more
www.anodot.com | [email protected]
10
prone to presenting false-positives because the algorithm gets a data point and
has to produce a result; it cannot go back to fix that result at a later time. We will
go into more detail about online machine learning in Part 2 of this document
series.
Consider the need to learn the average number of users that come to a particular
website every day. There are two ways to learn this. One is to collect all the data
over a period of several months and then compute that average. The alternative
is to compute it as the data comes in. The latter method means that the company
cannot go back and fix issues in what was previously learned about the average.
The algorithm sees the data, uses it and then sets it aside; however, it does not
go back and fix anything if there were anomalies that skewed the calculation of
the average. This example illustrates why online machine learning algorithms
tend to be more prone to false-positives. On the other hand, online learning
methods tend to be more scalable; it is easier to scale them to a much higher
number of data points/metrics to learn on.
RATE OF CHANGE
In our experience, most of the businesses that we work with have constant
change in their metrics. Their environment changes; they release new products
or new versions of their applications, which in turn changes the patterns of how
people use them.
www.anodot.com | [email protected]
11
Some systems change very slowly. They tend to be closed systems that are not
impacted by outside events. For example, automated manufacturing processes
tend to be closed systems that do not change much over time. When a company
manufactures some sort of widget, the process is typically fairly steady. The
machine operates with the same amount of force to stamp out each widget; the
widget is heated in a furnace that is within a strict temperature range; the
conveyor belt moves the widgets down the line at a constant speed; and so on
throughout the manufacturing process. A metric with a slow rate of change might
look like the graphic shown in Figure 8 below.
The rate of change has implications on the learning algorithms that an anomaly
detection system should use. If a system has constant changes which most
online businesses do then the system needs adaptive algorithms that know to
take into account that things change. However, if the rate of change is very slow,
the system can collect a year's worth of data and learn what is normal from that
dataset. Then the model should not need to be updated for a long time. For
example, the process to manufacture a widget is not going to change, so the
anomaly detection system can apply an algorithm that does not have to be
adaptive to frequent changes. On the other hand, data pertaining to an e-
commerce website will change frequently because that is just the nature of the
business.
At Anodot, the most interesting problem to solve is the one that changes
frequently, so we utilize highly adaptive algorithms that take this into account.
www.anodot.com | [email protected]
12
CONCISENESS
Conciseness means the system takes multiple metrics into account at once for a
holistic look at what is happening.
Now the question is, how does a doctor look at all of these vital signs? If he or she
looks at each measurement by itself, it will not be clear what is going on. When
the pulse rate decreases, by itself, it does not tell the doctor very much. Only the
combination of these metrics can begin to tell a story of what is going on.
In terms of design of the anomaly detection system, there are two methods that
take conciseness into consideration: univariate anomaly detection and
multivariate anomaly detection.
www.anodot.com | [email protected]
13
There are downsides to using multivariate anomaly detection techniques. For one
thing, these methods are very hard to scale. They are best when used with just
several hundred or fewer metrics. Also, it is often hard to interpret the cause of
the anomaly. All of the metrics are taken as input but the output simply says
there is something strangean anomaly, without identifying which metric(s) it is
associated with. In the healthcare analogy, the doctor would put in the vital signs
and receive the patient is sick, without any further explanation about why.
Without having insight into what is happening with each metric, it is hard to know
which one(s) affect the output, making it hard to interpret the results.
Another technical issue with these multivariate techniques is that they require all
the measured metrics to be somewhat homogeneous in their behavior; i.e., the
signal type must be more or less similar. If the set of signals or metrics behave
very differently from each other, then these techniques tend to not work well.
A HYBRID APPROACH
The univariate method causes alert storms that make it hard to diagnose why
there is an anomaly, and the multivariate methods are hard to apply. Anodot
utilizes a hybrid approach to take advantage of the good aspects of each method,
without the technical challenges they present. Anodot learns what is normal for
each one of the metrics by themselves, and after detecting anomalies the system
checks if it can combine them at the single metric level into groups and then give
an interpretation to that group.
We never have a model that indicates how all the metrics should behave
together. Instead we have a model for each metric by itself, but when some of
them become anomalous, we look for smart ways to combine related anomalies
www.anodot.com | [email protected]
14
into a single incident. This hybrid approach offers a practical way to achieve very
good results. The main challenge in this approach is how to know which metrics
are related to each other. We will describe how to use machine learning
algorithms to automatically discover these relationships in the third part of this
series.
DEFINITION OF INCIDENTS
The last design principle asks the question, Are incidents well-defined? While
the answer to this question is typically no, as incidents for an online business
are almost never well-defined, we will cover it because it provides the opportunity
to further discuss supervised versus unsupervised learning and apply it to the
design principle. In addition, it may be that over time, a business can define some
incidentsleading to semi-supervised learning techniques.
A well-defined incident is one in which all (or at least most) of the potential
causes of anomalies can be enumerated. This typically applies to a closed system
with a very limited number of metrics. It might apply, for example, to a relatively
simple machine where the product designers or engineers have written
documentation of what could go wrong. That list could be used to learn how to
detect anomalies. However, for an e-commerce website, it would be a Sisyphean
task to try to list what could go wrong and break those things down to tangible
incidents where mathematical models could be applied.
www.anodot.com | [email protected]
15
www.anodot.com | [email protected]
16
The answer is yes. There is a whole field called semi-supervised learning, and this
is a technique that Anodot uses. We collect some feedback in our system to
improve how we learn what is normal and how we detect anomalies based on a
few examples that we get from our users. This helps in making the assumption of
what is normal.
SUMMARY
Building an automated anomaly detection system for large scale analytics is a
tremendously complex endeavor. The sheer volume of metrics, as well as data
patterns that evolve and interact, make it challenging to understand what data
models to apply. Numerous algorithm design principles must be considered, and
if any of them are overlooked or misjudged, the system might overwhelm with
false-positives or fail to detect important anomalies that can affect business. Data
scientists must consider, for example:
Timeliness how quickly a determination must be made on whether
something is an anomaly or not
Scale how many metrics must be processed, and what volume of data
each metric has
Rate of change how quickly a data pattern changes, if at all
Conciseness whether all the metrics must be considered holistically
when looking for anomalies, or if they can be analyzed and assessed
individually
Definition of incidents how well anomalous incidents can be defined in
advance
Anodot has built an anomaly detection system with these design considerations
in mind. In parts 2 and 3 of the series, we will explain how these design
www.anodot.com | [email protected]
17
considerations played into building the Anodot system. Part 2 looks at various
ways that an anomaly detection system can learn the normal behavior of time
series data. The concept of what is normal is a critical consideration in deciding
what is abnormal; i.e., an anomaly. And part 3 explores the processes of
identifying and correlating abnormal behavior.
North America
669-600-3120
[email protected]
International
+972-9-7718707
[email protected]
ABOUT ANODOT
Anodot was founded in 2014, and since its launch in January 2016 has been
providing valuable business insights through anomaly detection to its customers
in financial technology (fin-tech), ad-tech, web apps, mobile apps, e-commerce
and other data-heavy industries. Over 40% of the companys customers are
publicly traded companies, including Microsoft, VF Corp, Waze (a Google
company), and many others. Anodot's real-time business incident detection uses
patented machine learning algorithms to isolate and correlate issues across
multiple parameters in real time, supporting rapid business decisions. Learn
more at http://www.anodot.com/.
Copyright 2017, Anodot. All trademarks, service marks and trade names
referenced in this material are the property of their respective owners.
www.anodot.com | [email protected]
1
INTRODUCTION
Anomaly detection helps companies determine when something changes in their
normal business patterns. When done well, it can give a company the insight it
needs to investigate the root cause of the change, make decisions, and take
actions that can save money (or prevent losing it) and potentially create new
business opportunities.
www.anodot.com | [email protected]
2
analytics. In Part 1 of this white paper series, we outlined the various types of
machine learning and the critical design principles of an anomaly detection
system. We highly recommend reading Part 1 to get the foundational information
necessary to comprehend this document.
In Part 2, we will continue the discussion with information about how systems
can learn what normal behavior looks like, in order to identify anomalous
behavior. Part 3 of our white paper series will cover the processes of identifying
and correlating abnormal behavior. In each of the documents, we discuss the
general technical challenges and Anodots solutions to these challenges.
The techniques described within this paper are well grounded in data science
principles and have been adapted or utilized extensively by the mathematicians
and data scientists at Anodot. The veracity of these techniques has been proven
in practice across hundreds of millions of metrics from Anodot's large customer
base. A company that wants to create its own automated anomaly detection
system would encounter challenges like those described within this document.
Consider the data pattern in Figure 1 below. The shaded area was produced
because of such statistical analysis. We could, therefore, apply statistical tests
such that any data point outside of the shaded area is defined as abnormal and
anything within it is normal.
www.anodot.com | [email protected]
3
Making this assumption means that if the data comes from a known distribution,
then 99.7% of the data points should fall within these bounds. If a data point is
outside these bounds, it can be called an anomaly because the probability of it
happening normally is very small.
This is a very simple model to use and to estimate. It is well known and taught in
basic statistics classes, requiring only computation of the average and the
standard deviation. However, assuming any type of data will behave like the
normal distribution is nave; most data does not behave this way. This model is,
therefore, simple to apply, but usually much less accurate than other models.
www.anodot.com | [email protected]
4
At Anodot, we look at a vast number of time series data and see a wide variety of
data behaviors, many kinds of patterns, and diverse distributions that are
inherent to that data. There is not one type of distribution that fits all possible
metrics. There has to be some way to classify each signal to decide which should
be modeled with a normal distribution, and which should be modeled with a
different type of distribution and technique.
Choosing just one model does not work, and we have seen it even within a single
company when they measure many different metrics. Each metric behaves
differently. In Part 1 of this document series, we used the example of a persons
vital signs operating as a complete system. Continuing with that example, the
technique for modeling the normal behavior of a person's heart rate may be very
different from that which models his or her temperature reading.
www.anodot.com | [email protected]
5
For companies that choose to build their own anomaly detection system, this is
often where the first part of the complexity comes into play. Most open source
techniques deal with "smooth metrics." The metrics are not normal distribution,
but they tend to be very regularly sampled and have stationary behavior. They
tend to have behaviors that don't change rapidly, and they don't exhibit other
behaviors. Applying open source techniques only covers a fraction of what is
measured and if they are applied on metrics that are not smooth, the result will
either be a lot of false-positives or there will be many anomalies that are not
detected (i.e. false-negatives), because the metric is fitted with the wrong model.
Not everything is smooth and stationary, and those models only work on a
fraction of the metrics. Worse, it is difficult to know which metrics are like this.
Those datasets would somehow have to be identified.
Consider the pattern in the signal shown in Figure 4. If the smooth techniques are
applied on this data, the little spikes that seem completely normal would be
considered anomalous and would generate alerts every minute. The smooth
model would not work here.
www.anodot.com | [email protected]
6
Knowing what a data pattern looks like in order to apply an appropriate model is
a very complex task.
There is another aspect we have observed quite often with the data we see from
our customers: the model that is right today, may not be right tomorrow. In
Figure 5, we see how a metrics behavior can change overnight.
We have seen this happen many times, and each time, it was totally unexpected;
the data starts out one way and then goes into a totally different mode. It may
start kind of smooth and then change to steep peaks and valleysand stay there.
That pattern becomes the new normal. It is acceptable to say at the beginning of
the new pattern, that the behavior is anomalous, but if it persists, we must call it
the new normal.
www.anodot.com | [email protected]
7
Let us consider how this affects the company building its own detection system.
The companys data scientist will spend several weeks classifying the data for the
company's 1,000 metric measurements and make a determination for a metric
model. It could be that a week from now, what the data scientist did in classifying
the model is irrelevant for some of thembut it may not be clear for which ones.
We know that many different metrics that are measured have seasonal patterns,
but the pattern might be unknown. Nevertheless, it is important to take the
seasonal pattern into consideration for the model. Why? If the model of what is
normal knows to account for a metrics seasonal pattern, then it is possible to
detect the anomalies in samples that vary from the seasonal pattern. Without
considering the seasonal pattern, too many samples might be falsely identified as
anomalies.
Often we see, not just a single seasonal pattern, but multiple seasonal patterns
and even different types of multiple seasonal patterns, like the two examples
shown in Figure 7.
www.anodot.com | [email protected]
8
Figure 7 shows an example of a real metric with two seasonal patterns working
together at the same time. In this case, they are weekly and daily seasonal
patterns. The image shows that Fridays and weekends tend to be lower, while the
other days of the week are higher. There is a pattern that repeats itself week after
week, so this is the weekly seasonal pattern. There is also a daily seasonal pattern
that illustrates the daytime hours and nighttime hours; the pattern tends to be
higher during the day and lower during the night.
These two patterns are intertwined in a complicated way. There is almost a sine
wave for the weekly pattern, and another faster wave for the daily pattern. In
signal processing, this is called amplitude modulation, and it is normal for this
metric. If we do not account for the fact that these patterns co-exist, then we do
not know what normal is. If we know how to detect it and take it into account, we
can detect very fine anomalies like the ones shown in orange in Figure 7 above.
The values in orange indicate a drop in activity which may be normal on a
weekend but not on a weekday. If we do not know to distinguish between these
patterns, we will not understand the anomaly, so we either miss it or we create
false-positives.
www.anodot.com | [email protected]
9
In the example above, we see a clear daily pattern. In addition, we see an event
that occurs every four hours which causes a spike that lasts for an hour and then
comes down. The spikes are normal because of a process or something that
happens regularly. The orange line shows an anomaly that would be very hard to
detect if we did not take into account that there is both the daily pattern and the
spikes every four hours. We call this pattern additive because the spikes are
added to what normally happens during the day; the pattern shows a consistent
spike every four hours on top of the daily pattern.
If we assume there are no seasonal patterns in any of the metrics and we apply
standard techniques, we are either going to be very insensitive to anomalies or
too sensitive to them, depending on what technique we use. However, making
assumptions about the existence of a seasonal pattern has its issues as well.
www.anodot.com | [email protected]
10
Second, if the wrong seasonal pattern is assumed, the resulting normal model
may be completely off. For example, if the data point is assumed to be a daily
seasonal pattern, but it is actually a 7-hour pattern, then comparing 8 AM one day
to 8 AM another day is not relevant. We would need to compare 8 AM one day to
3 PM that same day. Improperly defining the seasons will lead to many false-
positives due to the poor initial baseline.
Figure 9 Comparing a 7-hour seasonal pattern with an assumed 24-hour seasonal pattern
For this reason, some tools require the user to define the season in order for the
tool to estimate the baseline. Of course, this is not scalable for more than a few
dozen metrics. What is needed is a system that will automatically and accurately
detect seasonality (if it exists). If this capability is not built into the system,
www.anodot.com | [email protected]
11
assumptions will have to be made that are going to cause an issue, either from
the statistics side in needing more data, or from the accuracy side in identifying
anomalies.
www.anodot.com | [email protected]
12
This underscores the need for online adaptive learning algorithms which learn
the model with every new data point that comes in. This type of learning
algorithm does not wait to receive a batch of data points to learn from; rather, it
updates what has been learned so far with every new data point that arrives.
These so-called online learning algorithms do not have to be adaptive, but by
nature they usually are, which means that every new data point changes what
has been learned up to that time.
We can contrast an online learning model to a model that uses data in batch
mode. For example, a video surveillance system that needs to recognize human
images will learn to recognize faces by starting with a dataset of a million pictures
that includes faces and non-faces. It learns what a face is and what a non-face is
in batch mode before it starts receiving any real data points.
1
What do we mean by "online" machine learning? This is not a reference to the Internet or the World Wide Web.
Rather, "online" is a data science term that means the learning algorithm takes every data point, uses each one to
update the model and then does not concern itself with that data point ever again. The algorithm never looks back at
the history of all the data points, but rather goes through them sequentially.
It is not necessarily real-time because time is not a factor here. In an e-commerce example, time can be a factor, but in
the general sense of an online learning algorithm, it just means that if there are 1,000 data points to learn from, the
algorithm goes through them one by one to learn from them, throws them away and then moves on to the next one. It
is more of a sequential learning algorithm. The word "online" is widely used in the world of data science but it has
nothing to do with the Internet; this is simply the term used in literature. For more information, see the Wikipedia entry
about online machine learning.
www.anodot.com | [email protected]
13
The machine never goes back to previously viewed data points to put them into a
current context. The machine cannot say, "Based on what I see now, I know that
this data point from five days ago is actually not an anomaly." It cannot consider,
"Maybe I should have done something different." The luxury of going back in time
and reviewing the data points again does not exist, which is one of the
shortcomings of this paradigm. The advantage of this approach is that it is fast
and adaptive; it can produce a result now and there is no need to wait to collect a
lot of data before results can be produced. In cases where a rapid analysis is
needed, the advantages of this approach far outweigh its disadvantages.
There are various examples of online adaptive learning models that learn the
normal behavior of time series data that can be found in data science, statistics
and signal processing literature. Among them are Simple Moving Average,
Double/Triple Exponential (Holt-
Winters) and Kalman Filters + ARIMA
and variations
THE ADVANTAGE OF THIS APPROACH
IS THAT IT IS FAST AND ADAPTIVE;
The following is an example of how a IT CAN PRODUCE A RESULT NOW
simple moving average is calculated
AND THERE IS NO NEED TO WAIT TO
and how it is applied to anomaly
detection. We want to compute the COLLECT A LOT OF DATA BEFORE
average over a time series, but we do RESULTS CAN BE PRODUCED.
not want the average from the
www.anodot.com | [email protected]
14
beginning of time until present. Instead, we want the average during a window of
time because we know we need to be adaptive and things could change over
time. In this case, we have a moving average with a window size of seven days,
and we measure the metric every day. For example, we look at the stock price at
the end of every trading day. The simple moving average would compute the
average of the stock price over the last seven days. Then we compare tomorrow's
value to that average and see if it deviates significantly. If it does deviate
significantly from the average value, it is an anomaly and if not, then it is not an
anomaly. Using a simple moving average is a straightforward way of considering
whether we have an anomaly or not.
The other models listed above are (much) more complex versions of that but, if
one can understand a simple moving average, then the other models can be
understood as well.
If our learning rate is too slow, meaning our moving average window is very large,
then we would adapt very slowly to any changes in that stock price. If there are
big changes in the stock price, then the baseline the confidence interval of
where the average should be will be very large, and we will be very insensitive
to changes.
If we make the rate too fast i.e., the window is very small then we will adapt
too quickly and we might miss things. We might think that anomalies are not
anomalies because we are adapting to them too quickly.
www.anodot.com | [email protected]
15
How do we know what the learning rate should be? If we have a small number of
time series 100 or fewer we could inspect and change the parameters as
needed. However, a manual method will not work when we have a large number
of time series, so the algorithms need to automatically tune themselves.
There are many different metrics and each one has its own behavior. The rate at
which these metrics change could be fast or slow depending on what they are;
there is no one set of parameters that fits them all well. Auto-tuning these
parameters is necessary to provide an accurate baseline for millions of metrics.
This is something that is often overlooked by companies building anomaly
detection systems (incidentally, auto-tuning is built into the Anodot system).
Auto-tuning is not an easy task, but it is an important one for achieving more
accurate results.
There is another pitfall to be aware of. If we have a metric and we tune the
learning rate to fit it well when it behaves normally, what happens when there is
an anomaly?
Consider a scenario where we have a data point that is an anomaly. Recall that
the three steps of online learning are to read the sample, update the model using
the data point, and move on to the next data point. What happens if a data point
is an anomaly? Do we update the model with the new data point or not?
www.anodot.com | [email protected]
16
Updating the model with every data point (including anomalous ones), is one
strategy, but it is not a very good one.
www.anodot.com | [email protected]
17
An example of this would be a company that does a stock split. All of a sudden,
the stock price is cut in half; instead of it being $100 a share, it suddenly drops to
$50. It will stay around $50 for a while and the anomaly detection system must
adapt to that new state. Identifying that the drop is an anomaly is not a bad thing,
especially if we are unaware there was a split, but eventually we want our normal
value to go down to around that $50 state.
In the online world, these types of changes happen a lot. For example, a company
has a Web application and after a large marketing campaign, the number of users
quickly increases 25 percent. If the campaign was good, the number of users may
stay elevated for the long term. When a SaaS company adds a new customer, its
application metrics will jump, and that is normal. They might want to know about
that anomaly in the beginning, but then they will want the anomaly detection
system to learn the new normal.
These kinds of events happen frequently; we must not ignore them by not
allowing those data points to affect anything from now until eternity. On the
other hand, we do not want the system to learn too quickly, otherwise all
anomalies will be very short, and if it goes back to the previous normal state,
measurements will be off. There is a fine balance here: how fast we learn versus
how adaptive we are.
In the Anodot system, when we see anomalies, we adapt the learning rate in the
model by giving the anomalous data points a lower weight. If the anomaly
persists for a long enough time, we begin to apply higher and higher weights until
the anomalous data points have a normal weight like any other data point, and
then we model to that new state. If it goes back to normal, then nothing happens;
it just goes back to the previous state and everything is okay.
www.anodot.com | [email protected]
18
These two approaches to updating the learning rate are shown below. In Figure
11, the model is updated without weighting the anomalies. In this instance, most
of the anomaly is actually missed by the model being created.
www.anodot.com | [email protected]
19
SUMMARY
This document outlines a general framework for learning normal behavior in a
time series of data. This is important because any anomaly detection needs a
model of normal behavior to determine whether a new data point is normal or
abnormal.
There are many patterns and distributions that are inherent to data. An anomaly
detection system must model the data, but a single model does not fit all metrics.
It is especially important to consider whether seasonality is present in the data
pattern when selecting a model.
In Part 3 of this series, we will look at the processes of identifying and correlating
abnormal behavior, which help to distill the list of anomalies down to the most
significant ones that warrant investigation. Without these important processes, a
system could identify too many anomalies to investigate in a reasonable amount
of time.
North America
669-600-3120
[email protected]
International
+972-9-7718707
[email protected]
www.anodot.com | [email protected]
20
ABOUT ANODOT
Anodot was founded in 2014, and since its launch in January 2016 has been
providing valuable business insights through anomaly detection to its customers
in financial technology (fin-tech), ad-tech, web apps, mobile apps, e-commerce
and other data-heavy industries. Over 40% of the companys customers are
publicly traded companies, including Microsoft, VF Corp, Waze (a Google
company), and many others. Anodot's real-time business incident detection uses
patented machine learning algorithms to isolate and correlate issues across
multiple parameters in real time, supporting rapid business decisions. Learn
more at http://www.anodot.com/.
Copyright 2017, Anodot. All trademarks, service marks and trade names
referenced in this material are the property of their respective owners.
www.anodot.com | [email protected]
1
INTRODUCTION
Many high velocity online business systems today have reached a point of such
complexity that it is impossible for humans to pay attention to everything
happening within the system. There are simply too many metrics and too many
data points for the human brain to discern. Most online companies already use
data metrics to tell them how the business is doing, and detecting anomalies in
the data can lead to saving money or creating new business opportunities. Thus,
it has become imperative for companies to use machine learning in large scale
systems to analyze patterns of data streams and look for anomalies
Consider an airlines pricing system that calculates the price it should charge for
each and every seat on all of its routes in order to maximize revenue. Seat pricing
can change multiple times a day based on thousands of factors, both internal and
external to the company. The airline must consider those factors when deciding
to increase, decrease or hold a fare steady. An anomaly in any given factor can be
an opportunity to raise the price of a particular seat to increase revenue, or lower
the price to ensure the seat gets sold.
Here in Part 3, the final document of our white paper series, we will cover the
processes of identifying, ranking and correlating abnormal behavior. Many of the
aspects we discuss in this document are unique to Anodot, such as ranking and
scoring anomalies and correlating metrics together. Most other vendors that
www.anodot.com | [email protected]
2
provide anomaly detection solutions for do not include these steps in their
analysis, and we believe them to be a real differentiator and a major reason why
Anodots solution goes beyond merely bringing accurate anomalies to light with
minimum false positives and negatives, but puts them into the context of the full
story to provide actionable information.
Steps 1 and 2 were covered in detail in the previous two white papers. This
document covers steps 3 and 4. Step 5 is not in the scope of this white paper
series.
In our earlier documents, we used the example of the human body as a complex
system with many metrics and data points for each metric. Body temperature is
one of those metrics; an individual's body temperature typically changes by
about a half to one degree between its highest and lowest points each day. A
slight temperature rise to, say, 37.8 C (100.0 F), would be anomalous but not a
cause for great concern, as taking an aspirin might help lower the temperature
back to normal. However, an anomalous rise to 40 C (104.0 F) will certainly
warrant a trip to the doctor for treatment. These are both anomalies, but one is
more significant than the other in terms of what it means within the overall
system.
www.anodot.com | [email protected]
3
with the human eye, one could posit what is more or less significant based on
intuition, and this method can be encoded into an algorithm.
Having such a score provides the ability to filter -anomalies based on their
significance. In some cases, the user would want to be alerted only if the score or
the significance is very high; and in other cases, the user would want to see all
anomalies. For example, if a business is looking at a metric that represents the
companys revenue, then the user would probably want to see anomalies
pertaining to anything that happens, even if they are very small. But if the same
business is looking at the number of users coming into its application from a
specific location like Zimbabwe assuming the company doesnt do a lot of
business in Zimbabwe then maybe the user only wants to see the big
anomalies; i.e., highly significant anomalies. In the Anodot system, this is
configured using a simple slider as seen in Figure 2a.
The user needs this input mechanism because all the anomaly detection is
unsupervised, and the system has no knowledge of what the user cares about
more.
Note that the significance slider in the Anodot system does not adjust the
baseline or the normal behavior model; it only defines which anomalies the user
chooses to consume. This helps users focus on what is most important to them,
preventing alert fatigue. If there are too many alerts, such as one for every single
anomaly, the alerts eventually become overwhelming and meaningless.
www.anodot.com | [email protected]
4
Scoring occurs through machine learning since the scores are relative to past
anomalies of that metric, not an absolute value.
Now suppose the big spike the one labeled 90 was not there. Without a
significant anomaly to compare to, the other anomalies would look bigger, more
significant. In fact, we would probably change the scale of the graph.
This is an important distinction because there are other scoring mechanisms that
look at the absolute deviation without context of what happened in the past.
Anodot initially took this approach but we saw quickly, from a human
perspective, that when people look at a long history of a time series and see the
anomalies within it, in their minds they consider the anomalies relative to each
other as well as relative to normal. Anodots algorithms now mimic this human
thought process using probabilistic Bayesian models.
www.anodot.com | [email protected]
5
In the screenshot in Figure 2b, the significance slider is set to 70, meaning that
only the orange anomalies would be alerted on, and not the gray ones, which fall
below that score.
Figure 2a, The significance slider in the Anodot system lets users select the level of anomalies to be alerted
on.
Figure 2b, With significance set at 70, users would be alerted on the two orange alerts that are above 70,
but not the smaller gray alerts below 70.
www.anodot.com | [email protected]
6
If there are many anomalies at the single metric level and they are not combined
into a story that describes the whole incident, then it is very hard to understand
what is going on. However, combining them into a concise story requires an
understanding of which metrics are related, because otherwise the system runs
the risk of combining things that are completely unrelated. The individual metrics
could be anomalous at the same time just by chance.
Behavioral topology learning provides the means to learn the actual relationships
among different metrics. This type of learning is not well-known; consequently,
many solutions do not work this way. Moreover, finding these relationships at
scale is a real challenge. If there are millions of metrics, how can the relationships
among them be discovered efficiently?
As shown in Figure 3, there are several ways to figure out which metrics are
related to each other.
www.anodot.com | [email protected]
7
The first method of relating metrics to each other is abnormal based similarity.
Intuitively, human beings know that when something is anomalous, it will
typically affect more than one key performance indicator (KPI). In the other
papers in this series, we have been using the example of the human body. When
someone has the flu, the illness will affect his or her temperature, and possibly
also heart rate, skin pH, and so on. Many parts of this system called a body will be
affected in a related way.
Based on these intuitions, one can design algorithms that find the abnormal
based similarity between metrics. One way to find abnormal based similarity is to
apply clustering algorithms. One possible input to the clustering algorithm would
be the representation of each metric as anomalous or not over time (vectors of
0s and 1s); the output is groups of metrics that are found to belong to the same
cluster. There are a variety of clustering algorithms, including K-means,
hierarchical clustering and the Latent Dirichlet Allocation algorithm (LDA). LDA is
one of the more advanced algorithms, and Anodots abnormal based similarity
processes have been developed on LDA with some additional enhancements.i
www.anodot.com | [email protected]
8
The advantage that LDA has over other algorithms, is that most clustering
algorithms would allow a data point - or a metric in this case - to belong to only
one group. There could be hundreds of different groups, but in the end, a metric
will belong to just one. Often, it is not that clear-cut. For example, on a mobile
app, its latency metric could be in a group with the metric related to the
applications revenue, but it could also be related to the latency of that app on
desktops alone. By using clustering algorithms that force a choice of just one
group, the system might miss out on important relationships. LDA clusters things
in such a way that they can belong to more than one group, i.e. soft clustering,
as opposed to hard clustering.
Another advantage of
LDA is that most ADVANTAGES OF LDA
clustering algorithms
have some distance
MOST CLUSTERING ALGORITHMS WOULD
function between what ALLOW A DATA POINT - OR A METRIC IN
is being measured that THIS CASE - TO BELONG TO ONLY ONE
is similar. The LDA
GROUP
algorithm allows a
metric to be partially MOST CLUSTERING ALGORITHMS HAVE SOME
similar to the other DISTANCE FUNCTION BETWEEN WHAT IS
metrics. This comes BEING MEASURED THAT IS SIMILAR
back to the softness of
the algorithmit allows
partial similarity for a
metric to still belong to a group. In the context of learning metric relationships,
this is an important feature because, for example, application latency doesnt
always have to be anomalous when the revenue is anomalous. It is not always the
case that latency goes up anomalously and revenue goes down, and there can be
times when the revenue becomes anomalous but the latency does not go up or
down accordingly. The anomaly detection system must be able to take that
partiality into account.
The primary issue with abnormal based similarity is that it does not scale well
we discuss scaling later in the paper. In addition, it requires seeing enough
historical data containing anomalies so it can capture these relationships. Are
there additional types of information that can help capture the metric topology
with less (or no) history? We will discuss two additional methods of capturing
relationships between metrics next.
www.anodot.com | [email protected]
9
NAME SIMILARITY
This particular app is also available in Germany, so the name for the metric that
measures revenue in there might be something like
appName=XYZ.Country=Germany.what=revenue. By looking at the similarity
between these two metric names, we have a measure of how similar they are. If
they are very similar, then we say they should be grouped because they probably
describe the same system. It is reasonable to associate metrics using this
method; it is essentially based on term similarity, by comparing terms to see
whether they are equal and how much overlap they have.
www.anodot.com | [email protected]
10
It is also necessary to
remove seasonal patterns METRICS WITH A SEASONAL PATTERN ARE
from the metrics; CORRELATED WITH ANY OTHER METRICS THAT
otherwise anything with a
HAVE THE SAME SEASONAL PATTERN; THEREFORE,
seasonal pattern will be
correlated with anything SEASONAL PATTERNS MUST BE REMOVED.
else that has the same
seasonal pattern. If two
metrics both have a 24-hour seasonal pattern, the result will be a very high
similarity score regardless of whether they are related or not. In fact, many
metrics do have the same seasonal patterns but they are not related at all. For
instance, we could have two online apps that are not related, but if we look at the
number of visitors to both apps throughout the day, we will see the same pattern
because both apps are primarily used in the US and have the same type of users.
It could be the XYZ app and a totally unrelated news application.
Unlike abnormal based similarity which creates very few false positives but is
dependent on anomalies happening (which occurs rarely), thus more time to
pass, normal behavior similarity requires much less data in order to be
computed. However, if not done right e.g., if the data patterns are not de-
trended and de-seasonalized this method could create many false positives.
The Pearson correlation is a simple algorithm and is quite easy to implement, but
there are better approaches that are less prone to false positives, such as the
pattern dictionary based approach. Suppose each time series metric can be
partitioned into segments, where each segment is classified to one of N
prototypical patterns that are defined in a dictionary of known patterns like a
daily sine wave, a saw tooth, a square wave-like pattern, or other classifiable
shapes. Once the user has a dictionary of typical shapes, he or she can describe
each metric based on what shapes appeared in it at each segment.
As an example, from 8 AM to 12 PM, the metric had shape number 3 from the
dictionary of shapes, and from 12 PM to 5 PM, it had shape number 10. This
www.anodot.com | [email protected]
11
The main challenge in the shape dictionary based approach is how to create the
dictionary. A variety of algorithms can be employed for learning the dictionary,
but they all follow a similar approach: Given a (large) set of time series metric
segments, apply a clustering technique (or soft clustering technique such as LDA)
on all the segments, and then use the representations of the clusters as the
dictionary of shapes. Given a new segment of a metric, find the most
representative cluster in the dictionary and use its index as the new
representation of the segment.
One of the most promising algorithms tested at Anodot for creating such a
dictionary is a Neural-Network based approach (Deep Learning), namely, Stacked
Autoencoders. Stacked autoencoders are a multi-layer Neural Network designed
to discover a high-level representation of the input vectors in the form activation
of the output nodes. Training stacked autoencoders is done with a set of
segments of the time series; the activated nodes at the output of the network are
the dictionary representing prototypical shapes of the input segments. The
details of implementing this deep learning technique to accomplish this task are
out of the scope of this white paper.
USER INPUT
The second method is indirect input, in which the user manipulates the metrics to
create new metrics out of them. If there is revenue of XYZ app in multiple
countries, the user can now create a new metric by calculating the sum of the
revenue from all the countries. It can be assumed that if it makes sense to create
www.anodot.com | [email protected]
12
a composite metric of multiple metrics, then the individual metrics are likely
related to each other.
A MATTER OF SCALE
Of the various methods discussed
above, one of the major challenges is HOW CAN THESE COMPARISONS BE
scale. How can these comparisons be APPLIED AT VERY LARGE SCALE, SUCH
applied at very large scale? The
AS A BILLION METRICS?
algorithm-based methods are
computationally expensive when there
are a lot of metrics to work with. It either requires a lot of machines or a lot of
time to get results. How can it be done efficiently on a large scale, such as a
billion metrics?
One method is to group the metrics. We would start with one billion metrics
sorted into 100 different groups that are roughly related to each other. We can go
into each group and perform the heavy computation because now the number of
groups is small, and each group has its own order. If we have a group of one
million metrics, and then we separate them into 10 groups, we end up with 10
groups of 100k metrics each, which is a much smaller, more manageable
number. A mechanism is needed to enable fast and accurate partitioning.
How can this be done without knowing what things are similar? A locality
sensitive hashing (LSH) algorithm can help here. For every metric a company
measures, the system computes a hashtag that determines which group it
belongs to. Then, additional algorithms can be run on each group separately. This
breaks one big problem into a lot of smaller problems that can be parsed out to
different machines for faster results. This methodology does have a certain
probability of false positives and false negatives; however, the algorithm can tune
the system, depending on how many false positives and false negatives users are
willing to tolerate.
In this case, false positive means that two things are grouped together, despite
not exhibiting characteristics that would cause them to be grouped together.
False negative means that two things are put into separate groups when they
should be in the same group. The tuning mechanism allows the user to specify
the size of the groups based on the total number of metrics, as well as the
tolerance of false positives and false negatives that he or she is willing to accept.
One way to reduce the number of false negatives is to run the groups through
www.anodot.com | [email protected]
13
the algorithms a few times, changing the size of the group each time. If the
groups are small enough, they can run rapidly while not being computationally
expensive.
Any large-scale system with a high number of metrics will yield many anomalies
perhaps too many for the business to investigate in a meaningful time frame.
This is why all of the steps discussed across our series of three white papers are
important. Each step helps reduce the number of anomalies to a manageable
number of truly significant insight. This is illustrated in Figure 4 below.
This chart illustrates the importance of all the steps in an anomaly detection
system: normal behavior learning, abnormal behavior learning, and behavioral
topology learning. Consider a company that is tracking 4 million metrics. Out of
this, we found 158,000 single metric anomalies in a given week, meaning any
anomaly on any metric. This is the result of using our system to do anomaly
detection only at the single metric level, without anomaly scoring and without
www.anodot.com | [email protected]
14
metric grouping. Without the means to filter things, the system gives us all the
anomalies, and that is typically a very large number. Even though we started with
4 million metrics, 158,000 is still a very big numbertoo big to effectively
investigate; thus, we need the additional techniques to whittle down that
number.
If we look at only the anomalies that have a high significance score in this case a
score of 70 or above the number of anomalies drops off dramatically by an
order of magnitude to just over 910. This is the number of significant anomalies
we had for single metrics out of 4 million metrics for one week910 of them.
Better, but still too many to investigate thoroughly.
The bottom of the funnel shows how many grouped anomalies with high
significance we end up after applying behavioral topology learning techniques.
This is another order of magnitude reductionfrom 910 to 147. This number of
anomalies is far more manageable to investigate. Any organization with 4 million
metrics is large enough to have numerous people assigned to dig into the
distilled number of anomalies, typically looking at those anomalies that are
relevant to their areas of responsibility.
Figure 4 does not necessarily show the accuracy of the anomalies; rather, it
shows why all these steps are important; otherwise the number of anomalies can
be overwhelming. Even if they are good anomalies they found the right things
it would be impossible to investigate everything in a timely manner. Users
would simply stop paying attention because it would take them a long time to
understand what is happening. This demonstrates the importance of grouping
really reducing the forest of 158,000 anomalies into 147 grouped anomalies per
week. This goes back to the notion of conciseness covered in the design
principles white paper (part 1 of this series). Concise anomalies help to tell the
story of what is happening without being overwhelming, enabling a human to
investigate more quickly. Then the business can take advantage of an opportunity
that might be presented through the anomaly, or take care of any problem that
the anomaly has highlighted.
www.anodot.com | [email protected]
15
The flow of data comes from Customer Data Sources, as shown at the bottom of
the illustration, into what we call Anodotd, or Anodot Daemon, which does the
learning. When a data point comes in from a metric, the system already has the
pre-constructed normal model for that metric in its memory. If there is an
anomaly, it scores it using the abnormal model and sends it to the Anomaly
Events Queue (we use Kafka) on the left side of the illustration. If there is no
anomaly, Anodotd simply updates the model that it has so far and stores that
model in the database.
Many machine learning systems do not work this way. They pull data from a
database, do their learning and then push the data back to a database. However,
if you want the system to scale and find anomalies on 100% of the metrics
because it is unknown which metrics are important then the learning must be
done on all the samples that come in. If the data has already been stored in a
database and then must be pulled out in order to do the learning, the system will
not be able to scale up. There is no database system in the world that both read
efficiently and write rapidly. Enlarging the database system is a possibility, but it
will increase costs significantly. Certainly, to get the system to scale, learning
must be done on the data stream itself.
www.anodot.com | [email protected]
16
On the right side of Figure 5 is Hadoop/Spark HIVE offline learning. There are
some processes that Anodot runs offline, for example, the behavioral topology
learning or seasonality detection can be run offline; we do not have to run this
process on the data stream itself. Discovering that one metric is related to
another is not something that will change from data point to data point. Finding
that something has a weekly seasonal pattern does not have to be detected on
every data point that comes in for that metric. There is a price to pay when
processes run on the data stream, often in the form of accuracy. With online
learning, there is no luxury of going back and forth; thus, Anodot performs these
activities offline. This combination of online and offline learning optimizes
accuracy and efficiency.
Not all anomaly detection systems have all these components, but Anodot
believes they are all important to yield the best results in an efficient and timely
manner.
www.anodot.com | [email protected]
17
Figure 7: A screenshot of the Anodot Anoboard Dashboard showing anomalies detected in time series
data
At a minimum, you will need a team of data scientists with a specialty in time
series data and online machine learning. Just as chefs and doctors have their own
specialties, data scientists do as well. While there is a shortage of data scientists
in the market in general, the scarcity is even more acutely felt when searching for
particular specialties such as time series, and you may find yourself in
competition for talent with companies such as Google, Facebook and other
industry giants.
Besides the data scientists, you need a team of developers and other experts to
build a system around the algorithms which is efficient at scalable stream
processing and developing backend systems and has an easy-to-use user
interface. At the bare minimum, you would need backend developers creating
data flows, storage, and management of the large scale backend system, in
addition to UI experts and developers, QA and product management.
Note that this team not only has to develop and deploy the system, but maintain
it over time.
www.anodot.com | [email protected]
18
Based on our own experience and discussions with our customers who have
faced the build or buy decision, we estimate that it would take a minimum of 12
human years (a team of data scientists, developers, UI and QA) to build even the
most rudimentary anomaly detection system. And this basic system could still
encounter various technical issues that are far beyond the scope of this paper.
SUMMARY
Across this series of three white papers, we have covered the critical processes
and various types of learning of a large scale anomaly detection system.
www.anodot.com | [email protected]
19
Hopefully these documents have given the reader some insight to the complexity
of designing and developing a large scale anomaly detection system. The Anodot
system has been carefully designed using sophisticated data science principles
and algorithms, and as a result, we can provide to our customers truly
meaningful information about the anomalies in their business systems.
North America
669-600-3120
[email protected]
International
+972-9-7718707
[email protected]
ABOUT ANODOT
Anodot provides valuable business insights through anomaly
detection. Automatically uncovering outliers in vast amounts of time series data,
Anodots business incident detection uses patented machine learning algorithms
to isolate and correlate issues across multiple parameters in real-time,
supporting rapid business decisions. Anodot customers in fintech, ad-tech, web
apps, mobile apps and other data-heavy industries use Anodot to drive real
business benefits like significant cost savings, increased revenue and upturn in
customer satisfaction. The company was founded in 2014, is headquartered in
Raanana, Israel, and has offices in Silicon Valley and Europe. Learn more
at: http://www.anodot.com/.
www.anodot.com | [email protected]
20
Copyright 2017, Anodot. All trademarks, service marks and trade names
referenced in this material are the property of their respective owners.
i
Anodot uses an enhanced version of the latent Dirichlet allocation (LDA) algorithm in a unique way to
calculate abnormal based similarity. In natural language processing, LDA is a generative statistical
model that allows sets of observations to be explained by unobserved groups that explain why some
parts of the data are similar. For example, if observations are words collected into documents, it posits
that each document is a mixture of a small number of topics and that each word's creation is
attributable to one of the document's topics.
In LDA, each document may be viewed as a mixture of various topics, where each document is
considered to have a set of topics that are assigned to it via LDA. In practice, this results in more
reasonable mixtures of topics in a document.
For example, an LDA model might have topics that can be classified as CAT_related and DOG_related.
A topic has probabilities of generating various words, such as milk, meow and kitten, which can
be classified and interpreted by the viewer as CAT_related. Naturally, the word cat itself will have
high probability given this topic. The DOG_related topic likewise has probabilities of generating each
word: puppy, bark and bone might have high probability. Words without special relevance, such
as the will have roughly even probability between classes (or can be placed into a separate
category). A topic is not strongly defined, neither semantically nor epistemologically. It is identified on
the basis of supervised labeling and (manual) pruning on the basis of their likelihood of co-occurrence.
A lexical word may occur in several topics with a different probability, however, with a different typical
set of neighboring words in each topic. (Wikipedia)
ii
TF-IDF is short for term frequencyinverse document frequency. It is a numerical statistic intended
to reflect how important a word is to a document in a collection or corpus. It is often used as a
weighting factor in information retrieval and text mining. (Wikipedia)
www.anodot.com | [email protected]