Download ebooks file Beginning Anomaly Detection Using Python Based Deep Learning Implement Anomaly Detection Applications with Keras and PyTorch 2nd Edition Suman Kalyan Adari all chapters
Download ebooks file Beginning Anomaly Detection Using Python Based Deep Learning Implement Anomaly Detection Applications with Keras and PyTorch 2nd Edition Suman Kalyan Adari all chapters
https://ebookgate.com/product/deep-learning-with-pytorch-second-
edition-meap-v03-howard-huang/
ebookgate.com
https://ebookgate.com/product/deep-learning-with-tensorflow-explore-
neural-networks-with-python-1st-edition-zaccone/
ebookgate.com
https://ebookgate.com/product/deep-reinforcement-learning-with-python-
rlhf-for-chatbots-and-large-language-models-2nd-edition-nimish-sanghi/
ebookgate.com
General Relativity and the Pioneers Anomaly 1st Edition
Marcelo Samuel Berman
https://ebookgate.com/product/general-relativity-and-the-pioneers-
anomaly-1st-edition-marcelo-samuel-berman/
ebookgate.com
https://ebookgate.com/product/deep-learning-for-numerical-
applications-with-sas-1ed-edition-henry-bequet/
ebookgate.com
https://ebookgate.com/product/single-photon-generation-and-detection-
physics-and-applications-1st-edition-alan-migdall/
ebookgate.com
Beginning Anomaly
Detection Using
Python-Based Deep
Learning
Implement Anomaly Detection
Applications with Keras and PyTorch
Second Edition
Introduction�������������������������������������������������������������������������������������������������������������xv
iii
Table of Contents
Networking���������������������������������������������������������������������������������������������������������������������������� 21
Medicine�������������������������������������������������������������������������������������������������������������������������������� 21
Video Surveillance����������������������������������������������������������������������������������������������������������������� 22
Environment�������������������������������������������������������������������������������������������������������������������������� 22
Summary������������������������������������������������������������������������������������������������������������������������������������ 22
iv
Table of Contents
vi
Table of Contents
Chapter 11: Practical Use Cases and Future Trends of Anomaly Detection ��������� 481
Anomaly Detection�������������������������������������������������������������������������������������������������������������������� 482
Real-World Use Cases of Anomaly Detection���������������������������������������������������������������������������� 485
Telecom������������������������������������������������������������������������������������������������������������������������������� 485
Banking������������������������������������������������������������������������������������������������������������������������������� 487
Environmental��������������������������������������������������������������������������������������������������������������������� 488
Health Care�������������������������������������������������������������������������������������������������������������������������� 490
Transportation��������������������������������������������������������������������������������������������������������������������� 493
Social Media������������������������������������������������������������������������������������������������������������������������ 494
Finance and Insurance�������������������������������������������������������������������������������������������������������� 495
Cybersecurity���������������������������������������������������������������������������������������������������������������������� 496
Video Surveillance��������������������������������������������������������������������������������������������������������������� 499
Manufacturing��������������������������������������������������������������������������������������������������������������������� 500
Smart Home������������������������������������������������������������������������������������������������������������������������� 503
Retail����������������������������������������������������������������������������������������������������������������������������������� 504
Implementation of Deep Learning–Based Anomaly Detection�������������������������������������������������� 504
Future Trends���������������������������������������������������������������������������������������������������������������������������� 506
Summary���������������������������������������������������������������������������������������������������������������������������������� 508
Index��������������������������������������������������������������������������������������������������������������������� 511
vii
About the Authors
Suman Kalyan Adari is currently a machine learning research engineer. He obtained a
B.S. in computer science at the University of Florida and an M.S. in computer science,
specializing in machine learning, at Columbia University. He has been conducting
deep learning research in adversarial machine learning since his freshman year at the
University of Florida and has presented at the IEEE Dependable Systems and Networks
workshop on Dependable and Secure Machine Learning held in Portland, Oregon,
USA in June 2019. Currently, he works on various anomaly detection tasks spanning
behavioral tracking and geospatial trajectory modeling.
He is quite passionate about deep learning, and specializes in various fields ranging
from video processing to generative modeling, object tracking, time-series modeling,
and more.
Sridhar Alla is the co-founder and CTO of Bluewhale, which helps organizations big
and small in building AI-driven big data solutions and analytics, as well as SAS2PY, a
powerful tool to automate migration of SAS workloads to Python-based environments
using Pandas or PySpark. He is a published author of books and an avid presenter
at numerous Strata, Hadoop World, Spark Summit, and other conferences. He also
has several patents filed with the US PTO on large-scale computing and distributed
systems. He has extensive hands-on experience in several technologies, including Spark,
Flink, Hadoop, AWS, Azure, TensorFlow, Cassandra, and others. He spoke on anomaly
detection using deep learning at Strata SFO in March 2019 and at Strata London in
October 2019. He was born in Hyderabad, India, and now lives in New Jersey with his
wife, Rosie, his daughters, Evelyn and Madelyn, and his son, Jayson. When he is not busy
writing code, he loves to spend time with his family and also training, coaching, and
organizing meetups.
ix
About the Technical Reviewers
Puneet Sinha has accumulated more than 12 years of work
experience in developing and deploying end-to-end models
in credit risk, multiple marketing optimization, A/B testing,
demand forecasting and brand evaluation, profit and price
analyses, anomaly and fraud detection, propensity modeling,
recommender systems, upsell/cross-sell models, modeling
response to incentives, price optimization, natural language
processing, and OCR using ML/deep learning algorithms.
xi
Acknowledgments
Suman Kalyan Adari
I would like to thank my parents, Krishna and Jyothi, my sister, Niha, and my loving dog,
Pinky, for supporting me throughout the entire process of writing this book as well as my
various other endeavors.
Sridhar Alla
I would like to thank my wonderful, loving wife, Rosie Sarkaria, and my beautiful,
loving children, Evelyn, Madelyn, and Jayson, for all their love and patience during the
many months I spent writing this book. I would also like to thank my parents, Ravi and
Lakshmi Alla, for their blessings and all the support and encouragement they continue
to bestow upon me.
xiii
Introduction
Congratulations on your decision to explore the exciting world of anomaly detection
using deep learning!
Anomaly detection involves finding patterns that do not adhere to what is
considered as normal or expected behavior. Businesses could lose millions of dollars
due to abnormal events. Consumers could also lose millions of dollars. In fact, there are
many situations every day where people’s lives are at risk and where their property is at
risk. If your bank account gets cleaned out, that’s a problem. If your water line breaks,
flooding your basement, that’s a problem. If all flights at an airport get delayed due to a
technical glitch in the traffic control system, that’s a problem. If you have a health issue
that is misdiagnosed or not diagnosed, that’s a very big problem that directly impacts
your well-being.
In this book, you will learn how anomaly detection can be used to solve business
problems. You will explore how anomaly detection techniques can be used to address
practical use cases and address real-life problems in the business landscape. Every
business and use case is different, so while we cannot copy and paste code and build a
successful model to detect anomalies in any dataset, this book will cover many use cases
with hands-on coding exercises to give you an idea of the possibilities and concepts
behind the thought process. All the code examples in the book are presented in Python
3.8. We choose Python because it is truly the best language for data science, with a
plethora of packages and integrations with scikit-learn, deep learning libraries, etc.
We will start by introducing anomaly detection, and then we will look at legacy
methods of detecting anomalies that have been used for decades. Then we will look
at deep learning to get a taste of it. Then we will explore autoencoders and variational
autoencoders, which are paving the way for the next generation of generative models.
Following that, we will explore generative adversarial networks (GANs) as a way to detect
anomalies, delving directly into generative AI.
Then we’ll look at long short-term memory (LSTM) models to see how temporal data
can be processed. We will cover temporal convolutional networks (TCNs), which are
excellent for temporal data anomaly detection. We will also touch upon the transformer
xv
Introduction
xvi
CHAPTER 1
Introduction to Anomaly
Detection
In this chapter, you will learn about anomalies in general, the categories of anomalies,
and anomaly detection. You will also learn why anomaly detection is important, how
anomalies can be detected, and the use case for such a mechanism.
In a nutshell, this chapter covers the following topics:
• What is an anomaly?
What Is an Anomaly?
Before you get started with learning about anomaly detection, you must first understand
what exactly you are targeting. Generally, an anomaly is an outcome or value that
deviates from what is expected, but the exact criteria for what determines an anomaly
can vary from situation to situation.
Anomalous Swans
To get a better understanding of what an anomaly is, let’s take a look at some swans
sitting by a lake (Figure 1-1).
1
© Suman Kalyan Adari, Sridhar Alla 2024
S. K. Adari and S. Alla, Beginning Anomaly Detection Using Python-Based Deep Learning,
https://doi.org/10.1007/979-8-8688-0008-5_1
Chapter 1 Introduction to Anomaly Detection
Let’s say that we want to observe these swans and make assumptions about the
color of the swans at this particular lake. Our goal is to determine what the normal
color of swans is and to see if there are any swans that are of a different color than this
(Figure 1-2).
Figure 1-2. More swans show up, all of which are white
2
Chapter 1 Introduction to Anomaly Detection
We continue to observe swans for a few years and all of them have been white. Given
these observations, we can reasonably conclude that every swan at this lake should be
white. The very next day, we are observing swans at the lake again. But wait! What’s this?
A black swan has just flown in (Figure 1-3).
Considering our previous observations, we thought that we had seen enough swans
to assume that the next swan would also be white. However, the black swan defies that
assumption entirely, making it an anomaly. It’s not really an outlier, which would be, for
example, a really big white swan or a really small white swan; it’s a swan that’s entirely
a different color, making it an anomaly. In our scenario, the overwhelming majority of
swans are white, making the black swan extremely rare.
In other words, given a swan by the lake, the probability of it being black is very
small. We can explain our reasoning for labeling the black swan as an anomaly with one
of two approaches (though we aren’t limited to only these two approaches).
First, given that a vast majority of swans observed at this particular lake are white, we
can assume that, through a process similar to inductive reasoning, the normal color for a
swan here is white. Naturally, we would label the black swan as an anomaly purely based
on our prior assumptions that all swans are white, considering that we’d only seen white
swans before the black swan arrived.
3
Chapter 1 Introduction to Anomaly Detection
Another way to look at why the black swan is an anomaly is through probability.
Now assume that there is a total of 1000 swans at this lake and only two are black swans;
the probability of a swan being black is 2 / 1000, or 0.002. Depending on the probability
threshold, meaning the lowest probability for an outcome or event that will be accepted
as normal, the black swan could be labeled as anomalous or normal. In our case, we will
consider it an anomaly because of its extreme rarity at this lake.
The intersections of the dotted lines have created several different regions containing
data points. Of interest is the bounding box (solid lines) created from the intersection of
both sets of dotted lines since it contains the data points for samples deemed acceptable
(Figure 1-5). Any data point outside of that specific box will be considered anomalous.
4
Chapter 1 Introduction to Anomaly Detection
Figure 1-5. Data points are identified as “good” or “anomaly” based on their
location
Now that we know which points are and aren’t acceptable, let’s pick out a sample from
a new batch of screws and check its data to see where it falls on the graph (Figure 1-6).
Figure 1-6. A new data point representing the new sample screw is generated,
with the data falling within the bounding box
The data for this sample screw falls within the acceptable range. That means that this
batch of screws is good to use since its density as well as tensile strength is appropriate
for use by the consumer. Now let’s look at a sample from the next batch of screws and
check its data (Figure 1-7).
5
Chapter 1 Introduction to Anomaly Detection
Figure 1-7. A new data point is generated for another sample, but this falls
outside the bounding box
The data falls far outside the acceptable range. For its density, the screw has abysmal
tensile strength and is unfit for use. Since it has been flagged as an anomaly, the
factory can investigate why this batch of screws turned out to be brittle. For a factory of
considerable size, it is important to hold a high standard of quality as well as maintain
a high volume of steady output to keep up with consumer demands. For a monumental
task like that, automation to detect any anomalies to avoid sending out faulty screws is
essential and has the benefit of being extremely scalable.
So far, we have explored anomalies as data points that are either out of place, in the
case of the black swan, or unwanted, in the case of faulty screws. So what happens when
we introduce time as a new variable?
6
Chapter 1 Introduction to Anomaly Detection
Assume the initial spike in expenditures at the start of the month is due to the
payment of bills such as rent and insurance. During the weekdays, our example person
occasionally eats out, and on the weekends goes shopping for groceries, clothes, and
various other items. Also assume that this month does not include any major holidays.
These expenditures can vary from month to month, especially in months with major
holidays. Assume that our person lives in the United States, in which the holiday of
Thanksgiving falls on the last Thursday of the month of November. Many U.S. employers
also include the Friday after Thanksgiving as a day off for employees. U.S. retailers have
leveraged that fact to entice people to begin their Christmas shopping by offering special
deals on what has colloquially become known as “Black Friday.” With that in mind, let’s
take a look at our person’s spending pattern in November (Figure 1-9). As expected, a
massive spike in purchases occurred on Black Friday, some of them quite expensive.
7
Chapter 1 Introduction to Anomaly Detection
Figure 1-9. Spending habits for the same person during the month of November
Now assume that, unfortunately, our person has had their credit card information
stolen, and the criminals responsible for it have decided to purchase various items of
interest to them. Using the same month as in the first example (Figure 1-8; no major
holidays), the graph in Figure 1-10 depicts what could happen.
Figure 1-10. Purchases in the person’s name during the same month as in
Figure 1-8
Let’s assume we have a record of purchases for this user going back many years.
Thanks to this established prior history, this sudden influx in purchases would be
flagged as anomalies. Such a cluster of purchases might be normal for Black Friday or
in the weeks before Christmas, but in any other month without a major holiday, it looks
8
Chapter 1 Introduction to Anomaly Detection
out of place. In this case, our person might be contacted by the credit card company to
confirm whether or not they made the purchases.
Some companies might even flag purchases that follow normal societal trends. What
if that TV wasn’t really bought by our person on Black Friday? In that case, the credit
card company’s software can ask the client directly through a phone app, for example,
whether or not they actually bought the item in question, allowing for some additional
protection against fraudulent purchases.
Taxi Cabs
As another example of anomalies in a time series, let’s look at some sample data for taxi
cab pickups and drop-offs over time for a random city and an arbitrary taxi company and
see if we can detect any anomalies.
On an average day, the total number of pickups can look somewhat like the pattern
shown in Figure 1-11.
Figure 1-11. Number of pickups for a taxi company throughout the day
From the graph, we see that there’s a bit of post-midnight activity that drops off
to near zero during the late-night hours. However, customer traffic picks up suddenly
around morning rush hour and remains high until the early evening, when it peaks
during evening rush hour. This is essentially what an average day looks like.
9
Chapter 1 Introduction to Anomaly Detection
Let’s expand the scope out a bit more to gain some perspective of passenger traffic
throughout the week (Figure 1-12).
Figure 1-12. Number of pickups for a taxi company throughout the week
As expected, most of the pickups occur during the weekday when commuters
must get to and from work. On the weekends, a fair amount of people still go out to get
groceries or just go out somewhere for the weekend.
On a small scale like this, causes for anomalies would be anything that prevents
taxis from operating or incentivizes customers not to use a taxi. For example, say that a
terrible thunderstorm hits on Friday. Figure 1-13 shows that graph.
Figure 1-13. Number of pickups for a taxi company throughout the week, with a
heavy thunderstorm on Friday
10
Chapter 1 Introduction to Anomaly Detection
The thunderstorm likely influenced some people to stay indoors, resulting in a lower
number of pickups than unusual for a weekday. However, these sorts of anomalies are
usually too small-scale to have any noticeable effect on the overall pattern.
Let’s take a look at the data over the entire year, as shown in Figure 1-14.
Figure 1-14. Number of pickups for a taxi company throughout the year
The largest dips occur during the winter months when snowstorms are expected.
These are regular patterns that can be observed at similar times every year, so they
are not an anomaly. But what happens to customer traffic levels when a relatively rare
polar vortex descends on the city in early April and unleashes several intense blizzards?
Figure 1-15 shows the graph.
11
Chapter 1 Introduction to Anomaly Detection
Figure 1-15. Number of pickups for a taxi company throughout the year, with a
polar vortex descending on the city in April
As you can see in Figure 1-15, the intense blizzards severely slowed down all traffic
in the first week of April and burdened the city in the following two weeks. Comparing
this graph to the graph shown in Figure 1-14, there’s a clearly defined anomaly in April
caused by the polar vortex. Since this pattern is extremely rare for the month of April, it
would be flagged as an anomaly.
Categories of Anomalies
Now that you have more perspective of what anomalies can be in various situations, you
can see that they generally fall into these broad categories:
• Context-based anomalies
• Pattern-based anomalies
12
Chapter 1 Introduction to Anomaly Detection
are expected to be present in the data set and can be caused by unavoidable random
errors or from systematic errors relating to how the data was sampled. Anomalies would
be outliers or other values that one doesn’t expect to exist. These data anomalies might
be present wherever a data set of values exist.
As an example of a data set in which data point–based anomalies may exist,
consider a data set of thyroid diagnostic values, where the majority of the data points are
indicative of normal thyroid functionality. In this case, anomalous values represent sick
thyroids. While they are not necessarily outliers, they have a low probability of existing
when taking into account all the normal data.
We can also detect individual purchases totaling excessive amounts and label
them as anomalies since, by definition, they are not expected to occur or have a very
low probability of occurrence. In this case, they are labeled as potentially fraudulent
transactions, and the card holder is contacted to ensure the validity of the purchase.
Basically, we can say this about the difference between anomalies and outliers:
we should expect a data set to include outliers, but we should not expect it to include
anomalies. Though the terms “anomaly” and “outlier” are sometimes interchanged,
anomalies are not always outliers, and not all outliers are anomalies.
Context-Based Anomalies
Context-based anomalies consist of data points that might seem normal at first but
are considered anomalies in their respective contexts. Returning to our earlier personal
spending example, we might expect a sudden surge in purchases near certain holidays,
but these purchases could seem unusual in the middle of August. The person’s high
volume of purchases on Black Friday was not flagged because it is typical spending
behavior for people on Black Friday. However, if the purchases were made in a month
where it is out of place given previous purchase history, it would be flagged as an
anomaly. This might seem similar to the example presented for data point–based
anomalies, but the distinction for context-based anomalies is that the individual purchase
does not have to be expensive. If a person never buys gasoline because they own an
electric car, sudden purchases of gasoline would be out of place given the context. Buying
gasoline is normal behavior for many people, but in this context, it is an anomaly.
13
Chapter 1 Introduction to Anomaly Detection
Pattern-Based Anomalies
Pattern-based anomalies are patterns and trends that deviate from their historical
counterparts, and they often occur in time-series or other sequence-based data. In the
earlier taxi cab company example, the customer pickup counts for the month of April
were pretty consistent with the rest of the year. However, once the polar vortex hit, the
numbers tanked visibly, resulting in a huge drop in the graph, labeled as an anomaly.
Similarly, when monitoring network traffic in the workplace, expected patterns of
network traffic are formed from constant monitoring of data over several months or
even years for some companies. If an employee attempts to download or upload large
volumes of data, it generates a certain pattern in the overall network traffic flow that
could be considered anomalous if it deviates from the employee’s usual behavior.
As another example of pattern-based anomalies, if an external hacker decided to hit
a company’s website with a distributed denial-of-service (DDoS) attack—an attempt
to overwhelm the server that handles network flow to a certain website in an attempt
to bring the entire website down or stop its functionality—every single attempt would
register as an unusual spike in network traffic. All of these spikes are clearly deviants
from normal traffic patterns and would be considered anomalous.
Anomaly Detection
Now that you have a better understanding of the different types of anomalies, we can
proceed to discuss approaches to creating models to detect anomalies. This section presents
a few approaches we can take, but keep in mind we are not limited to just these methods.
Recall our reasoning for labeling the black swan as an anomaly. One reason was
that since all the swans we have seen thus far were white, the single black swan is an
obvious anomaly. A statistical way to explain this reasoning is that as of the most recent
set of observations, we have one black swan and tens of thousands of white swans.
Thus, the probability of occurrence of the black swan is one out of tens of thousands
of all observed swans. Since this probability is so low, it would make the black swan an
anomaly just because we do not expect to see it at all.
The anomaly detection models we will explore in this book follow these approaches
either by training on unlabeled data, training on normal, nonanomalous data, or training
on labeled data for normal and anomalous data. In the context of identifying swans, we
would be told which swans are normal and which swans are anomalies.
14
Chapter 1 Introduction to Anomaly Detection
So, what is anomaly detection? Quite simply, anomaly detection is the process in
which an advanced algorithm identifies certain data or data patterns to be anomalous.
Falling under anomaly detection are the tasks of outlier detection, noise removal, novelty
detection, event detection, change point detection, and anomaly score calculation.
In this book, we will explore all of these as they are all basically anomaly detection
methods. The following tasks of anomaly detection are not exhaustive, but are some of
the more common anomaly detection tasks today.
Outlier Detection
Outlier detection is a technique that aims to detect anomalous outliers within a given
data set. As previously discussed, three methods that can be applied to this situation are
to train a model only on normal data to identify anomalies (by a high reconstruction error,
described next), to model a probability distribution in which anomalies would be labeled
based on their association with really low probabilities, or to train a model to recognize
anomalies by teaching it what an anomaly looks like and what a normal point looks like.
Regarding the high reconstruction error, think of it this way: the model trains on a
set of normal data and learns the patterns corresponding to normal data. When exposed
to an anomalous data point, the patterns do not line up with what the model learned to
associate with normal data. The reconstruction error can be analogous to the deviance in
learned patterns between the anomalous point and the normal points the model trained
on. We will formally go over reconstruction error in Chapter 6. Going back to the example
of the swans, the black swan is different based on the patterns that we learned by observing
swans at the lake, and was thus anomalous because it did not follow the color pattern.
Noise Removal
Noise removal involves filtering out any constant background noise in the data set.
Imagine that you are at a party and you are talking with a friend. There is a lot of
background noise, but your brain focuses on your friend’s voice and isolates it because
that’s what you want to hear. Similarly, a model learns to efficiently encode the original
sound to represent only the essential information. For example, encoding the pitch of
your friend’s voice, the tempo, the vocal inflections, etc. Then, it is able to reconstruct the
original sound without the anomalous interference noise.
15
Chapter 1 Introduction to Anomaly Detection
This can also be a case where an image has been altered in some form, such as by
having perturbations, loss of detail, fog, etc. The model learns an accurate representation
of the original image and outputs a reconstruction without any of the anomalous
elements in the image.
Novelty Detection
Novelty detection is very similar to outlier detection. In this case, a novelty is a data
point outside of the training set the model was exposed to, that was shown to the model
to determine if it is an anomaly or not. The key difference between novelty detection and
outlier detection is that in outlier detection, the job of the model is to determine what is
an anomaly within the training data set. In novelty detection, the model learns what is a
normal data point and what isn’t and tries to classify anomalies in a new data set that it
has never seen before.
Examples can include quality assurance in factories to make sure new batches of
created products are up to par, such as with the example of screws from earlier. Another
case is network security. Incoming traffic data can be monitored to ensure there is no
anomalous behavior going on. Both of these situations involve novelties (new data) to
constantly predict on.
Event Detection
Event detection involves the detection of points in a time-series dataset that deviate
anomalously from the norm. For example, in the taxi cab company example earlier in the
chapter, the polar vortex reduced the customer pickup counts for April. These deviations
from the norm for April were all associated with an anomalous event that occurred
in the dataset, which an event detector algorithm would identify. Another example of
event detection would be the tracking of sea-ice levels at the poles over time, forming a
time series. An event detector algorithm could flag exactly when sea-ice levels deviate
anomalously from the usual norm, such as occurred recently in July 2023, when the sea-
ice levels were detected to be six standard deviations below the established average.
in statistical trends. A good example is global average temperatures over time. A change
point detection algorithm could identify periods of sustained warming where the
statistical properties start to shift over time. A change point detection algorithm could
identify accelerated warming periods that differ from the normal rate of warming.
Supervised anomaly detection is a technique in which the training data has labels
for both anomalous data points and normal data points. Basically, we tell the model
during the training process if a data point is an anomaly or not. Unfortunately, this
isn’t the most practical method of training, especially because the entire data set needs
to be processed and each data point needs to be labeled. Since supervised anomaly
detection is basically a type of binary classification task, meaning the job of the model is
to categorize data under one of two labels, any classification model can be used for the
task, though not every model can attain a high level of performance. Chapter 9 provides
an example in the context of a temporal convolutional network.
17
Chapter 1 Introduction to Anomaly Detection
Semi-supervised anomaly detection involves partially labeling the training data set.
Exact implementations and definitions for what “semi-supervised” entails may differ, but
the gist of it is that you are working with partially labeled data. In the context of anomaly
detection, this can be a case where only the normal data is labeled. Ideally, the model will
learn what normal data points look like so that it can flag as anomalous data points that
differ from normal data points. Examples of models that can use semi-supervised learning
for anomaly detection include autoencoders, which you will learn about in Chapter 6.
Unsupervised anomaly detection, as the name implies, involves training the model
on unlabeled data. After the training process, the model is expected to know what data
points are normal and what points are anomalous within the data set. Isolation forest, a
model we will explore in Chapter 4, is one such model that can be used for unsupervised
anomaly detection.
Data Breaches
In today’s age of big data, where huge volumes of information are stored about users
in various companies, information security is vital. Any information breaches must be
reported and flagged immediately, but it is hard to do so manually at such a scale. Data
leaks can range from simple accidents, such as an employee losing a USB stick that
contains sensitive company information that someone picks up and accesses the data
on, to intentional actions, such as an employee intentionally sending data to an outside
party, or an attacker gaining access to a database via an intrusion attack. Several high-
profile data leaks have been widely reported in news media, from Facebook / Meta,
iCloud, and Google security breaches where millions of passwords or photos were
leaked. All of those companies operate on an international scale, requiring automation
to monitor everything in order to ensure the fastest response time to any data breach.
The data breaches might not even need network access. For example, an employee
could email an outside party or another employee with connections to rival companies
about travel plans to meet up and exchange confidential information. Of course, these
18
Chapter 1 Introduction to Anomaly Detection
emails would not be so obvious as to state such intentions directly. However, monitoring
these emails could be helpful as a post-breach analysis to find out anyone suspicious
from within the company, or as part of a real-time monitoring software to ensure data
confidentiality compliance across teams for example. Anomaly detection models can sift
through and process employee emails to flag any suspicious activity by employees. The
software can pick up key words and process them to understand the context and decide
whether or not to flag an employee’s email for review.
The following are a few more examples of how anomaly detection software can
detect an internal data breach:
In this case, something won’t add up, which the software will detect and then flag
the employee. It could either turn out to be a one-off sanctioned event, which great, the
model did its job but it was ok this time, or it could turn out that the employee somehow
accessed data they shouldn’t have, which would mean there was a data breach.
The key benefit to using anomaly detection in the workspace is how easy it is to scale
up. These models can be used for small companies as well as large-scale international
companies.
Identity Theft
Identity theft is another common problem in today’s society. Thanks to development of
online services allowing for ease of access when purchasing items, the volume of credit
and debit card transactions that take place every day has grown immensely. However,
this development also makes it easier to steal credit and debit card information or bank
account information, allowing the criminals to purchase anything they want if the card
19
Chapter 1 Introduction to Anomaly Detection
isn’t deactivated or if the account isn’t secured again. Because of the huge volume
of transactions, monitoring everything is difficult. However, this is where anomaly
detection can step in and help, since it is highly scalable and can help detect fraud
transactions the moment the request is sent.
As we discussed earlier, context matters. When a payment card transaction is made, the
payment card company’s anomaly detection software takes into account the card holder’s
previous history to determine if the new transaction should be flagged or not. Obviously, a
series of high value purchases made suddenly would raise alarms immediately, but what
if the criminals were smart enough to realize that and just make a series of purchases over
time that won’t put a noticeable hole in the card holder’s account? Again, depending on
the context, the software would pick up on these transactions and flag them again.
For example, let’s say that someone’s grandmother was recently introduced to
Amazon and to the concept of buying things online. One day, unfortunately, she
stumbles upon an Amazon lookalike website and enters her credit card information.
On the other side, some criminal takes it and starts buying random things, but not all
at once so as not to raise suspicion—or so he thought. The grandmother’s identity theft
insurance company starts noticing some recent purchases of batteries, hard drives, flash
drives, and other electronic items. While these purchases might not be that expensive,
they certainly stand out when all the prior purchases made by the grandmother
consisted of groceries, pet food, and various decorative items. Based on this previous
history, the detection software would flag the new purchases and the grandmother
would be contacted to verify these purchases. These transactions might even be flagged
as soon as an attempt to purchase is made. In this case, either the location of the
purchaser or the nature of the transactions could raise alarms and stop the transaction
from being successful.
Manufacturing
We have explored a use case of anomaly detection in manufacturing. Manufacturing
plants usually have a certain level of quality that they must ensure their products meet
before shipping them out. When factories are configured to produce massive quantities
of output at a near constant rate, it becomes necessary to automate the process of
checking the quality of various samples. Similar to the fictitious example of the screw
manufacturer, manufacturing plants in real life might test and use anomaly detection
software to ensure the quality of various metal parts, tools, engines, food, clothes, etc.
20
Chapter 1 Introduction to Anomaly Detection
Networking
Perhaps one of the most important use cases for anomaly detection is in networking.
The Internet is host to a vast array of various websites located on servers all around the
world. Unfortunately, due to the ease of access to the Internet, many individuals access
the Internet for nefarious purposes. Similar to the data leaks that were discussed earlier
in the context of protecting company data, hackers can launch attacks on websites as
well to leak their information.
One such example would be hackers attempting to leak government secrets
through a network attack. With such sensitive information as well as the high volumes
of expected attacks every day, automation is a necessary tool to help cybersecurity
professionals deal with the attacks and preserve state secrets. On a smaller scale, hackers
might attempt to breach a cloud network or a local area network and try to leak data.
Even in smaller cases like this, anomaly detection can help detect network intrusion
attacks as they happen and notify the proper officials. An example data set for network
intrusion anomaly detection is the KDD Cup 1999 data set. This data set contains a
large amount of entries that detail various types of network intrusion attacks as well as a
detailed list of variables for each attack that can help a model identify each type of attack.
Medicine
Anomaly detection also has a massive role to play in the field of medicine. For example,
models can detect subtle irregularities in a patient’s heartbeat in order to classify
diseases, or they can measure brainwave activity to help doctors diagnose certain
conditions. Beyond that, they can help analyze raw diagnostic data for a patient’s organ
and process it in order to quickly diagnose any possible problems, similarly to the
thyroid example discussed earlier.
Anomaly detection can even be used in medical imagery to determine whether a given
image contains anomalous objects. For example, suppose an anomaly detection model
was trained by exposing it only to MRI imagery of normal bones. When shown an image of
a broken bone, it would flag the new image as an anomaly. Similarly, anomaly detection
can even be extended to tumor detection, allowing for the model to analyze every image in
a full-body MRI scan and look for the presence of abnormal growth or patterns.
21
Chapter 1 Introduction to Anomaly Detection
Video Surveillance
Anomaly detection also has uses in video surveillance. Anomaly detection software can
monitor video feeds and flag any videos that capture anomalous action. While this might
seem dystopian, it can certainly help catch criminals and maintain public safety on
busy streets and other transportation systems. For example, this type of software could
potentially identify a mugging in a street at night as an anomalous event and alert the
nearest police department. Additionally, this type of software can detect unusual events
at crossroads, such as an accident or some unusual obstruction, and immediately call
attention to the footage.
Environment
Anomaly detection can be used to monitor environmental conditions as well. For
example, anomaly detection systems are used to monitor heavy-metal levels in rivers
to pick up potential spills or leaks into the water supply. Another example is air quality
monitoring, which can detect anything from seasonal pollen to wildfire smoke coming in
from far away. Additionally, anomaly detection can be utilized to monitor soil health in
agricultural or environmental survey cases to gauge the health of an ecosystem. Any dips
in soil moisture or specific nutrient levels could indicate some kind of problem.
Summary
Generally, anomaly detection is utilized heavily in medicine, finance, cybersecurity,
banking, networking, transportation, and manufacturing, but it is not limited to those fields.
For nearly every case imaginable involving data collection, anomaly detection can be put to
use to help users automate the process of detecting anomalies and possibly removing them.
Many fields in science can benefit from anomaly detection because of the large volume of
raw data collection that goes on. Anomalies that would interfere with the interpretation
of results or otherwise introduce some sort of bias into the data could be detected and
removed, provided that the anomalies are caused by systematic or random errors.
In this chapter, we discussed what anomalies are and why detecting anomalies can
be very important to the data processing we have at our organizations.
Next, Chapter 2 introduces you to core data science concepts that you need to know
to follow along for the rest of the book.
22
CHAPTER 2
Introduction to Data
Science
This chapter introduces you to the basic principles of the data science workflow. These
concepts will help you prepare your data as necessary to be fed into a machine learning
model as well as understand its underlying structure through analysis.
You will learn how to use the libraries Pandas, Numpy, Scikit-Learn, and Matplotlib
to load, manipulate, and analyze data. Furthermore, you will learn how to perform data
I/O; manipulate, transform, and process the data to your liking; analyze and plot the
data; and select/create relevant features for the machine learning modeling task.
In a nutshell, this chapter covers the following topics:
• Data I/O
• Data manipulation
• Data analysis
• Data visualization
• Data processing
• Feature engineering/selection
Note Code examples are provided in Python 3.8. Package versions of all
frameworks used are provided. You will need some type of program to run
Python code, so make sure to set this up beforehand. The code repository for this
book is available at https://github.com/apress/beginning-anomaly-
detection-python-deep-learning-2e/tree/master.
23
© Suman Kalyan Adari, Sridhar Alla 2024
S. K. Adari and S. Alla, Beginning Anomaly Detection Using Python-Based Deep Learning,
https://doi.org/10.1007/979-8-8688-0008-5_2
Chapter 2 Introduction to Data Science
The repository also includes a requirements.txt file to check your packages and their
versions.
Code examples for this chapter are available at https://github.com/apress/
beginning-anomaly-detection-python-deep-learning-2e/blob/master/Chapter%20
2%20Introduction%20to%20Data%20Science/chapter2_datascience.ipynb. Navigate
to “Chapter 2 Introduction to Data Science” and then click chapter2_datascience.
ipynb. The code is provided as a .py file as well, though it is the exported version of the
notebook.
We will be using JupyterLab (https://jupyter.org) to present all of the code
examples.
Data Science
“Data science” is quite the popular term and buzzword nowadays, so what exactly is it?
In recent times, the term “data science” has come to represent a wide range of roles and
responsibilities. Depending on the company, a data scientist can be expected to perform
anything from data processing (often at scale, dipping into “big data” territory), to
statistical analysis and visualization, to training and deploying machine learning models.
In fact, data scientists often perform two or more of these roles at once.
This chapter focuses on concepts from all three roles, walking you through the
process of preparing and analyzing the dataset before you explore the modeling aspect
in Chapter 3.
Be advised that this will be a brief, high-level walkthrough over the most relevant
functionality that these various data science packages offer. Each package is so
comprehensive that a full book would be required to cover it in depth. Therefore, you are
encouraged to explore each package’s online documentation, as well as other guides and
walkthroughs, to learn as much as you can.
Dataset
A popular introductory dataset for budding data scientists is the Titanic Dataset,
available at Kaggle: https://www.kaggle.com/c/titanic. You can also find this dataset
hosted on this book’s repository at https://github.com/apress/beginning-anomaly-
detection-python-deep-learning-2e/blob/master/data/train.csv.
24
Chapter 2 Introduction to Data Science
Kaggle is a great place to find datasets. It also hosts machine learning modeling
competitions, some of which even have prize money. Kaggle is an excellent resource
for practicing your skills, and you are encouraged to do so! If you would like to practice
the concepts that you learn in this book, search Kaggle for various anomaly detection
datasets.
To download the Titanic dataset from the Kaggle web site, follow the instructions
provided next. If you prefer, Kaggle offers an API that you can install through PIP,
available at https://github.com/Kaggle/kaggle-api.
Figure 2-1. Overview page for the Titanic dataset on Kaggle (as it looks as of
April, 2023)
2. Click the Data tab. You should see a brief description of the
dataset as well as a Data Explorer, as shown in Figure 2-2.
25
Chapter 2 Introduction to Data Science
Figure 2-2. Click Download All to download the data as a zip file
4. Return to the Data tab, scroll down, and click Download All. A zip
file will download.
5. Extract this zip file anywhere that you’d like and make a note of
the directory path.
After you have extracted the zip file, you are almost ready to start processing this
dataset in Python. First, make sure you have an IDE to develop Python code with. To
easily follow the examples in this book, you are recommended to use Jupyter Notebook
or JupyterLab.
Next, make sure you have the following libraries and versions installed, though you
might also want to check the requirements.txt file available on the GitHub repository or
use it to prepare your environment:
• pandas==2.0.0
• numpy==1.22.2
• scikit-learn==1.2.2
• matplotlib==3.7.1
26
Chapter 2 Introduction to Data Science
It is not necessary to have the exact same versions, but keep in mind that older
versions may not contain features we explore later in the book. Newer versions should be
fine unless they are significantly more recent, which might result in reworked syntax or
features and thus introduce incompatibility.
You can easily check the version in Python. Figure 2-3 introduces code to import
these packages and print their versions.
import pandas
import numpy
import sklearn
import matplotlib
Figure 2-3. Code to import pandas, numpy, sklearn, and matplotlib and print
their versions
Figure 2-4. The text output of running the code in Figure 2-3
27
Chapter 2 Introduction to Data Science
Data I/O
With our environment set up, let’s jump straight into the content. Before we conduct any
type of data analysis, we need to actually have data. There are a myriad of ways to load
data in Pandas, but we will keep it simple and load from a csv file.
For the sake of convenience, let’s reimport our libraries with aliases, as shown in
Figure 2-5.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Figure 2-5. Importing pandas, numpy, and matplotlib’s pyplot as aliases, for
convenience
Once you have executed this code, let’s move on to loading our dataset.
28
Chapter 2 Introduction to Data Science
Data Loading
First, make sure that the path to your dataset is defined, like in Figure 2-6.
You may optionally define this to be the full, absolute path, just to make sure pandas
will find this file if you are having problems. Here, the data folder resides in a directory
level above our notebook, as it contains data files common to every chapter.
To load the data, we will use pd.read_csv(data_path), as shown in Figure 2-7. The
method read_csv() reads a csv file and all the data contained with it. It also can read
from other input formats, including JSON, SAS7BDAT, and Excel files. You can find more
details in the Pandas documentation: https://pandas.pydata.org/docs/reference/
index.html.
If this runs without producing any errors, you now have a pandas dataframe loaded.
An easy way to visualize this dataframe is to use df.head(N), where N is an optional
parameter to define how many rows you want to see. If you don’t pass N, the default is
five displayed rows. Simply run the line of code shown in Figure 2-8.
df.head(2)
29
30
Chapter 2
df.shape
This returns a tuple (M, N) with the table’s dimensions. It shows M rows and N
columns. For this example, you should see the following printed output:
(891, 12)
Data Saving
To save your dataset, call df.to_csv(save_path), where save_path is a variable that
contains a string path to where you want to save the dataframe. (There are many other
output formats available besides csv.) As an example, run the code shown in Figure 2-9.
df2 = df.head(2)
df2.to_csv('two_rows.csv', index=False)
Figure 2-9. df.head(2) returns two rows of df, which is saved as df2. This is then
saved as a csv to two_rows.csv, with the parameter index=False passed. This
parameter tells pandas not to save the index as part of the csv, so an extra column
is not introduced into the data where there was none before
The parameter index=False tells pandas to not save the dataframe index to the csv.
Pandas creates an index when you load in data, which you can override with a custom
index if it is relevant. You can change this to index=True (which is the default) and see
how that changes the csv output.
DataFrame Creation
Besides loading data from a specific source, you can create a dataframe from
scratch given a list or a dictionary. This is very useful to do when you are conducting
experiments in an iterative manner over several different variable settings and you want
to save the data into a nice table.
The code shown in Figure 2-10 creates a dataframe of arbitrary metrics.
31
Chapter 2 Introduction to Data Science
Figure 2-10. Creating a dataframe from a list of lists (two rows, three columns)
Note The ↵ symbol in Figure 2-10 (and subsequent code displays) indicates
that the code has been truncated and that it’s still the same line. So 'Model1',
'Model2', 'Model3']) is the actual ending of this line.
To create a dataframe from a dictionary, execute the code shown in Figure 2-11. In
this format, the keys of the dictionary are the columns themselves, and the values are
lists with the data corresponding to the keys.
After executing this code, you should once again see the dataframe in Table 2-2, as
this code creates the same output as the code shown in Figure 2-10.
Now that you know the very basics of data I/O in pandas, let’s move on to the many
ways we can manipulate this data to our liking.
32
Random documents with unrelated
content Scribd suggests to you:
expenses, including legal fees. YOU AGREE THAT YOU HAVE NO
REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF
WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE
PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE
FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.
1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookgate.com