0% found this document useful (0 votes)

5 views55 pages

Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi

This thesis explores the use of machine learning techniques for anomaly detection in time series data from Internet-of-Things sensors, specifically focusing on satellite telemetry signals. The author implemented various models, including LSTM and autoregression, achieving an F0.5 score of 76%, surpassing previous benchmarks. The work emphasizes the importance of automated anomaly detection systems in managing the increasing volume of data generated by organizations.

Uploaded by

obaid ur rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views55 pages

Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi

Uploaded by

obaid ur rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Machine Learning for Time Series Anomaly

Detection
by
Ihssan Tinawi
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Masters of Engineering in Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2019
○
c Massachusetts Institute of Technology 2019. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
May 10, 2019

Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Kalyan Veeramachaneni
Principal Research Scientist
Thesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Katrina LaCurts
Chairman, Department Committee on Graduate Theses
2
Machine Learning for Time Series Anomaly Detection
by
Ihssan Tinawi

Submitted to the Department of Electrical Engineering and Computer Science

on May 10, 2019, in partial fulfillment of the
requirements for the degree of
Masters of Engineering in Computer Science and Engineering

Abstract
In this thesis, I explored machine learning and other statistical techniques for anomaly
detection on time series data obtained from Internet-of-Things sensors. The data,
obtained from satellite telemetry signals, were used to train models to forecast a
signal based on its historic patterns. When the prediction passed a dynamic error
threshold, then that point was flagged as anomalous. I used multiple models such
as Long Short-Term Memory (LSTM), autoregression, Multi-Layer Perceptron, and
Encoder-Decoder LSTM.
I used the "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric
Dynamic Thresholding" paper as a basis for my analysis, and was able to beat their
performance on anomaly detection by obtaining an 𝐹0.5 score of 76%, an improvement
over their 69% score.

Thesis Supervisor: Kalyan Veeramachaneni

Title: Principal Research Scientist

3
4
Acknowledgments
Bismillah. In the name of God, the Most Gracious, the Most Merciful.
I would like to thank SES for their support in our project.
I would also like to thank my supervisor Kalyan for all I have learned from him.
Thanks to my friends and especially my roommate Rami for making this year a
memorable one.
Finally, special thanks to my parents without whom none of this would be possible.

5
6
Contents

1 Introduction 13
1.1 Machine Learning for Anomaly Detection . . . . . . . . . . . . . . . . 13
1.2 Motivations for Anomaly Detection . . . . . . . . . . . . . . . . . . . 15
1.3 Contributions & Key Findings . . . . . . . . . . . . . . . . . . . . . . 16

2 Relevant Work 19
2.1 Anomaly Detection for Spacecraft . . . . . . . . . . . . . . . . . . . . 20
2.2 Using Neural Networks for Anomaly Detection . . . . . . . . . . . . . 22

3 Data 23

4 System Architecture 29
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Time series Aggregation . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Rolling Window Sequences . . . . . . . . . . . . . . . . . . . . 31
4.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Multilayer Perceptron (MLP) . . . . . . . . . . . . . . . . . . 33
4.3.2 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . 36
4.3.3 LSTM Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.4 Linear Autoregression . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Post-Processing: Anomaly Detection . . . . . . . . . . . . . . . . . . 41
4.5 Sample Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7
5 Analysis 47
5.1 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 F-0.5 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8
List of Figures

1-1 Divergence of growths between data and human resources within or-
ganizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3-1 NASA Signal A-6 plotted to show its continuos nature. . . . . . . . . 24

3-2 NASA Data Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4-1 Mechanism of time series aggregation. . . . . . . . . . . . . . . . . . . 31

4-2 Mechanism of rolling window sequences. The sliding window corre-
sponds to different overlapping sequences. . . . . . . . . . . . . . . . 32
4-3 Tanh Activation Function. . . . . . . . . . . . . . . . . . . . . . . . . 34
4-4 ReLU Activtion Function. . . . . . . . . . . . . . . . . . . . . . . . . 34
4-5 Multilayer Perceptron Architecture. . . . . . . . . . . . . . . . . . . . 36
4-6 LSTM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4-7 Signal P-11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4-8 Signal P-11 and the forecast from NASA LSTM Model. . . . . . . . . 45
4-9 Close-up on signal P-11 and the forecast from NASA LSTM Model. . 46
4-10 Signal P-11 and Forecast from NASA LSTM Model. Flagged Anomaly
in Shaded Region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

9
10
List of Tables

3.1 Data Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 MSE Summary across 82 signals. Autoregression doesn’t have a train-

ing MSE as the neural network models do, since the model is fitted
differently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 𝐹0.5 Scores of the Models. . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Precision & Recall Scores for the Models. . . . . . . . . . . . . . . . . 51

11
12
Chapter 1

Introduction

1.1 Machine Learning for Anomaly Detection

Machine learning is a field within statistics in which programs use historic data to
make predictions about future data points or unknown labels. A model is said to learn
when it is able to produce estimates of a data point based on whatever structure it
has inferred from previous data. Machine learning can be used in the context of time
series forecasting to predict the value at the next time step; in classification problems
where given an input, the model’s goal is to predict the label; and in clustering to
determine which points resemble each other in high-dimensional space. These are
just three examples of common uses of machine learning.
The goal of machine learning is to map high-dimensional data and to try to sep-
arate data points based on their labels. Models are trying to minimize distances
between similar points and correctly classify them. At the core of machine learning
is data and loss functions. Models are trying to learn features from the data with the
help of loss functions that generalize our models to previously unseen data.
Time series anomaly detection is a field that has historically utilized several sta-
tistical methods in order to try to guess anomalies in sequential data. time series is
used to refer to data that has values associated with timesteps. Examples of time
series data are stock prices or temperature readings from a thermometer. In both
cases data is sampled at a certain frequency, and the values at those instances are

13
recorded. Often, time series data exhibits patterns like periodicity, cyclic growth, ex-
ponential decay, to name a few characteristics. The data can also sometimes undergo
shocks or changes that go against its predictable nature. For example, a company’s
stock price can tank if news leaks that it has sustained a big loss. Similarly, a sudden
increase in the temperature of a room in a factory can signal that there is a fire. In
any of these situations, it would be beneficial to have a system that can detect and
flag anomalies, giving administrators the ability to minimize losses. We will utilize a
publicly available spacecraft data set to analyze the validity of using machine learning
systems for anomaly detection tasks in spacecraft.

There are several ways to carry out anomaly detection, using statistical methods
such as clustering, wavelet transforms, autoregressive models, and other machine
learning and digital signal processing techniques. Within machine learning, the time
series anomaly detection problem can be classified into two categories: supervised
and unsupervised. The first category occurs when historic anomalies are known and
models are fitted to identify similar anomalies, and this is known as “supervised
learning” because data points are labeled with their true nature. In other settings,
the labels are not known and the model tries to predict whether a point is anomalous
based on how close it is to previously seen data, and these instances are referred to
as “unsupervised learning” tasks.

Unsupervised learning settings are much more common, because machines don’t
know a priori whether they are experiencing glitches, so the data doesn’t contain
explicit information about when and where an anomaly is taking place. In these
settings, it is expensive to hire experts to label whether data points are anomalous,
and it is even possible for the experts to misidentify or entirely miss abnormalities. A
good model is tuned so that is not strict leading to too few anomalies being noticed
and not over-sensitive such that a lot of normal points are flagged as anomalies. This
enables companies to adjust their services to situations of high demand or to replace
failing components of vehicles before they endanger passengers. In spacecraft, where
the machine is already launched and cannot be maintained, knowing ahead of time
about failure of components enables the operators to migrate their services to other

14
crafts or to rely on other components of the same machine.

1.2 Motivations for Anomaly Detection

If we were to plot the amount of data being generated within organizations, it would
be an exponential curve as in Figure 1-1 below. To make sense of all this data,
companies need to hire analysts to match the large rate of growth of the data. How-
ever, organizations can at best only afford to hire in a linear manner due to financial
constraints. Therefore, there is a growing gap between the amount of data being
generated and the number of personnel available to go through it. This causes an
intelligence gap, and it represents a large problem that companies in the digital age
are facing. Namely, it is too expensive to hire humans to review data at the rate
it is being generated. This is where computers can contribute. Machine intelligence
programs can be built in order to monitor information from critical systems and flag
anomalies as they occur. Human experts can then view these warnings, and decide
to deal with them on a case-by-case basis. This would decrease the amount of work
that the human experts have to do, thereby decreasing the burden on companies to
hire more and more experts.
Building machine learning systems that can process information and identify
anomalies is more cost effective than creating human-based teams. It can also detect
changes in the signal that are too subtle for humans to identify. Statistical methods
can be used to determine context anomalies, which are changes that shift across sea-
sonal trends. For instance, the average temperature is much lower in the winter than
in the summer. So, if you encounter a temperature of 80 degrees Fahrenheit in the
winter, it’s an anomaly, whereas in the summer it would be a normal temperature.
In our work, we focus on creating an anomaly detection system for satellites and
spacecraft. The input to the system is telemetry data coming from the satellite, and
the output is anomalies. The system flags anomalies and alerts technicians, reducing
the overall amount of data they need to review. The best system has a small number
of false positives and negatives. That is, the system is not sensitively flagging any

15
Figure 1-1: Divergence of growths between data and human resources within organi-
zations.

and all variations in the data (increasing the amount of work technicians have to do)
and at the same time the system’s threshold is not too strict that it cannot identify
anomalies.

1.3 Contributions & Key Findings

In this work, we explore several time series forecasting methods and apply them in
the context of spacecraft data to build an anomaly detection system. We study the
performance of multilayer perceptrons (MLP), Long Short-Term Memory (LSTM),
LSTM Encoder-Decoder, and linear autoregression models to contrast which archi-
tectures perform the best for time series data forecasting.
We work with a public dataset provided by researchers at NASA. The data comes
from two of their spacecrafts, and is available in a format that is standard to time
series data. As such, our system can be adapted to work with any dataset that shares
the format of the NASA data. More information about the data and its format can
be found in chapter 3.

16
Anomaly detection is a technique used in the field of statistics to determine out-
liers from signals. In our work, we apply state of the art machine learning and
traditional statistical techniques to spacecraft telemetry data to infer anomalies from
data in an unsupervised manner. We adapt the dynamic error thresholding technique
found in NASA’s LSTM and Nonparametric Thresholding Paper [2] in order to detect
anomalies. Below, we summarize our main contributions and findings.

∙ Built an end-to-end machine learning pipeline that can detect anomalies and is
modular for trying out different models.

∙ Tested out the pipeline with 4 different architectures (MLP, LSTM, LSTM
AutoEncoder, and Linear Autoregression).

∙ We found that increasing the length of the input sequence to the models yielded
improved results.

∙ By decreasing the number of hidden units, we were able to outperform NASA’s

model even without using their pruning technique, obtaining an 𝐹0.5 score of
0.76 over the reference paper’s 0.69 score.

∙ The LSTM Autoencoder got a high accuracy prediction score of 0.83, close to
the NASA model’s precision score of 0.88.

Chapter two discusses related work in the literature, and describes the difference
between our approach and other papers in the field.
Chapter three details the data that we are using.
Chapter four provides an overview of the system that was implemented. It de-
scribes the architecture of the anomaly detection system, and breaks it down into
pre-processing, modeling, and post-processing.
Chapter five is a comprehensive summary of our results and contains an analysis
of the different models we tested out. Possible future expansions to the project are
also discussed.

17
18
Chapter 2

Relevant Work

In recent years, a lot of deep learning models are being employed by researchers
for time series forecasting. State-of-the-art models now apply methods such as Long
Short Term Memory (LSTM), stacked LSTMs, and autoencoders, among other archi-
tectures, to forecast time series data and then detect anomalies from those predictions,
once the real measurements are observed. For example in a paper titled “Detecting
Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding” [2],
researchers from NASA apply a stacked LSTM model to telemetry data obtained
from two of their satellites. The signal is passed through the model and it generates
a forecast for the next timestamp. The predicted value is then compared to the re-
alized value to compute an error. After applying smoothing techniques, the error is
passed through a dynamic threshold that determines whether the observed value is
an anomaly, based on the previous errors of the model. We will be using several such
papers and drawing upon the literature to implement different techniques to generate
several pipelines that we can train our data on and use to generate anomalies.
The NASA paper introduced a novel technique called dynamic error threshold-
ing. This method is different from previously used error cutoffs such as fixed error
thresholds like x-sigma or distance based approaches such as clustering [4]. Non-
stationary data is characterized by having a mean and variance that shifts with time.
For instance, the data can have growth trends or exhibit cyclic behavior. This non-
stationarity makes it difficult for fixed error thresholding techniques to work in prac-

19
tice. Thus, dynamic error techniques, which are not constrained by non-stationary
data, have the potential of performing better. The error thresholding technique pro-
posed by the NASA team is nonparametric, dynamic, and unsupervised. We use this
error thresholding technique and compare its performance to x-sigma thresholding
to see which provides a prediction rate for anomalies. More information about the
theory underpinning dynamic error thresholding can be found in chapter 4 (System
Architecture).

2.1 Anomaly Detection for Spacecraft

In the field of aerospace engineering, there has been extensive use of anomaly detec-
tion. As previously mentioned, these systems are difficult to create, because on the
one hand you do not want to have a threshold that is too sensitive, resulting in too
many false flags. Such a sensitive threshold would be expensive to maintain, because
technicians and observers will have to verify a large number of anomalies. And on
the other hand, you do not want a threshold that is too strict, meaning that a lot of
anomalies will go by undetected.
Regardless, a lot of anomaly detection techniques exist for spacecraft. For ex-
ample, on the International Space System, there is the Inductive Monitoring System
(IMS), built by NASA to deal with anomalous behavior. The IMS uses a clustering-
based approach to determine the health of signals, as compared to nominal perfor-
mance. Clustering techniques are distance-based, meaning that clusters are formed
based on how far a data point is from others. Clustering methods do not require
the data to be labeled, making it a good fit for our unsupervised task. However, the
problem with clustering algorithms is that you need to know a priori the number of
clusters, and they are very sensitive to outlier data points. For instance, one outlier
data point will warp the shape of a cluster so that it spans incorrect regions.
To outline how a clustering algorithm would carry out anomaly detection, we
describe a setup that utilizes 𝑘-means clustering. 𝐾-means clustering is not to be
confused with 𝑘-nearest neighbors, which is used in classification and regression. 𝐾-

20
means clustering serves to partition 𝑛 observations into 𝑘 clusters based, where each
new data point is assigned to the cluster with the closest mean. Clustering can be
solved with an approximation algorithm that has two steps: assignment and update.
To conduct time series anomaly detection, the signal is passed in as individual data
points, and the number of clusters can be set to 2 (one anomalous and one normal).
Alternatively, we can experiment with 3 clusters (anomalous, normal, and outliers)
to see how that improves the performance of the system.

The problem with a clustering approach is that typically algorithms assume that
the data is equally distributed across the clusters, which is not a valid assumption to
make in the case of anomalies. For instance, a signal might have little to no anomalies
in a given sample. Another limitation of clustering-based approaches is that the data
might go through multiple phases. For instance, if we are looking at a thermometer’s
signal, and the thermometer is on Earth, then data in the summer will be higher on
average than data in the winter. This will lead the clustering algorithm to assign
the temperature readings into one cluster in the summer and another in the winter.
Thus, contextual anomalies are lost, where an unexpectedly high temperature in the
winter is just assigned to the summer cluster.

Beyond clustering-based approaches for anomaly detection, regression error tech-

niques are the most common in time series forecasting and anomaly detection, and
this is the approach that was taken by several papers on which we base our work.
The NASA paper, mentioned above, uses LSTMs as the regressor in the model. In
another paper, researchers from TCS Research in India, use LSTM-based encoder-
decoder model for multi-sensor anomaly detection in spacecraft [3]. The system we
build carries out modeling of the signal using different neural networks, adapted from
such papers, and then applies NASA’s dynamic error thresholding technique to de-
termine whether the residual error is too large, indicating that the input wasn’t in
line with what was expected. The process behind this approach is outlined further in
chapter 4 (System Architecture).

21
2.2 Using Neural Networks for Anomaly Detection
Recently, there has been large interest from research groups to apply neural archi-
tectures in the field of time series forecasting. With the advent of faster computing
chips and the abundance of data, a new Golden Era of machine learning has started,
making researchers interested in applying deep learning techniques to any decades-old
statistical field. Among these contributions are several papers that we use as basis
for the time series forecasting in our system. These methods are in contrast to the
“traditional” machine learning tools such as nearest neighbor or 𝑘-means clustering
that we explored earlier. They are also different from the statistical techniques such
as low-pass filters and other digital signal processing methods.
In deep learning, architectures such as Recurrent Neural Networks (RNNs) and
Long Short-Term Memory (LSTM) networks were found to perform better compared
to feed forward neural network architectures. Due to their ability to encode tem-
poral properties, RNNs and LSTMs perform well in time series forecasting, speech
recognition, and natural language processing tasks, since they both rely on modeling
sequences. LSTMs have the ability to learn the relationship between past data and
current values. The capacity of LSTM to model this dependency has led to a large
academic interest in using these units in machine learning for time series forecasting.
The NASA and TCS Research papers that we implement are some examples, and we
compare and contrast these with different architectures in our work.

22
Chapter 3

Data

The data we are using is in time series format. This means that every value is a
reading from the sensors of the spacecraft that occurred at a specific moment in
time. The data was obtained from a public dataset used in a study published by
research scientists at NASA. The telemetry data comes from two spacecrafts: the Soil
Moisture Active Passive satellite (SMAP) and the Curiosity Rover on Mars (MSL).
The data was anonymized with respect to time, but it retained its sequential nature.
The anonymization does not affect any time series forecasting of the data, as the
timestamp itself doesn’t matter, and we only care about the values in the sequences.
Furthermore, the telemetry signals are divided into several channels, where the
channel category and the signal number describe the signals and give hints as to which
of them are related. For example, signals “P-1” and “P-2” are related, as they come
from components representing “Power” signals of the craft. There are 82 signals avail-
able in the NASA dataset, with 55 coming from the SMAP satellite and 27 from the
MSL Mars Rover. The signals were individually plotted and evaluated for qualitative
features, such as being discrete or continuous and exhibiting any periodic trends. 54
of the 82 signals were found to be continuous by inspection, and the remaining sig-
nals were discrete. In order to determine characteristics for each signal, the Python
Plotly toolkit was used to plot each signal. After that, we visually inspected each
graph, zooming in on parts of the signal to determine whether the data was indeed
continuous or discrete at the lowest level.

23
Figure 3-1: NASA Signal A-6 plotted to show its continuos nature.

In Figure 3-1, we see the example of Signal A-6 from the NASA data. As can be
seen in the figure, the graph shows the data from timesteps 5000 to 7500. As men-
tioned above, the data has been anonymized with respect to the actual timestamp, so
we deal with timestamps as the timestep sequence number. Signal A-6 is continuous,
and oscillates between values of (−0.356, 0.356). Additionally, it seems there is an
an outlier value at around timestep 5500. This qualitative analysis is useful as we
try to justify later why some signals might have been better modeled using certain
architectures. It also provides an additional layer of insight when we are trying to see
why a certain signal was difficult to predict the errors for. The qualitative inspection
was carried out for all signals.
For a given signal, let’s say “M-1”, the data has 𝑚 rows and 𝑛 columns. The 𝑚
rows represent the commands sent to different modules, in binary format, and the
last row is the telemetry value itself. The layout of the data can be found in Figure
3-1, adapted from the NASA paper. There are 𝑛 rows corresponding to the number of
timestamps or readings taken for that signal. Therefore, the signal at each timestep
is an 𝑚 × 1 matrix. For our purposes, we do not use the commands as input features
for our models, so we only create sequences from the telemetry value itself. So, if we

24
Figure 3-2: NASA Data Format.

decide to use a sequence of length 250, then the input to the model would be a of size
250 × 1, instead of 250 × 𝑚. The structure of input and output data to and from the
models is explained in more detail in chapter 4 (System Architecture).

Upon looking at the data, one notices that all the telemetry values are between
-1 and 1. This doesn’t represent the real values that were captured from the devices.
The researchers pre-processed the data to make its values scale to that range. This
technique is an important step for time series forecasting analysis, as it confines the
range of predictions the model has to learn. Therefore, our system assumes that data
has already been scaled to be in the aforementioned range, and an algorithm to apply
this to new and real-time data can be found in section 5.2 entitled “Further work.”
The original researchers of the paper scaled the public data, but we also did two
changes to the data before carrying out the analysis.

First, we combined the “train” and “test” signals into one data file for each signal.
This enables us to try out different train/test splits. The paper had limited train/test
splits of the data, and in some cases there were more testing points compared to
training data points. In time series anomaly detection, you are implicitly assuming
that all training data does not have any anomalies. Therefore, the train/test split will
have a large effect on the outcome of the experiment. We found that to be true in our
experiments, where we encountered poorer performance when we used a train/test
split different from the one used in the NASA paper. We discuss this further in our
analysis (chapter 5). The second important change that was done in the data is that

25
Table 3.1: Data Summary

Number of Continuous Signals 53

Number of Discrete Signals 29
Total Number of Anomalies 106
Average Anomalies Per Signal 1.3

we are using only the telemetry value and not the series of commands that was input
to the different modules. While these input commands can act as features (such
as the mode of the satellite), the NASA paper ignores this part of the signal, and
instead the focus is on re-creating telemetry values from previous ones and not by
contextualizing it with other features of the craft. That is, at each point we want to
predict the value of the telemetry signal, and we do not necessarily want to guess the
type of command that was sent to different modules. As mentioned in the previous
paragraph, this means that the input is now of the shape 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ × 1 rather
than 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ × 𝑚, where 𝑚 is the number of rows in the input matrix.
In both our work and the NASA paper, we limit the dimension of the input vector
𝑚 = 1.
Despite the data being unsupervised, NASA researchers hired experts to sift
through the collected spacecraft logs and determine which of the signals were anoma-
lous. This is part of a normal analysis that typically occurs after mission data has
been collected. In a separate file called 𝑙𝑎𝑏𝑒𝑙𝑒𝑑_𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 NASA provided a sum-
mary of the anomalous regions of each of the 82 signals, as identified by its engineers.
The file is publicly available on the GitHub repository of the paper.1 It’s important to
note, however, that there may have been anomalies that were overlooked during the
annotation process due to human misjudgment. Nevertheless, the 𝑙𝑎𝑏𝑒𝑙𝑒𝑑_𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠
file is taken as the ground truth in our analysis. The labeled anomalies file has 82 rows
and 5 columns for the anomaly sequences, the name of spacecraft, the Channel ID,
the type of anomaly (contextual or point), and the number of telemetry values in the
stream. In the field of anomaly detection, a point anomaly is defined as data that is

1
https://github.com/khundman/telemanom

26
far too large or too small compared to the entire dataset. A contextual anomaly, how-
ever, is anomalous only in the context of the data around it. For example, on vacation
spending $100+ on food might be normal whereas on normal days it is considered
anomalous. In this work, we apply NASA’s dynamic error thresholding, which takes
into account the context of a data point, mitigating the incorrect classifications that
may occur if errors are compared to a static threshold. A more thorough description
of the technique can be found in the following chapter 4 (System Architecture).

27
28
Chapter 4

System Architecture

4.1 Overview
We are building an anomaly detection system, which is capable of swapping similar
primitives to rapidly prototype different architectures and converge on the model that
performs best for a given task. With that design consideration, we split our anomaly
detection task into three modular components. In the first one, we carry out all pre-
processing tasks, readying the data for a time series forecasting model. In the second
module, we can use any linear, non-linear, or machine-learning regressor to generate a
prediction for the next time step. In the last module, we compute the error residuals
between the predicted values and the realized values. Based on the error thresholding
technique used, anomalies are flagged and stored into a database.
In each of these sections, we talk about the algorithms that were implemented,
detailing the inputs and outputs, as well as which parameters could be changed and
tuned to achieve different results. In the results section, we delve deeper into the
performance results on the anomaly detection, and highlight which models and error
thresholding techniques achieved better performance in practice.
Chapter 4 is organized as follows. In section 4.2, we talk about the pre-processing
phase and the two algorithms we use to shape the input to the models. In section
4.3, we talk about different architectures that we have used in our work. This section
discusses the theory behind the models. Section 4.4 talks about the post-processing

29
and dynamic error thresholding technique that we use to detect anomalies. And
finally, section 4.5 walks us through a sample of the pipeline using one of the NASA
signals.

4.2 Data Pre-processing

The raw telemetry data could not be directly ingested into our pipeline, and it needs
to first undergo some processing. We implement two methods that changes the data
into a format that enables us to carry out time series forecasting. The first method is
called 𝑡𝑖𝑚𝑒 𝑠𝑒𝑟𝑖𝑒𝑠 𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑖𝑜𝑛, and it combines data across time intervals, thereby
changing the frequency of the data. The other is called 𝑟𝑜𝑙𝑙𝑖𝑛𝑔 𝑤𝑖𝑛𝑑𝑜𝑤 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠, and
it creates overlapping sequences that are used to predict in the time series forecasting
models. We will now go over each of these methods, and describe their functionalities.

4.2.1 Time series Aggregation

Inputs

∙ 𝑋: time series data, with values and time stamps

∙ 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙_𝑙𝑒𝑛𝑔𝑡ℎ: length of the interval, in seconds, that we want to aggregate

based on

Outputs

∙ 𝑣𝑎𝑙𝑢𝑒𝑠: data aggregated by taking the average across intervals

∙ 𝑖𝑛𝑑𝑒𝑥: the corresponding time stamp for each new aggregated data point

The time series aggregation method takes in time series data and clusters data
spanning a certain interval, using mathematical aggregation methods such as mean
and median, to name a few. The method is useful because it reduces the size of data
sets, by changing the frequency at which the data is sampled. Often times, in IoT
sensor data, information is collected at a very high rate, but a lot of the information is

30
Value 0.78 0.98 0.51 0.21 -0.31 -0.66 0.15 0.57 -0.65 0.99
Timestamp 1 2 3 4 5 6 7 8 9 10

average

Aggregated
Value 0.434 0.08

Aggregated
Timestamp 1 6

Figure 4-1: Mechanism of time series aggregation.

redundant, and can be summarized by taking averages across an interval of one hour
for example. This also cleans up the data that is being used, so outlier readings are
smoothed out across the interval. Though this is a limitation if we were looking for
individual anomalous events, it does help to focus on long term trends. If users of the
system believe that their data is sampled at the right frequency, then the interval can
be set to the current sampling rate. In the case of the NASA data, each timestamp
was equivalent to one anonymized time step, and thus the interval was set to 1.

4.2.2 Rolling Window Sequences

Inputs

∙ 𝑋: time series data, with values and time stamps

∙ 𝑖𝑛𝑑𝑒𝑥: the corresponding time stamps for values

∙ 𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑖𝑧𝑒: the size of the sequence window that you want to generate

∙ 𝑡𝑎𝑟𝑔𝑒𝑡_𝑠𝑖𝑧𝑒: the number of steps ahead to predict

Outputs

∙ 𝑜𝑢𝑡_𝑋: the sequence of 𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑖𝑧𝑒 starting at each time stamp

∙ 𝑜𝑢𝑡_𝑦: the corresponding next step(s) ahead to predict, based on 𝑡𝑎𝑟𝑔𝑒𝑡_𝑠𝑖𝑧𝑒

∙ 𝑋_𝑖𝑛𝑑𝑒𝑥: time stamps associated with 𝑜𝑢𝑡_𝑋 values

31
Value 0.78 0.98 0.51 0.21 -0.31 -0.66 0.15 0.57 -0.65 0.99
Timestamp 1 2 3 4 5 6 7 8 9 10

Figure 4-2: Mechanism of rolling window sequences. The sliding window corresponds
to different overlapping sequences.

∙ 𝑌 _𝑖𝑛𝑑𝑒𝑥: time stamps associated with 𝑜𝑢𝑡_𝑦 values

Rolling window sequences is a method that is commonly used to prepare time

series data for building models for forecasting. 𝑂𝑢𝑡_𝑋 acts as the input to the
forecasting problem, giving the model context for the previous values. The model
then uses the values in the sequence to predict the next 𝑡𝑎𝑟𝑔𝑒𝑡_𝑠𝑖𝑧𝑒 values. In this
project, we use the sequence 𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑖𝑧𝑒 to be 250 steps, as used in the NASA
LSTM paper. In the results section, we discuss the effects of varying the length of
this sequence on the performance of the models.

4.3 Models
We use a host of machine learning and statistical models in order to carry out the
time series forecasting. The models use the data that was pre-processed using the
methods described in section 4.2 above. The input to forecasting models is sequences
of aggregated data that is scaled to be in the range [-1, 1]. There’s a wide literature
of work that discusses different methods of normalization. In many machine learning
setups, the models perform better when they use normalization techniques [1].
As our problem is an unsupervised learning setup, then the training data used to
fit the models is assumed to be correct, ie. without anomalies. This is because the
model is primed so that all the data that it has seen in the training phase is regarded
as normal. The intuition is that the model is accustomed to seeing the data in the
training set. But then when the model goes live or is being subjected to testing data,
then any data that exhibits characteristics different from the training data will not
be successfully predicted by the model. As such, there will be a larger than usual
prediction error for this any anomalous points. That’s why we would like to train

32
models that are capable of first capturing periodic trends in the data and second
performing poorly when bad data is ingested into the pipeline.
In our setup, we implemented around 10 models, and we have selected 4 models to
carry out our analysis. The models are: multilayer perceptron, stacked LSTM model,
LSTM encoder-decoder, and linear autoregressive model. They all rely on the same
principle of using past sequential data to try and predict future data, with the excep-
tion of the LSTM enc-dec, which seeks to recreate the input itself after compressing
the data with the encoder and then unzipping it with the decoder. Before going into
the results of the models, we discuss the theory and fundamentals underpinning each
of these models, and any implementation details that we added for necessity along
the way.

4.3.1 Multilayer Perceptron (MLP)

Multilayer perceptron (MLP) is a type of feedforward artificial neural network ar-

chitecture that comprises of three layers at the minimum: an input layer, hidden
layer(s), and an output layer. The hidden layer units have a non-linear activation
function, and the output layer in the context of a time series forecasting problem is a
single unit that predicts the value at the next time step. MLP is considered to be the
“vanilla” neural network, because it does not include any recurrent, convolutional, or
gated layers that more recent architectures adopted. It acts as a linear perceptron,
with the only difference being the nonlinearity introduced by adding activation
functions such as tanh, rectified linear (RELU), and others.
The hyperbolic tangent, 𝑡𝑎𝑛ℎ, function maps all the real numbers to an output
range of [-1, 1]. For the output node, this works well for our prediction task, because
we normalize the range of our input values to [-1, 1]. This technique, called feature
scaling, is important to be done in machine learning algorithms, because objective
functions don’t work properly without normalization. As classifiers compute the Eu-
clidean distance between two data points, the distance is often skewed by features that
have broad range of values. This normalization also helps with the faster convergence
of training in stochastic gradient descent.

33
Figure 4-3: Tanh Activation Func- Figure 4-4: ReLU Activtion Func-
tion. tion.

The RELU is another type of activation function that we utilize in the training of
our deep neural networks, where the function definition is

𝑓 (𝑥) = 𝑚𝑎𝑥(0, 𝑥)

RELUs apply a non-linear transformation to inputs, taking only the positive part
of the input. RELUs are preferred over other activation functions like 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 and
𝑡𝑎𝑛ℎ, because they offer two benefits: sparsity and better learning. Better learning
is achieved because there are lower chances of the vanishing gradient problem. This
is because, when 𝑥 > 0, the gradient has a constant value, as opposed to the gradient
of tanh which becomes increasingly smaller the larger 𝑥 gets. The constant gradient
results in faster learning. The second advantage, sparsity, can be observed when 𝑥 ≤ 0
and the values default to 0. This transformation leads to more dense representation
of matrices, speeding up the learning process for the network and decreasing its
size. Tanh, on the other hand, will generate non-zero values, resulting in dense
representations.

Time series forecasting problems are in essence a supervised learning task. While
the data initially is not “labeled”, a few easy transformations, described in Data Pre-
processing (section 4.2 of System Architecture), can allow us to treat the data as
if it’s supervised. Primarily, the main trick is to add a lag factor to the data, and
then you have the realized values of the 𝑦 that you are trying to predict. Addition-
ally, 𝑟𝑜𝑙𝑙𝑖𝑛𝑔_𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 creates the sequences that are used as input to the

34
model. In supervised learning settings, artificial neural networks are trained with
a technique called gradient descent (GD). The aim of GD is to minimize an objec-
tive function, usually a heuristic cost function of how far the predictions were from
the training value. In our models, we used the mean squared error as our accuracy
measure.

𝑛
1 ∑︁
𝐽(𝜃) = (𝑦ˆ𝑖 − 𝑦𝑖 )2
𝑛 𝑖=1

where n is the size of the training data.

In gradient descent, in order to find the best value of a parameter we subtract the
gradient of the cost function with respect to that parameter [5].

𝜕
𝜃𝑗 = 𝜃𝑗 − 𝛼 × 𝐽(𝜃)
𝜕𝜃𝑗
where 𝛼 is the learning rate.
However, gradient descent turns out to be slow in training, so we use a variant
called Stochastic Gradient Descent (SGD), where instead of using the cost gradient
of all examples, we calculate it only for one example at each iteration. In practice,
SGD performs just as well but trains faster than regular gradient descent. The SGD
algorithm proceeds as follows. First, we choose an initial vector of parameters 𝑤 and
a learning rate 𝜂. Next, we repeat the following until convergence (a minimum error
is reached):

∙ Randomly shuffle examples in the training set.

∙ For 𝑖 = 1, 2, ..., 𝑛 do 𝑤 := 𝑤 − 𝜂∇𝑄𝑖 (𝑤)

The architecture that we used for the multilayer perceptron is comprised of an

input layer, 3 hidden layers each followed by dropout layers, and then one output layer.
A dropout layer is added as a regularization technique. It reduces the overfitting of
the model by preventing the coadaptation of the units based on the training data [6].
All of the hidden layers have RELU activation functions, and the output layer has
a 𝑡𝑎𝑛ℎ activation. The models were trained for 35 epochs, which is the number of

35
Input Dense #1 Dropout Dense #2 Dropout Dense #3 Dropout Output
Layer Layer Layer Layer Layer

Figure 4-5: Multilayer Perceptron Architecture.

times the entire training dataset is passed through the neural network. The number of
epochs was determined in practice, such that our model neither overfits nor underfits
the data.

4.3.2 Long Short-Term Memory (LSTM)

Long short-term memory (LSTM) is a type of deep neural network architecture that
is recurrent, meaning that the output at one step is fed back into the system creating
a loop. LSTM is capable of doing classification and prediction problems, because
its architecture is conducive to learning long-term dependencies between data points.
An LSTM unit does not automatically apply previous signals to the current step, but
uses a set of cells to learn which previous time steps are useful to predict the current
one. The unit is comprised of a cell, an input gate, an output gate, and a forget gate
that work together to be able to forecast the next step. These gate units learn to
open and close, controlling the constant error flow.
As it’s a recurrent layer, an LSTM can be adapted to time series data. In this
model, we replicate the architecture of the NASA LSTM paper, described in Figure
4-6. The architecture is comprised of an input layer, two LSTM layers with 80 hidden
units, two dropout layers after each LSTM, and one output layer. The output layer
is the value at the next time step that we are trying to forecast. If we wanted to
predict multiple steps ahead, then we would need to adjust the number of units at the
output layer. Similarly, if we wanted to change the length of the sequence, which is the
pattern that the model uses to generate a prediction, then we would have to adjust the
input layer to reflect that. The dropout layers are there to introduce regularization to

36
Preprocess
LSTM LSTM

...

...
data

Input Dropout Dropout Output

Layer LSTM #1 Layer LSTM #2 Layer Layer

Figure 4-6: LSTM Architecture.

the network. In neural networks, models tend to overfit to the training data, leading
to poor performance on the test data. That’s why the introduction of regularization
techniques such as dropout utilize better performance on the testing set.
During testing, we varied multiple parameters in the architecture of the stacked
LSTM system. We varied the number of hidden units in each LSTM layer, the number
of stacked LSTM layers, and the train/test split of the data. Each of these led to
different results, which we go over in the next section.

4.3.3 LSTM Autoencoder

In machine learning, an autoencoder is a kind of artificial neural network that is used

to learn efficient codings of patterns in an attempt to recreate them. Autoencoders are
unsupervised, as you do not need to provide labels for the input sequences. The model
aims to replicate the input sequence based on the data representation it has learned
from previous sequences. Thus, the autoencoder compresses data in the input layer
to an intermediate representation and then decompresses the encoding to an output
that resembles the original data. As such, autoencoders are tools of dimensionality
reduction. The performance of any autoencoder model is measured by the ability of
the model to recreate the input sequence.
There are multiple types of autoencoders such as regular, denoising, spare, and
variational autoencoders. The variational autoencoder type contain a generative
model. Their latent layers are continous, allowing random sampling and interpo-
lation. A regular autoencoder is comprised of an input layer, an output layer, and

37
one or more hidden layers connecting them. The input layer has to have the same
number of nodes as the output layer, because we are trying to predict the input 𝑋
rather than an output 𝑌 .
Regular autoencoders have a problem dealing with variable length sequences, be-
cause they are designed to work with fixed length inputs. One way to handle this
problem is to use LSTM-based autoencoders that organize their architecture using
what is known as the Encoder-Decoder LSTM model. This type of autoencoder
is capable of supporting variable length input sequences. It also capitalizes on the
recurrent LSTM layer to learn the temporal ordering of input sequences and other
long-term dependencies.
Using Keras, an open-source machine learning library, the LSTM autoencoder can
be implemented as follows:

1. LSTM Layer.

2. RepeatVector, which repeats the input 𝑛 times, ie. if the input was of shape
(𝑛𝑢𝑚_𝑡𝑟𝑎𝑖𝑛, 𝑠𝑒𝑞_𝑙𝑒𝑛𝑔𝑡ℎ) it changes it to (𝑛𝑢𝑚_𝑡𝑟𝑎𝑖𝑛, 𝑛, 𝑠𝑒𝑞_𝑙𝑒𝑛𝑔𝑡ℎ).

3. LSTM Layer (with 𝑟𝑒𝑡𝑢𝑟𝑛_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 = 𝑇 𝑟𝑢𝑒).

4. TimeDistributed, which is used to condense the input to a dense layer with

𝑚 = 1 elements corresponding to the sequence points.

The LSTM Encoder-Decoder takes in an input which is the time series sequence.
The output is the recreated sequence, as deconstructed by the LSTM decoder layer.
In order to measure how accurately the model has reconstructed the input sequence,
we use the mean squared error, which is outlined in chapter 5.

4.3.4 Linear Autoregression

Linear regression is a statistical technique that has been used for a long time across
all sciences and humanities fields. Data scientists, physicists, economists, and so
many others benefit from the power of the method. Linear regression studies the
relationship between a dependent variable and several independent ones, trying to

38
estimate the coefficients of the latter variables. Fundamentally, we have data for
both the dependent and independent variables. Often, linear regressors use a method
called ordinary least squares to determine the coefficients of the variables and the
intercept, if any. Linear regression written in matrix notation is

𝑦 = 𝑋𝛽 + 𝜖
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y 1 x11 ... x1𝑝 𝛽 𝜖
⎢ 1 ⎥ ⎢ ⎥ ⎢ 0⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y2 ⎥ ⎢1 x21 ... x2𝑝 ⎥ ⎢𝛽 ⎥ ⎢ 𝜖1
⎥, 𝛽 = ⎢ 1 ⎥, 𝜖 =
⎥
where 𝑦 = ⎢ ⎥, 𝑋 = ⎢
⎢ .. ⎥ . .. .. ⎥ ⎢ .. ⎥ ⎢ ..
⎢ ⎥.
⎢1 .. .
⎢ ⎥
⎢ . ⎥ . ⎥ ⎢ . ⎥ ⎢ . ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
y𝑛 1 x𝑛1 ... x𝑛𝑝 𝛽𝑝 𝜖𝑛

The aim of linear regression is to estimate values for the 𝛽 coefficients, using data
points of the dependent variable, 𝑦, and the independent variables, 𝑥. There are
multiple methods of estimation for these coefficients, such as maximum likelihood
estimation, ridge regression, and generalized least squares. However, the most com-
monly used estimation method is ordinary least squares (OLS). In OLS, the objective
function we are trying to minimize is the sum of the squares of the differences between
the predicted values of 𝑦 obtained from the regressor and the observed dependent val-
ues.

𝛽ˆ = 𝑎𝑟𝑔𝑚𝑖𝑛𝛽 (||𝑦 − 𝑋𝛽||2 )

In other words,
𝑋 𝛽ˆ = 𝑦

𝑋 𝑇 𝑋 𝛽ˆ = 𝑋 𝑇 𝑦

𝛽ˆ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦

The matrix 𝑋 𝑇 𝑋 is invertible if and only if all its columns are independent.
That occurs when the independent variables are not perfectly multicollinear with the

39
dependent variable. This means that the output variable cannot be linearly predicted
from the independent variables with a substantial degree of accuracy. In situations
where there is perfect multicollinearity, the matrix 𝑋 has less than full rank, and
therefore the matrix 𝑋 𝑇 𝑋 cannot be inverted, meaning that for a general linear
model, with 𝑦 = 𝑋𝛽 + 𝜖, there doesn’t exist an OLS estimator.

Autoregressive (AR) models are a type of linear regressions, where the indepen-
dent variables are a time lag of the previous values of the dependent or output vari-
able. However, AR models do not satisfy the standard assumptions for least squares
regression. With some assumptions, like stationarity, independently and identically
distributed (iid) errors with zero mean and constant variance, AR models can be
accurately estimated by least squares. In the spacecraft time series context, we make
the assumption that the aforementioned conditions hold, even though in practice we
know that they may not, to test the effectiveness of linear regressors in detecting
anomalies. For example, we know for sure that the stationarity assumption does
not hold, due to unforeseen shocks that the spacecraft can experience or different
seasons/periods of time in which the environmental factors of the vehicle changes.
Nevertheless, we make the assumption that these factors will not affect our data, and
rely on our dynamic error thresholding techniques to capture this non-stationarity.

As linear regression is such a widely-used tool, there were multiple toolkits that
could have been used in our implementation. We relied on the sklearn.linear_model
𝐿𝑖𝑛𝑒𝑎𝑟𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 module when running experiments for data using a linear regres-
sor. Sklearn’s linear regression can be adapted to an autoregression setup, when the
input is pre-processed to be made up of time series sequences. This was already done
in the 𝑟𝑜𝑙𝑙𝑖𝑛𝑔_𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 method that was outlined above. The architec-
ture of the system remained very similar to LSTM. In fact, the project aimed to use
modular components and functions that could be easily re-deployed to test different
models, so all the pre-processing and post-processing functions remained the same
for this model, and we only changed the statistical learning method in the middle.
Currently, the pipeline is set up to handle univariate time series data, but it can be
adapted to multivariate signals by using vector autoregressions (VARs).

40
In a lot of settings, linear regression performs just as well as advanced machine
learning techniques. The ability of linear regression to do well makes it hard for
data scientists and researchers to use advanced machine learning and deep learning
techniques. In industry, where decisions are driven more by strategy and costs, a lot
of experts cannot provide justification for the application of state-of-the-art models.
This results in the adoption of linear regression in the technology stacks of companies.
As such, we use linear regression in our analysis to answer the question of whether
using more sophisticated models provides a tangible difference in the performance of
the system.

4.4 Post-Processing: Anomaly Detection

In our work, we implement the anomaly detection method that was used in the
NASA paper. In this section, we outline the theory behind the anomaly detection
method. The input to the post-processing model is a list of forecast values and their
corresponding realized values. The first step is to compute the the prediction error
as given by:

𝑒 = |𝑦𝑡 − 𝑦ˆ𝑡 |

As such, a vector of errors is generated for each of the data points in the test set:

e = [𝑒𝑡−ℎ , 𝑒𝑡−1 , ..., 𝑒𝑡 ]

where h is the length of the window sequences (the number of sequences is the
length of data points minus the window size). After computing the errors, they are
passed through a smoothing algorithm such as the exponentially-weighted moving
average (EWMA). Thus, we now have the smoothed errors vector es .
To determine whether the computed errors constitute anomalies, we use an adap-
tation of dynamic error thresholding technique mentioned in the NASA paper. The
main difference, as we outline below, is that we do not implement the pruning method.

41
The unsupervised method aims to compute a threshold, 𝜖, that is found by taking
the point that maximizes the following:

𝜖 = 𝜇(es ) + z𝜎(es )

where 𝜖 is the maximum error from the set of smoothed errors such that

Δ𝜇(es ) Δ𝜎(es )
𝜇(es )
+ 𝜎(es )
𝜖 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝜖) = (4.1)
|𝑒𝑎 | + |𝐸𝑠𝑒𝑞 |2
where
∆𝜇(es ) = 𝜇(es ) − 𝜇({𝑒𝑠 ∈ es |𝑒𝑠 < 𝜖})

∆𝜎(es ) = 𝜎(es ) − 𝜎({𝑒𝑠 ∈ es |𝑒𝑠 < 𝜖})

𝑒𝑎 = {𝑒𝑠 ∈ es |𝑒𝑠 > 𝜖}

Eseq = 𝑐𝑜𝑛𝑡𝑖𝑔𝑢𝑜𝑢𝑠 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 𝑖𝑛 ea

The value of z lies in a range between 0 and 10, and the error thresholding mech-
anism tries out different values to see which one minimizes the cost, as shown in
equation 4.1 above. The intuition behind the dynamic error thresholding technique is
that we wish to identify anomalies that, if removed, would result in the largest percent
decrease in the mean and standard deviation of the vector of smoothed errors. The
thresholding technique also computes a score for each anomaly, in order to be able to
compare the severeness of the outliers. The formula for the severity of the anomaly
is given by:
(𝑖)
(𝑖) 𝑚𝑎𝑥(𝑒𝑠𝑒𝑞 ) − 𝑎𝑟𝑔𝑚𝑎𝑥(𝜖)
𝑠 =
𝜇(es ) + 𝜎(es )
To mitigate false positives, the NASA paper introduces a pruning measure that
is aimed at reducing the sensitivity towards outliers that aren’t far enough to be
anomalies. To do this, a list is created that contains the maximum errors in all
anomaly sequences, sorted in descending order. At the end of the list, the largest
error value that hasn’t been flagged as anomalous is also added to the list. The list
is then stepped through and the percent decrease at each iteration is calculated. If

42
Figure 4-7: Signal P-11.

the percent decrease exceeds a certain threshold 𝑝, which they found to be good in
practice at 𝑝 = 0.13, then the error sequence is flagged as an anomaly. In our work,
we do not use this pruning procedure, but still maintain good performance across
different models. Using this pruning method would definitely improve the precision
and therefore 𝐹0.5 score, but we leave this as an additional step for further work.

4.5 Sample Pipeline

As shown in the previous sections of this chapter, our pipeline consists of three main
components: data pre-processing, model, and post-processing. To better outline the
way the different components work together, we will step through each module and
show how an input signal changes in each section. We choose a sample signal, called
P-11, to exhibit the way a pipeline would run on one signal.
As can be seen in Figure 4-7 signal P-11 is continuous, and follows a somewhat
periodic trend. The spike in the middle of the signal is actually an anomaly, as it has

43
been indicated anomalous in the 𝑙𝑎𝑏𝑒𝑙𝑒𝑑_𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 file. In this section, we try to
see if our model can correctly predict that this point is an anomaly.

The first thing that the pipeline does is aggregate the data according to a specific
interval. This is done using the 𝑡𝑖𝑚𝑒_𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 method, outlined in section
4.2.1. Since the NASA paper did not aggregate their data, we set the interval equal
to 1, meaning that we average each data point with itself, resulting in no change for
the input signal. This method is nevertheless very important to have if the data is
sampled at a high frequency, which is very common in time series applications.

The next step in the pre-processing section is the creation of sequences from the
input signal. To this end, we used the 𝑟𝑜𝑙𝑙𝑖𝑛𝑔_𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 method described
in section 4.2.2 which takes in the time series data and the window size and target
size, corresponding to the length of the sequence and number of values ahead to
predict respectively. In our experiments, we varied the window size, using lengths of
100 and 250; however, we kept the target size to one, meaning that we are only trying
to predict the next time step given the input sequence.

Now that the signal has been formatted, the next step is to pass it through the
model. We have outlined four different models in section 4.3 above, which are LSTM,
MLP, LSTM Autoencoder, and AR. These models all have different parameters that
can be tuned, but they operate on the same inputs/outputs, with the exception of
the autoencoder, which uses the input as its output when it’s encoding and then
decoding the sequence. For more details about the mechanism of these models, we
refer the reader to the corresponding subsections in 4.3 that discuss in more detail
the theory behind each model. The model uses the training part of the data to learn
the weights of the time series regressor that will be used to predict the testing data.
When we pass in the testing data, a vector of forecast values is generated as the
output. This output is carried forward to the next step of the pipeline, which is the
post-processing. As we can see in Figure 4-8, the model predicts the signal in orange,
but there is a discrepancy between the actual signal in blue and the forecast. Figure
4-9 contains a zoomed-in version of the forecast. We can see that the model doesn’t
exactly replicate the signal, but it follows it closely. These qualitative descriptions

44
Figure 4-8: Signal P-11 and the forecast from NASA LSTM Model.

are difficult to measure, so we resort to metrics such as mean squared error that we
discuss in section 5.1.
In the post-processing phase, the generated forecast, 𝑦ˆ, is cross-referenced with
the realized signal 𝑦𝑡𝑟𝑢𝑒 to see whether the true signal was in line with the model’s
expectations or not. If it happened that the difference between the two was very large,
then it means that the signal has diverged from the assumptions that the model has
learned and was therefore operating under. Thus, we flag the error as an anomaly,
according to the dynamic error thresholding explained in section 4.4. Figure 4-10
shows the region that is flagged by our algorithm as anomalous. The region is the
one that is just preceding the huge spike that we referenced earlier. As such, the
model was able to accurately predict the anomaly by flagging the region right before
it happened. What likely happened in this case is that the large anomaly was preceded
by some minor abnormalities, which the model detected and flagged.

45
Figure 4-9: Close-up on signal P-11 and the forecast from NASA LSTM Model.

Figure 4-10: Signal P-11 and Forecast from NASA LSTM Model. Flagged Anomaly
in Shaded Region.

46
Chapter 5

Analysis

5.1 Mean Squared Error

The NASA spacecraft telemetry dataset has 82 signals, and for each architecture a
model was trained for each of the signals. We primarily dealt with univariate signals,
but the models can easily be adapted to work with multivariate signals, by changing
the size of the input vector and making it a matrix. Working with multivariate
data can reveal additional insights and improve the accuracy of predictions, because
oftentimes signals are correlated, especially when telemetry signals are coming from
the same vehicle. However, using multivariate signals limits the interpretability of
anomalies. For instance, if our models worked with 10s of signals at a time and an
anomaly is detected, then it is hard to determine which signals are the ones that
caused the anomaly to arise. This is why in our analysis we focused on evaluating
univariate time signal models. A sample training script for one of the architectures
can be found in Appendix B. The code shows the procedure to train 82 different
models given a certain architecture.
During training of the models, the mean squared error was used as a metric for
the training for each of the signals. Mean squared error (MSE) is an average of
the squares of the errors between the predicted values and the real ones. The best
possible is 0, and it is achieved when the predicted value is identical to the real one.
The larger the MSE, the worse the predictions are, and there is no theoretical bound

47
on how large MSE can get. The formula for MSE is given by

𝑛
1 ∑︁
𝑀 𝑆𝐸 = (𝑌𝑖 − 𝑌ˆ𝑖 )2
𝑛 𝑖=1

The MSE was used to assess the models in two ways. First, while training the
MSE was used as a measure of how well a certain architecture was able to learn
the data and start making accurate predictions on the training sample. When the
validation error exceeded the training error, then we knew that the model didn’t
generalize well, and that we have overfit the data. Alternatively, when the validation
error was smaller than the training error, then the model was able to generalize, since
it performed better on previously unseen data. We also used the training MSEs,
averaged across all 82 signals, to determine which architectures performed better. A
summary of the MSEs can be found in Table 5.1.

We ran our experiments on all 82 signals, training the model once on each signal.
The reason that we don’t train one model for all the signals is that they exhibit
different features that the model seeks to learn. Therefore, a model that is trained on
the entire dataset won’t generalize well in our case, because the fitting signal A-1, for
instance, would have no added benefit when trying to predict signal D-5. Thus, when
we compute the MSE and anomalies in this chapter, we do so by generating a script
that goes through each signal independently, trains a model, forecasts the testing data,
and then computes anomalies as predicted by the anomaly detection method. The
MSE of a model is computed by averaging the MSE across all the signals, and that’s
why we have a associated standard deviation in table 5.1. Moreover, the anomalies
were aggregated across the signals and for each model the 𝐹0.5 score was computed.
The 𝐹0.5 results are discussed in section 5.2.

Additionally, the MSE was evaluated by comparing it to a randomly shuffled

MSE. The intuition behind this baseline is that we want to capture how well our
model forecasted the signals, as opposed to a random shuffle of the input signal. We
set up a script that shuffled each signal and computed the MSE between the shuffled
signal and the original one. The MSE of the 82 signals were averaged to give an

48
Table 5.1: MSE Summary across 82 signals. Autoregression doesn’t have a training
MSE as the neural network models do, since the model is fitted differently.

Model Name MSE Value (std deviation)

Shuffled MSE (baseline) 5.4 × 105 (1 × 105 )
Shifted MSE (baseline) 2.1 × 106 (2 × 106 )
NASA LSTM 0.03 (0.05)
NASA LSTM hidden_units=100 0.03 (0.05)
NASA LSTM 3-stacked 0.03 (0.05)
MLP x_len=250 0.095 (0.2)
MLP 6-dense layers 0.085 (0.2)
MLP x_len=100 0.053 (0.1)
LSTM EncDec x_len=100 1.6 × 105 (1 × 106 )
LSTM EncDec x_len=250 0.015 (0.05)

average MSE of 5.4 × 105 which is a far larger number than the MSE that we achieve
with our models, but it is nevertheless a baseline. Another baseline that we have
computed, across all signals, is the MSE of the signal shifted to the right by one.
That is, the MSE computes the difference between 𝑥𝑡 , 𝑥𝑡+1 at each point. This MSE
is a good measure, if the signals exhibit low variance across time steps. However,
we found that this MSE measure was larger than the shuffled MSE, with a value of
2.1 × 106 (also larger than the MSE of our models).

5.2 F-0.5 Score

As we can see from the mean squared error tables above, MSE is not good at separat-
ing which models perform better on learning the data, because the numbers are only
marginally different. Moreover, the most important metric we are looking at is the
ability to detect anomalies, which is not determined by the MSE. As such, we work
with another metric that is often used to assess the quality of classifications. For
the system we are building, we would like to predict as many anomalies as possible,
while decreasing the amount of false alarms that we get. In statistics terms, a type I
error occurs when a normal point is incorrectly labeled as anomalous, and this is also
known as a false positive. A metric that corresponds to type I error is precision, which

49
Table 5.2: 𝐹0.5 Scores of the Models.

Model Name 𝐹0.5 𝑆𝑐𝑜𝑟𝑒

NASA LSTM (paper) 0.69
NASA LSTM (our implementation) 0.61
NASA LSTM hidden_units=100 0.34
NASA LSTM hidden_units=40 0.76
NASA LSTM 3-stacked 0.69
MLP x_len=250 0.42
MLP 6-dense layers 0.59
MLP x_len=100 0.23
AR x_len=100 0.25
AR x_len=250 0.57
LSTM EncDec x_len=100 0.34
LSTM EncDec x_len=250 0.66

is the percentage of selected anomalies that are true positives versus false positives.
A type II error, on the other hand, occurs when a point that is anomalous in nature
is not detected. Such an error is also called a false negative, and the recall metric
aims to capture how many true positives were flagged as a total of all the anomalies
(true positives + false negatives).
We created a script that looks at the anomalies generated by the model and checks
whether they are present in the 𝑙𝑎𝑏𝑒𝑙𝑒𝑑_𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 file. If there is an intersection
between an anomaly sequence in our prediction and the 𝑙𝑎𝑏𝑒𝑙𝑒𝑑_𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 file, then
the anomaly is considered to be a true positive. If an anomaly was predicted by our
model but doesn’t exist in the anomalies file, then it is counted as a false positive.
Finally, if an anomaly exists in the 𝑙𝑎𝑏𝑒𝑙𝑒𝑑_𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 database but is not predicted
by our model, then we count it as a false negative. With these metrics, we can now
compute the 𝐹0.5 score, which is a representation of the anomaly predictive power of
our system.
In the NASA paper, the model without parametric pruning (𝑝 = 0 as in our
system), the performance was 0.88 (precision), 0.67 (recall), and 0.69 (𝐹0.5 ). We can
see from Table 5.2 that only our LSTM model with 40 hidden units outperformed the
NASA baseline, while the 3-stacked LSTM and LSTM autoencoder with 𝑥_𝑙𝑒𝑛 = 250

50
Table 5.3: Precision & Recall Scores for the Models.

Model Name Precision Recall

NASA LSTM (paper) 0.88 0.67
NASA LSTM 0.61 0.62
NASA LSTM hidden_units=100 0.35 0.32
NASA LSTM hidden_units=40 0.78 0.67
NASA LSTM 3-stacked 0.69 0.70
MLP x_len=250 0.42 0.43
MLP 6-dense layers 0.59 0.61
MLP x_len=100 0.23 0.22
AR x_len=100 0.25 0.28
AR x_len=250 0.58 0.53
LSTM EncDec x_len=100 0.38 0.24
LSTM EncDec x_len=250 0.83 0.35

performed close to the reference paper. We will now explore each model individually
to discuss what changes worked well for each architecture.

For the NASA LSTM model, we obtained similar performance to the paper when
we implemented the model, as described in their paper. Varying the number of
hidden units worked very well when we decreased the number from 80 to 40, but
performed poorly when we increased the number to 100. In fact, decreasing the
number of hidden units to 40 was the most successful model we tried, out of all the
different architectures, yielding an 𝐹0.5 score of 0.76. Another good improvement
over the NASA architecture was adding a third LSTM layer to the two-stacked model
that they implemented, which yielded an 𝐹0.5 score of 0.69, our second-best model
in practice. The LSTM model, however, took the most time to train out of all the
architectures that we tried. We also kept the length of the sequences constant, where
𝑥_𝑙𝑒𝑛 = 250 for all the LSTM experiments that we ran.

The MLP model trained much faster than the LSTM model, and was second
fastest after the linear autoregression model. Using an MLP as described in Section
4.3.1 and with 𝑥_𝑙𝑒𝑛 = 100 yielded the lowest score of 0.23. This score was improved
to 0.42 when we set 𝑥_𝑙𝑒𝑛 = 250. Finally, when we increased the number of dense
layers from 3 to 6, the 𝐹0.5 score went up to 0.59. MLP doesn’t give us the best results,

51
but given that it trains faster than LSTM, it provides a more efficient alternative.
In a lot of machine learning experiments, linear AR models are used to reference
the performance of deep neural networks, since linear autoregressions are among the
most efficient and commonly used tools in data science. In our experiments, the AR
model performed decently well when 𝑥_𝑙𝑒𝑛 = 250, giving an 𝐹0.5 score of 0.57. The
model didn’t do that well when 𝑥_𝑙𝑒𝑛 = 100, but that seemed to be the trend across
the other models such as MLP and LSTM EncDec. Given the fact that AR trains
way faster than all the other models, it is good for use cases where computing power
is limited, but overall it is not a good architecture if we are looking for top-of-the-line
results.
Finally, the LSTM Encoder-Decoder model performed generally well, achieving
an 𝐹0.5 score of 0.66 when 𝑥_𝑙𝑒𝑛 = 250. The most notable thing about this model
can be found in Table 5.3 that outlines the precision and recall scores of the models.
We can see in the last row that the precision score of the LSTM EncDec model was
the highest among all the models, with a score of 0.83. This means that the LSTM
EncDec was very accurate when it came to predicting the anomalies, but interestingly
the model’s recall score was on the lower end of the spectrum, indicating that it
couldn’t successfully find all possible anomalies.

5.3 Further Work

There are several next steps that would make sense for this project. We identify three
main tasks that can be done as follow-ups to the current work that we presented.
First, people interested in conducting further research can try out different pa-
rameters for our models and different anomaly detection techniques altogether. A
more thorough analysis can be carried out by users who can choose a few signals they
would like and run autotuning tools on the signals to figure out what the best hyper-
parameters are. We couldn’t do this for 82 signals and 4 models, as it would be too
complex to do for all the signals simultaneously. Another area that can be explored
is the trying a different anomaly detection technique, such as x-sigma thresholding.

52
Second, for those who would like to apply this model in industry, it would be very
important to be able to run this system in real-time and detect anomalies on-the-go
as they happen. In order to do this, engineers need to keep in mind different data
normalization techniques, in order to keep the input in the range of [-1, 1]. One
effective way of doing this is to clip any data that is larger than the minimum and
maximum possible values that we know ahead of time that sensors can’t and shouldn’t
go beyond.
Third, we encourage researchers to make use of our modular system to try out
different forecasting architectures. In our work, we explored linear AR, MLP, stacked
LSTMs, and LSTM autoencoders, but there are so many other models that can be
used. For instance, there are bidirectional LSTMs, variations of linear regression,
variational autoencoders, convolutional neural networks, recurrent neural networks,
memory-based models, sequence-2-sequence, and gated recurrent units. It would also
be useful and interesting to explore how some of these models can be adapted to work
with multivariate signals. The current architecture should work with multivariate
signals, with only a few minor changes.

5.4 Conclusion

In this work, we analyzed 82 signals coming from NASA spacecraft in order to detect
anomalies. We built a modular infrastructure such that we can rapidly prototype dif-
ferent models, then we compare and contrast different architectures such as modified
versions of the NASA LSTM model, Autoregressive (AR) model, LSTM autoencoder,
and Multilayer Perceptron (MLP) model. We also varied the parameters of these
models to improve the ability to detect anomalies, as measured by the 𝐹0.5 score.
We find that increasing the length of the input sequence 𝑥_𝑙𝑒𝑛 improved the
performance of the model. We were able to achieve an improvement over the NASA
paper’s 𝐹0.5 score of 0.69, when we decreased the number of hidden units to 40 and
achieved a score of 0.76. In contrast, when we increased the number of hidden units
from 80 as in the paper to 100, the results decreased to 0.34. We also got close results

53
when we added a third LSTM layer to the stacked LSTM model, achieving 0.69 score.
The AR model performed as well as the MLP model when the length of the
sequence is 250, but both models performed lower than the NASA paper baseline.
The LSTM Encoder-Decoder didn’t obtain great 𝐹0.5 score, but it yielded the highest
precision score of 0.83. Overall, the best performance came from small improvements
to the NASA LSTM model.
In terms of next steps, we would recommend exploring alternative error thresh-
olding techniques such as x-sigma. We would also try out different machine learning
models such as variational autoencoders, bidirectional LSTM, seq2seq, gated recur-
rent units, among other models. Exploring different architectures would allow us to
converge to systems that best model the data and work together with humans to help
detect anomalies as soon as they happen.

54
Bibliography

[1] Samit Bhanja and Abhishek Das. Impact of data normalization on deep neural
network for time series forecasting. CoRR, abs/1812.05519, 2018.

[2] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and
Tom Soderstrom. Detecting spacecraft anomalies using lstms and nonparametric
dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, KDD ’18, pages 387–
395, New York, NY, USA, 2018. ACM.

[3] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet
Agarwal, and Gautam Shroff. Lstm-based encoder-decoder for multi-sensor
anomaly detection. CoRR, abs/1607.00148, 2016.

[4] Friedrich Pukelsheim. The three sigma rule. The American Statistician, 48(2):88–
91, 1994.

[5] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR,

abs/1609.04747, 2016.

[6] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: A simple way to prevent neural networks from overfit-
ting. Journal of Machine Learning Research, 15:1929–1958, 2014.

Time Series Forecasting and Anomaly Detection Using Deep Learning - Elsevier
No ratings yet
Time Series Forecasting and Anomaly Detection Using Deep Learning - Elsevier
16 pages
(Terrorism, Security, and Computation) Kishan G. Mehrotra, Chilukuri K. Mohan, HuaMing Huang (Auth.) - Anomaly Detection Principles and Algorithms-Springer International Publishing (2017)
No ratings yet
(Terrorism, Security, and Computation) Kishan G. Mehrotra, Chilukuri K. Mohan, HuaMing Huang (Auth.) - Anomaly Detection Principles and Algorithms-Springer International Publishing (2017)
229 pages
2020 02. DNNRec A Novel Deep Learning Based Hybrid Recommender System
No ratings yet
2020 02. DNNRec A Novel Deep Learning Based Hybrid Recommender System
14 pages
Cheboli Deepthi May2010 PDF
No ratings yet
Cheboli Deepthi May2010 PDF
83 pages
Deep Learning For Anomaly Detection in Time-Series Data Review Analysis and Guidelines
No ratings yet
Deep Learning For Anomaly Detection in Time-Series Data Review Analysis and Guidelines
23 pages
Time-Series Anomaly Detection Service at Microsoft
No ratings yet
Time-Series Anomaly Detection Service at Microsoft
9 pages
SMBL Merged
No ratings yet
SMBL Merged
28 pages
Anomaly Detection in Time Series Data: A Practical Implementation For Pulp and Paper Industry
No ratings yet
Anomaly Detection in Time Series Data: A Practical Implementation For Pulp and Paper Industry
108 pages
Anomaly Detection in Big Data
No ratings yet
Anomaly Detection in Big Data
148 pages
A Review On Anomaly Detection in Time Series
No ratings yet
A Review On Anomaly Detection in Time Series
6 pages
Time - Series - Data 2024 05 22 05 16
No ratings yet
Time - Series - Data 2024 05 22 05 16
50 pages
Spacecraft Time-Series Online Anomaly Detection Using Deep Learning
No ratings yet
Spacecraft Time-Series Online Anomaly Detection Using Deep Learning
9 pages
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
No ratings yet
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
13 pages
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
No ratings yet
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
6 pages
Atf ETH Master Thesis AD+RCA
No ratings yet
Atf ETH Master Thesis AD+RCA
43 pages
Time Series Anomaly Detection With DL
No ratings yet
Time Series Anomaly Detection With DL
18 pages
Elk 2111 123
No ratings yet
Elk 2111 123
17 pages
1 s2.0 S0167739X23000560 Main
No ratings yet
1 s2.0 S0167739X23000560 Main
12 pages
Full Text 01
No ratings yet
Full Text 01
39 pages
5.1.1 Objective and Scope: Jyenis 2020
No ratings yet
5.1.1 Objective and Scope: Jyenis 2020
8 pages
Anomaly Detection
No ratings yet
Anomaly Detection
51 pages
Toheeb Jimoh Essay2022 Final
No ratings yet
Toheeb Jimoh Essay2022 Final
45 pages
Robust Anomaly Detection For Multivariate Time Series Through Stochastic Recurrent Neural Network
No ratings yet
Robust Anomaly Detection For Multivariate Time Series Through Stochastic Recurrent Neural Network
10 pages
Anomaly Detection Review
No ratings yet
Anomaly Detection Review
3 pages
FULLTEXT01
No ratings yet
FULLTEXT01
68 pages
Anomaly Detection and Time Series Analysis1
No ratings yet
Anomaly Detection and Time Series Analysis1
6 pages
Deep Learning For Time Series Anomaly Detection-A Survey
No ratings yet
Deep Learning For Time Series Anomaly Detection-A Survey
43 pages
Deep Learningfor Time Series Anomaly Detection
No ratings yet
Deep Learningfor Time Series Anomaly Detection
42 pages
Anomaly Detection and Failure Prediction in Gas Turbines
No ratings yet
Anomaly Detection and Failure Prediction in Gas Turbines
98 pages
HybridAD A Hybrid Model-Driven Anomaly Detection Approach For Multivariate Time Series
No ratings yet
HybridAD A Hybrid Model-Driven Anomaly Detection Approach For Multivariate Time Series
13 pages
Ashwath Thesis PDF
No ratings yet
Ashwath Thesis PDF
90 pages
Predictive Maintenance For AirProductionUnit in EuroTram Vehicles MarianaBarros
No ratings yet
Predictive Maintenance For AirProductionUnit in EuroTram Vehicles MarianaBarros
110 pages
Anomaly Detection For A Water Treatment System Using Unsupervised Machine Learning
No ratings yet
Anomaly Detection For A Water Treatment System Using Unsupervised Machine Learning
8 pages
Iva 4
No ratings yet
Iva 4
43 pages
Merged
No ratings yet
Merged
67 pages
Anomaly Detection Analysis and Prediction-2019
No ratings yet
Anomaly Detection Analysis and Prediction-2019
18 pages
IoT Anomaly Detection Methods and Applications - A Survey - Elsevier Enhanced Reader
No ratings yet
IoT Anomaly Detection Methods and Applications - A Survey - Elsevier Enhanced Reader
17 pages
Pattern Recognition & Anomaly Detection
No ratings yet
Pattern Recognition & Anomaly Detection
2 pages
DL For Time Series Anomaly Detection
No ratings yet
DL For Time Series Anomaly Detection
42 pages
Anomaly Detection Time Series Final PDF
No ratings yet
Anomaly Detection Time Series Final PDF
12 pages
2020TadGAN Time Series Anomaly Detection Using
No ratings yet
2020TadGAN Time Series Anomaly Detection Using
11 pages
Deployment of Analytics Solutions - Module VII - Students
No ratings yet
Deployment of Analytics Solutions - Module VII - Students
120 pages
Evaluation Metrics For Anomaly Detection Algorithm
No ratings yet
Evaluation Metrics For Anomaly Detection Algorithm
18 pages
Time Series Anomaly Detection Using Generative Adversarial Networ
No ratings yet
Time Series Anomaly Detection Using Generative Adversarial Networ
44 pages
GonzalezAdrian Thesis2017
No ratings yet
GonzalezAdrian Thesis2017
117 pages
Mausumi Doi - Org.10.32010.26166127.2020.3.2.196.206
No ratings yet
Mausumi Doi - Org.10.32010.26166127.2020.3.2.196.206
12 pages
Bioengineering 10 00405 v2
No ratings yet
Bioengineering 10 00405 v2
30 pages
Anomaly Detection in Self-Organizing Networks - Conventional Versus Contemporary Machine Learning
No ratings yet
Anomaly Detection in Self-Organizing Networks - Conventional Versus Contemporary Machine Learning
9 pages
Anomaly Detection in Aircraft Data Using Recurrent Neural Networks RNN
No ratings yet
Anomaly Detection in Aircraft Data Using Recurrent Neural Networks RNN
8 pages
Discovering System Health Anomalies Using Data Mining Techniquesdiscovering System Health Anomalies Using Data Mining Techniques
No ratings yet
Discovering System Health Anomalies Using Data Mining Techniquesdiscovering System Health Anomalies Using Data Mining Techniques
11 pages
Anomaly Detection of Spacecraft Telemetry Data Using Temporal Convolution Network
No ratings yet
Anomaly Detection of Spacecraft Telemetry Data Using Temporal Convolution Network
5 pages
The Ultimate Guide To Anomaly Detection: Key Use Cases, Techniques, and Autoencoder Machine Learning Models
No ratings yet
The Ultimate Guide To Anomaly Detection: Key Use Cases, Techniques, and Autoencoder Machine Learning Models
9 pages
Knime Anomaly Detection Visualization
No ratings yet
Knime Anomaly Detection Visualization
13 pages
Sensors 23 01561
No ratings yet
Sensors 23 01561
16 pages
A Novel Anomaly Detection Approach For Internet of Things Time Series Data
No ratings yet
A Novel Anomaly Detection Approach For Internet of Things Time Series Data
13 pages
Anomaly Detection in Electricity Consumption Data of Buildings Using Predictive Models
No ratings yet
Anomaly Detection in Electricity Consumption Data of Buildings Using Predictive Models
20 pages
Machine Learning For Anomaly Detection A Systemati
No ratings yet
Machine Learning For Anomaly Detection A Systemati
47 pages
MAJDANI SHABESTARI 2020 Automated Anomaly Recognition in Real Time
No ratings yet
MAJDANI SHABESTARI 2020 Automated Anomaly Recognition in Real Time
181 pages
Anomaly Detection For Web Log Based Data
No ratings yet
Anomaly Detection For Web Log Based Data
5 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Electronic Circuit Designing: Laboratory Contents
No ratings yet
Electronic Circuit Designing: Laboratory Contents
6 pages
Research Proposal1
No ratings yet
Research Proposal1
3 pages
Transducers
No ratings yet
Transducers
16 pages
Assignment # 02 DSA
No ratings yet
Assignment # 02 DSA
1 page
Ijiis S 2014030601 13
No ratings yet
Ijiis S 2014030601 13
8 pages
Dsa Lab 1
No ratings yet
Dsa Lab 1
6 pages
Pencil Template 16x9
No ratings yet
Pencil Template 16x9
5 pages
Experiment No. 7: Implementation of Gauss Seidel Method For Solution of Equations
No ratings yet
Experiment No. 7: Implementation of Gauss Seidel Method For Solution of Equations
6 pages
Faisal Technical Writing Assigment
No ratings yet
Faisal Technical Writing Assigment
14 pages
Recent Advances in Image Dehazing
No ratings yet
Recent Advances in Image Dehazing
27 pages
Lab 9 SS
No ratings yet
Lab 9 SS
24 pages
Lab 10 SS
No ratings yet
Lab 10 SS
10 pages
Assignment # 2
No ratings yet
Assignment # 2
3 pages
Pes Rules 2022 Pdf1
No ratings yet
Pes Rules 2022 Pdf1
16 pages
DSA Lab 12
No ratings yet
DSA Lab 12
6 pages
Lab Manual 10
No ratings yet
Lab Manual 10
7 pages
DSA Lab 6
No ratings yet
DSA Lab 6
9 pages
Lab Manual 13
No ratings yet
Lab Manual 13
4 pages
DSA Lab 2
No ratings yet
DSA Lab 2
9 pages
DSA Lab 8
No ratings yet
DSA Lab 8
5 pages
DSA Lab 9
No ratings yet
DSA Lab 9
7 pages
DSA Lab 1
No ratings yet
DSA Lab 1
6 pages
Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU
No ratings yet
Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU
50 pages
Stochastic Gradient Descent 2
No ratings yet
Stochastic Gradient Descent 2
42 pages
Module 2
No ratings yet
Module 2
42 pages
Model-Based Deep Learning
No ratings yet
Model-Based Deep Learning
35 pages
Machine Learning Algorithms, Real-World Applications and Research Directions
No ratings yet
Machine Learning Algorithms, Real-World Applications and Research Directions
73 pages
L10 - Intro - To - Deep - Learning
No ratings yet
L10 - Intro - To - Deep - Learning
75 pages
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
No ratings yet
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
10 pages
Intro To Data Science - 17 May 2024
No ratings yet
Intro To Data Science - 17 May 2024
12 pages
DN CNN
No ratings yet
DN CNN
14 pages
AIML
No ratings yet
AIML
24 pages
DLbook
No ratings yet
DLbook
165 pages
Resources ML
No ratings yet
Resources ML
22 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
No ratings yet
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
601 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Transfer Learning-Based Deep Feature Extraction Framework Using Fine-Tuned Efficientnet B7 For Multiclass Brain Tumor Classification
No ratings yet
Transfer Learning-Based Deep Feature Extraction Framework Using Fine-Tuned Efficientnet B7 For Multiclass Brain Tumor Classification
22 pages
Unit 2
No ratings yet
Unit 2
31 pages
Class Imbalance
No ratings yet
Class Imbalance
12 pages
21CS743
100% (1)
21CS743
1 page
Question Bank - Module 2 - Module-3 Module 4 - Module 5
No ratings yet
Question Bank - Module 2 - Module-3 Module 4 - Module 5
4 pages
9 AIML Question Bank Updated 5 Units
No ratings yet
9 AIML Question Bank Updated 5 Units
21 pages
Capstone Project Report Format
No ratings yet
Capstone Project Report Format
37 pages
Bin Liu Et Al - 2020 - Grape Leaf Disease Identification Using Improved Deep Convolutional Neural
No ratings yet
Bin Liu Et Al - 2020 - Grape Leaf Disease Identification Using Improved Deep Convolutional Neural
14 pages
A Method For Improving CNN-Based Image Recognition Using Dcgan
No ratings yet
A Method For Improving CNN-Based Image Recognition Using Dcgan
12 pages
PRCV Unit-2
No ratings yet
PRCV Unit-2
24 pages
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
No ratings yet
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
106 pages
Reference For Report Work
No ratings yet
Reference For Report Work
22 pages
Horovod
No ratings yet
Horovod
5 pages

Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi

Uploaded by

Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi

Uploaded by

Machine Learning for Time Series Anomaly

Submitted to the Department of Electrical Engineering and Computer Science

Thesis Supervisor: Kalyan Veeramachaneni

3-1 NASA Signal A-6 plotted to show its continuos nature. . . . . . . . . 24

4-1 Mechanism of time series aggregation. . . . . . . . . . . . . . . . . . . 31

3.1 Data Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 MSE Summary across 82 signals. Autoregression doesn’t have a train-

1.1 Machine Learning for Anomaly Detection

1.2 Motivations for Anomaly Detection

1.3 Contributions & Key Findings

∙ By decreasing the number of hidden units, we were able to outperform NASA’s

2.1 Anomaly Detection for Spacecraft

Beyond clustering-based approaches for anomaly detection, regression error tech-

Number of Continuous Signals 53

4.2 Data Pre-processing

4.2.1 Time series Aggregation

∙ 𝑋: time series data, with values and time stamps

∙ 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙_𝑙𝑒𝑛𝑔𝑡ℎ: length of the interval, in seconds, that we want to aggregate

∙ 𝑣𝑎𝑙𝑢𝑒𝑠: data aggregated by taking the average across intervals

Figure 4-1: Mechanism of time series aggregation.

4.2.2 Rolling Window Sequences

∙ 𝑋: time series data, with values and time stamps

∙ 𝑖𝑛𝑑𝑒𝑥: the corresponding time stamps for values

∙ 𝑡𝑎𝑟𝑔𝑒𝑡_𝑠𝑖𝑧𝑒: the number of steps ahead to predict

∙ 𝑜𝑢𝑡_𝑋: the sequence of 𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑖𝑧𝑒 starting at each time stamp

∙ 𝑜𝑢𝑡_𝑦: the corresponding next step(s) ahead to predict, based on 𝑡𝑎𝑟𝑔𝑒𝑡_𝑠𝑖𝑧𝑒

∙ 𝑋_𝑖𝑛𝑑𝑒𝑥: time stamps associated with 𝑜𝑢𝑡_𝑋 values

∙ 𝑌 _𝑖𝑛𝑑𝑒𝑥: time stamps associated with 𝑜𝑢𝑡_𝑦 values

Rolling window sequences is a method that is commonly used to prepare time

4.3.1 Multilayer Perceptron (MLP)

Multilayer perceptron (MLP) is a type of feedforward artificial neural network ar-

where n is the size of the training data.

∙ Randomly shuffle examples in the training set.

∙ For 𝑖 = 1, 2, ..., 𝑛 do 𝑤 := 𝑤 − 𝜂∇𝑄𝑖 (𝑤)

The architecture that we used for the multilayer perceptron is comprised of an

Figure 4-5: Multilayer Perceptron Architecture.

4.3.2 Long Short-Term Memory (LSTM)

Input Dropout Dropout Output

Figure 4-6: LSTM Architecture.

4.3.3 LSTM Autoencoder

In machine learning, an autoencoder is a kind of artificial neural network that is used

3. LSTM Layer (with 𝑟𝑒𝑡𝑢𝑟𝑛_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 = 𝑇 𝑟𝑢𝑒).

4. TimeDistributed, which is used to condense the input to a dense layer with

4.3.4 Linear Autoregression

𝛽ˆ = 𝑎𝑟𝑔𝑚𝑖𝑛𝛽 (||𝑦 − 𝑋𝛽||2 )

4.4 Post-Processing: Anomaly Detection

e = [𝑒𝑡−ℎ , 𝑒𝑡−1 , ..., 𝑒𝑡 ]

∆𝜎(es ) = 𝜎(es ) − 𝜎({𝑒𝑠 ∈ es |𝑒𝑠 < 𝜖})

𝑒𝑎 = {𝑒𝑠 ∈ es |𝑒𝑠 > 𝜖}

Eseq = 𝑐𝑜𝑛𝑡𝑖𝑔𝑢𝑜𝑢𝑠 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 𝑖𝑛 ea

4.5 Sample Pipeline

5.1 Mean Squared Error

Additionally, the MSE was evaluated by comparing it to a randomly shuffled

Model Name MSE Value (std deviation)

5.2 F-0.5 Score

Model Name 𝐹0.5 𝑆𝑐𝑜𝑟𝑒

Model Name Precision Recall

5.3 Further Work

[5] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR,

You might also like