0% found this document useful (0 votes)
20 views9 pages

Spacecraft Time-Series Online Anomaly Detection Using Deep Learning

The document discusses using deep learning models for online anomaly detection in spacecraft telemetry channels. It proposes training models in an online manner to quickly understand channel behavior and identify anomalies in real-time, reducing training time compared to traditional methods. The approach aims to help address challenges of manually monitoring thousands of channels due to limitations of experts and time requirements.

Uploaded by

davidtop666888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Spacecraft Time-Series Online Anomaly Detection Using Deep Learning

The document discusses using deep learning models for online anomaly detection in spacecraft telemetry channels. It proposes training models in an online manner to quickly understand channel behavior and identify anomalies in real-time, reducing training time compared to traditional methods. The approach aims to help address challenges of manually monitoring thousands of channels due to limitations of experts and time requirements.

Uploaded by

davidtop666888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Spacecraft Time-Series Online Anomaly Detection

Using Deep Learning


Sriram Baireddy? Sundip R. Desai† Richard H. Foster†
Moses W. Chan† Mary L. Comer? Edward J. Delp?
∗ †
Video and Image Processing Lab (VIPER) Advanced Technology Center
School of Electrical and Computer Engineering Lockheed Martin Corporation
Purdue University Sunnyvale, California, USA
West Lafayette, Indiana, USA

Abstract—Anomaly detection in spacecraft telemetry channels


is of great importance, especially considering the extremeness
of the spacecraft operating environment. These anomalies often
function as precursors for system failure. Currently, domain
experts manually monitor telemetry channels, which is time-
consuming and limited in scope. An automated approach to
anomaly detection would be ideal, considering that each satellite
system has thousands of channels to monitor. Deep learning
models have been shown to be effective at capturing the normal
behavior of the channels and flagging any abnormalities. How-
ever, each channel needs a unique model trained on it, and high
performing models have been shown to require an increased
training time. We instead propose training deep learning models
in an online manner to quickly understand the behavior of a
given channel and identify anomalies in real-time. This greatly
2023 IEEE Aerospace Conference | 978-1-6654-9032-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/AERO55745.2023.10115783

reduces the amount of training time required to obtain a model


for each channel. We present the results of our approach to show
that we can achieve performance comparable to state-of-the-
art spacecraft anomaly detection methods with minimal training
time.

TABLE OF C ONTENTS
1. I NTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. R ELATED W ORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 1: Some examples of time-series anomalies identified
3. P ROPOSED A PPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 by experts (highlighted in various colors) in two spacecraft
4. E XPERIMENTAL R ESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 telemetry channels.
5. C ONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
R EFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 scarcity of experts and the time required [1].
B IOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
With the recent increased emphasis on machine learning,
the effectiveness of an automated approach to time-series
1. I NTRODUCTION anomaly detection has been shown in [2–4]. The scarcity of
spacecraft time-series domain experts still remains a factor,
A growing number of satellites are in orbit around Earth, col- as the effectiveness of these techniques is usually dependent
lecting huge amounts of data and making many modern con- on the amount of labeled data available for training. Since
veniences, like high-speed communications, possible. Given anomalous data cannot be labeled easily, one solution is
their importance to modern society and high degree of com- to adopt a semi-supervised approach. An efficient semi-
plexity, satellites must be constantly monitored, especially supervised anomaly detection approach is to learn the normal
as they operate in space, the harshest environment known to and expected behavior of a telemetry channel, so any devia-
man. Fortunately, precursors for system failure may show tions from this behavior can be flagged in post-processing [5].
themselves as abnormalities in internal system data that is This can be done effectively by utilizing a neural network as
usually collected by the satellite. A few examples of satellite a predictor and a mathematical model of expected prediction
telemetry anomalies are shown in Figure 1. Because of how errors [1,6]. However, high performing models for spacecraft
complicated these systems are, there are often thousands of anomaly detection also require increased training times [7],
time-series channels to monitor so that anomalous behavior which may hamstring this strategy for any realistic applica-
can be fully cataloged. Currently, domain experts manually tion. There can be thousands of channels to monitor in each
watch these channels to flag anomalies, which limits the satellite system, and each channel will need a unique network
extent to which the systems can be monitored given the to be trained from scratch.
978-1-6654-9032-0/23/$31.00 ©2023 IEEE Thus, we investigate the effectiveness of using an online
1
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
learning scenario for developing deep learning models for Anomaly Detection
spacecraft time-series anomaly detection. Online learning
means that the model sees a data point from the time-series The approach for time-series anomaly detection changes de-
only once, as the data is received sequentially. This dramati- pending on the amount of labeled data available. Very rarely
cally reduces the processing time required by the model, and is there sufficient ground truth information regarding anoma-
to maintain computational speed, the models are also spatially lies, but these supervised approaches can be considered as
less complex. classification problems. When using supervised approaches,
techniques to address the scarcity of labeled data often need
To summarize, in this paper, we describe several online time- to be employed. For example, recent work in supervised
series predictor models consisting of simple recurrent neural anomaly detection has involved transfer learning. It was
networks. While the model receives the time-series data in shown that modified U-Nets can be used for supervised
anomaly detection after pre-training on synthetically gener-
an online manner i.e., sequentially, we identify anomalies by ated normal and abnormal time-series data [18].
modeling the prediction error in real-time and thresholding
the log-likelihood of the prediction error. We show that it Instead of relying on supervised approaches, the lack of
is possible to achieve spacecraft time-series anomaly detec-
tion performance comparable to previous approaches while labeled anomaly data in spacecraft telemetry channels makes
requiring minimal training time. a semi-supervised approach more promising. The motivation
behind a semi-supervised approach is the time-series predic-
tion equivalent of a “one-class” approach [5, 19]. Since the
anomalies are unknown and varying by nature, it is more
2. R ELATED W ORK prudent to focus on learning the normal behavior of the time-
Time-Series Prediction series data. Then, during inferencing, any large prediction
errors can be marked as abnormal, as the data must have
Using historical trends to predict future trends is a task deviated from the expected value [20].
of great importance in many fields such as engineering,
weather forecasting, and finance. Approaches for time- The simplest form of semi-supervised anomaly detection has
series prediction have been developed for decades, with many been out-of-limit (OOL) approaches. OOL approaches use
involving statistical modeling like autoregressive integrated pre-defined thresholds and the actual values of the signal
moving average (ARIMA). However, most of these statistical data samples to determine anomalous behavior. Improving
approaches make assumptions about the data and are not on OOL approaches, the work in [5] investigated more ad-
as effective on time-series data with more irregularity and vanced detection techniques based on nearest-neighbors and
nonlinearity [8]. As a result, in the past few years, machine clustering. These techniques have disadvantages relating
learning techniques have begun to take precedence in this to parameter specification, interpretability, or computational
field. A recurrent neural network (RNN) is a deep learning expense [1]. Recently, deep learning approaches using RNNs,
system first introduced in 1986 [9]. It enables time-series such as LSTMs or GRUs, have proven to be very effective
modeling by introducing connections between hidden layers, at detecting anomalies in time-series data. For example,
allowing the system to retain information about past inputs. the results in [6] showed the utility of LSTMs for detecting
The original system suffers from the vanishing gradient prob- abnormal behavior in time-series signals like ECG data. The
lem, making it unable to learn from long sequences [10]. Two work in [3] enabled anomaly detection in multi-channel data
modified RNN architectures have since risen in popularity: by using an LSTM-based encoder-decoder structure. This
(1) Long-Short Term Memory (LSTM) models, proposed in application of RNNs has extended to spacecraft data as well.
[11]; and (2) Gated Recurrent Units (GRUs), proposed in Hundman et al. [1] showed that LSTMs were effective
[12]. Both approaches are able to learn on longer time- at detecting anomalies in satellite and rover data, and the
series sequences by utilizing gates to filter what information results were improved by an ensemble of LSTMs and an
is retained. Many modern sequence-based processes utilize SVM [21]. Using transfer learning to reduce the amount
one of these neural networks, and they are further discussed of time required to obtain these detector models was shown
in Section 3. to be effective as well [7]. Recent work involving extreme
learning machines [16, 17] provided another approach to re-
In an attempt to model the neocortex of the human brain, duce the training time for anomaly detector models. Outside
hierarchical temporal memory (HTM) was introduced by of recurrent neural networks, deep learning approaches using
[13]. It learns trends in the data by using spatial pooling generative adversarial networks (GANs) have been shown to
and temporal memory after transforming the data into sparse be useful for time-series anomaly detection. A GAN consist-
representations. As a result of its design, HTM works well ing of LSTMs was used to distinguish anomalous data [22],
in the online learning scenario, meaning it requires minimal and it was subsequently shown that similar GANs trained
training data. However, it was found that explicit temporal with cycle-consistency loss could be used for unsupervised
information was integral to its performance [14], which is anomaly detection [23].
not ideal for robust time-series predictors. The work in [15]
discusses other state-of-the-art online time-series predictors.
They can be categorized as support vector regression (SVR)-
based approaches and extreme learning machine (ELM)- 3. P ROPOSED A PPROACH
based approaches. SVR is a machine learning technique that Suppose we have the multi-channel time-series X ∈ X m×n
is essentially learning the function that best fits the data points consisting of telemetry data from a spacecraft subsystem
provided. ELMs [16] are simple neural networks known for (represented by X ), where m is the number of telemetry
their fast learning capability [15], achieved by removing the channels and n is the number of time samples in each channel.
need for backpropagation and relying on random initializa- In the task of spacecraft time-series anomaly detection, our
tion. It has been shown that leveraging these models in an goal is to identify samples in X that are unexpected or
ensemble leads to very promising prediction results [15, 17]. anomalous.
Our anomaly detector model consists of a time-series predic-

2
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
(a) Block diagram of our (RNN) model (b) Block diagram of our Stacked(RNN) model

(c) Block diagram of our AR(RNN) model (d) Block diagram of our ARStacked(RNN) model

Figure 2: Block diagrams of our various RNN-based predictor models. The RNN layers can be either LSTM or GRU layers,
and the model names are adjusted accordingly i.e., AR(RNN) becomes ARLSTM or ARGRU. The RNN layers consist of N e
hidden nodes, and the fully-connected (FC) layers are set to provide a single output.

tor and a prediction error model. As mentioned previously, we can define the set of input-output pairs used as:
the focus of this paper is on achieving reasonable anomaly
t−1
detection performance with minimal training time. To that A = {(Xt−p , Xt ) : p ≤ t ≤ n}
end, we use recurrent neural networks (RNNs) in an online (1)
learning scenario as our time-series predictors. Traditionally, = {(xi , zi ) : 1 ≤ i ≤ N },
in offline learning, deep learning models receive data in where:
batches, where the number of samples in each batch is defined
as the “batch size” and training time is described in terms N = n − p + 1, the number of unique data samples
of “epochs” (the number of times the model sees the entire available.
training dataset). In online learning, both the batch size and
the number of epochs is equal to 1. We also model the Recurrent Component
prediction error in real time by observing the prediction errors
as they occur to calculate a running mean and variance. Let Recurrent neural networks are feedforward networks that
us first discuss our options for the RNN-based time-series include connections between adjacent hidden neurons, in-
predictor. troducing the concept of time, or at least sequentiality, to
deep learning models [9]. To train these models, we use
Time-Series Prediction backpropagation through time (BPTT) [25]. However, the
standard RNN model suffers from the vanishing gradient
Given the previous p values, our goal is to predict the next problem. Since we are backpropagating through time, the
l values of X. This can be accomplished by dividing this gradients get smaller and smaller, meaning that the network,
time-series prediction problem into m sub-problems, where a in essence, has short-term memory, hampering learning of
unique predictor model is used for each channel. longer sequences [10].
Let us formalize this problem using notation from the se- To address this, two modified RNN architectures have since
quence modeling literature [24]. Since we are focusing on risen in popularity: (1) Long-Short Term Memory (LSTM)
each channel individually, without loss of generality, let a models, proposed in [11]; and (2) Gated Recurrent Units
time-series from X be denoted as X and its value at time t (GRUs), proposed in [12]. Both approaches utilize gates
as Xt . Furthermore, the sequence (Xa , Xa+1 , · · · , Xb ) can to filter what information is retained, allowing learning on
be written in shorthand as Xab . We want to find a mapping longer time-series sequences.
function f : X p → X l by training a predictor for the channel
X, where p is the number of previous samples provided to LSTMs contain three gates: (1) the input gate, (2) the forget
each predictor and l is the number of future samples predicted gate, and (3) the output gate. These gates are sigmoidal units
by each predictor. In this paper, we choose l = 1. With this, that control the flow of information within the LSTM. For
example, the input gate multiplies the value of the input to
the network along with the hidden state information from
the previous time step. If the value of the input gate is
close to 0, then this information is mostly ignored. If it is
3
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
close to 1, almost all of the information is passed through. Anomaly Detection
The forget gate determines if the current contents of the
LSTM should be forgotten, while the output gate multiplies After we have our time-series predictor, we need to then learn
the hidden state information to determine the output of the to distinguish between expected and unexpected prediction
LSTM. This network successfully addressed the vanishing errors. As stated previously, a common assumption in the
time-series anomaly detection literature is that anomalous
gradient problem, and has become one of the most prevalent sequences are unknown i.e., we only have samples of ex-
neural network models in use today.
pected prediction errors. Based on this, we can state that
our goal is to understand what expected prediction errors
GRUs are newer than LSTMs, and contain only two gates: look like. As with previous approaches in the anomaly
(1) the update gate and (2) the reset gate. The update gate
determines if the cell state should be updated with the current detection literature [6], our prediction error distribution can
be modeled as a normal distribution, meaning error values
information or not, similar to the input gate of the LSTM. that are unexpected will have low probability of occurring.
The reset gate determines if the previous cell state should be
kept or discarded, similar to the forget gate of the LSTM. Recall that to minimize training time required for our model,
There is no larger consensus on which of these two recurrent we are looking at sequential input of data and online learning
networks is more effective at time-series modeling, so we will principles. Thus, the distribution will have to be estimated in
be examining both structures. real-time i.e., the mean and variance are continuously updated
based on the latest prediction error. Here, we assume that the
To keep our predictor model spatially and computationally majority of the provided time-series is not anomalous i.e., the
general behavior of the time-series is normal, with occasional
simple, we are using a max of 2 recurrent layers. We will deviations that are considered outliers.
refer to models with 2 recurrent layers as ‘stacked’ predictors.
Using multiple recurrent layers enables us to capture more We can formalize this task as follows, similar to the approach
information about the data we are modeling, and this is true introduced in [17]. Let us consider a batch of data A0 ,
for other deep learning architectures as well.
defined below in Equation 4. This is the data used to initialize
the anomaly detector.
Fully-Connected Layers
We use a single fully-connected layer to convert from the A0 = {(xi , zi ) : 1 ≤ i ≤ N0 } ⊂ A, (4)
temporal feature information present in the hidden states of
the RNNs to our predictor model’s actual prediction. Addi- where:
tionally, we use an activation function to control the model
output. For example, a sigmoid function would enforce a N0 is the number of initial data samples used.
limit on the output to between 0 and 1, while the hyperbolic
tangent function would enforce a limit of −1 and 1. Both Then we can define an initial set of prediction error data B0
of these functions calculate an exponential of their input, as:
which can be more computationally expensive. Since our B0 = {ei : 1 ≤ i ≤ N0 }, (5)
primary goal here is to limit the range of the output values, we where:
can simplify computation by using “hard” versions of these
functions, defined below: ei is the prediction error at the ith sample, or the
difference between zi and o(xi ),
0 if x ≤ −3
(
N0 is the number of initial data samples used.
Hardsigmoid(x) = 1 if x ≥ 3 (2)
x 1
6 + 2 otherwise We then find the initial mean µ0 and variance σ02 as:
N0
(
1 if x > 1 1 X
µ0 = ei , (6)
Hardtanh(x) = −1 if x < −1 (3) N0 i=1
x otherwise
0N
1 X
Autoregressive Component σ02 = (ei − µ0 )2 . (7)
N0 − 1 i=1
Due to the non-linear nature of recurrent networks, one
observed issue with RNN models is that the scale of outputs
is not always sensitive to the scale of inputs [26]. Unfor- In our case, the subsequent batches of data Ai consist of
tunately, in specific real datasets, the scale of input signals Ni = 1 samples, where i = 1, 2, . . . i.e., we operate
can constantly change in a non-periodic manner. To attempt on a sample-by-sample basis. Welford [27] showed that a
to address this, we employ a similar approach as the work sequential approach to calculating the mean and variance of
in [26]. The final prediction of the model is separated into observed prediction errors can be defined as follows for the
a linear and non-linear part. The non-linear part contains kth batch of data, where e is the latest prediction error:
recurrent patterns and is provided by the recurrent component
of our predictor model. The linear part is captured by an au- e − µk−1
toregressive model that takes the model input and transforms µk = µk−1 + , (8)
N0 + k
it before adding it to the non-linear part.
N0 + k − 2 2 (e − µk−1 )(e − µk )
To summarize, we are examining several RNN-based predic- σk2 = σ + . (9)
tor models. These models can be described based on their N0 + k − 1 k−1 N0 + k − 1
major features: (1) LSTM-based or GRU-based, (2) Stacked
or Single Layer, and (3) Autoregressive or Standard. These With the solutions in Equations 8 and 9, we can model our
architectures are depicted in Figure 2. prediction error distribution as a normal distribution in a
4
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
sequential manner as new errors are observed. We note that, 73 channels of data, and each channel has 5000 data points.
in this paper, we are not smoothing the errors to account for We use two input sequence lengths i.e., p = 10 and p = 100.
brief, non-anomalous, abrupt changes in signal value that can The Adam optimizer [30] was used to train the model, and
occur in spacecraft telemetry, contrary to previous approaches mean squared error (MSE) loss was used as the loss function.
in the literature [1]. This is of significance because adding The recurrent layers in our models all consist of N e = 64
any filtering does lead to a small delay in anomaly identifi- hidden nodes, while our fully-connected (FC) layers are set
cation, as the corresponding filtered error is delayed. By not to provide a single output i.e., l = 1. The activation function
filtering, we move from near real-time anomaly identification used with our FC layers is hardtanh, defined in Equation 3,
to real-time identification. We define the anomalies as errors as our data is between −1 and 1.
with a log-likelihood of occurrence less than a threshold T .
In summary, we detect anomalies by modeling the prediction To evaluate the prediction performance, we use root mean
errors in a sequential manner and examining the resulting log- square error (RMSE), but with a slight modification. We will
likelihoods. call this error RMSE(s), and it is identical to RMSE, but with
s as a parameter representing how much of the initial data is
discarded before calculating the error. Thus, RMSE(0) would
4. E XPERIMENTAL R ESULTS represent the standard RMSE, while RMSE(3000) would
represent the RMSE for the 3001st data point onward. Since
Dataset we are evaluating our models in an online learning scenario,
While the standard dataset for evaluating spacecraft anomaly this enables us to examine our model performance with more
detection is the SMAP/MSL dataset provided by Hundman nuance. Suppose a model has a large RMSE(0). A smaller
et al. [1], there have been some concerns raised about the RMSE(3000) shows us that a model is taking longer to adapt
effectiveness of this data [28]. We instead use the MRO-SIN to the data but ultimately predicting effectively, while a larger
dataset, introduced in [29]. It was created using data from RMSE(3000) indicates that it is not learning from the data
the Mars Reconnaissance Orbiter (MRO). The MRO dataset well. RMSE(s) is defined as follows, using the set of data A
contains thousands of channels that encompass all spacecraft from Equation 1.
subsystems, and is considered an unlabeled dataset i.e., even v
if anomalies are present, they are not labeled. For each u
u1 X N
channel, we have 8 days of data preceding a system reboot. RM SE(s) = t (zi − o(xi ))2 , (10)
The assumption when making the MRO-SIN dataset was that N i=1+s
the first five days of data of the time-series in each channel
represent normal functioning i.e., no anomalies are present. where:
To create the MRO-SIN dataset, the first five days of data N is the number of samples in the set of data A,
in each channel of the MRO dataset was normalized into zi is the paired output corresponding to input xi ,
the range of [−1, 1]. To account for channels with a lot of o(xi ) is the output of the predictor model given xi .
unchanging values, channels with a standard deviation less
than 0.1 were removed. The first 5, 000 data points in each We also examine the average time it takes to run each
channel can be considered training sequences, while the next model on the MRO-SIN channels. Since this is an online
10, 000 data points were used as test sequences i.e., 1 to 4 learning scenario, the processing time here really represents
anomalies were injected in random locations. Additionally, the training time for our models, as they learn while each
the anomalies were randomly chosen from six different types: sample is provided sequentially. We compare our approaches
to the existing spacecraft time-series online anomaly detector,
1. Noise: A noisy sequence sampled from a Gaussian distri- EnOS-ELM [17].
bution is added to the anomaly location
2. Magnification: The values in an anomaly location are We report the prediction results of our RNN-based models in
multiplied by a value in the range [1.1, 2.5] Tables 1 and 2. Let us first examine the difference in perfor-
3. Shrinkage: The values in an anomaly location are divided mance between LSTMs and GRUs for our MRO spacecraft
by a value in the range [1.1, 2.5] data. In almost every case, the GRU-based models outper-
4. Peak: A large value is inserted in the center of the form the LSTM-based models, indicating that they are able to
anomaly location and linear interpolation is used to replace learn the nominal behavior of our telemetry channels better.
the remaining values More specifically, with both RMSE(0) and RMSE(3000)
5. Pit: A small value is inserted in the center of the anomaly metrics, the GRU-based models do better, showing that they
location and linear interpolation is used to replace the remain- adapt quicker to the new data and are better trained towards
ing values the end of the data stream. Additionally, the GRU-based
6. Data Lost: All values in an anomaly location are set to the models are also faster to run on the data than the LSTM-
most recent previous value i.e., the data is frozen at a certain based models. Thus, we conclude that our final anomaly
value until the anomaly ends detection model for our spacecraft data should use a GRU-
based predictor.
After injecting the anomalies, with the help of experts, 73 We can more closely examine the results reported in Table
channels were selected to constitute the MRO-SIN dataset. 1, now comparing our predictor models’ results with the
Prediction EnOS-ELM, proposed in [17]. If we compare using the
RMSE(3000) metric, we can see that two of our models,
We first evaluate the prediction ability of our various online StackedGRU and ARStackedGRU, outperform EnOS-ELM
RNN-based predictor models. To do this effectively, we when using input sequence length p = 10. The autore-
use the training dataset of MRO-SIN, as it represents the gressive component has some impact here as well, slightly
spacecraft time-series data we are trying to model while reducing the prediction error compared to the model without
remaining free of any known anomalous behavior. There are it. However, when those models are using input sequence
5
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
Table 1: The average prediction errors of the online GRU-based deep learning models we are examining, as well as the EnOS-
ELM model introduced in [17]

Metric GRU ARGRU StackedGRU ARStackedGRU EnOS-ELM


in = 10 in = 100 in = 10 in = 100 in = 10 in = 100 in = 10 in = 100 in = 100
. RMSE(0) 0.1673 0.1604 0.1882 0.1698 0.1607 0.1660 0.1736 0.1776 0.0997
RMSE(3000) 0.1047 0.1173 0.1076 0.1274 0.0942 0.1155 0.0932 0.1239 0.0965
Avg. Time 3.95 4.42 4.46 4.91 6.60 6.52 6.98 6.91 13.09

Table 2: The average prediction errors of the online LSTM-based deep learning models we are examining, as well as the
EnOS-ELM model introduced in [17]

Metric LSTM ARLSTM StackedLSTM ARStackedLSTM EnOS-ELM


in = 10 in = 100 in = 10 in = 100 in = 10 in = 100 in = 10 in = 100 in = 100
. RMSE(0) 0.1974 0.1727 0.2000 0.1762 0.2004 0.1935 0.2115 0.1877 0.0997
RMSE(3000) 0.1210 0.1154 0.1177 0.1218 0.1178 0.1157 0.1193 0.1196 0.0965
Avg. Time 4.02 4.49 4.58 5.02 7.00 6.58 7.15 7.19 13.09

(a) Prediction result of GRU model (b) Prediction result of StackedGRU model

(c) Prediction result of ARStackedGRU model (d) Prediction result of EnOS-ELM model

Figure 3: The prediction results of our GRU, StackedGRU, and ARStackedGRU models, as well as the EnOS-ELM model, on
the MRO-SIN channel A-0412. We used an input sequence length p = 10 for the GRU-based models, and p = 100 for the
EnOS-ELM model. All of the models adapt to the data quickly, but the EnOS-ELM clearly learns the fastest.

length p = 100, and for both lengths for all of the other simplicity helps display the differences in adaptability for
models, EnOS-ELM has a lower prediction error. While even these models. We can visually see how the GRU model
our slowest RNN models are still nearly twice as fast as the adapts to the data, and how the StackedGRU improves on its
EnOS-ELM, we can also see the difference in performance base model by better matching the data earlier, showing that
in both RMSE(0) and RMSE(3000). The EnOS-ELM has stacking recurrent network layers helps capture information
nearly identical performance in those two metrics, indicating faster. The ARStackedGRU improves even further, scaling
that it reaches a stable prediction performance almost instan- the outputs better to match the input magnitude and getting
taneously. For our predictor models, there is a significant closer to the actual data almost at the start. However,
difference in those metrics, showing that it takes longer to the EnOS-ELM matches the data almost perfectly from the
adapt to the data. We can see this visually in Figure 3. beginning, showing how it almost instantaneously achieves
stable prediction performance. Taking this into consideration,
The results shown are the prediction performance of the base we move on to evaluating the spacecraft anomaly detection
GRU, the StackedGRU, the ARStackedGRU, and the EnOS- performance of the GRU-based models.
ELM models on the A-0412 channel from the MRO-SIN
dataset. We note that as these results were obtained in an Anomaly Detection
online learning scenario, we can see the learning process for Based on the results observed during the prediction evalua-
the models as they sequentially progress through the data. tion, we decided to use a GRU-based predictor model. We
This channel data is sinusoidal in nature, and its relative
also determined it was better to first provide our predictor

6
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
Table 3: The anomaly detection results of our GRU-based models

GRU ARGRU StackedGRU ARStackedGRU EnOS-ELM


Metric in = 10 in = 100 in = 10 in = 100 in = 10 in = 100 in = 10 in = 100 in = 100
. Precision 0.486 0.450 0.317 0.388 0.346 0.451 0.391 0.394 0.402
Recall 0.449 0.534 0.385 0.441 0.398 0.551 0.381 0.475 0.432
F1-Score 0.467 0.488 0.347 0.413 0.370 0.496 0.386 0.431 0.416

models with the training sequences for each channel before explanation also addresses the ARStackedGRU performance,
providing the corresponding test sequences. This enabled as the autoregressive component directly incorporates the
us to provide the potentially anomalous sequences after the input values back into the output of the network. The decrease
GRU-based predictor had adapted to the channel data, en- in performance of ARStackedGRU can also be explained by
abling a proper anomaly detection evaluation. The EnOS- examining the types of anomalies present in the MRO-SIN
ELM model received the same treatment, but as observed dataset. Magnfication, shrinkage, and data lost all could all
above, did not benefit greatly with this additional data. Thus, have been easily missed with the autoregressive layer directly
the initial batch A0 defined in Equation 4 actually consisted providing input information to the output.
of the final N0 = 2000 samples of the training sequence,
while the threshold to determine anomalies T ∈ [−4, −8], After this analysis, we can see that our GRU-based anomaly
depending on the predictor model used. These values can detector models outperform the exisiting spacecraft time-
also be adjusted on a per-channel basis, if necessary. series online anomaly detection approach EnOS-ELM, both
in computational speed and anomaly detection performance.
We use the same anomaly detection evaluation criteria as While they require some more training data to initialize prop-
Hundman et al. [1]: erly for anomaly detection, the computational costs that are
saved with using an online learning approach make this one-
1. We record a true positive (TP) if any detected anomaly time initialization condition more acceptable. In a scenario
sequence overlaps with a true anomaly sequence. If multi- where absolutely nothing is known about the data to be
ple detected anomalies overlap with the same true anomaly modeled, EnOS-ELM could be used to monitor the channels
sequence, only one true positive is recorded. initially while our GRU-based models are initialized with
2. We record a false positive (FP) if a detected anomaly data that is known to be non-anomalous. Additionally, the
sequence has no overlap with a true anomaly sequence. depth of research being done with recurrent neural networks
3. We record a false negative (FN) if a true anomaly sequence means that in the future, further adjustments can be made to
has no overlap with a detected anomaly sequence. the performance of these models.

We use precision, recall, and F1-Score to evaluate anomaly


detection performance, which is shown in Table 3. A higher 5. C ONCLUSIONS
precision indicates that fewer false anomalies are identified, It is necessary to monitor spacecraft systems for anomalous
while a higher recall means fewer false negatives occur. behavior so potential mission limiting issues can be identi-
Depending on the operating situation, one can choose which fied. The current approach of manual monitoring by domain
behavior would be prioritized. The F1-score is a combination experts is limited in scope and time-consuming. Training
of the two metrics, which we use to rank the performance RNN-based predictor models in an offline manner for each
of the models being examined. As we can see, some of channel has been shown to be effective but, given the thou-
our GRU-based models do better than the EnOS-ELM at sands of channels present in a system, is also computationally
anomaly detection, while others do worse. Interestingly, each expensive. In this paper, we explore the use of RNN-based
model’s better anomaly detection performances occurs when predictor models trained in an online learning scenario. Our
the input sequence length p = 100 instead of 10, while models are able to learn from the time-series data in real-
the prediction performance tended to favor the shorter input time, dramatically reducing the amount of training required
sequence length. to obtain a predictor model for each channel. Furthermore,
the anomaly detection performance of our model is compa-
We can focus on the models we examined in the prediction rable to state-of-the-art online spacecraft anomaly detection
evaluation: (1) GRU, (2) StackedGRU, and (3) ARStacked- methods. Future work includes investigating the feasibility
GRU. The StackedGRU has the best anomaly detection F1- of using neural network compression techniques to optimize
score, followed by the GRU, and then the ARStackedGRU. these models for computation and storage costs.
All three of these models also performed better at spacecraft
anomaly detection than the existing online appproach EnOS-
ELM. However, the EnOS-ELM performed better than these
three models in the prediction task when considering the input ACKNOWLEDGMENTS
sequence length p = 100. This dichotomy of performance in This material is based on research sponsored by Lockheed
the prediction and anomaly detection task has some interest- Martin Corporation. The views and conclusions contained
ing implications. While a general statement can be made that herein are those of the authors and should not be interpreted
a good predictor model is a necessary component of a good as necessarily representing the official policies or endorse-
anomaly detection model, we can now also conclude that the ments, either expressed or implied of Lockheed Martin Cor-
best prediction model does not necessarily lead to the best poration.
anomaly detection model. There could be several explana-
tions to this result; one of which is that focusing solely on the Address all correspondence to Edward J. Delp at
input values as the EnOS-ELM does could lead to good pre- [email protected].
diction results on clean data, but anomalous behavior being
missed as the model adapts too quickly to some changes. This
7
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES ory and Recurrent Neural Networks for Time Series
Prediction: An Empirical Validation and Reduction to
[1] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, Multilayer Perceptrons,” Neurocomputing, vol. 396, pp.
and T. Soderstrom, “Detecting Spacecraft Anomalies 291–301, 2020.
Using LSTMs and Nonparametric Dynamic Threshold-
ing,” Proceedings of the ACM SIGKDD International [15] R. Ye and Q. Dai, “A Novel Transfer Learning Frame-
Conference on Knowledge Discovery and Data Mining, work for Time Series Forecasting,” Knowledge-Based
pp. 387–395, August 2018, London, United Kingdom. Systems, vol. 156, pp. 74–99, 2018.
[2] S. Chauhan and L. Vig, “Anomaly Detection in ECG [16] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme
Time Signals via Deep Long-Short Term Memory Net- Learning Machine: Theory and Applications,” Neuro-
works,” Proceedings of the IEEE International Confer- computing, vol. 70, no. 1-3, pp. 489–501, 2006.
ence on Data Science and Advanced Analytics, pp. 1–7, [17] S. Baireddy, S. R. Desai, R. H. Foster, M. W. Chan,
October 2015, Paris, France. M. L. Comer, and E. J. Delp, “Spacecraft Time-Series
[3] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, Online Anomaly Detection Using Extreme Learning
G. Shroff, and P. Agarwal, “LSTM-Based Encoder- Machines,” Proceedings of IEEE Aerospace, March
Decoder for Multi-Sensor Anomaly Detection,” arXiv 2022, Big Sky, MT.
preprint arXiv:1607.00148, July 2016. [18] T. Wen and R. Keyes, “Time Series Anomaly Detec-
[4] A. Nanduri and L. Sherry, “Anomaly Detection in Air- tion Using Convolutional Neural Networks and Trans-
craft Data Using Recurrent Neural Networks (RNN),” fer Learning,” arXiv preprint arXiv:1905.13628, May
Proceedings of Integrated Communications Navigation 2019.
and Surveillance, pp. 5C2–1–5C2–8, April 2016, Hern- [19] P. Zheng, S. Yuan, X. Wu, J. Li, and A. Lu, “One-Class
don, VA. Adversarial Nets for Fraud Detection,” Proceedings
[5] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly De- of the AAAI Conference on Artificial Intelligence, pp.
tection: A Survey,” ACM Computing Surveys, vol. 41, 1286–1293, July 2019, Honolulu, HI.
no. 3, July 2009. [20] P. Hayton, S. Utete, D. King, S. King, P. Anuzis, and
[6] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Long L. Tarrasenko, “Static and Dynamic Novelty Detection
Short Term Memory Networks for Anomaly Detection Methods for Jet Engine Health Monitoring,” Philosoph-
in Time Series,” Proceedings of the European Sympo- ical Transactions of the Royal Society A, vol. 365, pp.
sium on Artificial Neural Networks, April 2015, Bruges, 493–514, 2006.
Belgium. [21] T. Li, S. R. Desai, J. L. Mathieson, R. H. Foster,
[7] S. Baireddy, S. R. Desai, J. L. Mathieson, R. H. Foster, M. W. Chan, M. L. Comer, and E. J. Delp, “A Stacked
M. W. Chan, M. L. Comer, and E. J. Delp, “Spacecraft Predictor and Dynamic Thresholding Algorithm for
Time-Series Anomaly Detection Using Transfer Learn- Anomaly Detection in Spacecraft,” Proceedings of the
ing,” Proceedings of the IEEE Conference on Computer IEEE Military Communications Conference, pp. 165–
Vision and Pattern Recognition, Workshop on AI for 170, November 2019, Norfolk, VA.
Space, pp. 1951–1960, June 2021, Virtual Conference. [22] D. Li, D. Chen, B. Jin, L. Shi, J. Goh, and S.-K.
[8] R. Samsudin, A. Shabri, and P. Saad, “A Comparison Ng, “MAD-GAN: Multivariate Anomaly Detection for
of Time Series Forecasting Using Support Vector Ma- Time Series Data with Generative Adversarial Net-
chine and Artificial Neural Network Model,” Journal of works,” Proceedings of the International Conference
Applied Sciences, vol. 10, no. 11, pp. 950–958, 2010. on Artificial Neural Networks, pp. 703–716, September
[9] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, 2019, Munich, Germany.
“Learning Representations by Back-Propagating Er- [23] A. Geiger, D. Liu, S. Alnegheimish, A. Cuesta-
rors,” Nature, vol. 323, no. 6088, pp. 533–536, October Infante, and K. Veeramachaneni, “TadGAN: Time Se-
1986. ries Anomaly Detection Using Generative Adversarial
[10] R. Pascanu, T. Mikolov, and Y. Bengio, “On the Diffi- Networks,” arXiv preprint arXiv:2009.07769, Novem-
culty of Training Recurrent Neural Networks,” Proceed- ber 2020.
ings of the International Conference on International [24] Z. Mariet and V. Kuznetsov, “Foundations of Sequence-
Conference on Machine Learning, pp. 1310—-1318, to-Sequence Modeling for Time Series,” Proceedings
June 2013, Atlanta, GA. of the International Conference on Artifical Intelligence
[11] S. Hochreiter and J. Schmidhuber, “Long Short-Term and Statistics, pp. 408–417, April 2019, Naha, Japan.
Memory,” Neural Computation, vol. 9, no. 8, p. [25] P. Werbos, “Backpropagation through time: what it does
1735–1780, November 1997. and how to do it,” Proceedings of the IEEE, vol. 78,
[12] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, no. 10, pp. 1550–1560, 1990.
F. Bougares, H. Schwenk, and Y. Bengio, “Learning [26] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling
Phrase Representations using RNN Encoder–Decoder Long- and Short-Term Temporal Patterns with Deep
for Statistical Machine Translation,” Proceedings of the Neural Networks,” Proceedings of the International
Conference on Empirical Methods in Natural Language ACM SIGIR Conference on Research & Development in
Processing, pp. 1724–1734, October 2014, Doha, Qatar. Information Retrieval, p. 95–104, 2018, ann Arbor, MI.
[13] Y. Cui, S. Ahmed, and J. Hawkins, “Continuous Online [27] B. P. Welford, “Note on a Method for Calculating Cor-
Sequence Learning with an Unsupervised Neural Net- rected Sums of Squares and Products,” Technometrics,
work Model,” Neural Computation, vol. 28, no. 11, pp. vol. 4, no. 3, pp. 419–420, 1962.
2474–2504, 2016. [28] R. Wu and E. J. Keogh, “Current Time Series Anomaly
[14] J. Struye and S. Latre, “Hierarchical Temporal Mem- Detection Benchmarks are Flawed and are Creating the
8
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.
Illusion of Progress,” arXiv preprint arXiv:2009.13807, Moses W. Chan is a Lockheed Martin
August 2021. technical fellow and an Adjunct Profes-
sor of Electrical and Computer Engi-
[29] T. Li, S. R. Desai, R. H. Foster, M. W. Chan, M. L. neering at Purdue University. His re-
Comer, and E. J. Delp, “A Matching-Based Method search focus resides primarily on de-
for Anomaly Verification in Spacecraft Telemetry,” Pro- fensive systems with multi-sensor and
ceedings of IEEE Aerospace, March 2022, Big Sky, MT. multi-int fusion, missile defense, space
[30] D. P. Kingma and J. Ba, “Adam: A method for stochas- tracking and surveillance, and space-
tic optimization,” Proceedings of the 3rd International craft anomaly detection.
Conference for Learning Representations, May 2015,
san Diego, California. Mary L. Comer is an Associate Pro-
fessor of Electrical and Computer En-
gineering at Purdue University. Her
B IOGRAPHY [ research interests include statistical im-
age modeling and analysis, stochastic
simulation of images, rare event model-
ing, and simulation, and anomaly detec-
Sriram Baireddy is a PhD candidate in tion.
Electrical Engineering at Purdue Uni-
versity. He earned his B.S. and M.S. de-
grees in Electrical Engineering at Pur- Edward J. Delp is the Charles William
due in 2018 and 2021, respectively, with Harrison Distinguished Professor of
minors in economics, math, and physics. Electrical and Computer Engineering
He currently investigates the application and Professor of Biomedical Engineer-
of machine learning techniques to sig- ing at Purdue University. His research
nals, images, and videos for forensic and interests include image and video pro-
agricultural research. cessing, image analysis, computer vi-
sion, image and video compression, mul-
timedia security, medical imaging, mul-
timedia systems, communication and in-
Sundip R. Desai is a Guidance, Nav- formation theory.
igation and Controls engineer and As-
sociate Fellow at Lockheed Martin Cor-
poration. His research at the Advanced
Technology Center has been focused on
general machine learning, computer vi-
sion, explainable AI, recommender sys-
tems, pose estimation, anomaly detec-
tion and characterization of time series
signals.

Richard H. Foster is an Engineer-


ing Manager/Senior Researcher at the
Lockheed Martin Corporation Advanced
Technology Center. His research in-
terests include System Protection us-
ing multiphenomenology observables
and applying AI/machine learning tech-
niques to advance the methods for the
protection of systems. In addition, ap-
plying optimal estimation techniques in
optimizing the design of advanced communication systems.

9
Authorized licensed use limited to: Northeastern University. Downloaded on August 10,2023 at 08:59:06 UTC from IEEE Xplore. Restrictions apply.

You might also like