A Large-Scale Annotated Multivariate Time Series A
A Large-Scale Annotated Multivariate Time Series A
Abstract
This paper presents the largest publicly available, non-simulated, fleet-wide aircraft
flight recording and maintenance log data for use in predicting part failure and
maintenance need. We present 31,177 hours of flight data across 28,935 flights,
which occur relative to 2,111 unplanned maintenance events clustered into 36 types
of maintenance issues. Flights are annotated as before or after maintenance, with
some flights occurring on the day of maintenance. Collecting data to evaluate
predictive maintenance systems is challenging because it is difficult, dangerous,
and unethical to generate data from compromised aircraft. To overcome this, we
use the National General Aviation Flight Information Database (NGAFID), which
contains flights recorded during regular operation of aircraft, and maintenance logs
to construct a part failure dataset. We use a novel framing of Remaining Useful
Life (RUL) prediction and consider the probability that the RUL of a part is greater
than 2 days. Unlike previous datasets generated with simulations or in laboratory
settings, the NGAFID Aviation Maintenance Dataset contains real flight records
and maintenance logs from different seasons, weather conditions, pilots, and flight
patterns. Additionally, we provide Python code to easily download the dataset
and a Colab environment to reproduce our benchmarks on three different models.
Our dataset presents a difficult challenge for machine learning researchers and a
valuable opportunity to test and develop prognostic health management methods.
1 Introduction
Maintenance related issues are a notable risk in the operation single-engine fixed-wing aircraft
Goldman et al. (2002) for private use, known as General Aviation (GA). With improvements in data
collection and processing capabilities, there is now an opportunity to indirectly monitor the condition
of a variety of components in GA aircraft for the purposes of Predictive Maintenance (PM).
The domain of prognostics and health management (PHM) has shifted towards the use of more
data driven approaches Tsui et al. (2015), sometimes also involving deep learning methods. These
methods aim to predict the RUL, detect faults, or monitor the condition of a system and its parts.
However, it is particularly difficult to collect data for these purposes. Rezaeianjouybari and Shang
(2020) states that the data collection process is often too time consuming and too perfect (the data
collected in laboratory conditions may not translate well to the real world). It is clear that there is
a lack of high quality, real world data, yet there is a great need for such data to evaluate new PHM
methodologies.
Existing datasets for PHM often use simulated data Liu et al. (2012) or model systems in controlled
environments, such as a power plant data in the 2015 Prognostics and Health Management Society
Data Challenge. However, PHM methodologies can be extended to real world systems in largely
uncontrolled environments, such as cars or aircraft. As vehicles collect more data, see Arena et al.
2 Related Work
2.1 Aircraft and Aircraft Maintenance Datasets
Publicly available aircraft maintenance datasets often use simulated data rather than data collected
from real life events. Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) Liu
et al. (2012) is a commonly used simulation program for aircraft engines. These simulations are
sometimes supplemented by real flight conditions, e.g. Arias Chao et al. (2021), where flight data is
used as the input for the simulation. However, these approaches rely on synthetic data, which may
not reflect the noisy nature of data collection. There are papers that reference real data, but they do
not publicly release their datasets, e.g. Celikmih et al. (2020) and Dangut et al. (2022). The task of
detecting anomalies from flight data was investigated by Chu et al. (2010), however the dataset was
generated from a flight simulator. Yildirim and Kurt (2018) collects real world data from aircraft, but
only during normal operation of the aircraft and not when there is a fault. A common difficulty in
collecting real world failure data is the associated cost and safety risk; it would be unethical to ask
pilots to fly faulty aircraft. Yang et al. (2021) presented a smaller version of this dataset for aviation
maintenance.
Aircraft flight data has also been used for other tasks, aside from maintenance prediction. Klein
(1989) used real data to estimate aerodynamic properties of aircraft. Khadilkar and Balakrishnan
(2012) used aircraft flight recordings to estimate fuel usage during the taxing stage. Aircraft flight
data can also be used to improve estimation of the remaining range for the aircraft, Randle et al.
2
(2011). In most PHM dataests, Liu et al. (2012), the goal is to predict an estimate or probability
density function of the RUL, prior to a fault occuring. Prior to this work, there were no publicly
available fleet-wide maintenance records of this size and scope coupled with flight data recordings in
both fault free and faulty states.
There are many time series classification datasets, such as Bagnall et al. (2018). These datasets are
used to evaluate both deep learning methods, such as Inception Time, Fawaz et al. (2020), and non
deep methods, such as Mini Rocket, Dempster et al. (2021) and HIVE-COTE, Lines et al. (2018).
Time series classification could also include audio datasets, such as Vincent et al. (2013), but such
datasets belong in their own domain of audio processing. As shown by Bagnall et al. (2018), a
collection of time series datasets will contain very diverse types of data, such as motion sensor data,
electroencephalogram data, traffic data and more. Few time series datasets exist that contain as many
training examples with so many data points per training example.
Several methods have been developed for MTS classification, for a review see Fawaz et al. (2019).
Notable non-deep learning methods include distance based k-nearest neighbors by Orsenigo and
Vercellis (2010) and Dynamic Time Warping KNN by Seto et al. (2015). For deep learning methods,
well performing MTS classifiers tend to utilize some combination of Recurrent Neural Network
(RNN) and Convolutional Neural Network (CNN) methods, Karim et al. (2017), or CNN only
methods Wang et al. (2017). However, RNN methods struggle with long sequences due to the
vanishing gradient problem, as mentioned by Le and Zuidema (2016). CNN only methods can
provide strong results for MTS classification, as shown by Assaf et al. (2019), but may struggle
when relevant features are temporally sparse and related. Yang et al. (2021) also used multi-headed
self-attention methods.
3 Dataset Description
The full dataset is constructed using flight recordings from the NGAFID database and maintenance
records from MaintNet. Each maintenance record describes the type of maintenance performed on
an aircraft and the time period when maintenance occurred. From this information, we extract that
aircraft’s flights occurring before and after the maintenance period. Only the first 5 flights before and
after and any flights during maintenance are extracted. All maintenance activities were unplanned and
done on request. However, when issues are detected, FAA policy forbids the plane from flying. This
means that issues are always detected sometime during or after the first flight before maintenance.
Each flight record contains readings from 23 different sensors every second. A description of the
sensors can be found in Table 1. All flight data recordings come from the same aircraft model, the
Cessna 172.
The NGAFID serves as a repository for general aviation flight data, with a web portal for viewing
and tracking flight safety events for individual pilots as well as for fleets of aircraft Karboviak et al.
(2018). The NGAFID currently contains over 900,000 hours of flight data generated by over 780,000
flights by 12 different types of aircraft, provided by 65 fleets and individual users, resulting in over
3.15 billion per second flight data records across 103 potential flight data recorder parameters. Five
years of textual maintenance records from a fleet which provided data to the NGAFID have been
clustered by maintenance issue type and then validated by domain experts for the MaintNet project,
see Akhbardeh et al. (2020).
MaintNet’s maintenance record logbook data was clustered into 36 different maintenance issue types.
The count of flights per issue type is located in table 3 in Appendix A. Although some maintenance
issues occur very rarely, all maintenance issues with flight data are included in the full dataset. It
is important to note that the NGAFID collects real flight data from aircraft flying with potentially
faulty parts (as is the case for any real world fleet of aircraft). This is because individual components
may fail without causing catastrophic failure of the aircraft during regular operation. The collection
3
Column Name Description
volt1 Main electrical system bus voltage (alternators and main battery)
volt2 Essential bus (standby battery) bus voltage
amp1 Ammeter on the main battery (+ charging, - discharging)
amp2 Ammeter on the standby battery (+ charging, - discharging)
FQtyL Fuel quantity left
FQtyR Fuel quantity right
E1 FFlow Engine fuel flow rate
E1 OilT Engine oil temperature
E1 OilP Engine oil pressure
E1 RPM Engine rotations per minute
E1 CHT1 1st cylinder head temperature
E1 CHT2 2nd cylinder head temperature
E1 CHT3 3rd cylinder head temperature
E1 CHT4 4th cylinder head temperature
E1 EGT1 1st Exhaust gas temperature
E1 EGT2 2nd Exhaust gas temperature
E1 EGT3 3rd Exhaust gas temperature
E1 EGT4 4th Exhaust gas temperature
OAT Outside air temperature
IAS Indicated air speed
VSpd Vertical speed
NormAc Normal acceleration
AltMSL Altitude miles above sea level
Table 1: Description of the data collected by aircraft sensors
of this data poses no additional safety risk to the pilots because data collection occurs for all flights
performed.
Attempting to train a model on the full dataset is particularly difficult. This is because of two main
issues. The first is the under representation of certain classes. For example, the spark plug related
issue contains only 15 flights. These classes are included in the dataset to allow for future researchers
to train models that address the class imbalance issue, as class imbalance is an important area of
machine learning research Johnson and Khoshgoftaar (2019). The second is that each flight contains
a significant amount of data, which makes it hard to regularize a model trained on the data. In the
spark plug example, each of the 15 flights contain more than 40,000 data points, which would be
used to predict a single class out of 36. Regularization is another important research area of machine
learning, Kukačka et al. (2017), but it is beyond the scope of this paper. These two challenges are
inevitable when collecting real world data, as it is impossible to guarantee that each part in a real
world aircraft fails at the same rate.
A subset of the full dataset is used in this paper to benchmark time series classification methods and
provide a baseline. Since addressing class imbalance and regularization are beyond the scope of this
paper, the subset is designed to minimize the impact of those two issues. By limiting the subset to only
classes that have at least 50 flights, it removes flights where generalization is a problem and reduces
the impact of class imbalance. The subset also defines the binary RUL problem as P (RU L > x)
with x as 2 days. We label flights within 2 days before maintenance as P (RU L > x) = 0 and flights
after 2 days after maintenance as P (RU L > x) = 1, as such flights contain brand new parts.
The data subset is defined as all flights within two days of the maintenance period, excluding any
flights during the maintenance period. Furthermore, any flights belonging to classes that would have
fewer than 50 flights before or after the maintenance period are also excluded. This leaves us with a
4
Figure 1: Exhaust Gas Temperature before and after Maintenance for Intake Gasket Issue. This shows
that the sensor readings before an issue occurs and after it is fixed is largely the same. Notice that the
EGT values are mostly in line with each other, but diverge at certain times. It is not obvious where
an anomaly occurs, nor is it obvious whether or not there is an issue. The difference in temperature
values indicate a difference in flight plan as well.
total of 5844 flights after maintenance and 5602 flights before maintenance in 19 different classes of
maintenance issues. The count of flights per issue type is located in table 3 in Appendix A.
The full dataset is provided without any preprocessing and contains the full flight data for each
flight. Some sensors recorded NaN values at certain time steps. This can be caused by many factors,
including but not limited to initialization of the sensor at the beginning of the flight, failure of the
sensor during flight, and failure of the recording system during flight. Approximately 1% of all
datapoints contain NaN values. We decided to present future researchers with NaN values as replacing
such values with another, such as 0, would change the meaning.
For training a model in our benchmark experiments, the data subset is scaled. All values are scaled
via MinMax, with a scaled minimum value of 0 and a scaled maximum value of 1. Minimums
and maximums for each channel of data were calculated using all of the data. Please note that
the full dataset is not distributed with its values scaled, which allows for examination of different
normalization techniques by users of the dataset.
3.4 Visualization
Readings from 4 sensors from two flights are provided in figure 1. Only the output from 4 sensors are
included to keep the graph readable. Even with only 4 sensors, it should be clear that determining the
probability of part failure for a human would be extremely difficult. Even with domain knowledge, it
would only be possible to detect that an issue has occurred, but not that an issue is going to occur.
To ensure privacy, information regarding the serial number of the aircraft of the flight and the date
of the flight have been removed from the publicly released dataset. Latitude and longitude are also
removed from the flight. The specific text of the maintenance logs are also withheld.
3.6 Labels
Labels are assigned to flights based on the date of the flight and the date of the maintenance. As
per the Federal Aviation Administration regulations, aircraft are only permitted to operate in a safe
manner. This means that maintenance issues and part failures that occurred prior to the maintenance
date will be fixed after the maintenance date.
5
Part failures in the aircraft can be described as acute or chronic. Acute part failure describes a
complete failure of a part due to an unexpected event. For example, if the aircraft collides with a
small object, such as a bird, this would create data representing an acute failure and an instantaneous
change in the remaining useful life. Chronic part failure describes gradual wear and tear that renders
a part unsafe for flight, before scheduled maintenance can replace the part. For example, an intake
gasket may wear down more quickly than expected, leading to a leakage that negatively impacts the
safety of the aircraft.
It should be noted that dramatic decreases in remaining useful life are rare. Based on analysis of the
intake gasket leak/damage class, there were 9085 cases were described as leaking and only 15 cases
were described as torn. This may negatively impact anomaly detection methods and will be discussed
in a later section.
Because all maintenance records are related to unplanned maintenance, it is clear that the related
parts reached the end of their remaining useful life prematurely. These unplanned maintenance events
occurred outside of scheduled maintenance and can be considered anomalies in the expect wear and
tear of the associated parts.
The detection of maintenance issues and their associated part failure is quite challenging. We define
this as min∀i∈I P (RU Li > x), where I represents the set of all parts and RU Li is the remaining
useful life of part i.
While this problem may resemble anomaly detection, it is important to note that a significant portion
of variance in the data is caused by outside factors. Pilot actions can dramatically alter readings for
every sensor on the aircraft, causing no two flights to be the same. Weather conditions, which may
vary with altitude, are also inconsistent. For these reasons, an anomaly detection approach may find
a large number of anomalous sections on any given flight. These anomalies may be caused by a
maintenance issue or irregular flight activities.
It should be clear that the data collected by aircraft sensors pose a greater challenge than sensors
on laboratory or industrial assets. In those situations, the environment is much more controlled, as
opposed to the environment of an aircraft in flight. They may also utilize dedicated sensors designed
to collect data for the purpose of detecting specific issue. This is not the case for our dataset, as the
sensors collect general flight data, which were not originally designed to collect data for maintenance
issue prediction and binary RUL estimation.
An extension of the previous problem is to also classify the type of maintenance issue based on the
maintenance record left by the mechanic. This problem is significantly more difficult for two reasons.
First is the class imbalance, with some classes having thousands of flights and others having less than
one hundred. Second is the similarity in how maintenance issues affect flight characteristics. This is
because two different issues may cause very similar changes in flight characteristics.
Formally, this is defined as ∀i ∈ I[P (RU Li > x)]. This is significantly more difficult than detection
of the maintenance issue.
In addition to the class labels for specific maintenance issues, this dataset is provided with a simple
hierarchy for maintenance issues. There are 5 groupings of maintenance issues: engine, baffle, oil,
cylinder, and other. Argyriou et al. (2006) has shown that hierarchies can provide benefits in terms of
regularization, as they improve the information provided by the labels. These hierarchies are included
for potential future research.
6
4 Benchmark Experiments
Three different tasks are defined and all three tasks use the same subset of data. The first task is
maintenance issue detection. Here, we assign after maintenance flights with a negative label and
before maintenance flights with a positive label. The second task is maintenance issue classification,
where we assign after maintenance flights with class 0 and before maintenance flights with their
maintenance issue class (from 1 to 19). The third is combined detection and classification, where a
network is trained on both tasks simultaneously.
4.2 Models
4.2.2 HIVECOTEv2
We attempted to test HIVECOTEv2 Middlehurst et al. (2021) on the NGAFID maintenance dataset
using the sktime implementation by Löning et al. (2022). However, the training time for the model
exceeded the maximum time allocated for Google Colab instances (24hrs). Shifaz et al. (2020) notes
that HIVECOTE scales polynomially with data, which could explain why it could not train in time on
the NGAFID maintenance dataset. Not only is the NGAFID maintenance dataset large, it also has
many time steps and many channels. No results for HIVECOTEv2 are included.
4.2.4 ConvMHSA
Multi-Headed Self Attention (MHSA) modules were popularized by Devlin et al. (2018) for usage in
Natural Language Processing and by Dosovitskiy et al. (2020) for Computer Vision. The ConvMHSA
model used for benchmarking is the same as the one in Yang et al. (2021). The model implements
attention layers that mimic the functionality of the encoder layers present in BERT Devlin et al.
(2018). Instead of token embeddings, the model generates sequence embeddings with the use of 1D
convolutions along the temporal dimension. These learnable sequence embeddings capture local
relationships and to compress the MTS to a shorter length. It uses a series of 1D convolutions to
reduce the temporal resolution from 4096 to 512 and then employs 4 stacked MHSA encoder layers
with 8 heads each and 64 dense units per head. The output is globally average pooled and fed to a
dense layer for classification.
7
Model Task Binary Acc. Multiclass Acc. Loss Train Loss
Binary 76.0% N/A 0.526 0.003
ConvMHSA Multi N/A 52.8% 2.168 1.097
Both 76.1% 56.3% 1.756 1.377
Binary 75.5% N/A 0.569 0.214
InceptionTime Multi N/A 54.1% 2.251 1.365
Both 74.0% 55.4% 1.667 1.038
Binary 59.8% N/A 0.667 0.395
MiniRocket
Multi N/A 50.4% 1.800 0.424
Table 2: Validation metrics for model and task combinations, averaged across 5 folds. Train loss is
included to measure overfitting
4.3 Results
A summary of results can be found in Table 2. Overall, models tend to overfit the data, but especially
so on classifying the specific maintenance issue. This is because certain classes contain only 75
examples. MiniRocket performed relatively poorly compared to the deep learning models. If we
compare the training loss and the validation loss, it appears that MiniRocket may have difficulty in
generalization in the multi class case. It also seems to suffer from under fitting in the binary case.
These results suggest that deep learning methods may have an advantage for this type of problem.
We can test out of fold early detection performance by creating an early detection dataset, consisting
of only validation flights before maintenance in the intake gasket leak/damage class. Note that this
does not account for false positives, since they would not be taken in for maintenance. This test is
repeated for each fold using the best InceptionTime model, trained on both tasks simultaneously. The
results are summarized in Figure 2. Based on the 5 fold validation, one can conclude that there is
no statistically significant difference in recall across the number of flights before maintenance. This
suggests that the model is capable of predicting a part failure or the need for maintenance before the
problem arises in the flight immediately before maintenance.
Flight Safety and Predictive Maintenance Systems The Cessna 172 is the most produced aircraft
at over 43,000 units Goyer et al. (2022) and is widely used in general aviation. This includes flight
schools, recreation, and many other applications. However, general aviation is one of the most
dangerous forms of civil aviation according to Board (2011). The NGAFID Aviation Maintenance
dataset can be used for training and testing systems that detect maintenance issues early to improve
aircraft safety.
8
Figure 2: Min, mean, and max early detection accuracy of maintenance issues on validation flights.
Only out of fold pre maintenance flights are included in this calculation, done using each fold’s best
trained model based on validation accuracy.
Class Imbalance Research One of the major challenges in this dataset is the under-representation
of many classes. Half of all flights are recorded after maintenance and two damages classes, intake
gasket leak/damage and rocker cover leak/loose/damage, represent the majority of all damage classes.
This is so problematic that models trained to predict the specific damage class for a flight will predict
that the flight is post maintenance or that the flight belongs to one of the two majority classes more
than 95% of the time. Future researchers can use this dataset to test methods that mitigate problems
caused by imbalanced training data. This is particularly relevant for prognostic health management,
where it is often easy to gather data of a system in normal operation, but extremely difficult to gather
abnormal operation data.
Time Series Classification Benchmarking Benchmark results suggest that the NGAFID Aviation
Maintenance dataset is particularly challenging for non deep learning methods. Of the two main
tasks, deep learning models seem to perform reasonable well for maintenance issue detection, but
all models perform quite poorly on maintenance issue classification. One possible explanation is
that the data is more similar to audio data, where deep methods perform well, and less similar to
the time series data benchmarks by Bagnall et al. (2018), where methods such as Mini Rocket and
HIVECOTE perform well. While we were unable to test HIVECOTE, we invite other researchers to
evaluate their methods on the NGAFID Aviation Maintenance dataset.
Anomaly Detection While this dataset does not contain localized annotations of failures, one could
separate flights into regular operation and compromised operation and attempt anomaly detection. As
noted in earlier sections, the authors of this paper believe that anomaly detection in this dataset would
be extremely difficult. This is because the data collected contains a significant degree of variance
caused by other factors, such as the pilot’s experience, the weather, the flight plan, the payload carried,
and so on.
Furthermore, anomaly detection works best in cases of acute part failure, where the failure of a part
is dramatic or has a dramatic impact on the overall system. As noted in earlier sections, chronic part
failure is more common and does not create a dramatic impact on the overall system. While such
failures make the aircraft unsafe to fly, it does not immediately cause the aircraft to crash or cease
operation.
9
An auto encoder system, such as An and Cho (2015), would struggle with this dataset. First, the
encoder must accommodate significant degrees of variance when there are no anomalies. Second,
the encoder must detect the most subtle of changes in flight characteristics, which may be easily
overshadowed by other sources of variance. Imagine a pilot performing a barrel roll with a leaky
gasket; the variance in data caused by the barrel roll would eclipse any variance caused by a leaky
gasket.
Transfer Learning This dataset also presents opportunities for transfer learning, Torrey and Shavlik
(2010). While this dataset uses sensor data from the Cessna 172 collected from a flight school, the
same sensors can be placed on other Cessna 172 aircraft for other uses. For example, the Cessna 172
can be used for passenger, cargo, or military purposes. It is also possible that these sensors can be
mounted in similar single engine aircraft. This would allow researchers to develop PHM systems
without needing to collect as much data from the other aircraft; the data from the other aircraft can be
supplemented with the NGAFID data.
The authors believe that any transfer learning for a military purpose will not directly endanger more
lives. This is because this dataset only contains flight and maintenance information, which can be
used to improve the safety and reliability of systems.
6 Future Research
Flight Event Detection Datasets Currently, the NGAFID web application provides detection
services for certain flight events, such as a aircraft stall event. Using the existing flight data and expert
labeling of said flight data, it is possible to create a time series localization dataset, with the goal of
detecting both the presence and timing of such events.
Unsupervised and Self Supervised Datasets The NGAFID database contains more than 900,000
hours of flight data from various fleets and aircraft models. While most of the data is unlabeled, it
can still be used for self supervised and unsupervised learning, such as contrastive representation
learning or masked data modeling, similar to Devlin et al. (2018).
Dataset Expansion The authors are actively working with the Federal Aviation Administration
and existing flights schools to obtain more maintenance data to expand this dataset to multiple
airframes and fleets of aircraft. This process is both slow and costly due to data governance and legal
issues. Given the value of this dataset even from a single fleet, the authors will provide the current
available data for the Cessna 172 and work on a future data release to include additional air frames
and maintenance issues.
7 Conclusion
In this paper we present 31,177 hours of flight data across 28,935 flights, which occur relative to
2,111 unplanned maintenance events clustered into 36 types of maintenance issues. Each flight
records information from 23 different sensors every second on the Cessna 172 aircraft, during normal
operation of a flight school. Our paper makes the significant contribution of providing non-simulated,
compromised aircraft flight data, collected ethically at no additional danger to the pilots involved.
This dataset is made easily accessible at the links in Section 1.
The large amount of flight data involving a compromised aircraft is particularly valuable to prognostic
health management and predictive maintenance. Because aircraft in question, the Cessna 172, is
often used in flight schools, recreation, agriculture, and more, this dataset can help create systems
that can greatly improve flight safety.
Finally, the NGAFID Aviation Maintenance dataset will be of particular interest to machine learning
researchers working with time series data. It is our aim, by releasing this dataset and identifying
areas for future research, to encourage further work in the detection and classification of maintenance
issues. We hope this in turn leads to improved future detection and classification algorithms.
10
References
Akhbardeh, F., Desell, T., and Zampieri, M. (2020). MaintNet: A collaborative open-source library for
predictive maintenance language resources. In Proceedings of the 28th International Conference
on Computational Linguistics: System Demonstrations, pages 7–11, Barcelona, Spain (Online).
International Committee on Computational Linguistics (ICCL).
An, J. and Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction
probability. Special Lecture on IE, 2(1):1–18.
Arena, F., Collotta, M., Luca, L., Ruggieri, M., and Termine, F. G. (2021). Predictive maintenance in
the automotive sector: A literature review. Mathematical and Computational Applications, 27(1):2.
Argyriou, A., Evgeniou, T., and Pontil, M. (2006). Multi-task feature learning. Advances in neural
information processing systems, 19.
Arias Chao, M., Kulkarni, C., Goebel, K., and Fink, O. (2021). Aircraft engine run-to-failure dataset
under real flight conditions for prognostics and diagnostics. Data, 6(1):5.
Assaf, R., Giurgiu, I., Bagehorn, F., and Schumann, A. (2019). Mtex-cnn: Multivariate time series
explanations for predictions with convolutional neural networks. In 2019 IEEE International
Conference on Data Mining (ICDM), pages 952–957. IEEE.
Bagnall, A., Dau, H. A., Lines, J., Flynn, M., Large, J., Bostrom, A., Southam, P., and Keogh,
E. (2018). The uea multivariate time series classification archive, 2018. arXiv preprint
arXiv:1811.00075.
Board (2011). Review of U.S. civil aviation accidents: review of aircraft accident data, 2007-2009.
U.S. National Transportation Safety Board.
Celikmih, K., Inan, O., and Uguz, H. (2020). Failure prediction of aircraft equipment using machine
learning with a hybrid data preparation method. Scientific Programming, 2020.
Chu, E., Gorinevsky, D., and Boyd, S. (2010). Detecting aircraft performance anomalies from cruise
flight data. In AIAA Infotech@ Aerospace 2010, page 3307.
Dangut, M. D., Jennions, I. K., King, S., and Skaf, Z. (2022). A rare failure detection model for
aircraft predictive maintenance using a deep hybrid learning approach. Neural Computing and
Applications, pages 1–19.
Dempster, A., Petitjean, F., and Webb, G. I. (2020). Rocket: exceptionally fast and accurate time
series classification using random convolutional kernels. Data Mining and Knowledge Discovery,
34(5):1454–1495.
Dempster, A., Schmidt, D. F., and Webb, G. I. (2021). Minirocket: A very fast (almost) deterministic
transform for time series classification. In Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining, pages 248–257.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers
for image recognition at scale. arXiv preprint arXiv:2010.11929.
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.-A. (2019). Deep learning for
time series classification: a review. Data mining and knowledge discovery, 33(4):917–963.
Fawaz, H. I., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D. F., Weber, J., Webb, G. I., Idoumghar,
L., Muller, P.-A., and Petitjean, F. (2020). Inceptiontime: Finding alexnet for time series classifica-
tion. Data Mining and Knowledge Discovery, 34(6):1936–1962.
Goldman, S. M., Fiedler, E. R., and King, R. E. (2002). General aviation maintenance-related
accidents: A review of ten years of ntsb data.
11
Goyer, B. I., Staff, F., Staff, I. G., and Flying (2022). Cessna 172: Still relevant today?
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In
European conference on computer vision, pages 630–645. Springer.
Johnson, J. M. and Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance.
Journal of Big Data, 6(1):1–54.
Karboviak, K., Clachar, S., Desell, T., Dusenbury, M., Hedrick, W., Higgins, J., Walberg, J., and Wild,
B. (2018). Classifying aircraft approach type in the national general aviation flight information
database. In International Conference on Computational Science, pages 456–469. Springer.
Karim, F., Majumdar, S., Darabi, H., and Chen, S. (2017). Lstm fully convolutional networks for
time series classification. IEEE access, 6:1662–1669.
Khadilkar, H. and Balakrishnan, H. (2012). Estimation of aircraft taxi fuel burn using flight data
recorder archives. Transportation Research Part D: Transport and Environment, 17(7):532–537.
Klein, V. (1989). Estimation of aircraft aerodynamic parameters from flight data. Progress in
Aerospace Sciences, 26(1):1–77.
Kukačka, J., Golkov, V., and Cremers, D. (2017). Regularization for deep learning: A taxonomy.
arXiv preprint arXiv:1710.10686.
Le, P. and Zuidema, W. (2016). Quantifying the vanishing gradient and long distance dependency
problem in recursive neural networks and recursive lstms. arXiv preprint arXiv:1603.00423.
Lines, J., Taylor, S., and Bagnall, A. (2018). Time series classification with hive-cote: The hierarchical
vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery
from Data, 12(5).
Liu, Y., Frederick, D. K., DeCastro, J. A., Litt, J. S., and Chan, W. W. (2012). User’s guide for the
commercial modular aero-propulsion system simulation (c-mapss): Version 2. Technical report.
Löning, M., Bagnall, T., Király, F., Middlehurst, M., Ganesh, S., Oastler, G., Lines, J., ViktorKaz,
Walter, M., Mentel, L., RNKuhns, chrisholder, Tsaprounis, L., Owoseni, T., Rockenschaub, P.,
Khrapov, S., jesellier, danbartl, Bulatova, G., eenticott shell, Lovkush, Take, K., Meyer, S. M.,
AidenRushbrooke, Gilbert, C., Schäfer, P., oleskiewicz, Xu, Y.-X., Ansari, A., and Sakshi, A.
(2022). alan-turing-institute/sktime: v0.11.4.
Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., and Bagnall, A. (2021). Hive-cote 2.0:
a new meta ensemble for time series classification. Machine Learning, 110(11):3211–3243.
Oguiza, I. (2022). tsai - a state-of-the-art deep learning library for time series and sequential data.
Github.
Orsenigo, C. and Vercellis, C. (2010). Combining discrete svm and fixed cardinality warping distances
for multivariate time series classification. Pattern Recognition, 43(11):3787–3794.
Randle, W. E., Hall, C. A., and Vera-Morales, M. (2011). Improved range equation based on aircraft
flight data. Journal of Aircraft, 48(4):1291–1298.
Rezaeianjouybari, B. and Shang, Y. (2020). Deep learning for prognostics and health management:
State of the art, challenges, and opportunities. Measurement, 163:107929.
Seto, S., Zhang, W., and Zhou, Y. (2015). Multivariate time series classification using dynamic time
warping template selection for human activity recognition. In 2015 IEEE Symposium Series on
Computational Intelligence, pages 1399–1406. IEEE.
Shifaz, A., Pelletier, C., Petitjean, F., and Webb, G. I. (2020). Ts-chief: a scalable and accurate forest
algorithm for time series classification. Data Mining and Knowledge Discovery, 34(3):742–775.
Torrey, L. and Shavlik, J. (2010). Transfer learning. In Handbook of research on machine learning
applications and trends: algorithms, methods, and techniques, pages 242–264. IGI global.
12
Tsui, K. L., Chen, N., Zhou, Q., Hai, Y., and Wang, W. (2015). Prognostics and health management:
A review on data driven approaches. Mathematical Problems in Engineering, 2015.
Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., and Matassoni, M. (2013). The second
‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines. In 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, pages 126–130.
Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment
and uniformity on the hypersphere. In III, H. D. and Singh, A., editors, Proceedings of the 37th
International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning
Research, pages 9929–9939. PMLR.
Wang, Z., Yan, W., and Oates, T. (2017). Time series classification from scratch with deep neural
networks: A strong baseline. In 2017 International joint conference on neural networks (IJCNN),
pages 1578–1585. IEEE.
Yang, H., LaBella, A., and Desell, T. (2021). Predictive maintenance for general aviation using
convolutional transformers. arXiv preprint arXiv:2110.03757.
Yildirim, M. T. and Kurt, B. (2018). Aircraft gas turbine engine health monitoring system by real
flight data. International Journal of Aerospace Engineering, 2018.
13
A Appendix
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Seciton 3
(c) Did you discuss any potential negative societal impacts of your work? [No] The data
is collected from purpose built sensors on aircraft. We believe that this data is so
14
specialized as to be only applicable in the aviation industry and only to single engine
aircraft, which very few people interact with regularly.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [Yes] As a repository
with links to Colab Notebooks
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] See the code in the repo
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] But only for the early detection in section 4.4. The overall
results in 4.3 only report mean values. However, you can reproduce results as the splits
are availalbe in the dataset.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] See section 4.2.5
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes]
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [No] See data sheet supplementary material
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [Yes] Section 3.3
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
15