0% found this document useful (0 votes)
31 views15 pages

A Large-Scale Annotated Multivariate Time Series A

The document presents the NGAFID Aviation Maintenance Dataset, the largest publicly available dataset of real aircraft flight recordings and maintenance logs, comprising 31,177 hours of flight data across 28,935 flights and 2,111 unplanned maintenance events. This dataset aims to aid in predictive maintenance by providing a challenging environment for machine learning researchers to develop and test prognostic health management methods. The authors also offer Python code and Colab notebooks for easy access and benchmarking of the dataset, which includes detailed sensor data from Cessna 172 aircraft.

Uploaded by

Monark Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views15 pages

A Large-Scale Annotated Multivariate Time Series A

The document presents the NGAFID Aviation Maintenance Dataset, the largest publicly available dataset of real aircraft flight recordings and maintenance logs, comprising 31,177 hours of flight data across 28,935 flights and 2,111 unplanned maintenance events. This dataset aims to aid in predictive maintenance by providing a challenging environment for machine learning researchers to develop and test prognostic health management methods. The authors also offer Python code and Colab notebooks for easy access and benchmarking of the dataset, which includes detailed sensor data from Cessna 172 aircraft.

Uploaded by

Monark Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A Large-Scale Annotated Multivariate Time Series

Aviation Maintenance Dataset from the NGAFID

Hong Yang Travis Desell


Rochester Institute of Technology Rochester Institute of Technology
Rochester, NY 14623 Rochester, NY 14623
[email protected] [email protected]
arXiv:2210.07317v1 [cs.LG] 13 Oct 2022

Abstract

This paper presents the largest publicly available, non-simulated, fleet-wide aircraft
flight recording and maintenance log data for use in predicting part failure and
maintenance need. We present 31,177 hours of flight data across 28,935 flights,
which occur relative to 2,111 unplanned maintenance events clustered into 36 types
of maintenance issues. Flights are annotated as before or after maintenance, with
some flights occurring on the day of maintenance. Collecting data to evaluate
predictive maintenance systems is challenging because it is difficult, dangerous,
and unethical to generate data from compromised aircraft. To overcome this, we
use the National General Aviation Flight Information Database (NGAFID), which
contains flights recorded during regular operation of aircraft, and maintenance logs
to construct a part failure dataset. We use a novel framing of Remaining Useful
Life (RUL) prediction and consider the probability that the RUL of a part is greater
than 2 days. Unlike previous datasets generated with simulations or in laboratory
settings, the NGAFID Aviation Maintenance Dataset contains real flight records
and maintenance logs from different seasons, weather conditions, pilots, and flight
patterns. Additionally, we provide Python code to easily download the dataset
and a Colab environment to reproduce our benchmarks on three different models.
Our dataset presents a difficult challenge for machine learning researchers and a
valuable opportunity to test and develop prognostic health management methods.

1 Introduction
Maintenance related issues are a notable risk in the operation single-engine fixed-wing aircraft
Goldman et al. (2002) for private use, known as General Aviation (GA). With improvements in data
collection and processing capabilities, there is now an opportunity to indirectly monitor the condition
of a variety of components in GA aircraft for the purposes of Predictive Maintenance (PM).
The domain of prognostics and health management (PHM) has shifted towards the use of more
data driven approaches Tsui et al. (2015), sometimes also involving deep learning methods. These
methods aim to predict the RUL, detect faults, or monitor the condition of a system and its parts.
However, it is particularly difficult to collect data for these purposes. Rezaeianjouybari and Shang
(2020) states that the data collection process is often too time consuming and too perfect (the data
collected in laboratory conditions may not translate well to the real world). It is clear that there is
a lack of high quality, real world data, yet there is a great need for such data to evaluate new PHM
methodologies.
Existing datasets for PHM often use simulated data Liu et al. (2012) or model systems in controlled
environments, such as a power plant data in the 2015 Prognostics and Health Management Society
Data Challenge. However, PHM methodologies can be extended to real world systems in largely
uncontrolled environments, such as cars or aircraft. As vehicles collect more data, see Arena et al.

Preprint. Under review.


(2021), there is an opportunity to combine said data with unplanned maintenance records to create a
novel RUL estimation dataset for predictive maintenance.
GA flight data is particularly challenging for PHM due to the nature of the aircraft sensors. Unlike
commercial aircraft, GA aircraft only possess basic sensors for monitoring critical systems, such
as the engine and battery, and flight instruments, for measuring air speed and altitude. A predictive
model relying on such basic sensors for predicting RUL of specific parts must be both highly sensitive
to the most subtle of changes and highly robust to noise. This is because a significant portion of the
variance in flight data is explained by pilot action and not the condition of the airplane’s parts. This
poses a difficult problem for conventional approaches, such as auto encoders.
In this paper we present the NGAFID Aviation Maintenance Dataset for use in binary RUL estimation.
This dataset is constructed using flight records and unplanned maintenance logs of a single flight
school over multiple years. For each unplanned breakdown of a part, which lead to unplanned
maintenance, there are multiple flights days before and after the maintenance event. From here, we
create two groups of flight data, one with parts that are going to break within x days and ones with
new parts, which are not expected to break within x days. This challenging PHM problem uses one
of the largest real-world PHM datasets and it also poses a particularly difficult machine learning
problem for time series classification and time series anomaly detection.

1.1 Our Contributions

Data: https://doi.org/10.5281/zenodo.6624956 and https://www.kaggle.com/dat


asets/hooong/aviation-maintenance-dataset-from-the-ngafid A dask dataframe
containing 31,177 hours of flight data across 28,935 flights, with a header csv that describes each
flight and the associated maintenance file. The data was collected automatically by a flight school
after each flight and uploaded to the NGAFID database. Flights occurred in a variety of seasons and
weather conditions. All flights were flown with Cessna 172 aircraft. 23 sensors record data every
second, resulting in more than 100 million rows of flight data with a total size of 4.3 GB.
Benchmarks: https://github.com/hyang0129/NGAFIDDATASET A repository containing
helper code to automatically download and process the dataset and 2 Colab notebooks for replicating
the benchmark experiments. Anyone can run the benchmark experiments with one click using
the Colab notebooks in a web browser, with a free Linux environment including GPUs and TPUs
provided by Google. This demonstrates how to use the dataset and replicate experiments, regardless
of the replicator’s hardware and software limitations.
The above files are licensed under GNU General Public License V3.0.

2 Related Work
2.1 Aircraft and Aircraft Maintenance Datasets

Publicly available aircraft maintenance datasets often use simulated data rather than data collected
from real life events. Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) Liu
et al. (2012) is a commonly used simulation program for aircraft engines. These simulations are
sometimes supplemented by real flight conditions, e.g. Arias Chao et al. (2021), where flight data is
used as the input for the simulation. However, these approaches rely on synthetic data, which may
not reflect the noisy nature of data collection. There are papers that reference real data, but they do
not publicly release their datasets, e.g. Celikmih et al. (2020) and Dangut et al. (2022). The task of
detecting anomalies from flight data was investigated by Chu et al. (2010), however the dataset was
generated from a flight simulator. Yildirim and Kurt (2018) collects real world data from aircraft, but
only during normal operation of the aircraft and not when there is a fault. A common difficulty in
collecting real world failure data is the associated cost and safety risk; it would be unethical to ask
pilots to fly faulty aircraft. Yang et al. (2021) presented a smaller version of this dataset for aviation
maintenance.
Aircraft flight data has also been used for other tasks, aside from maintenance prediction. Klein
(1989) used real data to estimate aerodynamic properties of aircraft. Khadilkar and Balakrishnan
(2012) used aircraft flight recordings to estimate fuel usage during the taxing stage. Aircraft flight
data can also be used to improve estimation of the remaining range for the aircraft, Randle et al.

2
(2011). In most PHM dataests, Liu et al. (2012), the goal is to predict an estimate or probability
density function of the RUL, prior to a fault occuring. Prior to this work, there were no publicly
available fleet-wide maintenance records of this size and scope coupled with flight data recordings in
both fault free and faulty states.

2.2 Time Series Datasets

There are many time series classification datasets, such as Bagnall et al. (2018). These datasets are
used to evaluate both deep learning methods, such as Inception Time, Fawaz et al. (2020), and non
deep methods, such as Mini Rocket, Dempster et al. (2021) and HIVE-COTE, Lines et al. (2018).
Time series classification could also include audio datasets, such as Vincent et al. (2013), but such
datasets belong in their own domain of audio processing. As shown by Bagnall et al. (2018), a
collection of time series datasets will contain very diverse types of data, such as motion sensor data,
electroencephalogram data, traffic data and more. Few time series datasets exist that contain as many
training examples with so many data points per training example.

2.3 Classification Methods

Several methods have been developed for MTS classification, for a review see Fawaz et al. (2019).
Notable non-deep learning methods include distance based k-nearest neighbors by Orsenigo and
Vercellis (2010) and Dynamic Time Warping KNN by Seto et al. (2015). For deep learning methods,
well performing MTS classifiers tend to utilize some combination of Recurrent Neural Network
(RNN) and Convolutional Neural Network (CNN) methods, Karim et al. (2017), or CNN only
methods Wang et al. (2017). However, RNN methods struggle with long sequences due to the
vanishing gradient problem, as mentioned by Le and Zuidema (2016). CNN only methods can
provide strong results for MTS classification, as shown by Assaf et al. (2019), but may struggle
when relevant features are temporally sparse and related. Yang et al. (2021) also used multi-headed
self-attention methods.

3 Dataset Description

3.1 Data Acquisition

The full dataset is constructed using flight recordings from the NGAFID database and maintenance
records from MaintNet. Each maintenance record describes the type of maintenance performed on
an aircraft and the time period when maintenance occurred. From this information, we extract that
aircraft’s flights occurring before and after the maintenance period. Only the first 5 flights before and
after and any flights during maintenance are extracted. All maintenance activities were unplanned and
done on request. However, when issues are detected, FAA policy forbids the plane from flying. This
means that issues are always detected sometime during or after the first flight before maintenance.
Each flight record contains readings from 23 different sensors every second. A description of the
sensors can be found in Table 1. All flight data recordings come from the same aircraft model, the
Cessna 172.
The NGAFID serves as a repository for general aviation flight data, with a web portal for viewing
and tracking flight safety events for individual pilots as well as for fleets of aircraft Karboviak et al.
(2018). The NGAFID currently contains over 900,000 hours of flight data generated by over 780,000
flights by 12 different types of aircraft, provided by 65 fleets and individual users, resulting in over
3.15 billion per second flight data records across 103 potential flight data recorder parameters. Five
years of textual maintenance records from a fleet which provided data to the NGAFID have been
clustered by maintenance issue type and then validated by domain experts for the MaintNet project,
see Akhbardeh et al. (2020).
MaintNet’s maintenance record logbook data was clustered into 36 different maintenance issue types.
The count of flights per issue type is located in table 3 in Appendix A. Although some maintenance
issues occur very rarely, all maintenance issues with flight data are included in the full dataset. It
is important to note that the NGAFID collects real flight data from aircraft flying with potentially
faulty parts (as is the case for any real world fleet of aircraft). This is because individual components
may fail without causing catastrophic failure of the aircraft during regular operation. The collection

3
Column Name Description
volt1 Main electrical system bus voltage (alternators and main battery)
volt2 Essential bus (standby battery) bus voltage
amp1 Ammeter on the main battery (+ charging, - discharging)
amp2 Ammeter on the standby battery (+ charging, - discharging)
FQtyL Fuel quantity left
FQtyR Fuel quantity right
E1 FFlow Engine fuel flow rate
E1 OilT Engine oil temperature
E1 OilP Engine oil pressure
E1 RPM Engine rotations per minute
E1 CHT1 1st cylinder head temperature
E1 CHT2 2nd cylinder head temperature
E1 CHT3 3rd cylinder head temperature
E1 CHT4 4th cylinder head temperature
E1 EGT1 1st Exhaust gas temperature
E1 EGT2 2nd Exhaust gas temperature
E1 EGT3 3rd Exhaust gas temperature
E1 EGT4 4th Exhaust gas temperature
OAT Outside air temperature
IAS Indicated air speed
VSpd Vertical speed
NormAc Normal acceleration
AltMSL Altitude miles above sea level
Table 1: Description of the data collected by aircraft sensors

of this data poses no additional safety risk to the pilots because data collection occurs for all flights
performed.

3.2 Dataset Subset for Benchmarking

Attempting to train a model on the full dataset is particularly difficult. This is because of two main
issues. The first is the under representation of certain classes. For example, the spark plug related
issue contains only 15 flights. These classes are included in the dataset to allow for future researchers
to train models that address the class imbalance issue, as class imbalance is an important area of
machine learning research Johnson and Khoshgoftaar (2019). The second is that each flight contains
a significant amount of data, which makes it hard to regularize a model trained on the data. In the
spark plug example, each of the 15 flights contain more than 40,000 data points, which would be
used to predict a single class out of 36. Regularization is another important research area of machine
learning, Kukačka et al. (2017), but it is beyond the scope of this paper. These two challenges are
inevitable when collecting real world data, as it is impossible to guarantee that each part in a real
world aircraft fails at the same rate.
A subset of the full dataset is used in this paper to benchmark time series classification methods and
provide a baseline. Since addressing class imbalance and regularization are beyond the scope of this
paper, the subset is designed to minimize the impact of those two issues. By limiting the subset to only
classes that have at least 50 flights, it removes flights where generalization is a problem and reduces
the impact of class imbalance. The subset also defines the binary RUL problem as P (RU L > x)
with x as 2 days. We label flights within 2 days before maintenance as P (RU L > x) = 0 and flights
after 2 days after maintenance as P (RU L > x) = 1, as such flights contain brand new parts.
The data subset is defined as all flights within two days of the maintenance period, excluding any
flights during the maintenance period. Furthermore, any flights belonging to classes that would have
fewer than 50 flights before or after the maintenance period are also excluded. This leaves us with a

4
Figure 1: Exhaust Gas Temperature before and after Maintenance for Intake Gasket Issue. This shows
that the sensor readings before an issue occurs and after it is fixed is largely the same. Notice that the
EGT values are mostly in line with each other, but diverge at certain times. It is not obvious where
an anomaly occurs, nor is it obvious whether or not there is an issue. The difference in temperature
values indicate a difference in flight plan as well.

total of 5844 flights after maintenance and 5602 flights before maintenance in 19 different classes of
maintenance issues. The count of flights per issue type is located in table 3 in Appendix A.

3.3 Data Preprocessing

The full dataset is provided without any preprocessing and contains the full flight data for each
flight. Some sensors recorded NaN values at certain time steps. This can be caused by many factors,
including but not limited to initialization of the sensor at the beginning of the flight, failure of the
sensor during flight, and failure of the recording system during flight. Approximately 1% of all
datapoints contain NaN values. We decided to present future researchers with NaN values as replacing
such values with another, such as 0, would change the meaning.
For training a model in our benchmark experiments, the data subset is scaled. All values are scaled
via MinMax, with a scaled minimum value of 0 and a scaled maximum value of 1. Minimums
and maximums for each channel of data were calculated using all of the data. Please note that
the full dataset is not distributed with its values scaled, which allows for examination of different
normalization techniques by users of the dataset.

3.4 Visualization

Readings from 4 sensors from two flights are provided in figure 1. Only the output from 4 sensors are
included to keep the graph readable. Even with only 4 sensors, it should be clear that determining the
probability of part failure for a human would be extremely difficult. Even with domain knowledge, it
would only be possible to detect that an issue has occurred, but not that an issue is going to occur.

3.5 Data Privacy

To ensure privacy, information regarding the serial number of the aircraft of the flight and the date
of the flight have been removed from the publicly released dataset. Latitude and longitude are also
removed from the flight. The specific text of the maintenance logs are also withheld.

3.6 Labels

Labels are assigned to flights based on the date of the flight and the date of the maintenance. As
per the Federal Aviation Administration regulations, aircraft are only permitted to operate in a safe
manner. This means that maintenance issues and part failures that occurred prior to the maintenance
date will be fixed after the maintenance date.

5
Part failures in the aircraft can be described as acute or chronic. Acute part failure describes a
complete failure of a part due to an unexpected event. For example, if the aircraft collides with a
small object, such as a bird, this would create data representing an acute failure and an instantaneous
change in the remaining useful life. Chronic part failure describes gradual wear and tear that renders
a part unsafe for flight, before scheduled maintenance can replace the part. For example, an intake
gasket may wear down more quickly than expected, leading to a leakage that negatively impacts the
safety of the aircraft.
It should be noted that dramatic decreases in remaining useful life are rare. Based on analysis of the
intake gasket leak/damage class, there were 9085 cases were described as leaking and only 15 cases
were described as torn. This may negatively impact anomaly detection methods and will be discussed
in a later section.
Because all maintenance records are related to unplanned maintenance, it is clear that the related
parts reached the end of their remaining useful life prematurely. These unplanned maintenance events
occurred outside of scheduled maintenance and can be considered anomalies in the expect wear and
tear of the associated parts.

3.7 Problem Structure

3.7.1 Detection of Maintenance Issues

The detection of maintenance issues and their associated part failure is quite challenging. We define
this as min∀i∈I P (RU Li > x), where I represents the set of all parts and RU Li is the remaining
useful life of part i.
While this problem may resemble anomaly detection, it is important to note that a significant portion
of variance in the data is caused by outside factors. Pilot actions can dramatically alter readings for
every sensor on the aircraft, causing no two flights to be the same. Weather conditions, which may
vary with altitude, are also inconsistent. For these reasons, an anomaly detection approach may find
a large number of anomalous sections on any given flight. These anomalies may be caused by a
maintenance issue or irregular flight activities.
It should be clear that the data collected by aircraft sensors pose a greater challenge than sensors
on laboratory or industrial assets. In those situations, the environment is much more controlled, as
opposed to the environment of an aircraft in flight. They may also utilize dedicated sensors designed
to collect data for the purpose of detecting specific issue. This is not the case for our dataset, as the
sensors collect general flight data, which were not originally designed to collect data for maintenance
issue prediction and binary RUL estimation.

3.7.2 Classification of Maintenance Issues

An extension of the previous problem is to also classify the type of maintenance issue based on the
maintenance record left by the mechanic. This problem is significantly more difficult for two reasons.
First is the class imbalance, with some classes having thousands of flights and others having less than
one hundred. Second is the similarity in how maintenance issues affect flight characteristics. This is
because two different issues may cause very similar changes in flight characteristics.
Formally, this is defined as ∀i ∈ I[P (RU Li > x)]. This is significantly more difficult than detection
of the maintenance issue.

3.7.3 Hierarchical Classification

In addition to the class labels for specific maintenance issues, this dataset is provided with a simple
hierarchy for maintenance issues. There are 5 groupings of maintenance issues: engine, baffle, oil,
cylinder, and other. Argyriou et al. (2006) has shown that hierarchies can provide benefits in terms of
regularization, as they improve the information provided by the labels. These hierarchies are included
for potential future research.

6
4 Benchmark Experiments

4.1 Dataset and Task Definition

Three different tasks are defined and all three tasks use the same subset of data. The first task is
maintenance issue detection. Here, we assign after maintenance flights with a negative label and
before maintenance flights with a positive label. The second task is maintenance issue classification,
where we assign after maintenance flights with class 0 and before maintenance flights with their
maintenance issue class (from 1 to 19). The third is combined detection and classification, where a
network is trained on both tasks simultaneously.

4.2 Models

4.2.1 Mini Rocket


Mini Rocket Dempster et al. (2021) is an improvement over the original Rocket classifier, Dempster
et al. (2020). While the original Rocket transforms the input time series using a large number of
random convolutional kernels and trains a linear classifier on the transformed features, Mini Rocket
improves upon this by using a smaller, fixed set of kernels. This method has performed quite well on
many time series datasets and trains much more quickly than deep learning methods. We evaluate a
GPU implementation of Mini Rocket by Oguiza (2022).

4.2.2 HIVECOTEv2
We attempted to test HIVECOTEv2 Middlehurst et al. (2021) on the NGAFID maintenance dataset
using the sktime implementation by Löning et al. (2022). However, the training time for the model
exceeded the maximum time allocated for Google Colab instances (24hrs). Shifaz et al. (2020) notes
that HIVECOTE scales polynomially with data, which could explain why it could not train in time on
the NGAFID maintenance dataset. Not only is the NGAFID maintenance dataset large, it also has
many time steps and many channels. No results for HIVECOTEv2 are included.

4.2.3 Inception Time


Convolutional networks are very popular in the domain of computer vision, but can also perform
well in time series classification. Fawaz et al. (2020) proposed the InceptionTime model as an
ensemble of five Inception models. Each Inception model is composed of two residual blocks, each
containing three Inception modules, followed at the very end by a global average pooling layer and a
dense classification head layer. Inception modules contain convolutions of various kernel sizes and
a bottleneck layer. The residual blocks help mitigate the vanishing gradient issue by allowing for
direct gradient flow He et al. (2016). Fawaz et al. (2020) noted that ensembling was necessary due to
the high standard deviation in accuracy of single Inception models and the small size of time series
datasets.
For this study, we evaluate the Inception model without ensembling, so references to InceptionTime
refer to just the Inception model, without ensembling. This is because the compute cost is reduced
five-fold allowing for a more efficient use of limited resources.

4.2.4 ConvMHSA
Multi-Headed Self Attention (MHSA) modules were popularized by Devlin et al. (2018) for usage in
Natural Language Processing and by Dosovitskiy et al. (2020) for Computer Vision. The ConvMHSA
model used for benchmarking is the same as the one in Yang et al. (2021). The model implements
attention layers that mimic the functionality of the encoder layers present in BERT Devlin et al.
(2018). Instead of token embeddings, the model generates sequence embeddings with the use of 1D
convolutions along the temporal dimension. These learnable sequence embeddings capture local
relationships and to compress the MTS to a shorter length. It uses a series of 1D convolutions to
reduce the temporal resolution from 4096 to 512 and then employs 4 stacked MHSA encoder layers
with 8 heads each and 64 dense units per head. The output is globally average pooled and fed to a
dense layer for classification.

7
Model Task Binary Acc. Multiclass Acc. Loss Train Loss
Binary 76.0% N/A 0.526 0.003
ConvMHSA Multi N/A 52.8% 2.168 1.097
Both 76.1% 56.3% 1.756 1.377
Binary 75.5% N/A 0.569 0.214
InceptionTime Multi N/A 54.1% 2.251 1.365
Both 74.0% 55.4% 1.667 1.038
Binary 59.8% N/A 0.667 0.395
MiniRocket
Multi N/A 50.4% 1.800 0.424
Table 2: Validation metrics for model and task combinations, averaged across 5 folds. Train loss is
included to measure overfitting

4.2.5 Model Configuration


All results reported were generated using a Google Colab instance with a v2-8 TPU. All models
were trained for 200 epochs with 200 steps per epoch using a batch size of 128, with 5-fold cross
validation. Flights are truncated to the last 4096 time steps and padded to be of the same size. Models
used an Adam optimizer with a learning rate of 3e-5 for CONVMHSA and 1e-4 for InceptionTime.
Mini-rocket was trained using a Google Colab instance with a Nvidia V100 GPU. Models were
trained for 200 epochs with 143 steps and a batch size of 64, with 5 fold cross validation. Flights
are truncated to the last 4096 time steps and padded to be of the same size. Models used an Adam
optimizer with a learning rate of 2.5e-5.
Notebooks used to train the models are available on Github and can be run in Colab through a web
browser. Colab notebooks can be exported as regular Jupyter notebooks to run the experiments locally
on a GPU. See https://github.com/hyang0129/NGAFIDDATASET for benchmark instructions.

4.3 Results

A summary of results can be found in Table 2. Overall, models tend to overfit the data, but especially
so on classifying the specific maintenance issue. This is because certain classes contain only 75
examples. MiniRocket performed relatively poorly compared to the deep learning models. If we
compare the training loss and the validation loss, it appears that MiniRocket may have difficulty in
generalization in the multi class case. It also seems to suffer from under fitting in the binary case.
These results suggest that deep learning methods may have an advantage for this type of problem.

4.4 Early Detection Testing

We can test out of fold early detection performance by creating an early detection dataset, consisting
of only validation flights before maintenance in the intake gasket leak/damage class. Note that this
does not account for false positives, since they would not be taken in for maintenance. This test is
repeated for each fold using the best InceptionTime model, trained on both tasks simultaneously. The
results are summarized in Figure 2. Based on the 5 fold validation, one can conclude that there is
no statistically significant difference in recall across the number of flights before maintenance. This
suggests that the model is capable of predicting a part failure or the need for maintenance before the
problem arises in the flight immediately before maintenance.

5 Discussion of Dataset Research Potential

Flight Safety and Predictive Maintenance Systems The Cessna 172 is the most produced aircraft
at over 43,000 units Goyer et al. (2022) and is widely used in general aviation. This includes flight
schools, recreation, and many other applications. However, general aviation is one of the most
dangerous forms of civil aviation according to Board (2011). The NGAFID Aviation Maintenance
dataset can be used for training and testing systems that detect maintenance issues early to improve
aircraft safety.

8
Figure 2: Min, mean, and max early detection accuracy of maintenance issues on validation flights.
Only out of fold pre maintenance flights are included in this calculation, done using each fold’s best
trained model based on validation accuracy.

Class Imbalance Research One of the major challenges in this dataset is the under-representation
of many classes. Half of all flights are recorded after maintenance and two damages classes, intake
gasket leak/damage and rocker cover leak/loose/damage, represent the majority of all damage classes.
This is so problematic that models trained to predict the specific damage class for a flight will predict
that the flight is post maintenance or that the flight belongs to one of the two majority classes more
than 95% of the time. Future researchers can use this dataset to test methods that mitigate problems
caused by imbalanced training data. This is particularly relevant for prognostic health management,
where it is often easy to gather data of a system in normal operation, but extremely difficult to gather
abnormal operation data.

Contrastive Representation Learning Contrastive representation learning is a technique that


attempts to learn a representation of the input data in order to provide a benefit to a downstream task,
as described by Wang and Isola (2020). While some flights were excluded from the subset of data
used for benchmarking, those flights could be used to learn representations of flight data, which may
be useful for improving performance on the downstream detection and classification task.

Time Series Classification Benchmarking Benchmark results suggest that the NGAFID Aviation
Maintenance dataset is particularly challenging for non deep learning methods. Of the two main
tasks, deep learning models seem to perform reasonable well for maintenance issue detection, but
all models perform quite poorly on maintenance issue classification. One possible explanation is
that the data is more similar to audio data, where deep methods perform well, and less similar to
the time series data benchmarks by Bagnall et al. (2018), where methods such as Mini Rocket and
HIVECOTE perform well. While we were unable to test HIVECOTE, we invite other researchers to
evaluate their methods on the NGAFID Aviation Maintenance dataset.

Anomaly Detection While this dataset does not contain localized annotations of failures, one could
separate flights into regular operation and compromised operation and attempt anomaly detection. As
noted in earlier sections, the authors of this paper believe that anomaly detection in this dataset would
be extremely difficult. This is because the data collected contains a significant degree of variance
caused by other factors, such as the pilot’s experience, the weather, the flight plan, the payload carried,
and so on.
Furthermore, anomaly detection works best in cases of acute part failure, where the failure of a part
is dramatic or has a dramatic impact on the overall system. As noted in earlier sections, chronic part
failure is more common and does not create a dramatic impact on the overall system. While such
failures make the aircraft unsafe to fly, it does not immediately cause the aircraft to crash or cease
operation.

9
An auto encoder system, such as An and Cho (2015), would struggle with this dataset. First, the
encoder must accommodate significant degrees of variance when there are no anomalies. Second,
the encoder must detect the most subtle of changes in flight characteristics, which may be easily
overshadowed by other sources of variance. Imagine a pilot performing a barrel roll with a leaky
gasket; the variance in data caused by the barrel roll would eclipse any variance caused by a leaky
gasket.

Transfer Learning This dataset also presents opportunities for transfer learning, Torrey and Shavlik
(2010). While this dataset uses sensor data from the Cessna 172 collected from a flight school, the
same sensors can be placed on other Cessna 172 aircraft for other uses. For example, the Cessna 172
can be used for passenger, cargo, or military purposes. It is also possible that these sensors can be
mounted in similar single engine aircraft. This would allow researchers to develop PHM systems
without needing to collect as much data from the other aircraft; the data from the other aircraft can be
supplemented with the NGAFID data.
The authors believe that any transfer learning for a military purpose will not directly endanger more
lives. This is because this dataset only contains flight and maintenance information, which can be
used to improve the safety and reliability of systems.

6 Future Research

Flight Event Detection Datasets Currently, the NGAFID web application provides detection
services for certain flight events, such as a aircraft stall event. Using the existing flight data and expert
labeling of said flight data, it is possible to create a time series localization dataset, with the goal of
detecting both the presence and timing of such events.

Unsupervised and Self Supervised Datasets The NGAFID database contains more than 900,000
hours of flight data from various fleets and aircraft models. While most of the data is unlabeled, it
can still be used for self supervised and unsupervised learning, such as contrastive representation
learning or masked data modeling, similar to Devlin et al. (2018).

Dataset Expansion The authors are actively working with the Federal Aviation Administration
and existing flights schools to obtain more maintenance data to expand this dataset to multiple
airframes and fleets of aircraft. This process is both slow and costly due to data governance and legal
issues. Given the value of this dataset even from a single fleet, the authors will provide the current
available data for the Cessna 172 and work on a future data release to include additional air frames
and maintenance issues.

7 Conclusion

In this paper we present 31,177 hours of flight data across 28,935 flights, which occur relative to
2,111 unplanned maintenance events clustered into 36 types of maintenance issues. Each flight
records information from 23 different sensors every second on the Cessna 172 aircraft, during normal
operation of a flight school. Our paper makes the significant contribution of providing non-simulated,
compromised aircraft flight data, collected ethically at no additional danger to the pilots involved.
This dataset is made easily accessible at the links in Section 1.
The large amount of flight data involving a compromised aircraft is particularly valuable to prognostic
health management and predictive maintenance. Because aircraft in question, the Cessna 172, is
often used in flight schools, recreation, agriculture, and more, this dataset can help create systems
that can greatly improve flight safety.
Finally, the NGAFID Aviation Maintenance dataset will be of particular interest to machine learning
researchers working with time series data. It is our aim, by releasing this dataset and identifying
areas for future research, to encourage further work in the detection and classification of maintenance
issues. We hope this in turn leads to improved future detection and classification algorithms.

10
References
Akhbardeh, F., Desell, T., and Zampieri, M. (2020). MaintNet: A collaborative open-source library for
predictive maintenance language resources. In Proceedings of the 28th International Conference
on Computational Linguistics: System Demonstrations, pages 7–11, Barcelona, Spain (Online).
International Committee on Computational Linguistics (ICCL).
An, J. and Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction
probability. Special Lecture on IE, 2(1):1–18.
Arena, F., Collotta, M., Luca, L., Ruggieri, M., and Termine, F. G. (2021). Predictive maintenance in
the automotive sector: A literature review. Mathematical and Computational Applications, 27(1):2.
Argyriou, A., Evgeniou, T., and Pontil, M. (2006). Multi-task feature learning. Advances in neural
information processing systems, 19.
Arias Chao, M., Kulkarni, C., Goebel, K., and Fink, O. (2021). Aircraft engine run-to-failure dataset
under real flight conditions for prognostics and diagnostics. Data, 6(1):5.
Assaf, R., Giurgiu, I., Bagehorn, F., and Schumann, A. (2019). Mtex-cnn: Multivariate time series
explanations for predictions with convolutional neural networks. In 2019 IEEE International
Conference on Data Mining (ICDM), pages 952–957. IEEE.
Bagnall, A., Dau, H. A., Lines, J., Flynn, M., Large, J., Bostrom, A., Southam, P., and Keogh,
E. (2018). The uea multivariate time series classification archive, 2018. arXiv preprint
arXiv:1811.00075.
Board (2011). Review of U.S. civil aviation accidents: review of aircraft accident data, 2007-2009.
U.S. National Transportation Safety Board.
Celikmih, K., Inan, O., and Uguz, H. (2020). Failure prediction of aircraft equipment using machine
learning with a hybrid data preparation method. Scientific Programming, 2020.
Chu, E., Gorinevsky, D., and Boyd, S. (2010). Detecting aircraft performance anomalies from cruise
flight data. In AIAA Infotech@ Aerospace 2010, page 3307.
Dangut, M. D., Jennions, I. K., King, S., and Skaf, Z. (2022). A rare failure detection model for
aircraft predictive maintenance using a deep hybrid learning approach. Neural Computing and
Applications, pages 1–19.
Dempster, A., Petitjean, F., and Webb, G. I. (2020). Rocket: exceptionally fast and accurate time
series classification using random convolutional kernels. Data Mining and Knowledge Discovery,
34(5):1454–1495.
Dempster, A., Schmidt, D. F., and Webb, G. I. (2021). Minirocket: A very fast (almost) deterministic
transform for time series classification. In Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining, pages 248–257.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers
for image recognition at scale. arXiv preprint arXiv:2010.11929.
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.-A. (2019). Deep learning for
time series classification: a review. Data mining and knowledge discovery, 33(4):917–963.
Fawaz, H. I., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D. F., Weber, J., Webb, G. I., Idoumghar,
L., Muller, P.-A., and Petitjean, F. (2020). Inceptiontime: Finding alexnet for time series classifica-
tion. Data Mining and Knowledge Discovery, 34(6):1936–1962.
Goldman, S. M., Fiedler, E. R., and King, R. E. (2002). General aviation maintenance-related
accidents: A review of ten years of ntsb data.

11
Goyer, B. I., Staff, F., Staff, I. G., and Flying (2022). Cessna 172: Still relevant today?
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In
European conference on computer vision, pages 630–645. Springer.
Johnson, J. M. and Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance.
Journal of Big Data, 6(1):1–54.
Karboviak, K., Clachar, S., Desell, T., Dusenbury, M., Hedrick, W., Higgins, J., Walberg, J., and Wild,
B. (2018). Classifying aircraft approach type in the national general aviation flight information
database. In International Conference on Computational Science, pages 456–469. Springer.
Karim, F., Majumdar, S., Darabi, H., and Chen, S. (2017). Lstm fully convolutional networks for
time series classification. IEEE access, 6:1662–1669.
Khadilkar, H. and Balakrishnan, H. (2012). Estimation of aircraft taxi fuel burn using flight data
recorder archives. Transportation Research Part D: Transport and Environment, 17(7):532–537.
Klein, V. (1989). Estimation of aircraft aerodynamic parameters from flight data. Progress in
Aerospace Sciences, 26(1):1–77.
Kukačka, J., Golkov, V., and Cremers, D. (2017). Regularization for deep learning: A taxonomy.
arXiv preprint arXiv:1710.10686.
Le, P. and Zuidema, W. (2016). Quantifying the vanishing gradient and long distance dependency
problem in recursive neural networks and recursive lstms. arXiv preprint arXiv:1603.00423.
Lines, J., Taylor, S., and Bagnall, A. (2018). Time series classification with hive-cote: The hierarchical
vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery
from Data, 12(5).
Liu, Y., Frederick, D. K., DeCastro, J. A., Litt, J. S., and Chan, W. W. (2012). User’s guide for the
commercial modular aero-propulsion system simulation (c-mapss): Version 2. Technical report.
Löning, M., Bagnall, T., Király, F., Middlehurst, M., Ganesh, S., Oastler, G., Lines, J., ViktorKaz,
Walter, M., Mentel, L., RNKuhns, chrisholder, Tsaprounis, L., Owoseni, T., Rockenschaub, P.,
Khrapov, S., jesellier, danbartl, Bulatova, G., eenticott shell, Lovkush, Take, K., Meyer, S. M.,
AidenRushbrooke, Gilbert, C., Schäfer, P., oleskiewicz, Xu, Y.-X., Ansari, A., and Sakshi, A.
(2022). alan-turing-institute/sktime: v0.11.4.
Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., and Bagnall, A. (2021). Hive-cote 2.0:
a new meta ensemble for time series classification. Machine Learning, 110(11):3211–3243.
Oguiza, I. (2022). tsai - a state-of-the-art deep learning library for time series and sequential data.
Github.
Orsenigo, C. and Vercellis, C. (2010). Combining discrete svm and fixed cardinality warping distances
for multivariate time series classification. Pattern Recognition, 43(11):3787–3794.
Randle, W. E., Hall, C. A., and Vera-Morales, M. (2011). Improved range equation based on aircraft
flight data. Journal of Aircraft, 48(4):1291–1298.
Rezaeianjouybari, B. and Shang, Y. (2020). Deep learning for prognostics and health management:
State of the art, challenges, and opportunities. Measurement, 163:107929.
Seto, S., Zhang, W., and Zhou, Y. (2015). Multivariate time series classification using dynamic time
warping template selection for human activity recognition. In 2015 IEEE Symposium Series on
Computational Intelligence, pages 1399–1406. IEEE.
Shifaz, A., Pelletier, C., Petitjean, F., and Webb, G. I. (2020). Ts-chief: a scalable and accurate forest
algorithm for time series classification. Data Mining and Knowledge Discovery, 34(3):742–775.
Torrey, L. and Shavlik, J. (2010). Transfer learning. In Handbook of research on machine learning
applications and trends: algorithms, methods, and techniques, pages 242–264. IGI global.

12
Tsui, K. L., Chen, N., Zhou, Q., Hai, Y., and Wang, W. (2015). Prognostics and health management:
A review on data driven approaches. Mathematical Problems in Engineering, 2015.

Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., and Matassoni, M. (2013). The second
‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines. In 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, pages 126–130.

Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment
and uniformity on the hypersphere. In III, H. D. and Singh, A., editors, Proceedings of the 37th
International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning
Research, pages 9929–9939. PMLR.

Wang, Z., Yan, W., and Oates, T. (2017). Time series classification from scratch with deep neural
networks: A strong baseline. In 2017 International joint conference on neural networks (IJCNN),
pages 1578–1585. IEEE.

Yang, H., LaBella, A., and Desell, T. (2021). Predictive maintenance for general aviation using
convolutional transformers. arXiv preprint arXiv:2110.03757.

Yildirim, M. T. and Kurt, B. (2018). Aircraft gas turbine engine health monitoring system by real
flight data. International Journal of Aerospace Engineering, 2018.

13
A Appendix

Class Name Overall Count of Flights Subset Count of Flights


aircraft start/external issue 90 0
baffle bracket loose/damage 35 0
baffle crack/damage/loose/miss 545 304
baffle mount loose/damage 42 0
baffle plug need repair/replace 465 254
baffle rivet loose/miss/damage 92 0
baffle screw miss/loose 348 211
baffle seal loose/damage 336 197
baffle spring damage 79 0
baffle tie/tie rod loose or damage 303 0
cowling miss/loose/damage 89 0
cylinder compression issue 302 143
cylinder crack/fail/need part repair 196 108
cylinder exhaust valve/stuck valve issue 48 0
cylinder head/exhaust gas temperature issue 71 0
cylinder/exhaust push rod/tube damage 106 0
drain line/tube damage 127 0
engine crankcase/crankshaft/firewall near repair 99 0
engine failure/fire/time out 236 161
engine idle/rpm issue 150 93
engine need repair/reinstall/clean 148 80
engine run rough 311 141
engine seal/tube/bolt loose or damage 144 76
engine/propeller overspeed or damage 137 94
induction damage/hardware fail 51 0
intake gasket leak/damage 4244 2098
intake tube/bolt/seal/boot loose or damage 556 269
magneto failure 51 0
mixture fail/need adjust 17 0
oil cooler need maintenance 123 75
oil dipstick/tube need repair 20 0
oil leak/pressure issue 53 0
oil return line issue 37 0
pilot/in-flight noticed issue 317 75
rocker cover leak/loose/damage 2157 1024
spark plug need repair/replace 15 0
flights recorded after maintenance 10291 5844
flights recorded during maintenance 6504 0
Table 3: Count of flights by class in the full dataset and the subset used for the benchmark experiments.

Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Seciton 3
(c) Did you discuss any potential negative societal impacts of your work? [No] The data
is collected from purpose built sensors on aircraft. We believe that this data is so

14
specialized as to be only applicable in the aviation industry and only to single engine
aircraft, which very few people interact with regularly.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [Yes] As a repository
with links to Colab Notebooks
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] See the code in the repo
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] But only for the early detection in section 4.4. The overall
results in 4.3 only report mean values. However, you can reproduce results as the splits
are availalbe in the dataset.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] See section 4.2.5
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes]
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [No] See data sheet supplementary material
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [Yes] Section 3.3
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]

15

You might also like