"Vupnbufe "Wjbujpo 0ddvssfodft $bufhpsj (Bujpo: Kosio Marev and Krasin Georgiev

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Conference on Military Technologies (ICMT) 2019

May 30 – 31, 2019, Brno, Czech Republic

"VUPNBUFE "WJBUJPO 0DDVSSFODFT $BUFHPSJ[BUJPO


Kosio Marev and Krasin Georgiev†
† Department of Aeronautics, Technical University of Sofia, 1000 Sofia, Bulgaria, e-mail: krasin@tu-sofia.bg

Abstract— Information about aviation events is collected Automated natural language processing (NLP) of inci-
by all participants in the aviation system, e.g. airlines, dent and accident reports for aviation risk management
maintenance organizations, and air traffic controllers. Re- has been studied in the recent years [9]–[14]. Some
porting and initial assessment usually involves assigning
categories from a predefined nomenclature (scheme) aligned success has been demonstrated on tasks as multi-class and
with the purpose of the reporting system and the established multi-label categorization, topic modeling and similarity
processing practices. Such manual categorization is time search. Interactive browsing and exploration approach was
and resource consuming and, more importantly, limiting proposed in [12]. Practical applications of the automatic
the application of the dataset. We apply and evaluate the classification discussed are
effectiveness of a state of the art Neural Networks based
algorithm for Natural Language Processing for classification • reducing the number of possible categories in new
of aviation safety report narratives. Multi-class, multi-label events categorization for current databases;
supervised learning is performed on two small datasets, 4500
• using customized nomenclatures to re-categorize old
and 8000 cases with 16 and 54 classes respectively, both
extracted from the NASA Aviation Safety Reporting System. datasets for specific tasks as reliability and risk
The results are promising if compared to recent studies analysis;
and considering that an off the shelf algorithm without • an interactive search based on keywords or sample
much customization is applied. Automatic categorizations reports.
can relief the current burden for manual categorization
of the events by reducing the number of likely categories, A short overview of the techniques adopted for classifi-
targeting quality checks to most ambiguous records and cation of aviation narratives follows. The basic approach
applying new or updated classification schemes. splits the problem in two tasks, first represent the text by
Keywords— aviation safety, occurrence reporting, NLP, an array of numbers with fixed length (feature extraction)
NASA ASRS, fastai, neural networks and then transform the features into one or more labels
(classification).
I. I NTRODUCTION Each document is converted to a vector of text unit
frequencies (“bag” of words, e.g. see [11]). The document
Multiple aviation safety information collection pro- vector is with fixed length equal to the number of text
grams are in operation today [1]. They accumulate shared units in the vocabulary of the dataset. Each value in the
experience of flight, operational, maintenance, and regu- vector is the frequency of the corresponding term in the
latory staff either as individuals or organizations. Some of document. Hand-written, rule-based normalizers has been
the reports are mandatory, required from the regulations used to convert synonyms, abbreviations, and multi-word
to monitor the risk from known safety threats and other terms into single terms, numbers to signifiers, etc. in [11].
are voluntary, allowing unrestricted exchange of personal Then different types of text units (also called tokens or
observations and safety concerns [2]–[5]. In most cases, terms) are constructed, from single terms or stems to n-
measured numerical parameter, event characteristics and grams of terms or characters. Each term frequency (TF)
categories (assigned originally or after additional assess- can be scaled to account the “rarity” of the term in the
ments) are supplemented by a free form event description collection, the so called TF*IDF representation, where
provided as simple text. IDF is for Inverse Document Frequency. Some approaches
What information is collected and how it is coded into further apply Latent Semantic Analysis (LSA) to reduce
categories is mostly determined by the purpose of each the document vector from vocabulary word frequencies
database and the established analysis practices. However (several thousands of terms) to topic frequencies (several
the reporting and analysis requirements can change in hundred semantic topics) [13]–[15]. The Document Terms
time. For example, the European Aviation Safety Agency matrix is decomposed into Document-Topic and Topic-
(EASA) has begun applying a new European Risk Clas- Terms matrices using appropriate matrix factorization.
sification Scheme (ERCS) to historical occurrences [3], This is an unsupervised learning method for feature
[6]; the Human Factor Analysis and Classification System dimension reduction and topic extraction. Probabilistic
(HFACS) has been applied to study past accidents [7], alternatives for topic modelling exist [16], [17], and were
[8]. Applications other than the initial design are severely applied for aviation safety reports in [9].
restricted by the manpower required to do the necessary Above studies does not exploit the modern representa-
knowledge extraction. Careful reading, understanding, tion of the documents as a sequence of word embeddings
assessment and final labeling or number extraction of [18]–[20]. Each word is represented by a vector instead of
hundreds of thousands of narratives by experts is rarely an index in the vocabulary and similar words are closer in
an option. the vector space – the so called distributed representation.

c
978-1-7281-4593-8/19/$31.00 2019 IEEE

Authorized licensed use limited to: Jyvaskylan Yliopisto. Downloaded on October 14,2020 at 06:44:21 UTC from IEEE Xplore. Restrictions apply.
The classification part is done using conventional al- 2010, 2013 and 2014 to form an extra-training dataset
gorithms or newly developed ones. A correlation be- with 10742 reports.
tween document-term and topic-term vectors combined Another study based on ASRS narratives tries to predict
with threshold was applied in [10]; K-Nearest-Neighbor the type of the anomalous events [9]. There are 57
(KNN) classifier based on document-topic vectors cosine predefined classes and a randomly selected subset of
similarity in [13], [14]; Support Vector Machines (SVM) 10000 reports was explored. For our study, 10000 records
over document term vectors – [12]; Bayesian multi-variate from the expanded extra-training dataset described above
regression on document-topic vectors – [9], etc. The were used to allow better reproducibility of the results.
multi-label text classification is handled in different ways Training and validation datasets were produced using 5-
[21], i.e. training independent classifiers for each target fold cross-validation.
category ( [12] – 37 SVM classifiers with linear kernels),
A variety of metrics can be used for performance
splitting the document to sentences and taking the most
evaluation of the model predictions [27]. We consider
frequent sentence classes (with KNN classifier in [14]),
only the ones applicable for both multi-class and multi-
using a distribution classifier that can output distribution
label problems. Most metrics were calculated using the
of probabilities for all labels [9].
implementations in [27]. The performance metrics se-
Again it seems reasonable to consider the recent
lected for this study follow the ones used in [14] and
achievements in the field of machine learning and NLP.
[9] to allow comparison:
Many traditional algorithms in computer vision, machine
translation and automatic control have been replaced by - Hamming Loss (HL ) – the fraction of the wrong
deep neural networks [22], [23]. The power of the new labels to the total number of labels. This is the least
“data”-based approach is demonstrated by the advance- restrictive measure, as full credit is given for all matching
ments in self-driving cars, robotics and game industry. records and labels combinations.
The need for big training dataset has been relaxed in - Exact Match Ratio (A) – the fraction of records with
image processing by the so called “transfer learning” exact match of all labels. This is the most restrictive
– existing models trained for different problem and on measure, as no credit is given for correctly predicting
unrelated data are fine-tuned for the particular task. Cur- some of the labels of a record.
rent researches develop transfer learning techniques for - Precision (P ) and Recall (R) scores – the fraction
NLP by using pretrained word vectors to build document of true predicted positive to all predicted positive (pre-
representation for further fine tuning and classification cision) and to all positive (recall) respectively. These are
[24], [25]. Howard and Ruder propose pretraining and appropriate for measuring the performance for unbalanced
fine tuning of a whole language model and show that their data. Several schemes of averaging exist to apply these
method outperforms existing state-of-the-art on multiple metrics for multi-class and multi-label data – micro,
representative text classification tasks [25]. macro, weighted and sample, i.e. averaging as a whole,
The power and limitations of the neural networks (NN) averaging by labels, by labels with weighting based on
based approach has not been demonstrated on aviation label frequency, and by records, respectively.
occurrence narrative data. The goal of this study is to
apply an off the shelf NN NLP technique with minor - F1 scores – a harmonic mean of precision and recall,
modifications on reports from the NASA Aviation Safety appropriate for measuring the performance for unbalanced
Reporting System (ASRS) database. The algorithm is data.
introduced in the next section and follows [25]. The - Multilabel ranking metrics – label ranking average
achieved performance will be discussed in the context of precision, coverage error and label ranking loss as defined
similar metrics reported in the aviation safety literature and implemented in [9], [27]. These are metrics that do
[9], [12], [14]. not require explicit prediction of the labels.
- One error – evaluates “how frequently the top ranked
II. M ATERIALS AND M ETHODS predicted label is not among the true labels” [9].
Data for the training and validation datasets were taken A method called Universal Language Model Fine-
from ASRS Database Online [26]. The query reproduced tunning (ULMFiT) was used for the classification [25].
the filter applied by Robinson (2018) to allow fair com- The model is a sequence of a language model (LM) with
parison of the results [14]. The dataset included passenger 3-layer AWD LSTM architecture and a pooling linear
flights under FAR Part 121, all records for years 2011 and classifier [25], [28]. The training procedure has three
2012 for the training and for year 2009 for the validation steps: 1) general LM pretraining; 2) fine tuning the LM;
subsets. January 2013 was also included in the training 3) classifier fine-tunning. All calculations were based on
subset. This resulted in 4500 training and 2948 validation the code accompanying [25]. For the first step a pretrained
reports. Each report can have single primary problem and model was imported. The fine tuning of the LM was done
multiple contributing factors / situations. There are 16 with the safety narratives from the dataset. The classifier
possible labels. The primary and the contributing factors in the third step was modified to allow training with
are treated as separate datasets. Above training dataset multi-label input data. The cross entropy loss function
was expanded with additional 6242 records from years was replaced with binary cross entropy with logits:

Authorized licensed use limited to: Jyvaskylan Yliopisto. Downloaded on October 14,2020 at 06:44:21 UTC from IEEE Xplore. Restrictions apply.
TABLE I
L
 D OCUMENT FREQUENCY BY LABEL . DATA FROM ASRS
 
L= xl − xl yl + log(1 + e−xl ) , (1) Train Val
l=1 Category count % count %
where is the output xl for label l at the last linear layer Aircraft 2841 63.1 1635 55.5
without non-linear activation and yl is the true value for Human Factors 2264 50.3 1672 56.7
label l, i.e. 1 or 0. The prediction matrix was calculated Procedure 1663 37.0 754 25.6
from the output scores xl by threshold crossing. The Company Policy 711 15.8 624 21.2
threshold was selected based on the required precision- Weather 447 9.9 270 9.2
recall balance. As a whole, the model structure and Environment 1 446 9.9 238 8.1
model hyperparameters were preserved as in the original Chart Or Publicat. 478 10.6 259 8.8
implementation [29]. Airport 278 6.2 212 7.2
The calculations were performed using Python (v.3.6.4, ATC 2 107 2.4 60 2.0
Anaconda Inc.) with Pandas (v.0.22.0), Numpy (v.1.14.2), Manuals 321 7.1 163 5.5
SciKit-learn (v.0.19.1), spacy (v.2.0.11, Explosion AI), MEL 129 2.9 84 2.8
Equipment / Tooling 74 1.6 56 1.9
torch (v.0.4.1.post2) and Fastai (v. 0.7.0) libraries.
Part 3 113 2.5 71 2.4
III. R ESULTS Airspace Structure 107 2.4 92 3.1
Free text classification using a state-of-the-art off the Staffing 71 1.6 52 1.8
Logbook Entry 101 2.2 59 2.0
shelf neural networks based approach was applied on
safety report narratives datasets. The approach applies Sum 23557 225.6 6301 213.7
Total records 4500 100 2948 100
the so called Universal Language Model Fine-tuning 1 Environment – Non Weather Related;
(ULMFiT) method with transfer learning for text classifi- 2 DTC Equipment / Nav Facility / Buildings;
cation [25]. A language model based on recurrent neural 3 Incorrect / Not Installed / Unavailable Part
network and a classifier based on fully connected network
are available as open source pretrained model and code
[25], [29]. The pretrained with Wikipedia texts language
model was fine-tuned with safety records narratives. The [9]. To reduce the penalty from miss-represented labels,
safety report datasets were samples of narratives and the micro averaging was used. Other performance metrics
corresponding labels, e.g. primary problem / contributing were calculated also to allow discussions about individual
factors, abnormal event types, etc. from the NASA ASRS categories and labels, and comparison with other studies.
database.
B. Classification performance
A. Data characteristics The results reported in Robinson 2018 for multi-label
The dataset of ASRS narratives with contributing fac- task [14] are replicated in Table II together with our per-
tors will be explored in details. The frequency distribution formance metrics on the same datasets. On the multi-class
of the labels of the documents is shown for both the and multi-label contributing factors dataset we reduce the
training and the validation subsets in Table I. There are Hamming loss by 54 % (from 0.216 to 0.099) and improve
4500 records in the train dataset and 2948 in the validation the F1 score by 37 % (from 0.484 to 0.663). In addition,
dataset. The number of unlabeled records is only 15 for two dummy predictions are shown to set a baseline for
the training and 13 for the validation parts so these can comparison – predicting all possible labels for each record
be safely ignored. The labels are selected from a set (column 3) and always predicting the most frequent label
of 16 values without restriction of the number of labels “Aircraft” from the training dataset (Column 4).
per record. It is important to note that there is a data
TABLE II
imbalance as some of the categories are rare with less than
P ERFORMANCE METRICS FOR CONTRIBUTING FACTORS LABELS
3% of the observations while other are assigned to more
1 2
than 50% of all cases (“Aircraft” and “Human factor”). Metric [14] [14] dummy dummy ULMFiT
ones 3 fixed 4
This means that our classifier predictions for the former
Hamming loss 0.216 0.364 0.866 0.125 0.099
records will be poor but the latter will dominate most of
Precision 0.351 0.255 0.134 0.570 0.608
the performance measures.
Recall 0.781 0.935 1.0 0.265 0.729
The multi-label problem can be characterized by the
F1 score 0.484 0.400 0.237 0.362 0.663
number of labels per record. It varies from zero (0.35 %) 1 contributing by primary algorithm; 2 contributing by contributing
to ten (< 0.1 %) and most of the records have single label algorithm 3 all labels as ones; 4 all zeros except “aircraft”;
(36 %), two labels (30 %), three labels (20 %) and four
labels (8 %). The average number of labels per record
is 2.26. Predicting the exact number and combination The results reported in Agovic 2010 for anomalous
of labels for each record is unnecessarily ambitious task events labeling task [9] are replicated in Table III together
and most studies rely on average precision, recall, and with our performance metrics on a similar dataset (same
f1 scores as in [12], [14] or different ranking metrics database, same size of the dataset, 54 instead of 57

Authorized licensed use limited to: Jyvaskylan Yliopisto. Downloaded on October 14,2020 at 06:44:21 UTC from IEEE Xplore. Restrictions apply.
TABLE III TABLE IV
P ERFORMANCE METRICS FOR EVENT TYPE LABELS P ERFORMANCE METRICS PER LABEL . DATA FROM ASRS DATABASE .
Metric BMR [9] ULMFiT C ONTRIBUTING FACTORS . ULMF I T ALGORITHM
OneError 38.5 ± 0.8 27.0 ± 2.7 f1-score precision recall support
AvePrec 64.0 ± 0.5 75.2 ± 2.4 Human Factors 79.5 71.3 89.7 1672
Coverage 8.17 ± 0.14 5.01 ± 0.75 Aircraft 84.7 77.1 94.1 1635
HammingLoss 4.4 ± 0.0 2.9 ± 0.4 Procedure 52.5 37.2 89.1 754
RankLoss 5.7 ± 0.2 3.2 ± 0.4 Company Policy 57.7 65.6 51.4 624
Weather 62.5 64.0 61.1 270
Chart Or Publication 51.1 46.9 56.0 259
Environment 17.4 33.7 11.8 238
classes) and the same type of cross-validation. Again, the Airport 46.9 49.7 44.3 212
label ranking average precision is higher and the errors Manuals 47.1 42.6 52.8 163
are lower for our approach. Airspace Structure 13.2 50.0 7.6 92
Better predictions were reported in the literature for MEL 40.0 43.7 36.9 84
larger datasets. Tulechki applied classical NLP techniques Incorrect . . . Part 16.5 30.8 11.3 71
supplemented by hand written rules for categorization ATC 3.0 14.3 1.7 60
of 136861 aviation accident and incident reports into Logbook Entry 0.0 0.0 0.0 59
37 event type [12]. The micro-average precision, recall Equipment / Tooling 3.5 100.0 1.8 56
and F1-score were P=86.79%, R=74.08%, F1=79.15% Staffing 0.0 0.0 0.0 52
respectively. These values can’t be fairly compared to ours micro avg 66.3 60.8 72.9 6301
as the dataset is more than 20 times larger. Unfortunately macro avg 36.0 45.4 38.1 6301
ECCAIRS databases are not public. Moreover, we are samples avg 69.1 68.0 78.9 6301
interested in a limited number of training samples, from weighted avg 64.3 61.3 72.9 6301
a few hundred to a few thousands, a number that can be
easily prepared by a small team of experts in a reasonable
time.
Automatic categorization can be used as a backup or
The prediction success was further evaluated at a label
quality control procedure for important manual assess-
level for the contributing factors dataset (Table IV). The
ments. A manual review of improperly labeled records
most frequent labels “Human Factors” and “Aircraft” have
has revealed obvious inconsistencies in the original report
f1 score of about 0.8, but the rare ones as “Logbook
coding [12]. Moreover perfect qualitative assessment is
Entry” and “Staffing” are not detected at all. The general
not expected even from human raters. A study spe-
properties of the model described in [12], as poor perfor-
cially designed to test human rater reliability in HFACS
mance for rare classes and inherent difficulty with some
categorization gives Krippendorff Alpha values of 0.79
classes, were observed for our model-data setup also. The
across the four tires and 0.67 across the 19 categories
f1 score for label “Weather” (0.63) is much better than
[30]. Having a tool to select cases for expert review
the scores for “Procedure”, “Chart or Publication” and
is especially sensible when the number of records is
“Environment – Non Weather Related” (0.53, 0.51 and
in hundreds (1020 accidents analyzed in [8]) or even
0.17 respectively) even though the support samples are
in thousands (e.g. 14086 general aviation accidents and
similar in number (Table IV).
incidents with 31491 aircrew casual factors in [7]).
Further training on the expanded training dataset leads
to some increase of the micro scores (f1 = 0.68). The IV. C ONCLUSION
increase is more prominent for the rear cases as shown in
Figure 1 where the predictions of the updated model are Current state-of-the-art neural network models as
evaluated on the same validation set. The macro averaged ULMFiT can be applied for classification of aviation
f1 score was improved from 0.36 to 0.40. safety narratives. We show that a model based on ULM-
Automatic classification can be applied for preliminary FiT outperforms alternative models for classification of
screening of records for further manual assessment. Then safety occurrence narratives for small training datasets.
the recall is important as it determines what part of target The f1 score 0.484 reported by Robinson [14] was in-
records will be missed. A tradeoff between precision and creased to 0.663 (using exactly the same ASRS training
recall is easy to achieve in order to satisfy such user and validation datasets). This result was achieved without
requirements. In Table II, our approach, while giving parameter tuning and therefore further improvement can
better overall score, has a lower recall compared to be expected.
[14]. This is fixed by reducing the threshold for positive The maturity and the accessibility of the tools for
labels so the precision/recall is changed from 0.61/0.73 to automatic text classification mean that such techniques
0.56/0.78 or 0.42/0.90. A recall score of 0.9 means that should become a regular element in the toolbox of the
the classifier will catch 90% of the target records. Then aviation safety analyst. The neural networks approach is
a manual review will be needed to discard 58% of the flexible enough to handle natively both multi-class and
filtered records. multi-label problems. Further improvements are expected

Authorized licensed use limited to: Jyvaskylan Yliopisto. Downloaded on October 14,2020 at 06:44:21 UTC from IEEE Xplore. Restrictions apply.
Staffing
Train 4500 support
Equipment / Tooling
Train 10000
Logbook Entry
ATC Equipment
Incorrect Part
MEL
Airspace Structure
Manuals
Label
Airport
Environment
Chart Or Publication
Weather
Company Policy
Procedure
Aircraft
Human Factors
0 10 20 30 40 50 60 70 80 90 0 200 400 600 800 1000 1200 1400 1600 1800

,%
F1 scores Support

Fig. 1. Scores F1 by contributing factor label for training datasets with 4500 and 10000 samples. Support based on 2948 validation samples

as the neural network based NLP with transfer learning [13] S. D. Robinson, W. J. Irwin, T. K. Kelly, and X. O. Wu, “Applica-
is an active field of research. tion of machine learning to mapping primary causal factors in self
reported safety narratives,” Safety Science, vol. 75, pp. 118–129,
A next step is incorporating structured information to 2015.
supplement the text processing algorithm. Pretraining the [14] S. Robinson, “Multi-Label Classification of Contributing Causal
LM on large corpuses of aircraft technical documentation Factors in Self-Reported Safety Narratives,” Safety, vol. 4, no. 3,
p. 30, 2018.
or other aviation related literature is another simple step [15] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and
to improve the model. Other NLP tasks as topic modeling R. Harshman, “Indexing by latent semantic analysis,” Journal of
(taxonomy evaluation) and similarity calculation (similar- the American Society for Information Science, vol. 41, no. 6, pp.
391–407, 1990.
ity based retrieval) could be studied. [16] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceed-
ings of the 22nd annual international ACM SIGIR conference on
Research and development in information retrieval - SIGIR ’99.
R EFERENCES New York, New York, USA: ACM Press, 1999, pp. 50–57.
[17] D. M. Blei, B. B. Edu, A. Y. Ng, A. S. Edu, M. I. Jordan, and J. B.
[1] GST, “Major Current or Planned Government Aviation Safety Edu, “Latent Dirichlet Allocation,” Journal of Machine Learning
Information Collection Programs,” p. 60, 2004. Research, vol. 3, pp. 993–1022, 2003.
[2] W. Reynard, C. E. Billing, E. S. Cheaney, and R. Hardy, “The [18] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estima-
Development of the NASA Aviation Safety Reporting System, tion of Word Representations in Vector Space,” in ICLRWorkshop,
NASA ASRS (Pub. 34),” NASA, Tech. Rep., 1986. 1 2013.
[3] EC, “Regulation (EU) No 376/2014 of the European Parliament [19] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean,
and of the Council of 3 April 2014 on the reporting, analysis and “Distributed Representations of Words and Phrases and their
follow-up of occurrences in civil aviation,” pp. 18–43, 2014. Compositionality,” in Proceedings of the 26th International Con-
[4] CAA.UK, “CAP382: Occurrence Reporting Scheme,” ference on Neural Information Processing Systems - Volume 2, ser.
2016. [Online]. Available: https://www.caa.co.uk/Our-work/ NIPS’13. USA: Curran Associates Inc., 2013, pp. 3111–3119.
Make-a-report-or-complaint/MOR/Occurrence-reporting/ [20] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural
[5] FAA, “SDR Reporting.” [Online]. Available: https://av-info.faa. Probabilistic Language Model,” Journal of Machine Learning
gov/sdrx/Default.aspx Research, vol. 3, no. Feb, pp. 1137–1155, 2003.
[6] EASA, “EASA Annual Safety Review 2018,” European Aviation [21] G. Tsoumakas and I. Katakis, “Multi-label classification: An
Safety Agency, Safety Intelligence and Performance department, overview,” International Journal of Data Warehousing and Mining
Cologne, Germany, Tech. Rep., 2018. (IJDWM), vol. 3, no. 3, pp. 1–13, 2007.
[7] S. A. Shappell and D. A. Wiegman, “A Human Error Analysis [22] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
of General Aviation Controlled Flight Into Terrain Accidents 521, no. 7553, pp. 436–444, 2015.
Occurring Between 1990-1998,” p. 25, 2003. [23] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya,
[8] S. Shappell, C. Detwiler, K. Holcomb, C. Hackworth, A. Boquet, R. Wald, and E. Muharemagic, “Deep learning applications and
and D. A. Wiegmann, “Human Error and Commercial Aviation challenges in big data analytics,” Journal of Big Data, vol. 2,
Accidents: An Analysis Using the Human Factors Analysis and no. 1, p. 1, 12 2015.
Classification System,” Human Factors: The Journal of the Human [24] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in
Factors and Ergonomics Society, vol. 49, no. 2, pp. 227–242, 4 Translation: Contextualized Word Vectors,” in NIPS, 7 2017.
2007. [25] J. Howard and S. Ruder, “Universal Language Model Fine-tuning
[9] A. Agovic, H. Shan, and A. Banerjee, “Analyzing aviation safety for Text Classification,” in ACL. Association for Computational
reports: From topic modeling to scalable multi-label classification,” Linguistics, Melbourne, 7 2018.
in Conference on Intelligent Data Understanding (CIDU), 2010, [26] “ASRS Database Online - Aviation Safety Reporting System.”
pp. 83–97. [Online]. Available: https://asrs.arc.nasa.gov/search/database.html
[10] C. Pimm, C. Raynal, N. Tulechki, E. Hermann, G. Caudy, and [27] “Model evaluation: quantifying the quality of predictions.”
L. Tanguy, “Natural Language Processing (NLP) tools for the [Online]. Available: http://scikit-learn.org/stable/modules/model_
analysis of incident and accident reports,” in International Confer- evaluation.html#multiclass-and-multilabel-classification
ence on Human-Computer Interaction in Aerospace (HCI-Aero), [28] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and Op-
Brussels, Belgium, 2012. timizing LSTM Language Models,” CoRR, vol. abs/1708.02182,
[11] N. Tulechki, “Natural language processing of incident and accident 2017.
reports : application to risk management in civil aviation,” Ph.D. [29] J. Howard and others, “The fastai deep learning library, v0.7.0,”
dissertation, Université de Toulouse, 2015. 2018. [Online]. Available: https://github.com/fastai/fastai
[12] L. Tanguy, N. Tulechki, A. Urieli, E. Hermann, and C. Ray- [30] A. Ergai, T. Cohen, J. Sharp, D. Wiegmann, A. Gramopadhye,
nal, “Natural language processing for aviation safety reports: and S. Shappell, “Assessment of the Human Factors Analysis
From classification to interactive analysis,” Computers in Industry, and Classification System (HFACS): Intra-rater and inter-rater
vol. 78, pp. 80–95, 5 2016. reliability,” Safety Science, 2016.

Authorized licensed use limited to: Jyvaskylan Yliopisto. Downloaded on October 14,2020 at 06:44:21 UTC from IEEE Xplore. Restrictions apply.

You might also like