How To Do Machine Learning With Small Data
How To Do Machine Learning With Small Data
Abstract—Artificial intelligence experienced a technological domain-specific data samples, and the P is a measure of
breakthrough in science, industry, and everyday life in the success achieved in a particular task T . A less formal definition
recent few decades. The advancements can be credited to the is given in [7], machine learning (ML) “is a sub-field of
ever-increasing availability and miniaturization of computational
arXiv:2311.07126v1 [cs.LG] 13 Nov 2023
resources that resulted in exponential data growth. However, computer science, but is often also referred to as predictive
because of the insufficient amount of data in some cases, analytics, or predictive modeling”. Its goal and usage is to
employing machine learning in solving complex tasks is not build new and/or leverage existing algorithms to learn from
straightforward or even possible. As a result, machine learning data, in order to build generalizable models that give accurate
with small data experiences rising importance in data science and predictions, or to find patterns, particularly with new and
application in several fields. The authors focus on interpreting the
general term of “small data” and their engineering and industrial unseen similar data”.
application role. They give a brief overview of the most important In the last two decades, considerable progress was made
industrial applications of machine learning and small data. Small in ML-science leveraging the growing computational power
data is defined in terms of various characteristics compared to [4], [8], [9] and the vast amounts of data increasingly being
big data, and a machine learning formalism was introduced. available (big data), see [10]. Consequently, data-driven ML
Five critical challenges of machine learning with small data in
industrial applications are presented: unlabeled data, imbalanced methods are applied to social, economic, industrial, environ-
data, missing data, insufficient data, and rare events. Based on mental, and even political tasks [11]. A large variety of com-
those definitions, an overview of the considerations in domain prehensive ML publications appeared, e.g., [1], [4], [5] with
representation and data acquisition is given along with a tax- references therein related to one such as search engines, ma-
onomy of machine learning approaches in the context of small chine translation, online retailers, social networking services,
data.
financial modeling, marketing, education, and policy-making.
Index Terms—machine learning, small data, industrial appli- Further detailed understanding of the underlying mechanisms
cations, engineering applications of ML methods was gained as in statistical learning theory
[12], [13], pattern classification [14], [15], neural and deep
I. I NTRODUCTION learning (DL) [8], [9], [16], [17] in different applications.
Machine learning became the most popular buzzword nowa- Small data: The adoption of big data-driven ML methods
days in business meetings and is often propagated as the magic is not always feasible, especially when the amount of avail-
novel cure for all problems and the carrier of the digital rev- able data is insufficient for the desired performance due to
olution. However, the general term machine learning contains various limitations or when data-sets are high-dimensional,
a variety of data-driven methods which are fundamental for complicated, or expensive. For example, cases such as medical
“artificial intelligence” (AI) [1], coined by John McCarthy [2] magnetic resonance therapy scans of new or growing cancer
and pushed by Alan Turing’s question “can machine think?” cells, expensive and time-consuming aero-engine test data,
to the modern AI research known today [3], [4]. In different limited data, or even a lack of knowledge about unknown
contexts, machine learning (ML) is also known as data mining, datasets, shall be covered by the term “small data” as a
or predictive analysis [5]. counterpart to “big data”. The term “small data” was promoted
T. M. Mitchel [6] defines machine learning as: “A computer by many authors and had different interpretations, e.g. “Small
program is said to learn from experience E concerning data results from the experimental or intentionally collected
some class of tasks T and performance measure P , if its data of a human scale where the focus is on causation and
performance at tasks in T , as measured by P , improves with understanding rather than prediction” [18]. In [19] the author
experience E”. Here, classes of tasks T encompass different remarks “a lone piece of small data is almost never meaningful
types of prediction and inference, the source of experience E is enough to build a case or create a hypothesis, but blended with
the available “a priori” knowledge about the domain, including other insights and observations ... comes together to create
a solution that forms the foundation of a future brand or
This study was supported by the Ministry of Science, Research and Culture business”. Other interpretations of small data is presented in
of Brandenburg through the Project Kognitive Materialdiagnostik under Grant review publications, e.g. [20], [21], neural network small data
22-F241-03-FhG/007/001. {Corresponding author: Ivan Kraljevski.}
Ivan Kraljevski, Yong Chul Ju and Constanze Tschöpe are with Fraunhofer approaches [22] and sensor fusion applications [23].
Institute for Ceramic Technologies and Systems, Fraunhofer IKTS, Dresden, While small data are already widespread in social and
Germany (e-mail: [email protected]). financial applications, they are relatively rare in industrial and
Dmitrij Ivanov, Matthias Wolff are with Brandenburg University of Tech-
nology Cottbus–Senftenberg, Cottbus, Germany (e-mail: matthias.wolff@b- engineering. The availability, quality, and composition of the
tu.de). data significantly influence the performance. ML should be
2
able to address the main challenges, such as, to handle high- a) Organization of the paper: The paper is organized
dimensional problems and data-sets, to present transparent and as follows: In Section II, we first define small data in terms
concrete results, to adapt to dynamic environmental conditions, of various characteristics compared with those of big data,
to expand the knowledge by learning from results, to utilize and we present a machine learning formalism which used
available data without special treatment, to identify inter-and throughout the paper.
intra-relations in the processes, correlation, casualty [24], [25]. Then, we describe five main challenges of small data ML as
seen from industrial applications: unlabeled data, imbalanced
data, missing data, insufficient data, and rare events. Based
A. Industrial applications
on those definitions, in Section III we give the overview of
They encompass all the processes, directly and indirectly, the considerations in domain representation acquisition and
related to the manufacturing, or maintenance of manufacturing taxonomy of machine learning approaches in the context of
plants, such as, in automotive, aviation, chemical, pharma- small data.
ceutical, food, and steel industry. Machine learning methods In Section IV, we list some of the possible approaches to
are already widely applied in various industries for process small data ML in terms of the five categories mentioned above.
optimization, monitoring, and control applications [26]–[29]. We close this paper with concluding remarks in Section V.
In semiconductor manufacturing [30], rotational machinery
[31], materials degradation [32], engine failure prediction [33] II. C HALLENGES OF S MALL DATA
as some examples in predictive maintenance.
A. Definition of Small Data
To explore the general term “small data”, we should first
B. Engineering applications describe the characteristics of big data [54], [55] in contrast
These applications cover all the fields which apply scientific with the small data [56]:
principles for developing, designing, building, and analyzing • Volume – sheer quantity of the data, small data could have
technological solutions, such as bio-engineering, e.g., model- limited to large volume, where big data always have an
ing and identification of gene and protein structures [34], DNA enormous volume.
research [35], predictive modeling in biomedical engineering • Variety – small data are are constrained in terms of their
[36], drug discovery [37], [38]; in electrical engineering, e.g., breadth and diversity, while big data draw insights from
autonomous driving [39]; in civil engineering, e.g., structural a wide spectrum range of data types.
design optimization, structural health monitoring [40], [41]; • Velocity – small data are generated much slower and less
in mechanical engineering, e.g., flow field prediction [42], continually compared to big data, which are produced
data-driven flow dynamics [43], data-driven computational much faster in real-time.
mechanics [44]; in material science [45]–[47], e.g., prediction • Veracity – in big data, the data quality and the value
of material properties [48], materials design and discovery can vary greatly, and it can be messy, noisy, contain
[49], [50]; in optimization tasks for complex and expensive uncertainty and errors. In contrast, small data are usually
black-box problems, e.g., smart data [51], [52]. produced in a controlled manner.
Despite the rapid growth and availability of big data, small • Exhaustivity – an entire system is captured (big data),
data play an even more critical role in applying ML in rather than being sampled (small data).
industrial and engineering. Small data allows a more focused • Resolution and indexicality – big data are fine-grained in
ML design providing answers to more targeted questions and resolution and uniquely indexical in identification, while
explainable models. ML with big data is applicable only if the small data resolution range from coarse to fine and in
domain of analysis is expected to be stationary, whereas the indexicality from weak to strong.
dynamics of systems often change over time. • Relationality – big data contain common attributes that
Also, in healthcare and biology sciences, the data is obtained enable conjoining of different datasets, where the rela-
via experimentation, and the formulation of relevant hypothe- tionality in small data is weak.
ses to explain the data is the limiting factor for employing • Extensionality and scalability – the big data can add or
approaches based on big data. Therefore we want to divide change new attributes easily and rapidly expand in size,
the general term “small data” into different categories, which where small data are difficult to administer and have
may partly overlap but may serve as guidelines to present the limited extensionality and scalability.
main ideas and ML techniques in the context of small data. • Value – in big data, many insights can be extracted and
It is important to emphasize that applying machine learning the data repurposed, where small data have limited scope
to small data for a specific domain also requires specific and rarely could be reused for different purposes.
knowledge about the problem or the task. We could roughly • Variability – the meaning of the data can constantly
divide the ML approaches into generalized and specialized be shifting depending on the context in which they are
according to the available data’s nature. A straightforward generated, and this does not apply to small data which
application of ML frameworks without underlying knowledge are quite inflexible regarding their generation.
may lead to wrong conclusions [53]. That is why we do not In the case of industrial and engineering application, we
claim to present the best-practice approaches, but inspire the must not draw a clear line to big data but motivate use cases
interested reader to explore their solutions. or applications which exhibit some of the aforementioned
3
attributes indicating small data. Hence, we suggest scenarios Here, the Di are pairwise disjoint and fulfill
that might include certain redundancies but serve as essential
keywords: unlabeled, imbalanced, missing and insufficient D0 ∪ D1 ∪ . . . ∪ DK = D. (6)
data, and rare events. We denote by Ni := |Di | with i ∈ {0, 1, . . . , K} the size of
the subsets.
B. Machine Learning Formalism and Data Sets We additionally define:
We start with introducing symbols and definitions, which
Dunlabeled := D0 unlabeled subset
we will use throughout the rest of this paper. The presented [K
notation is loosely inspired by [57], [58]. Dlabeled := Di labeled subset (7)
i=1
The main objective of machine learning is to find a map
and denote by
h:X →Y (1)
Nunlabeled = |Dunlabeled | = N0 (8)
from an input domain X to an output domain Y. This map
is also called a predictor (classification, regression). It is to the number of unlabeled samples and by
suit a particular task and to improve with “experience” (cf. XK
Mitchel’s definition of machine learning in the introduction). Nlabeled = |Dlabeled | = Ni (9)
i=1
The objects in the input domain are also known as inputs, data
points, or samples, usually in the form of vectors of feature the number of labeled samples. We call a data set
variables (or features). The objects in the output domain are completely labeled, if Nunlabeled = 0,
also known as outputs, ground truth, targets, or labels.
unlabeled, if Nlabeled = 0, and
Both, the input and the output domain, can be either numeric
or symbolic. We call a domain numeric, if it consists of d- partially labeled, if 0 < Nlabeled < N.
dimensional real or complex vectors or vector sequences, i.e.,
+
X , Y ⊆ Cd , (2) C. Partly Labeled and Unlabeled Data
1
where “+” is the Kleene-Plus. We call a domain symbolic, A partially labeled or unlabeled data set is not necessarily
if it consists of symbolic entities. A symbolic entity can be small. However, only a small portion or even none of the data
an atomic symbol, any symbolic structure—strings, graphs, samples are labeled. Typically, this kind of data exhibits some
etc.—or a set of the aforementioned. Additionally, symbolic small data properties, e.g., limited scope, weak relationality,
entities may contain (some) numeric values and also weights: and inflexible generation.
weighted graphs, weighted sets, etc. Partly labeled or unlabeled data frequently occur in in-
If at least one of the domains, the input or the output dustrial applications: most machinery is equipped with all
domain, is numeric, we speak of numeric machine learning. kinds of sensors and their readings can be logged. The
If both domains are symbolic we speak of symbolic machine entirety of such sensor readings may or may not contain
learning. useful information. Without labeling, it is hard to make sense
For industrial applications, the “experience” to learn from of the data. Moreover, it is usually not possible to take
is typically supplied in form of data sets. Let ε denote the advantage of data collected in other domains, either labeled or
empty or non-existing label. Then a data set can be formally unlabeled, as the sensor readings are precise and restricted to
defined as a multi-set: the particular task at hand. Manual or semi-automatic analysis
D := (x(1) , y (1) ), . . . , (x(N ) , y (N ) ) ⊆ X × Y ∪ {ε} (3)
requires substantial effort involving organizational, technical,
and human resources.
consisting of N := |D| features-label pairs (x, y) ∈ D called However, as implied by Fig. 1-a, unlabeled data may be used
samples. Samples with y = ε are called unlabeled, samples for machine learning, particularly for unsupervised learning.
with y ̸= ε are called labeled.
In the following, we assume the label domain to be finite,
Y := {c1 , . . . , cK } with K labels c1 , . . . , cK which we call D. Imbalanced Data
classes. Then Y induces a K + 1 partition of the data set: A data set is said to be balanced if all classes are (at least
roughly) evenly represented. Formally, let
D0 := (x, y) ∈ D : y = ε unlabeled subset,
Ni
D1 := (x, y) ∈ D : y = c1 labeled with class c1 , αi := with i ∈ 1, 2, . . . K (10)
Nlabeled
..
. be the quota of samples of class i. A data set is balanced if
DK := (x, y) ∈ D : y = cK labeled with class cK . (4) 1
∀i ∈ {1, 2, . . . K} : αi ≈ . (11)
Here the Di are pairwise disjoint and fulfill K
D0 ∪ D1 ∪ . . . ∪ DK = D. (5) Otherwise, the data set is called imbalanced. Note that the
definition is not strict. Particularly, we do not require all quota
1 Non-empty sequence of arbitrary length to be exactly equal.
4
Unlabeled Data Imbalanced Data Missing Data Insufficient Data Rare Events
Unlabeled Class #1 Class #1 Class #1 Anomalies
Class #1 Class #0 Class #0 Class #0 Class #1
Class #0 Class #0
a. b. c. d. e.
Fig. 1. Small data examples on the "half moons" synthetic dataset, which presents unique challenges: a) lacks annotations, b) exhibits class imbalances, c)
contains missing data points, d) has a limited number of data points, and e) includes unseen or uncommon observations.
the older the car, it is more probably that the mileage will and the observations lie outside the boundaries of the normal
not be provided by the seller. However, the missingness of the regions.
mileage of the car can be predicted from its manufacturing Contextual or conditional anomalies are observations con-
year. sidered anomalous in a specific context (temporal or spatial)
Missing completely at random (MCAR), is the case where or specific conditions, only. Outside the given context, the
the probability that data are missing is independent of both observations would be considered normal.
observed and unobserved data. There is no pattern and no bias Collective anomalies are a collection of observations that
is introduced, and the missing data is a random subset of the form clusters that are considered anomalous with respect to
data. The analyses can be performed using only observations the entire data in the sense that there is no model for the
with complete data considering them a simple random sample cluster (see Fig. 1-e). There are two typical cases: a) data
of the entire data. In reality, data seldom exhibit characteristics are anomalies or b) the data points represent a new previous
of MCAR. For instance, if there are occasional missing values unseen class which is normal and should be included in the
in the sensor readings because of a power outage, this would model.
be MCAR, see Fig. 2-b. However, if there is any pattern in Anomalies are dynamic. Emerging new types of observa-
the outages, it would be correlated with the sensors’ readings, tions make it difficult to obtain or define labels for such oc-
then the data is not MCAR. currences. In order to formally describe an anomaly detection
Missing not at random (MNAR), data that do not meet problem, we first consider a set of two labels
the description of MCAR nor MAR. Here the missingness
Y := {cnormal , canomaly }. (17)
is specifically related to the missing variable. Since none
of the standard methods for dealing with missing data can The training set for machine learning contains normal samples
be employed, advanced predictive or statistical modeling of only
the missing data is necessary to get an unbiased estimate of Dtrain := (x, y = cnormal ) . (18)
the parameters. For instance, if the temperature sensor fails
whenever there is a condition that meets those out of the Thus, we can learn a model for the cnormal class only. An
operational range (Fig. 1-c) (too low or too high-temperature anomaly detector is to find out whether an unknown obser-
values), or it failed because of the power supply outage caused vation xtest is normal or an anomaly. Since we have only one
by the condition itself. class model, the detection decision ytest must be based on some
kind of score s(x) ∈ R indicating how well the normal model
fits the observation
F. Insufficient Data (
canomaly : s(xtest ) < θ
Data are called insufficient if their volume is very small ytest = (19)
and intrinsically limited due to expensive collection [63] cnormal : otherwise,
or generation, or very low frequency of production. With where θ ∈ R denotes a decision threshold. The unknown
insufficient data, machine learning algorithms cannot obtain observation xtest is considered as anomaly if it scores lower
proper models. than a defined threshold θ.
Fig. 1-d gives an example of a dataset with 75 samples per
class, where it is evident that it is difficult to determine the III. M ACHINE L EARNING WITH S MALL DATA
optimal boundary between the two classes.
This section presents a taxonomy of ML approaches and
algorithms in the context of small data in industrial applica-
G. Rare Events tions. We will refer to the definition given by T.M. Mitchel [6]
Rare events, also known as anomalies and novelties, could in Section I where a machine learning problem is defined as
be described as rare observations that deviate from the normal “learning from experience E concerning some class of tasks
operations or usual behavior. They could result from data T and performance measure P , if its performance at tasks in
acquisition or data processing errors and natural variation or T , as measured by P , improves with experience E.”
foreign influence in the system. Experience E can be defined by the domain and its rep-
Due to a large amount of “normal” data, detecting rare resentation as dataset D, along with the learning paradigm.
events or anomalies is usually related to big data and veracity. It could be considered a function of an ML approach and
However, very few and rare observations contain valuable supervised information in the form of data D, expert, and “a
information of high interest for the intended task. Hence the priori” knowledge. The type and nature of data D along with
anomalies themselves fit the definition of small data as given the task T will determine a suitable approach for solving a
in the Sect. II-A. In industrial and engineering applications, particular problem.
detection or prediction of such rare events is crucial, especially In the context of industrial applications, we take the example
in predictive maintenance and quality assurance. of product quality prediction. The task T is to distinguish
The authors of [64] classify anomalies into three categories: faulty products from good ones. The performance measure
Global outliers or point anomalies represent individual P is detection accuracy, based on training experience E
observations considered anomalous wrt. the rest of the data, (knowledge about what is good) gained by either learning on
usually due to data pollution by noise and outliers. Point data or using “a priori” (domain) knowledge about the physical
anomalies are not distributed into dense regions or clusters, properties of good specimens.
6
Data no Acquisition no Another appropriate solution. Answering the requirements about the
available? possible? domain data
availability, completeness and the annotations of the data,
yes yes determine one of the main paradigms in the machine learning:
no Aquisition data generation, supervised, semi-supervised, unsupervised,
Is it enough?
Processing transfer and reinforcement learning.
yes Depending on the goal, other factors are also important
for the optimal choice, discriminative or generative machine
Completely no Labeling
labeled? possible? learning, discrete or continuous target variables, balanced and
yes no
imbalanced data, complexity and so on.
Discriminative models learn the conditional probability of
yes
Data Partialy
Labeling labeled? the target given the observable variable, while the generative
yes no models compute the joint probability distribution of the targets
Fully labeled Partialy labeled Unlabeled
and the observations (distribution of the individual classes)
Dataset Dataset Dataset [68].
The target variables could be quantitative and categorical.
Fig. 3. Data collection process. Categorical variables are discrete and have a finite number
of distinctive groups. The quantitative variables are numeri-
From here, having examples of only good specimens E to cal, discrete (finite number of values) or continuous (infinite
detect bad ones T defined the machine learning approach as number of values in a defined range).
an anomaly or novelty detection. In the case of when we have incomplete data, where target
variable samples or feature values are missing, data generation
and consolidation should be applied, where the resulting
A. Domain Representation
synthetic or surrogate data could still exhibit the limitations
The first and most important step in applying any machine of small data.
learning approach is to ensure domain representative data (D). When the target variables are unknown, not defined, or not
Fig. 3 gives an overview of the requirements equally applied labeled, the appropriate approach depends on the criteria used
to existing data or data that should be collected. to exploit the hidden and unknown patterns in the data: clus-
The most important question is: do data already exists? tering by their similarities, discovering hidden dependencies
If not, is their acquisition possible at all? Data collection for dimensionality reduction or reconstruction with generative
might be technically and organizationally plausible, but it was models.
never done before. Also, data could be unavailable because of Although it seems not apparent initially, even small data
infeasible and/or expensive acquisition, or available but very could be very complex and challenging to model with tradi-
limited in terms of quantity and quality. tional ML approaches. Complexity could be expressed by a set
In the case when data’s existence is assumed, it might be of metrics (e.g., class ambiguity, boundary complexity, sample
that it was not observed before and could not be captured. sparsity, and features dimensionality) [69], that quantify the
Many such examples exist in industrial applications, e.g., in data and indicate how complicated the pattern extraction and
predictive maintenance and fault monitoring. classification will be.
The most extreme case is that there are no assumptions of
the form and properties of the data, and only the task, context, IV. A PPROACHES FOR S MALL DATA
or the domain of the application are known. Then using data In the following sections, we will focus on our taxonomy’s
collected in another domain or for another purpose could be intersecting points with the small data challenges as described
employed in reinforcement [65] or transfer learning [66], see in Section II. Some notable approaches are also graphically
Section IV-A3. presented for better understanding.
Another important aspect is data annotation, where the The legend of the used symbols is given in Table I. The in-
collected data could differ regarding the labeling coverage put data points could be unlabeled and labeled. They can have
as unlabeled, partially, and fully labeled. The availability, the different attributes such as e.g., one geometrical form, another
size, and the annotation are the main factors determining the grade of brightness. Hidden dependencies and relations could
appropriate machine learning approach in the context of small be discovered automatically or by human experts. The data
data. could be transformed or converted into another feature space
and their embeddings can be estimated. The learned model
B. Taxonomy of Machine Learning knows about the features and classification boundary between
Fig. 4 presents a taxonomy inspired by similar ones as the classes. The flow of the symbols in the graphs is from left
found in different studies and ML frameworks (such as [17], to right and from top to bottom.
[67]). Its objective is to address the most popular approaches
and algorithms relevant in the case of small data in industrial A. Approaches to Unlabeled Data
applications. This section discusses different approaches that could be
As already mentioned, the data D and the task T will employed to overcome the challenges in machine learning with
help to make the right decisions and to select the most unlabeled data as already introduced in Section II-C.
7
Generative Models
Simulations Data Consolidation
• Generative Adversarial Networks
• ML prior SIM run • Removal • Recurrent Neural Networks
• ML after SIM run • Imputation: mean, median, mode • Variational Autoencoder
• ML during SIM run • Re-label to new class • SMOTE (M-SMOTE)
• Predict • Deep Belief Networks (DBN)
a. a.
b. b.
?
?
?
?
?
?
c. c.
y
y x
x
d. d.
Fig. 5. Unsupervised learning: a-reconstruction (generative models), b- Fig. 6. Semi-supervised learning: a- self training, b- self taught, c- active
similarities (clustering), c-patterns (associations), d-dependencies (dimension- learning, d- graph methods.
ality reduction).
it to solve different, yet, related tasks. The learning process success is tested on the target dataset. Thus this method is only
is much faster, more accurate, and needs less training data applicable if some target data is available.
than creating model from scratch. Other terms often related The second category is feature-based learning. This method
to transfer learning may be knowledge-transfer, inductive and aims to extract meaningful structures between source and
semi-inductive-transfer, meta-learning, incremental learning, target domains to find a common feature space that has
etc. predictive qualities for both domains.
For a better explanation, we will stick to terms used in [66], The third category is parameter-based learning. This
[75], [76]. First, we define a source domain Ds , containing method is based on an underlying assumption that source and
a feature space, the marginal probability distributions of its target tasks share the same parameters. Hence parameters must
elements. The corresponding source task Ts contains a label be encoded.
space and predictive objective functions fs (x) which are Fourth category is relational based learning. This method
trained on the source dataset and the source labels. The feature is based on extracting similar relations in target and source
space contains Ns feature vectors (patterns), which must be domains [101]. There are quite a few studies related to
learned. transfer-learning for industrial and engineering applications,
Next we define a target domain Dt and the corresponding but some transfer learning cases in aerospace can be found
target task Tt also containing a label space and objective within [102].
functions ft (x). Usually the dimension or size of the feature
space in the target domain (Nt ) is smaller then the source B. Approaches to Imbalanced Data
domain (1 < Nt ≪ Ns ) representing the “real” small data As defined in Section II-D, imbalanced data problems occur
problem. when one class has a significantly smaller number of samples
The main objective of transfer learning is to derive a than others, for instance, in binary classification tasks, the
predictive objective function ft for the target task Tt using minority vs. majority class [103]. This can be extended to
information derived from Ds and Ts , see Fig. 8. Different sub multi-class classification scenarios [104]. The notion of class
types of transfer learning can be defined depending on the imbalance is also called class rarity or skewed data [105].
similarity of source and a target domain and task data. In many real-world applications [106], the minority class
is usually of greater interest since useful information belongs
to this class, e.g. medical diagnostics [107]–[110], detection
of banking fraud [111], network intrusion [112] and oil spills
[113].
The core issue of imbalanced learning algorithms is that
the minority class is often to be miss-classified because prior
information of the majority class has more influence during
the training and consequently on the prediction [105].
According to [105], [114], there are mainly three strategies
Fig. 8. Transfer learning: use abundant data from other domain to discover to deal with class imbalance: data-level methods, algorithm-
features and train a model that can be used for the target domain. level techniques, and hybrid approaches. Here, we only pro-
vide main ideas on each topic.
In this paper we can only give a brief overview. We refer For detailed explanation and a complete list of references,
to the literature for details. Hence we only use some common we refer to [103], [105], [114]–[117].
sub-types and point out specific requirements. Categorization 1) Data-level Methods: These methods’ main idea is to
wrt. to labels of data set: try balancing out between majority and minority classes by
• Inductive transfer learning: requires some labeled data for making use of sampling methods [117].
target domain Dt ,
• Semi-inductive transfer learning: no or little labeled data
in target domain Dt , and
• Unsupervised transfer learning: no labeled data at all in
Dt and Ds .
a. b. c.
The delimitation of the types of transfer learning differs in
the literature, see detailed overview in [66], [76]. As the last Fig. 9. Data level methods: a- ROS, b- RUS, c- SMOTE.
part, it is to explain what kind of knowledge is transferred and
how. This includes over-sampling minority classes [118] and
According to [66] four general knowledge transfer cate- under-sampling majority classes, e.g., synthetic minority over-
gories are dominant distinguished by transfer learning method. sampling technique (SMOTE) [119], random under-sampling
Category instance-based transfer learning is a common (RUS) and random over-sampling (ROS) methods [120], as
method to pick parts of the source datasets and reuse them presented on Fig. 9. Moreover, several deep learning-based
by re-weighting and feeding them to the “usual” learning data-level methods have been proposed: To achieve class
process using any classifier for the target domain. The learning balance, ROS and RUS have been applied to minority and
10
majority class, respectively [121]. Also, a dynamic sampling C. Approaches to Missing Data
technique based on two-phase learning with ROS and RUS
Apart from the types of missingness described in Sec-
[122] and convolutional neural network (CNN) has been
tion II-E, missing data can also be interpreted as missing
proposed [123].
feature values or class samples. According to that, standard
2) Algorithm-level Methods: While the aforementioned techniques dealing with missing data can be divided into three
data-level methods change the training data distribution di- categories.
rectly, the primary focus of algorithm-level methods is to The first category is related to consolidate and prepare data
modify existing algorithms so that more emphasis is placed for machine learning by removing data with missing values,
on minority classes [124]–[126]. imputation, and in the case of missing labels, to re-label, and
predict labels [135]. The second category encompasses meth-
ods for data generation with generative models based on deep
neural networks: generative adversarial networks (GANs) [80],
variational autoencoders (VAEs), and Deep Belief Networks
(DBN), including synthetic minority oversampling technique
a. b. SMOTE. The third group comprises of simulation-driven ap-
proaches.
Fig. 10. Algorithm level methods: a- class weights, b- decision threshold.
1) Data Consolidation: In this context, missing data can be
handled in three different ways. The first and most straight-
This can be achieved by introducing weights to classes forward approach is to accept the data as it is. The second
(Fig. 10-a) via a cost matrix [105], e.g., a more weight for each one is to delete the data with missing values from the dataset.
minority class, which is called cost-sensitive learning [127]. That could be done by, either list-wise deletion, where the
However, it is difficult to find an optimal cost matrix since observations are removed when even one value of the features
this process usually requires domain-specific expert knowledge is missing, or by pair-wise deletion where it is attempted to
and is not feasible in case of large data set or large number use all available data and omit cases on an analysis by analysis
of features [105]. basis, e.g., correlational analysis [136]. Removing data reduces
One of the biggest challenges in cost-sensitive learning is the amount of valuable information since many data analysis
the assignment of an effective cost matrix. The cost matrix techniques, including machine learning, are sensitive to miss-
can be defined empirically, based on past experiences, or a ing data. The third approach is the replacement of the missing
domain expert with knowledge of the problem can define them. values by single or multiple imputation and interpolation.
Alternatively, the false negative cost can be set to a fixed value Commonly used imputation techniques are mean, median,
while the false positive cost is varied, using a validation set mode and regression imputation, Maximum Likelihood Es-
to identify the ideal cost matrix. The latter has the advantage timation (MLE), and expectation–maximization (EM) algo-
of exploring a range of costs, but can be expensive and even rithm. Interpolation creates new values for the missing ones
impractical if the size of the data set or number of features is by using the nearest feature value. A comprehensive survey
too large. of the current methodology for handling missing data is given
Another approach is to adjust a decision threshold (Fig. 10- in [62] and missing data techniques in pattern recognition are
b) in test time in such a way that bias moves from majority reviewed in [137].
classes towards minority classes [128], [129]. 2) Generative Models: In this approach, a generative model
is trained from an incomplete dataset. The generative model is
Moreover, CNN-based methods have been proposed too,
then used to generate plausible replacements for the missing
e.g., using a different loss function [130] or by moving the
values. The models represent the data space, and observations
decision threshold [121].
can be sampled by a random variable from the latent space.
3) Hybrid Methods: For imbalanced data problems, hybrid The deep generative models widely used for missing data
methods refer to approaches that combine data-level and generation are generative adversarial networks (GANs), vari-
algorithm-level methods [131]. ational autoencoders (VAEs), Deep Belief Networks (DBNs),
As an example, one can apply cost-sensitive learning after and Recurrent Neural Networks (RNNs).
ensemble methods with data sampling [105], [117]. According a) Generative Adversarial Networks: GAN [80] is a deep
to [132], the classical hybrid methods can usually be applied generative model which is capable of creating data with the
to small scale problems, characterized by non-extreme imbal- same statistical properties as the available training data. The
ance ratio and problem-specific hand-crafted low dimensional generative model produces a sample from the model, and
features. an adversary (discriminative model) learns to distinguish the
Recently, CNN based methods have also been proposed, sample as belonging to the model or the data distribution.
which include: large margin local embedding (LMLE) by The basic idea could be explained in the context of a non-
combining triple header hinge loss function and quintuplet cooperative zero-sum game in game theory [138]. The two
sampling [133] and deep over-sampling (DOS) using micro- models are competing and improving their methods until the
cluster loss, k-nearest neighbors (k-NN) and over-sampling of point where the generated samples are indistinguishable from
a minority class [134]. the actual data samples (Fig. 11).
11
generator
image style transfer [147], 3D object generation [148], realistic
image synthesis [149], [150], object detection [151], high res-
feedback
olution image blending [152], visual surface inspection [153],
noise
anomaly detection in medical images [154] and industrial time
series [155], unsupervised fault detection [156], generating
R molecules for drug discovery [157], synthetic medical patient
Q records [158], music generation [159], [160], cartography
[161], and astrophysics and cosmology [162], [163].
discriminator
R model
feedback
b) Variational Autoencoders (VAE): Autoencoders (AE)
are deep neural networks that are primary used for learning
Fig. 11. Generative Adversarial Networks: a discriminator is deciding whether representative features and for dimensionality reduction. The
randomly sampled data from the domain space can be considered as real or basic training objective is to learn how to reconstruct the
fake.
network’s input at the output [17]. It is essentially a supervised
learning task, although the data set is not required to be
An illustrative example is where a generator is trying to labeled.
create images of counterfeit banknotes and a discriminator The architecture contains hidden layer with few neurons–
is trying to recognize whether images are fake or real. The called a bottleneck layer–which prevents the network from
generator’s role is to create an image from the sample of input simply passing the input to the output. The layer effectively
noise, e.g., noise using a normal or uniform distribution. The compresses the input into a code. The network consists of two
discriminator is determining the probability of how close the parts, an encoder that transforms the input data to the code and
generated sample is to the real one. This information is then the decoder which uses that representation to approximately
fed back into the generator as guidance to mimic essential reconstruct the input data as well as possible. It can be thought
features of the authentic images during the training. of as a standard feed-forward or convolutional network trained
The network type is not restricted. A straightforward imple- with mini-batch gradient descent and backpropagation, aiming
mentation consists of multlayer perceptrons for the generator to minimize the reconstruction error. There are variants of
and discriminator. In many applications, the generator is de- autoencoders which try to prevent learning the simple identity
convolutional, and the discriminator a convolutional neural function and to improve the ability to capture the salient
network. The training is performed through back-propagation information. AEs can be regularized, sparse, de-noising, and
for both networks where they play a two-player minimax game contractive.
to find the saddle point instead of the local optimum. The loss Variational autoencoders are generative variant of autoen-
function of GAN is based on a cross-entropy such that the coders [100]. The learning algorithm differs from ordinary
following odds are maximized: the created sample is classified autoencoders. VAE employ probabilistic mechanisms for de-
as fake, and the reference one is recognized as real. The scribing an observation in the space of latent variables, where
training procedure is performed in an alternating way utilizing the encoder describes the probability distribution for each
the gradient descent (GD): When the generator’s parameters latent variable instead of providing a single value for each
are fixed, a single iteration of gradient descent is carried out on encoding dimension. By sampling from the latent space, the
the discriminator. Then, another gradient descent is performed VAE decoder part can generate new data similar to that
on the generator with the fixed parameters of the discriminator. which was used during training. Compared with GAN they
When the optimum is reached, the generated samples are are easier to train but provide worse results due to the MSE
indistinguishable from real data, and the discriminator may based loss function. Further information and insights about
then be discarded. Based on game theory [138], the conver- VAEs and their extensions are presented in [164], and [165].
gence of GAN can be achieved when the generator and the Variational autoencoders were successfully used for missing
discriminator reach a so-called Nash equilibrium. data imputation in simulated milling circuit dataset [166],
The training of GAN is, however, highly challenging [139] traffic estimation [167], sensor data [168] and soft sensors
due to instability, non-convergence, mode collapse, or van- frameworks in industry [169].
ishing gradient [140]. However, the learning is performing c) Deep Belief Networks: Deep Belief Networks (DBN)
well when the right architecture and the hyperparameters are [170], [171] are probabilistic generative models consisting of
selected. multiple layers of hidden stochastic binary units with weighted
To deal with vanishing gradient and to improve the stability connections.
of GAN, Wasserstein GAN (WGAN) [141] has been proposed DBNs are hybrid graphical models and can be defined as
by making use of Wasserstein metric [142], another approach a stack of simpler networks like RBMs (Restricted Boltz-
has also been proposed in the context of semi-supervised man machine) or autoencoders. They can be trained in an
learning [143]. unsupervised learning task to discover deep hierarchical rep-
In recent years, the GAN moved from an exotic technique resentations and reconstruct their inputs. The top layers form
to mainstream, and the application area of GAN is extending associative memory by undirected symmetric connections. The
rapidly. Major area of applications are in applied computer- lower layers have directed acyclic connections to convert the
vision, but there are increasing numbers in other areas, e.g. associative memory to observable variables; the lowest layer
missing data imputation [144], [145], image in-painting [146], states represent the data vector.
12
DBNs are trained by an efficient layer-wise greedy learning even better performance on smaller datasets [196]. Variants
using the Contrastive Divergence (CD) algorithm. The first of the basic RNN architecture are the bidirectional recurrent
layer models the data as a visible layer, then the learned neural networks (BRNNs) [197], here the output layer get the
representation is used as input data for the second layer. input from two directions, the past and the future states in the
The process iterates for the subsequent layers until the last same time.
hidden layer is achieved, each time propagating the activation The application area of RNNs is quite extensive. In the form
of the learned features upward and learn their higher-level of generative models they were used e.g., for handling missing
representations. Discriminative fine-tuning of a trained DBN data in sequential series [190], clinical time series [198], and
can be done by adding a final layer converting the learned medical datasets [199].
representations into desired supervised predictions and back- 3) Simulation-driven Approaches: Simulation refers to, in
propagating error derivatives. The trained DBM can be used general, an approach that uses computer programs to perform
directly as a generative model by sampling from the top calculations based on mathematical models that represent real-
two hidden layers and then generate a sample from the world physical systems. It is employed when the mathematical
model by drawing from the visible units in a single pass of model describing the physical system too complex that it
forward sampling through the rest of the model. They are used cannot be solved using traditional analytical methods.
for generating and recognizing images, video, and motion- In this context, it is crucial to understand the properties of
capture data, for missing data imputation [172], [173], image the system itself, including its underlying processes and the
restoration [174], data recovery in sensor networks [175]. mathematical model in simulation-driven approaches for mak-
They are also successfully applied for classification in many ing decisions. That makes the difference between simulation-
other areas, e.g. fault detection [176], vehicle detection [177], driven and the ML approaches that do not impose such specific
bearing fault diagnosis [178], gearbox fault diagnosis [179], requirements, which are, therefore called data-centric or data-
time-series forecasting [180], structural health diagnosis [181], driven approaches.
etc. There are three main categories of simulation-driven ap-
d) Recurrent Neural Networks: RNNs are artificial neu- proaches:
ral networks [182] which have at least one multiple fixed Equation-based modeling (EBM) [200], [201] which is
activation function unit (recurrent units) with a hidden state typically described by ordinary differential equations (ODEs)
which signifies the past knowledge in a given time step [183], or partial differential equations (PDEs) and thereby represents
[184]. physical process well.
For the input sequence, an element at a time step and the Agent-based modeling (ABM) [202]–[204] which is well-
hidden state information are fed into the recurrent unit, which suited for assessing the influence on a system described by
updates the hidden state and returns an output, effectively actions and interactions of collective entities, i.e., agents.
memorizing the previous inputs and states. The process con- Hybrid modeling (HM) makes use of both EBM and ABM,
tinues until the end of the sequence, when the RNN outputs see e.g., [205]–[207].
the prediction. Moreover, to integrate simulation models and ML, one may
This property provides RNNs the capability of modeling think of three scenarios:
long-term dependencies. They are particularly useful for han-
dling sequential data, such as audio or video signals, as 1) applying ML before a simulation run,
well as stock market data. RNNs are employed in various 2) applying ML after a simulation run, and
tasks, including machine translation, time-series data analysis 3) applying ML within a simulation run.
like automatic speech recognition [185], natural language In case (1), ML algorithms prepare input data for a simulation
processing [186], and video classification [187]. run [208].
RNNs can be used also as a generative model e.g., text If the simulation models were agent-based models (ABM),
generation [188], image generation [189] and generation of the ML approach, e.g., a data-driven decision-making algo-
complex sequences with long-range structure, such as text or rithm, must be designed so that the concept of collection of
online handwriting [190]. autonomous decision-making entities (agents) can be applied.
The main issues in training RNNs are the difficulties with In case (2), the simulation’s output is fed directly into an
the Gradient Descent optimization [191], [192], the vanishing ML algorithm, and one can find an application of this case
gradient or exploding gradient prevents learning of long data in autonomous driving. Since training ML models for self-
sequences. driving cars requires a large amount of data, the models are
To overcome the vanishing gradient problems, gating-based trained not entirely on real data but also on generated data from
architectures are proposed, such as, long short-term memory simulators, e.g., on real data from one city but on generated
(LSTM) in [193], and later gated recurrent unit (GRU) in data for other cities that have similar characteristics as the real
[194]. The LSTM is composed of a cell, input, output, and one [209].
forget gates. The memory cell keeps values over an arbitrary Case (3) is very suitable for reinforcement learning (RL)
period, and the gates can control the inflows and outflows of [210], e.g., finding an optimal path within a simulation setup.
information over time [195]. The GRU has less parameters, That means that the feedback of trying possible scenarios
and it lacks the output gate compared to LSTMs; it achieves based on the simulation can directly be used during the ML
a performance comparable to LSTM on long sequences and models training, which is in contrast to cases (1) and (2).
13
Simulation has played a significant role in science and engi- When there is only one example per class, the support
neering, and recent advances of ML push the boundaries of the set S has N = K observations, and it is known as one-
field further [211]. Hence, one can find numerous applications shot learning (OSL), or when there are few (less than five)
in various domains: network traffic control in telecommunica- then as few-shot learning (FSL) [226]. Zero-shot learning
tion [212], physical systems modeling using ML and finite (ZSL) is when the objective is to classify unseen classes
element method (FEM) [213], optic flow estimation [214], without any observation available [227], [228] only by having
[215], fluid simulation using CNN-based generative model a high level of descriptions [27], [229], [230]. One should
[216], reservoir simulation in shale gas production [217], note that humans have this type of learning ability [231],
an agent-based operational simulation for unmanned aerial e.g., children can identify various different breeds of dogs
vehicles conducting maritime search-and-rescue missions in even if they have encountered only a few of them. However,
aerospace design [218], simulation-based kernel method for the traditional machine learning and particularly deep neural
prediction modeling using complex synthetic systems in bio- networks (DNNs) based learning algorithms will fail in this
informatics [219], DL approach to estimate stress distribution problem since they typically require a larger amount of training
[220], DL method for Reynolds-averaged Navier-Stokes sim- data to achieve a certain level of accuracy [8], [16], [232]–
ulations [221], DL-based prediction model for cosmological [234]. As discussed in [235], difficulties of OSL and FSL can
structure formation [222], DL-based quantization simulation be explained in terms of empirical risk minimization [236]
[223], image generation method using simulated and unsuper- by considering the interplay between approximation error and
vised learning by an adversarial network [224], mechanical estimation error which depends on the sample size based on
fault detection using simulation-trained classifier [225]. statistical learning theory [12], [13]. Hence, several methods
have been proposed by exploiting prior knowledge to deal
with the unreliability as mentioned above of the empirical
D. Approaches to Insufficient Data risk minimizer concerning data, model and algorithm. Here,
In the case of insufficient data, the dataset is complete and we only tackled the basic ideas of each aspect, and for more
fully labeled. However, the boundaries between the target cat- details, we refer to [226] and the references therein:
egories are ill-defined due to a small number of observations, a) Data-level Methods: When it comes to data, the basic
in consequence the ML algorithms will not generalize well for idea is to provide more data through prior knowledge to obtain
unseen data. the desired sample complexity and reduce the estimation risk.
In certain instances, the data exhibits an imbalance, and This can be achieved, e.g., by duplicating training datasets
certain categories may be quite rare, with only a few ob- using transformation, by borrowing data from other datasets,
servations available. Approaches to insufficient data can also or by learning augmentation policies from data as discussed
apply to cases of imbalanced data (see Section IV-B). In any in [237], for more details, refer to Section IV-C.
case of insufficient data, traditional ML approaches will fail to b) Model-based Methods: In the case of a model-based
provide reliable and useful models. Possible approaches that methods, the primary purpose is to reduce sample complexity
could take advantage of limited supervised information and by restricting the hypothesis space leveraging prior knowledge
prior knowledge for rapid generalization are N-shot learning [235] (Fig. 13).
(NSL) and transfer learning.
When a sufficient number of observations are available 1010 1010
1010
From a probabilistic perspective, much attention has been
paid to generative model-based methods [238]–[240] since
Prior knowledge
prior knowledge plays an important role, although there is a
Features Embedding
discriminative model-based approach such as [241].
0000 0100 1000
0001 0101 1001
0010 0110 1010 Moreover, many NN-based methods have been proposed
0001 1001
in recent years by incorporating various architectures of NNs
unseen data
which sometimes leverage external memory, e.g. Siamese NN
0010
[242] (Fig. 14), matching networks [243] (Fig. 15), meta-
Fig. 12. N-shot learning: using prior knowledge about the data and the N learning [244]–[246] with memory-augmented NN [247],
samples per class, a boundary is found which can classify unseen data. multi-task learning [248], [249] and embedding learning [250].
14
will have a considerable influence on the resulting data model, more frequent than the anomalous ones (Fig. 19). The common
hence handling of noise is crucial for the identification of real methods assess the dissimilarities in anomalous observations
anomalies, and different data pre-processing techniques [267] through the utilization of statistical, probabilistic, linear, prox-
have to be utilized to achieve robust modeling. imity, and deviation models [273].
Similar to other ML categories, anomaly detection uses
well-known supervised, semi-supervised and unsupervised ap-
proaches.
1) Supervised Learning: In supervised learning, both multi-
class and binary classification can be also employed when
only the prior information about the normal observations is Fig. 19. Unsupervised anomaly detection.
available. The multi-class classifiers are trained on multiple
normal classes; then, an anomalous sample is detected when An extreme-value analysis (EVA) is applied to identify
it is not classified into any of the normal classes using some outliers according: the fitness to statistic-probabilistic mod-
confidence threshold (Fig. 17). els, e.g., Gaussian mixture models optimized by Expectation
maximization (EM) or reconstruction errors from dimension
reduction methods (PCA, autoencoders) [273].
Proximity approaches can be applied as clustering (k-
means), statistical (Histogram-based Outlier Score HBOS
[274]), and the k-nearest neighbor methods (k-NN [275],
Local Outlier Factor (LOF) [276] and its variants). A com-
prehensive overview of unsupervised approaches is given in
[277].
There is a wide range of publications about anomaly detec-
tion in different application domains [273]: cyber-security, fi-
nancial frauds, medical anomalies, and industrial applications.
Fig. 17. Supervised anomaly detection. As the main application in industrial cases, anomaly de-
tection is state prediction, condition, or health monitoring.
If labeled anomalous samples are available beforehand, then Bearing-fault diagnostics in rotary machines [278], e.g. wind
it is possible to train a model treating both classes, anomalous [279], gas-turbines [280]–[282], turbo machines in petroleum
and normal, as part of a standard binary classification task. industry [283], motor bearing fault detection [284], then gear-
Usually, this case leverages methods from imbalanced data box fault diagnosis in [285], [286] are just few examples
learning (described in Section II-D), since the anomalous class among many others.
is mostly the minority class. Major deep learning techniques that have been applied in
2) Semi-supervised Learning: In semi-supervised ap- machine health monitoring are the auto-encoders, RBM, CNN,
proaches, it is assumed that there are enough labeled observa- RNN, and their corresponding variants [287]. An overview
tions of the normal data or behavior, and the discriminative of the ML approaches emphasizing deep learning methods is
boundary can be learned, where any samples outside the given in [288], [289].
boundary are considered anomalous or novel (Fig. 18).
V. C ONCLUSION
R EFERENCES [26] Z. Ge, Z. Song, S. X. Ding, and B. Huang, “Data mining and analytics
in the process industry: The role of machine learning,” Ieee Access,
[1] J. Bughin, E. Hazan, S. Ramaswamy, M. Chui, T. Allas, P. Dahlstrom, vol. 5, pp. 20 590–20 616, 2017.
N. Henke, and M. Trench, “Artificial intelligence: The next digital [27] P. Larrañaga, D. Atienza, J. Diaz-Rozo, A. Ogbechie, C. E. Puerto-
frontier?” McKinsey Global Institute, Discussion paper, 2017. [Online]. Santana, and C. Bielza, Industrial Applications of Machine Learning,
Available: https://apo.org.au/node/210501 ser. Chapman & Hall/CRC Data Mining and Knowledge Discovery
Series. CRC Press, 2019.
[2] J. Roberts, “Thinking machines: The search for artificial intelligence,”
[28] A. Mayr, D. Kißkalt, M. Meiners, B. Lutz, F. Schäfer, R. Seidel,
Distillations, Summer, 2016.
A. Selmaier, J. Fuchs, M. Metzner, A. Blank, and J. Franke,
[3] A. M. Turing, “Computing machinery and intelligence,” Mind, vol. 59,
“Machine learning in production – potentials, challenges and
no. 236, pp. 433–460, 1950.
exemplary applications,” Procedia CIRP, vol. 86, pp. 49–54, 2019,
[4] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspec- 7th CIRP Global Web Conference – Towards shifted production value
tives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, 2015. stream patterns through inference of data, models, and technology
[5] P. Domingos, “A few useful things to know about machine learning,” (CIRPe 2019). [Online]. Available: https://www.sciencedirect.com/
Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012. science/article/pii/S2212827120300445
[6] T. M. Mitchell, Machine learning. McGraw-Hill Education Ltd, 1997. [29] M. Mowbray, M. Vallerio, C. Perez-Galvan, D. Zhang, A. Del
[7] A. Castrounis, “Machine learning: An in-depth guide - overview, Rio Chanona, and F. J. Navarro-Brull, “Industrial data science – a
goals, learning types, and algorithms,” 2016, accessed: 2021- review of machine learning applications for chemical and process
03-21. [Online]. Available: https://www.innoarchitech.com/blog/ industries,” React. Chem. Eng., pp. –, 2022. [Online]. Available:
machine-learning-an-in-depth-non-technical-guide http://dx.doi.org/10.1039/D1RE00541C
[8] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. [30] H. Lee, Y. Kim, and C. Kim, “A deep learning model for robust wafer
521, no. 7553, pp. 436–444, 2015. fault monitoring with sensor measurement noise,” IEEE Transactions
[9] J. Schmidhuber, “Deep learning in neural networks: An overview,” on Semiconductor Manufacturing, vol. 30, no. 1, pp. 23–31, 2 2017.
Neural Networks, vol. 61, pp. 85–117, 2015. [Online]. Available: [31] R. Liu, B. Yang, E. Zio, and X. Chen, “Artificial intelligence for fault
https://www.sciencedirect.com/science/article/pii/S0893608014002135 diagnosis of rotating machinery: A review,” Mechanical Systems and
[10] D. Reinsel, J. Gantz, and J. Rydning, “Data age 2025. the digitization of Signal Processing, vol. 108, pp. 33–47, 2018. [Online]. Available:
the world from edge to core,” 11 2018, an IDC White Paper. Sponsored https://www.sciencedirect.com/science/article/pii/S0888327018300748
by Seagate. US44413318. [32] W. Nash, T. Drummond, and N. Birbilis, “A review of deep learning in
[11] J. Hinds, E. J. Williams, and A. N. Joinson, ““it wouldn’t the study of materials degradation,” npj Materials Degradation, vol. 2,
happen to me”: Privacy concerns and perspectives following the no. 1, pp. 1–12, 2018.
cambridge analytica scandal,” International Journal of Human- [33] S. Namuduri, B. N. Narayanan, V. S. P. Davuluru, L. Burton,
Computer Studies, vol. 143, p. 102498, 2020. [Online]. Available: and S. Bhansali, “Review—deep learning methods for sensor based
https://www.sciencedirect.com/science/article/pii/S1071581920301002 predictive maintenance and future perspectives for electrochemical
[12] V. N. Vapnik, The Nature of Statistical Learning Theory. Springer sensors,” Journal of The Electrochemical Society, vol. 167, no. 3,
New York, 1995. p. 037552, jan 2020. [Online]. Available: https://doi.org/10.1149/
[13] T. Hastie, R. Tibshirani, and J. H. Friedman, The elements of statistical 1945-7111/ab67a8
learning: data mining, inference, and prediction, 2nd ed., ser. Springer [34] K. K. Yang, Z. Wu, and F. H. Arnold, “Machine-learning-guided
Series in Statistics. Springer, 2009. directed evolution for protein engineering,” Nature methods, vol. 16,
[14] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. no. 8, pp. 687–694, 2019.
Wiley-Interscience, 2000. [35] Y. Zhang, S. Qiao, S. Ji, N. Han, D. Liu, and J. Zhou, “Identifica-
[15] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, tion of dna–protein binding sites by bootstrap multiple convolutional
2006. neural networks on sequence information,” Engineering Applications
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification of Artificial Intelligence, vol. 79, pp. 58–66, 2019.
with deep convolutional neural networks,” in Advances in Neural [36] T. Shaikhina, D. Lowe, S. Daga, D. Briggs, R. Higgins, and
Information Processing Systems, vol. 25, 2012, pp. 1097–1105. N. Khovanova, “Machine learning for predictive modelling based
on small data in biomedical engineering,” IFAC-PapersOnLine,
[17] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, ser.
vol. 48, no. 20, pp. 469–474, 2015, 9th IFAC Symposium on
Adaptive Computation and Machine Learning series. MIT Press, 2016.
Biological and Medical Systems BMS 2015. [Online]. Available:
[18] J. J. Faraway and N. H. Augustin, “When small data beats big data,”
https://www.sciencedirect.com/science/article/pii/S2405896315020765
Statistics and Probability Letters, vol. 136, no. C, pp. 142–145, 2018.
[37] H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande, “Low data
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
drug discovery with one-shot learning,” ACS Central Science, vol. 3,
S0167715218300762
no. 4, pp. 283–293, apr 2017.
[19] M. Lindstrom, Small data: The tiny clues that uncover huge trends. [38] J. Vamathevan, D. Clark, P. Czodrowski, I. Dunham, E. Ferran, G. Lee,
St. Martin’s Press, 2016. B. Li, A. Madabhushi, P. Shah, M. Spitzer et al., “Applications of
[20] X. Li, S. Deng, S. Wang, Z. Lv, and L. Wu, “Review of small data machine learning in drug discovery and development,” Nature Reviews
learning methods,” in Proc. IEEE Computer Software and Applications Drug Discovery, vol. 18, no. 6, pp. 463–477, 2019.
Conference, vol. 02, 2018, pp. 106–109. [39] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,
[21] G.-J. Qi and J. Luo, “Small data challenges in big data era: A survey of P. Goyal, L. D. Jackel, M. P. Monfort, U. Muller, J. Zhang, X. Zhang,
recent progress on unsupervised and semi-supervised methods,” IEEE J. J. Zhao, and K. Zieba, “End to end learning for self-driving cars,”
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, CoRR, vol. abs/1604.07316, 2016.
no. 4, pp. 2168–2187, 2022. [40] H. Salehi and R. Burgueño, “Emerging artificial intelligence methods
[22] M. Olson, A. Wyner, and R. Berk, “Modern neural networks generalize in structural engineering,” Engineering Structures, vol. 171, pp.
on small data sets,” in Proc. Advances in Neural Information Pro- 170–189, 09 2018. [Online]. Available: https://www.sciencedirect.
cessing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, com/science/article/pii/S0141029617335526
N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, [41] L. Hou, H. Chen, G. K. Zhang, and X. Wang, “Deep learning-
Inc., 2018, pp. 3619–3628. based applications for safety management in the aec industry: A
[23] D. Verma, G. Bent, G. de Mel, and C. Simpkin, “Machine learning review,” Applied Sciences, vol. 11, no. 2, 2021. [Online]. Available:
approaches for small data in sensor fusion applications,” in Proc. SPIE, https://www.mdpi.com/2076-3417/11/2/821
vol. 10635. International Society for Optics and Photonics, 2018. [42] R. Swischuk, L. Mainini, B. Peherstorfer, and K. Willcox, “Projection-
[24] T. Wuest, D. Weimer, C. Irgens, and K.-D. Thoben, “Machine learning based model reduction: Formulations for physics-based machine
in manufacturing: advantages, challenges, and applications,” Produc- learning,” Computers & Fluids, vol. 179, pp. 704–717, 2019.
tion & Manufacturing Research, vol. 4, no. 1, pp. 23–45, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/
[25] J. Wang, Y. Ma, L. Zhang, R. X. Gao, and D. Wu, “Deep S0045793018304250
learning for smart manufacturing: Methods and applications,” [43] N. Thuerey, K. Weißenow, L. Prantl, and X. Hu, “Deep learning
Journal of Manufacturing Systems, vol. 48, Part C, pp. 144–156, methods for Reynolds-averaged Navier-Stokes simulations of airfoil
2018, special Issue on Smart Manufacturing. [Online]. Available: flows,” AIAA Journal, vol. 58, no. 1, pp. 25–36, 2020. [Online].
https://www.sciencedirect.com/science/article/pii/S0278612518300037 Available: https://doi.org/10.2514/1.J058291
17
[44] T. Kirchdoerfer and M. Ortiz, “Data-driven computational mechanics,” [71] B. Settles, “Active learning literature survey,” University of Wisconsin–
Computer Methods in Applied Mechanics and Engineering, vol. 304, Madison, Computer Sciences Technical Report 1648, 2009. [Online].
pp. 81–101, 2016. [Online]. Available: https://www.sciencedirect.com/ Available: http://digital.library.wisc.edu/1793/60660
science/article/pii/S0045782516300238 [72] G. Hinton and T. J. Sejnowski, Unsupervised Learning: Foundations
[45] Y. Zhang and C. Ling, “A strategy to apply machine learning to small of Neural Computation. The MIT Press, 05 1999. [Online]. Available:
datasets in materials science,” Npj Computational Materials, vol. 4, https://doi.org/10.7551/mitpress/7011.001.0001
no. 1, pp. 1–8, 2018. [73] O. Chapelle, B. Schölkopf, and A. Zien, “Semi-supervised learning,”
[46] J. Schmidt, M. R. Marques, S. Botti, and M. A. Marques, “Recent IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542,
advances and applications of machine learning in solid-state materials mar 2009.
science,” npj Computational Materials, vol. 5, no. 1, pp. 1–36, 2019. [74] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised
[47] G. Pilania, “Machine learning in materials science: From explainable learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.
predictions to autonomous design,” Computational Materials Science, [75] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
vol. 193, p. 110360, 2021. [Online]. Available: https://www. Transactions on Knowledge and Data Engineering, vol. 22, no. 10,
sciencedirect.com/science/article/pii/S0927025621000859 pp. 1345–1359, 2010. [Online]. Available: https://ieeexplore.ieee.org/
[48] S. Chibani and F.-X. Coudert, “Machine learning approaches for the abstract/document/5288526
prediction of materials properties,” APL Materials, vol. 8, no. 8, p. [76] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer
080701, 2020. [Online]. Available: https://doi.org/10.1063/5.0018384 learning,” Journal of Big data, vol. 3, no. 9, pp. 1–40, 2016.
[49] J. E. Gubernatis and T. Lookman, “Machine learning in materials [77] P. Baldi, “Autoencoders, unsupervised learning, and deep
design and discovery: Examples from the present and suggestions architectures,” in Proceedings of ICML Workshop on Unsupervised and
for the future,” Physics Review Materials, vol. 2, p. 120301, 2018. Transfer Learning, ser. Proceedings of Machine Learning Research,
[Online]. Available: https://link.aps.org/doi/10.1103/PhysRevMaterials. I. Guyon, G. Dror, V. Lemaire, G. Taylor, and D. Silver, Eds., vol. 27.
2.120301 Bellevue, Washington, USA: PMLR, 02 Jul 2012, pp. 37–49. [Online].
[50] F. E. Bock, R. C. Aydin, C. J. Cyron, N. Huber, S. R. Available: http://proceedings.mlr.press/v27/baldi12a.html
Kalidindi, and B. Klusemann, “A review of the application of [78] S. Feng, H. Zhou, and H. Dong, “Using deep neural network
machine learning and data mining approaches in continuum materials with small dataset to predict material defects,” Materials &
mechanics,” Frontiers in Materials, vol. 6, 2019. [Online]. Available: Design, vol. 162, pp. 300–310, 2019. [Online]. Available: https:
https://www.frontiersin.org/article/10.3389/fmats.2019.00110 //www.sciencedirect.com/science/article/pii/S0264127518308682
[51] J. A. George and J. A. Rodger, Smart Data, ser. Wiley Series in Systems [79] A.-r. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for
Engineering and Management. John Wiley & Sons, 2010. phone recognition,” in Proc. NIPS Workshop on Deep Learning for
[52] F. Iafrate, From Big Data to Smart Data. John Wiley & Sons, 2015. Speech Recognition and Related Applications, vol. 1, 2009, pp. 1–9.
[53] N. Wagner and J. M. Rondinelli, “Theory-guided machine learning in [80] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
materials science,” Frontiers in Materials, vol. 3, 2016. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial
[54] D. Boyd and K. Crawford, “Critical questions for big data,” Informa- networks,” arXiv preprint arXiv:1406.2661, 2014.
tion, Communication & Society, vol. 15, no. 5, pp. 662–679, 2012. [81] N. Grira, M. Crucianu, and N. Boujemaa, “Unsupervised and semi-
[55] M. Hilbert, “Big data for development: A review of promises and supervised clustering: a brief survey,” A review of machine learning
challenges,” Development Policy Review, vol. 34, no. 1, pp. 135–174, techniques for processing multimedia content, vol. 1, pp. 9–16, 2004,
2016. report of the MUSCLE European Network of Excellence (FP6).
[56] R. Kitchin and G. McArdle, “What makes big data, big data? exploring [82] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE,
the ontological characteristics of 26 datasets,” Big Data & Society, vol. 78, no. 9, pp. 1464–1480, 1990.
vol. 3, no. 1, p. 2053951716631130, 2016. [83] S. Kotsiantis and D. Kanellopoulos, “Association rules mining: A recent
[57] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: overview,” GESTS International Transactions on Computer Science
From theory to algorithms. Cambridge university press, 2014. and Engineering, vol. 32, no. 1, pp. 71–82, 2006. [Online]. Available:
[58] M. Deisenroth, A. Faisal, and C. Ong, Mathematics for Machine http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.6295
Learning. Cambridge University Press, 2020. [Online]. Available: [84] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules
https://books.google.de/books?id=pbONxAEACAAJ between sets of items in large databases,” in Acm sigmod record,
[59] J. Ortigosa-Hernández, I. Inza, and J. A. Lozano, “Measuring the class- vol. 22, ACM. ACM Press, 1993, pp. 207–216.
imbalance extent of multi-class problems,” Pattern Recognition Letters, [85] M. J. Zaki, “Scalable algorithms for association mining,” IEEE
vol. 98, pp. 32–38, 2017. TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
[60] R. A. Dara, M. S. Kamel, and N. Wanas, “Data dependency in vol. 12, no. 3, pp. 372–390, 2000. [Online]. Available: https:
multiple classifier systems,” Pattern Recognition, vol. 42, no. 7, pp. //ieeexplore.ieee.org/abstract/document/846291
1260 – 1273, 2009. [Online]. Available: http://www.sciencedirect.com/ [86] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate
science/article/pii/S003132030800513X generation,” in ACM sigmod record, vol. 29. ACM, 2000, pp. 1–12.
[61] D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, [87] A. Dogan and D. Birant, “Machine learning and data mining in
pp. 581–592, 1976. manufacturing,” Expert Systems with Applications, vol. 166, p. 114060,
[62] R. J. Little and D. B. Rubin, Statistical analysis with missing data, 2020.
3rd ed., ser. Wiley Series in Probability and Statistics. John Wiley & [88] T. Bai, Y. Zhang, X. Wang, L. Duan, and J. Wang, “Bearing defect
Sons, 2019, vol. 793. signature analysis based on a SAX-based association rule mining,”
[63] A. Krizhevsky, “Learning multiple layers of features from tiny images,” in 2016 IEEE International Conference on Prognostics and Health
University of Toronto, Tech. Rep., 2009. Management (ICPHM). IEEE, jun 2016, pp. 1–7.
[64] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A [89] Z. Yang, W. H. Tang, A. Shintemirov, and Q. H. Wu, “Association
survey,” ACM computing surveys, vol. 41, no. 3, p. 15, 2009. rule mining-based dissolved gas analysis for fault diagnosis of power
[65] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, transformers,” IEEE Transactions on Systems, Man, and Cybernetics,
2nd ed. Cambridge, MA, USA: Bradford Books, 2018. Part C (Applications and Reviews), vol. 39, no. 6, pp. 597–610, Nov
[66] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and 2009.
Q. He, “A comprehensive survey on transfer learning,” Proceedings of [90] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum
the IEEE, vol. 109, no. 1, pp. 43–76, 2021. likelihood from incomplete data via the EM algorithm,” Journal of
[67] K. P. Murphy, Probabilistic Machine Learning: An introduction. MIT the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–22,
Press, 2021. [Online]. Available: probml.ai 1977. [Online]. Available: https://rss.onlinelibrary.wiley.com/doi/abs/
[68] T. Jebara, Machine learning: discriminative and generative, ser. The 10.1111/j.2517-6161.1977.tb01600.x
Springer International Series in Engineering and Computer Science. [91] I. T. Jolliffe, Principal Component Analysis. Springer, 2002.
Springer Science & Business Media, 2012, vol. 755. [92] A. Hyvärinen and E. Oja, “Independent component analysis: Algo-
[69] M. Basu and T. K. Ho, Eds., Data complexity in pattern recognition, rithms and applications,” Neural Networks, vol. 13, no. 4-5, pp. 411–
ser. Advanced Information and Knowledge Processing. Springer 430, 2000.
Science & Business Media, 2006. [93] M. E. Wall, A. Rechtsteiner, and L. M. Rocha, “Singular value decom-
[70] A. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré, “Data programming: position and principal component analysis,” in A practical approach to
Creating large training sets, quickly,” 2017. microarray data analysis. Springer, 2003, pp. 91–109.
18
[94] J. Bekker and J. Davis, “Learning from positive and unlabeled data: A [117] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
survey,” Machine Learning, vol. 109, no. 4, pp. 719–760, apr 2020. Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp.
[95] X. Zhu, “Semi-supervised learning literature survey,” Computer 1263–1284, 2009.
Sciences, University of Wisconsin-Madison, Tech. Rep. 1530, 2005. [118] H. Cao, X. Li, D. Y. Woon, and S. Ng, “Integrated oversampling for
[Online]. Available: https://minds.wisconsin.edu/handle/1793/60444 imbalanced time series classification,” IEEE Transactions on Knowl-
[96] X. Zhu and A. B. Goldberg, “Introduction to semi-supervised learning,” edge and Data Engineering, vol. 25, no. 12, pp. 2809–2822, 2013.
Synthesis lectures on artificial intelligence and machine learning, [119] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
vol. 3, no. 1, pp. 1–130, 2009. “Smote: Synthetic minority over-sampling technique,” Journal of Arti-
[97] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught ficial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002.
learning: Transfer learning from unlabeled data,” in Proc. of the 24th [120] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano, “Experimental
International Conference on Machine Learning. Association for perspectives on learning from imbalanced data,” in Proc. International
Computing Machinery, 2007, pp. 759–766. Conference on Machine Learning. Association for Computing Ma-
[98] X. Zhu, J. Lafferty, and R. Rosenfeld, “Semi-supervised learning chinery, 2007, pp. 935–942.
with graphs,” Ph.D. dissertation, Carnegie Mellon University, language [121] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study
technologies institute, school of . . . , 2005. of the class imbalance problem in convolutional neural networks,”
[99] X. Ning, X. Wang, S. Xu, W. Cai, L. Zhang, L. Yu, and Neural Networks, vol. 106, pp. 249–259, 2018. [Online]. Available:
W. Li, “A review of research on co-training,” Concurrency and https://www.sciencedirect.com/science/article/pii/S0893608018302107
Computation: Practice and Experience, vol. n/a, no. n/a, p. e6276, [122] H. Lee, M. Park, and J. Kim, “Plankton classification on imbalanced
mar 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/ large scale database via convolutional neural networks with transfer
10.1002/cpe.6276 learning,” in IEEE International Conference on Image Processing,
[100] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv 2016, pp. 3713–3717.
e-prints, p. arXiv:1312.6114, Dec 2013. [123] S. Pouyanfar, Y. Tao, A. Mohan, H. Tian, A. S. Kaseb, K. Gauen,
[101] Y. Sawada and K. Kozuka, “Transfer learning method using multi- R. Dailey, S. Aghajanzadeh, Y. Lu, S. Chen, and M. Shyu, “Dynamic
prediction deep Boltzmann machines for a small scale dataset,” in 14th sampling in convolutional neural networks for imbalanced data classi-
IAPR International Conference on Machine Vision Applications, 2015, fication,” in IEEE Conference on Multimedia Information Processing
pp. 110–113. and Retrieval, 2018, pp. 112–117.
[102] A. T. W. Min, R. Sagarna, A. Gupta, Y. Ong, and C. K. Goh, [124] Y. Lin, Y. Lee, and G. Wahba, “Support vector machines for classi-
“Knowledge transfer through machine learning in aircraft design,” fication in nonstandard situations,” Machine Learning, vol. 46, no. 1,
IEEE Computational Intelligence Magazine, vol. 12, no. 4, pp. 48– pp. 191–202, 2002.
60, 2017. [125] R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector
[103] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and machines to imbalanced datasets,” in Machine Learning: ECML 2004,
G. Bing, “Learning from class-imbalanced data: Review of methods ser. Lecture Notes in Artificial Intelligence book sub series (LNAI).
and applications,” Expert Systems with Applications, vol. 73, pp. Springer Berlin Heidelberg, 2004, vol. 3201, pp. 39–50.
220–239, 2017. [Online]. Available: https://www.sciencedirect.com/ [126] G. Wu and E. Y. Chang, “Kba: kernel boundary alignment considering
science/article/pii/S0957417416307175 imbalanced data distribution,” IEEE Transactions on Knowledge and
[104] S. Wang and X. Yao, “Multiclass imbalance problems: Analysis and Data Engineering, vol. 17, no. 6, pp. 786–795, 2005.
potential solutions,” IEEE Transactions on Systems, Man, and Cyber- [127] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and
netics, Part B (Cybernetics), vol. 42, no. 4, pp. 1119–1130, 2012. F. Herrera, Learning from imbalanced data sets. Springer, 2018,
[105] J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with vol. 11.
class imbalance,” Journal of Big Data, vol. 6, no. 1, p. 27, 2019. [128] J. J. Chen, C.-A. Tsai, H. Moon, H. Ahn, J. J. Young, and C.-H. Chen,
[106] N. Japkowicz, “The class imbalance problem: Significance and strate- “Decision threshold adjustment in class prediction,” SAR and QSAR in
gies,” in Proc. International Conference on Artificial Intelligence, Environmental Research, vol. 17, no. 3, pp. 337–352, 2006.
vol. 56, 2000, pp. 111–117. [129] B. Krawczyk and M. Woźniak, “Cost-sensitive neural network with
[107] L. J. Mena and J. A. Gonzalez, “Machine learning for imbalanced roc-based moving threshold for imbalanced classification,” in Proc.
datasets: Application in medical diagnostic,” in Proc. International Intelligent Data Engineering and Automated Learning. Springer,
Florida Artificial Intelligence Research Society Conference, G. Sutcliffe Cham, 2015, pp. 45–52.
and R. Goebel, Eds. AAAI Press, 2006, pp. 574–579. [Online]. [130] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
Available: http://www.aaai.org/Library/FLAIRS/2006/flairs06-113.php dense object detection,” in Proc. IEEE International Conference on
[108] D.-C. Li, C.-W. Liu, and S. C. Hu, “A learning method for the class Computer Vision (ICCV), 2017, pp. 2999–3007. [Online]. Available:
imbalance problem with medical data sets,” Computers in Biology and https://ieeexplore.ieee.org/document/8237586
Medicine, vol. 40, no. 5, pp. 509–518, 2010. [131] S. Wang, Z. Li, W. Chao, and Q. Cao, “Applying adaptive over-
[109] H. Parvin, B. Minaei-Bidgoli, and H. Alizadeh, “Detection of can- sampling technique based on data density and cost-sensitive svm
cer patients using an innovative method for learning at imbalanced to imbalanced learning,” in Proc. International Joint Conference on
datasets,” in Rough Sets and Knowledge Technology, Y. J., R. S., W. G., Neural Networks, 2012, pp. 1–8.
and S. Z., Eds., 2011, pp. 376–381. [132] Q. Dong, S. Gong, and X. Zhu, “Imbalanced deep learning by minority
[110] M. Wang, X. Yao, and Y. Chen, “An imbalanced-data processing class incremental rectification,” IEEE Transactions on Pattern Analysis
algorithm for the prediction of heart attack in stroke patients,” IEEE and Machine Intelligence, vol. 41, no. 6, pp. 1367–1381, 2019.
Access, vol. 9, pp. 25 394–25 404, 2021. [133] C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning deep
[111] W. Wei, J. Li, L. Cao, Y. Ou, and J. Chen, “Effective detection of representation for imbalanced classification,” in IEEE Conference on
sophisticated online banking fraud on extremely imbalanced data,” Computer Vision and Pattern Recognition, 2016, pp. 5375–5384.
World Wide Web, vol. 16, no. 4, pp. 449–475, 2013. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/
[112] D. A. Cieslak, N. V. Chawla, and A. Striegel, “Combating imbalance html/Huang_Learning_Deep_Representation_CVPR_2016_paper.html
in network intrusion datasets,” in IEEE International Conference on [134] S. Ando and C. Y. Huang, “Deep over-sampling framework for clas-
Granular Computing, 2006, pp. 732–737. sifying imbalanced data,” in Proc. Machine Learning and Knowledge
[113] X. Q. Ouyang, Y. P. Chen, and B. H. Wei, “Experimental study on Discovery in Databases. Springer International Publishing, 2017, pp.
class imbalance problem using an oil spill training data set,” Journal 770–785.
of Advances in Mathematics and Computer Science, vol. 21, no. 5, pp. [135] C. K. Enders, Applied missing data analysis, ser. Methodology in the
1–9, 2017. social sciences. Guilford press, 2010.
[114] B. Krawczyk, “Learning from imbalanced data: open challenges and [136] J. L. Peugh and C. K. Enders, “Missing data in educational research: A
future directions,” Progress in Artificial Intelligence, vol. 5, pp. 221– review of reporting practices and suggestions for improvement,” Review
232, 2016. of educational research, vol. 74, no. 4, pp. 525–556, 2004.
[115] N. Japkowicz and S. Stephen, “The class imbalance problem: A [137] P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal,
systematic study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429– “Pattern classification with missing data: a review,” Neural Computing
449, 2002. and Applications, vol. 19, no. 2, pp. 263–282, 2010.
[116] G. M. Weiss, “Mining with rarity: A unifying framework,” ACM [138] R. B. Myerson, Game Theory: Analysis of Conflict. Harvard University
SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 7–19, 2004. Press, 1997.
19
[139] M. Arjovsky and L. Bottou, “Towards principled methods for training Workshop on Neural Networks for Signal Processing, 2002, pp. 747–
generative adversarial networks,” 2017. 756.
[140] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen Netzen,” [160] L. Yang, S. Chou, and Y. Yang, “Midinet: A convolutional generative
Diploma thesis, Technical University of Munich, 1991. adversarial network for symbolic-domain music generation,” in Proc.
[141] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” 2017. International Society for Music Information Retrieval Conference.
[142] L. N. Vaserstein, “Markov processes over denumerable products International Society for Music Information Retrieval, 2017, pp. 324–
of spaces, describing large systems of automata,” Problems 331.
of Information Transmission, vol. 5, no. 3, pp. 47–52, 1969. [161] Y. Kang, S. Gao, and R. E. Roth, “Transferring multiscale map
[Online]. Available: http://www.mathnet.ru/php/archive.phtml?wshow= styles using generative adversarial networks,” International Journal of
paper&jrnid=ppi&paperid=1811&option_lang=eng Cartography, vol. 5, no. 2-3, pp. 115–141, 2019.
[143] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, [162] K. Schawinski, C. Zhang, H. Zhang, L. Fowler, and G. K. Santhanam,
and X. Chen, “Improved techniques for training GANs,” in Advances “Generative adversarial networks recover features in astrophysical
in Neural Information Processing Systems, D. Lee, M. Sugiyama, images of galaxies beyond the deconvolution limit,” Monthly Notices
U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran of the Royal Astronomical Society: Letters, vol. 467, no. 1, pp. L110–
Associates, Inc., 2016, pp. 2234–2242. L114, 2017.
[144] J. Yoon, J. Jordon, and M. Van Der Schaar, “Gain: Missing [163] M. Mustafa, D. Bard, W. Bhimji, Z. Lukić, R. Al-Rfou, and J. M.
data imputation using generative adversarial nets,” arXiv preprint Kratochvil, “Cosmogan: creating high-fidelity weak lensing conver-
arXiv:1806.02920, vol. 80, pp. 5689–5698, 2018. gence maps using generative adversarial networks,” Computational
[145] Y. Luo, X. Cai, Y. Zhang, J. Xu, and X. Yuan, “Multivariate time series Astrophysics and Cosmology, vol. 6, no. 1, p. 1, 2019.
imputation with generative adversarial networks,” in Proceedings of [164] C. Doersch, “Tutorial on variational autoencoders,” ArXiv, vol.
the 32nd International Conference on Neural Information Processing abs/1606.05908, 2016.
Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., [165] D. P. Kingma and M. Welling, “An introduction to variational
2018, p. 1603–1614. autoencoders,” Foundations and Trends® in Machine Learning,
[146] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros, vol. 12, no. 4, pp. 307–392, 2019. [Online]. Available: http:
“Context encoders: Feature learning by inpainting,” in Proc. IEEE //dx.doi.org/10.1561/2200000056
Conference on Computer Vision and Pattern Recognition, 2016, pp. [166] J. T. McCoy, S. Kroon, and L. Auret, “Variational autoencoders for
2536–2544. missing data imputation with application to a simulated milling circuit,”
[147] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer IFAC-PapersOnLine, vol. 51, no. 21, pp. 141–146, 2018, 5th IFAC
using convolutional neural networks,” in Proc. IEEE Conference on Workshop on Mining, Mineral and Metal Processing MMM 2018.
Computer Vision and Pattern Recognition, 2016, pp. 2414–2423. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
[148] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, “Learn- S2405896318320949
ing a probabilistic latent space of object shapes via 3d generative- [167] G. Boquet, J. L. Vicario, A. Morell, and J. Serrano, “Missing data
adversarial modeling,” arXiv preprint arXiv:1610.07584, pp. 82–90, in traffic estimation: A variational autoencoder imputation method,”
2016. in ICASSP 2019 - 2019 IEEE International Conference on Acoustics,
[149] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation Speech and Signal Processing (ICASSP), May 2019, pp. 2882–2886.
learning with deep convolutional generative adversarial networks,” in [168] N. Jaques, S. Taylor, A. Sano, and R. Picard, “Multimodal autoencoder:
Proc. International Conference on Learning Representations, 2016. A deep learning approach to filling in missing sensor data and enabling
[150] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing better mood prediction,” in 2017 Seventh International Conference on
of gans for improved quality, stability, and variation,” in Proc. Affective Computing and Intelligent Interaction (ACII), Oct 2017, pp.
of the 6th International Conference on Learning Representations. 202–208.
OpenReview.net, 2018. [Online]. Available: https://openreview.net/ [169] R. Xie, N. Magbool Jan, K. Hao, L. Chen, and B. Huang, “Supervised
forum?id=Hk99zCeAb variational autoencoders for soft sensor modeling with missing data,”
[151] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual IEEE Transactions on Industrial Informatics, vol. 16, no. 4, pp. 2820–
generative adversarial networks for small object detection,” in Proc. 2828, 2019.
IEEE Conference on Computer Vision and Pattern Recognition. IEEE, [170] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality
2017, pp. 1951–1959. of data with neural networks,” Science, vol. 313, no. 5786, pp.
[152] H. Wu, S. Zheng, J. Zhang, and K. Huang, “Gp-gan: Towards 504–507, 2006. [Online]. Available: https://science.sciencemag.org/
realistic high-resolution image blending,” in Proceedings of the content/313/5786/504
27th ACM International Conference on Multimedia. Association [171] G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p.
for Computing Machinery, 2019, p. 2487–2495. [Online]. Available: 5947, 2009.
https://doi.org/10.1145/3343031.3350944 [172] Z. Chen, S. Liu, K. Jiang, H. Xu, and X. Cheng, “A data imputation
[153] W. Zhai, J. Zhu, Y. Cao, and Z. Wang, “A generative adversarial method based on deep belief network,” in 2015 IEEE International
network based framework for unsupervised visual surface inspection,” Conference on Computer and Information Technology; Ubiquitous
in 2018 IEEE International Conference on Acoustics, Speech and Computing and Communications; Dependable, Autonomic and Secure
Signal Processing (ICASSP). IEEE, April 2018, pp. 1283–1287. Computing; Pervasive Intelligence and Computing, Oct 2015, pp.
[154] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and 1238–1243.
G. Langs, “Unsupervised anomaly detection with generative adversarial [173] X. Zhang, H. Zhang, and G. Gao, “Missing feature reconstruction
networks to guide marker discovery,” in Proc. International Conference methods for robust speaker identification,” in 2014 22nd European
on Information Processing in Medical Imaging, 2017, pp. 146–157. Signal Processing Conference (EUSIPCO), Sep. 2014, pp. 1482–1486.
[155] W. Jiang, Y. Hong, B. Zhou, X. He, and C. Cheng, “A gan-based [174] T. Nakashika, T. Takiguchi, and Y. Ariki, “High-frequency restoration
anomaly detection approach for imbalanced industrial time series,” using deep belief nets for super-resolution,” in 2013 International
IEEE Access, vol. 7, pp. 143 608–143 619, 2019. Conference on Signal-Image Technology Internet-Based Systems, Dec
[156] P. Spyridon and Y. S. Boutalis, “Generative adversarial networks for 2013, pp. 38–42.
unsupervised fault detection,” in 2018 European Control Conference [175] J. Du, H. Chen, and W. Zhang, “A deep learning method for data
(ECC), June 2018, pp. 691–696. recovery in sensor networks using effective spatio-temporal correlation
[157] M. Benhenda, “Chemgan challenge for drug discovery: can ai repro- data,” Sensor Review, vol. 39, no. 2, pp. 208–217, 2019.
duce natural chemical diversity?” 2017. [176] P. Tang, K. Peng, K. Zhang, Z. Chen, X. Yang, and L. Li,
[158] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Gen- “A deep belief network-based fault detection method for nonlinear
erating multi-label discrete patient records using generative adversarial processes,” IFAC-PapersOnLine, vol. 51, no. 24, pp. 9–14, 2018,
networks,” in Proceedings of the 2nd Machine Learning for Healthcare 10th IFAC Symposium on Fault Detection, Supervision and Safety
Conference, ser. Proceedings of Machine Learning Research, F. Doshi- for Technical Processes SAFEPROCESS 2018. [Online]. Available:
Velez, J. Fackler, D. Kale, R. Ranganath, B. Wallace, and J. Wiens, http://www.sciencedirect.com/science/article/pii/S2405896318322134
Eds., vol. 68. Boston, Massachusetts: PMLR, 18–19 Aug 2017, pp. [177] H. Wang, Y. Cai, and L. Chen, “A vehicle detection algorithm
286–305. based on deep belief network,” The scientific world journal,
[159] D. Eck and J. Schmidhuber, “Finding temporal structure in music: vol. 2014, 2014, article ID 647380. [Online]. Available: https:
blues improvisation with LSTM recurrent networks,” in Proc. IEEE //www.hindawi.com/journals/tswj/2014/647380/
20
[178] H. Shao, H. Jiang, H. Zhang, and T. Liang, “Electric locomotive bearing for Healthcare Conference, ser. Proceedings of Machine Learning
fault diagnosis using a novel convolutional deep belief network,” IEEE Research, F. Doshi-Velez, J. Fackler, D. Kale, B. Wallace,
Transactions on Industrial Electronics, vol. 65, no. 3, pp. 2727–2736, and J. Wiens, Eds., vol. 56. Northeastern University, Boston,
2017. MA, USA: PMLR, 2016, pp. 253–270. [Online]. Available:
[179] Z. Chen, C. Li, and R.-V. Sánchez, “Multi-layer neural network http://proceedings.mlr.press/v56/Lipton16.html
with deep belief network for gearbox fault diagnosis,” Journal of [199] J. Yoon, W. R. Zame, and M. van der Schaar, “Estimating missing
Vibroengineering, vol. 17, no. 5, pp. 2379–2392, 2015. data in temporal data streams using multi-directional recurrent neural
[180] T. Kuremoto, S. Kimura, K. Kobayashi, and M. Obayashi, “Time networks,” IEEE Transactions on Biomedical Engineering, vol. 66,
series forecasting using a deep belief network with restricted no. 5, pp. 1477–1490, 2017.
boltzmann machines,” Neurocomputing, vol. 137, pp. 47 – 56, [200] P. A. Fishwick, Simulation Model Design and Execution: Building
2014, advanced Intelligent Computing Theories and Methodologies. Digital Worlds. Pearson Education, Facsimile Edition, 1995.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/ [201] H. V. D. Parunak, R. Savit, and R. L. Riolo, “Agent-based modeling
S0925231213007388 vs. equation-based modeling: A case study and users’ guide,” in
[181] P. Tamilselvan, Y. Wang, and P. Wang, “Deep belief network based state Proc. International Workshop on Multi-Agent Systems and Agent-Based
classification for structural health diagnosis,” in 2012 IEEE Aerospace Simulation. Springer, 1998, pp. 10–25.
Conference, March 2012, pp. 1–11. [202] J. M. Epstein and R. Axtell, Growing Artificial Societies: Social Science
[182] D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning rep- from the Bottom Up. MIT Press, 1996.
resentations by back-propagating errors,” Cognitive modeling, vol. 5, [203] G. Weiss, Ed., Multiagent Systems: A Modern Approach to Distributed
no. 3, p. 1, 1988. Artificial Intelligence. MIT Press, 1999.
[183] S. Haykin, Neural Networks and Learning Machines, 3rd ed. Prentice- [204] E. Bonabeau, “Agent-based modeling: Methods and techniques for
Hall, 2008. simulating human systems,” Proceedings of the National Academy of
[184] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent Sciences, vol. 99, no. Supplement 3, pp. 7280–7287, may 2002.
neural networks for sequence learning,” CoRR, vol. abs/1506.00019, [205] W. G. Wilson, “Resolving discrepancies between deterministic popu-
2015. lation models and individual-based simulations,” American Naturalist,
[185] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre- vol. 151, no. 2, pp. 116–134, 1998.
trained deep neural networks for large-vocabulary speech recogni- [206] N. A. Cilfone, D. E. Kirschner, and J. J. Linderman, “Strategies
tion,” IEEE Transactions on Audio, Speech, and Language Processing, for efficient numerical implementation of hybrid multi-scale agent-
vol. 20, no. 1, pp. 30–42, 2012. based models to describe biological systems,” Cellular and Molecular
[186] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of CNN Bioengineering, vol. 8, no. 1, pp. 119–136, 2015.
and RNN for natural language processing,” CoRR, vol. abs/1702.01923, [207] N. Marilleau, C. Lang, and P. Giraudoux, “Coupling agent-based
2017. with equation-based models to study spatially explicit megapopulation
[187] K. Simonyan and A. Zisserman, “Two-stream convolutional networks dynamics,” Ecological Modelling, vol. 384, pp. 34–42, 2018.
for action recognition in videos,” in Advances in Neural Information [Online]. Available: https://www.sciencedirect.com/science/article/pii/
Processing Systems, 2014, pp. 568–576. S0304380018302163
[188] I. Sutskever, J. Martens, and G. Hinton, “Generating text with recur- [208] M. Elbattah and O. Molloy, “ML-aided simulation: A conceptual
rent neural networks,” in Proc. International Conference on Machine framework for integrating simulation models with machine learning,”
Learning, ser. ICML’11. Omnipress, 2011, pp. 1017–1024. in Proc. ACM SIGSIM Conference on Principles of Advanced Discrete
[189] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, Simulation. Association for Computing Machinery, 2018, pp. 33–36.
“Draw: A recurrent neural network for image generation,” in [209] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
Proceedings of the 32Nd International Conference on International “CARLA: An open urban driving simulator,” in Proceedings of the 1st
Conference on Machine Learning - Volume 37, ser. Proceedings of Annual Conference on Robot Learning, ser. Proceedings of Machine
Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37. Learning Research, S. Levine, V. Vanhoucke, and K. Goldberg, Eds.,
Lille, France: PMLR, 07–09 Jul 2015, pp. 1462–1471. [Online]. vol. 78. PMLR, 13–15 Nov 2017, pp. 1–16. [Online]. Available:
Available: http://proceedings.mlr.press/v37/gregor15.html http://proceedings.mlr.press/v78/dosovitskiy17a.html
[190] A. Graves, “Generating sequences with recurrent neural networks,” [210] A. Gosavi, “Reinforcement learning: A tutorial survey and recent
CoRR, vol. abs/1308.0850, 2013. advances,” INFORMS Journal on Computing, vol. 21, no. 2, pp. 178–
[191] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in 192, 2009.
optimizing recurrent networks,” in 2013 IEEE International Conference [211] L. von Rueden, S. Mayer, R. Sifa, C. Bauckhage, and J. Garcke,
on Acoustics, Speech and Signal Processing. IEEE, may 2013, pp. “Combining machine learning and simulation to a hybrid modelling
8624–8628. approach: Current and future directions,” in Advances in Intelligent
[192] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of Data Analysis XVIII, M. R. Berthold, A. Feelders, and G. Krempl,
training recurrent neural networks,” in Proc. of the 30th International Eds. Cham: Springer International Publishing, 2020, pp. 548–560.
Conference on Machine Learning, S. Dasgupta and D. McAllester, [212] W. J. Frawlley, “The role of simulation in machine learning
Eds., vol. 28. PMLR, 2013, pp. III–1310–III–1318. [Online]. research,” in Proc. Annual Symposium on Simulation, 1989, pp. 119–
Available: http://proceedings.mlr.press/v28/pascanu13.html 127. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/
[193] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural 748306
Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [213] O. Kononenko and I. Kononenko, “Machine learning and finite element
[194] K. Cho, B. v. Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, method for physical systems modeling,” CoRR, vol. abs/1801.07337,
H. Schwenk, and Y. Bengio, “Learning phrase representations using 2018.
rnn encoder–decoder for statistical machine translation,” in Proc. [214] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov,
Conference on Empirical Methods in Natural Language Processing. P. v. d. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical
Association for Computational Linguistics, 2014, pp. 1724–1734. flow with convolutional networks,” in Proc. IEEE International Con-
[Online]. Available: https://www.aclweb.org/anthology/D14-1179 ference on Computer Vision, 2015, pp. 2758–2766.
[195] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and [215] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
J. Schmidhuber, “LSTM: a search space odyssey,” IEEE Transactions “FlowNet 2.0: Evolution of optical flow estimation with deep net-
on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222– works,” in Proc. IEEE Conference on Computer Vision and Pattern
2232, 2017. Recognition (CVPR), 2017, pp. 1647–1655.
[196] S. Yang, X. Yu, and Y. Zhou, “Lstm and gru neural network perfor- [216] B. Kim, V. C. Azevedo, N. Thuerey, T. Kim, M. Gross, and B. So-
mance comparison study: Taking yelp review dataset as an example,” lenthaler, “Deep fluids: A generative network for parameterized fluid
in 2020 International Workshop on Electronic Communication and simulations,” Computer Graphics Forum, vol. 38, no. 2, pp. 59–70,
Artificial Intelligence (IWECAI), 2020, pp. 98–101. 2019.
[197] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net- [217] A. Kalantari-Dahaghi, S. Mohaghegh, and S. Esmaili, “Coupling
works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. numerical simulation and machine learning to model shale gas
2673–2681, 1997. production at different time resolutions,” Journal of Natural
[198] Z. C. Lipton, D. C. Kale, and R. C. Wetzel, “Directly modeling Gas Science and Engineering, vol. 25, pp. 380–392, 2015.
missing data in sequences with rnns: Improved classification of [Online]. Available: https://www.sciencedirect.com/science/article/pii/
clinical time series,” in Proceedings of the 1st Machine Learning S1875510015001651
21
surrogate data,” Journal of Neuroscience Methods, vol. 172, no. 2, pp. [281] H. Luo and S. Zhong, “Gas turbine engine gas path anomaly detection
312–322, 2008. using deep learning with Gaussian distribution,” in Proc. Prognostics
[259] J. Theiler, S. Eubank, A. Longtin, B. Galdrikian, and J. D. Farmer, and System Health Management Conference. IEEE, 2017, pp. 1–6.
“Testing for nonlinearity in time series: the method of surrogate data,” [282] W. Yan and L. Yu, “On accurate and reliable anomaly detection for
Physica D: Nonlinear Phenomena, vol. 58, no. 1-4, pp. 77–94, 1992. gas turbine combustors: A deep learning approach,” arXiv preprint
[260] A. Forrester, Engineering Design via Surrogate Modelling. Wiley, arXiv:1908.09238, 2019.
John & Sons, 2008. [283] L. Martí, N. Sanchez-Pi, J. M. Molina, and A. C. B. Garcia, “Anomaly
[261] A. Bhosekar and M. Ierapetritou, “Advances in surrogate based mod- detection based on sensor data in petroleum industry applications,”
eling, feasibility analysis, and optimization: A review,” Computers & Sensors, vol. 15, no. 2, pp. 2774–2797, 2015.
Chemical Engineering, vol. 108, pp. 250–267, 2018. [284] J. Tian, C. Morillo, M. H. Azarian, and M. Pecht, “Motor bearing fault
[262] R. Regis and C. Shoemaker, “A stochastic radial basis function method detection using spectral kurtosis-based feature extraction coupled with
for the global optimization of expensive functions,” INFORMS Journal k-nearest neighbor distance analysis,” IEEE Transactions on Industrial
on Computing, vol. 19, no. 4, pp. 497–509, 11 2007. Electronics, vol. 63, no. 3, pp. 1793–1803, March 2016.
[263] S. Songqing and W. G. Gary, “Survey of modeling and opti- [285] Z. Li, X. Yan, Z. Tian, C. Yuan, Z. Peng, and L. Li, “Blind vibration
mization strategies to solve high-dimensional design problems with component separation and nonlinear feature extraction applied to the
computationally-expensive black-box functions,” Structural and Multi- nonstationary vibration signals for the gearbox multi-fault diagnosis,”
disciplinary Optimization, vol. 41, no. 2, pp. 219–241, aug 2009. Measurement, vol. 46, no. 1, pp. 259–271, 2013. [Online]. Available:
[264] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed https://www.sciencedirect.com/science/article/pii/S0263224112002540
neural networks: A deep learning framework for solving forward and [286] T. Hiruta, K. Maki, T. Kato, and Y. Umeda, “Unsupervised
inverse problems involving nonlinear partial differential equations,” learning based diagnosis model for anomaly detection of motor
Journal of Computational physics, vol. 378, pp. 686–707, 2019. bearing with current data,” Procedia CIRP, vol. 98, pp. 336–341,
[265] S. Das and S. Tesfamariam, “State-of-the-art review of design of 2021, the 28th CIRP Conference on Life Cycle Engineering,
experiments for physics-informed deep learning,” arXiv preprint March 10 – 12, 2021, Jaipur, India. [Online]. Available: https:
arXiv:2202.06416, 2022. //www.sciencedirect.com/science/article/pii/S2212827121001438
[266] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, [287] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao,
and L. Yang, “Physics-informed machine learning,” Nature Reviews “Deep learning and its applications to machine health monitoring,”
Physics, vol. 3, no. 6, pp. 422–440, 2021. Mechanical Systems and Signal Processing, vol. 115, pp. 213–
[267] J. Han, J. Pei, and M. Kamber, Data mining: concepts and 237, 2019. [Online]. Available: https://www.sciencedirect.com/science/
techniques, 3rd ed., ser. The Morgan Kaufmann Series in Data article/pii/S0888327018303108
Management Systems. Elsevier, 2012. [Online]. Available: https: [288] S. Zhang, S. Zhang, B. Wang, and T. G. Habetler, “Deep learning
//www.sciencedirect.com/science/article/pii/B9780123814791000198 algorithms for bearing fault diagnostics—a comprehensive review,”
IEEE Access, vol. 8, pp. 29 857–29 881, 2020.
[268] B. Schölkopf, R. Williamson, A. Smola, J. Shawe-Taylort, and J. Platt,
[289] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for
“Support vector method for novelty detection,” in Advances in neural
anomaly detection,” ACM Computing Surveys, vol. 54, no. 2, p. 1–38,
information processing systems 14, Proceedings of the 12th Inter-
Mar 2021. [Online]. Available: http://dx.doi.org/10.1145/3439950
national Conference on Neural Information Processing Systems, ser.
NIPS´99. Cambridge, MA, USA: MIT Press, 1999, pp. 582–588.
[269] D. M. Tax and R. P. Duin, “Support vector data description,” Machine
learning, vol. 54, no. 1, pp. 45–66, 2004.
[270] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008
Eighth IEEE International Conference on Data Mining. IEEE, 2008,
pp. 413–422.
[271] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga, “Outlier detection
with autoencoder ensembles,” in Proceedings of the 2017 SIAM Inter-
national Conference on Data Mining. SIAM, 2017, pp. 90–98.
[272] N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld, “Toward supervised
anomaly detection,” Journal of Artificial Intelligence Research, vol. 46,
pp. 235–262, 2013.
[273] C. C. Aggarwal, “Outlier analysis,” in Data mining. Springer, 2015,
pp. 237–263. [Online]. Available: https://link.springer.com/chapter/10.
1007%2F978-3-319-14142-8_8
[274] M. Goldstein and A. Dengel, “Histogram-based outlier score (hbos): A
fast unsupervised anomaly detection algorithm,” KI-2012: Poster and
Demo Track, pp. 59–63, 2012.
[275] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for
mining outliers from large data sets,” SIGMOD Rec., vol. 29, no. 2,
pp. 427–438, may 2000. [Online]. Available: http://doi.acm.org/10.
1145/335191.335437
[276] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifying
density-based local outliers,” SIGMOD Record, vol. 29, no. 2, pp. 93–
104, 2000.
[277] M. Goldstein and S. Uchida, “A comparative evaluation of
unsupervised anomaly detection algorithms for multivariate data,”
PLOS ONE, vol. 11, no. 4, pp. 1–31, 04 2016. [Online]. Available:
https://doi.org/10.1371/journal.pone.0152173
[278] M. Sohaib, C.-H. Kim, and J.-M. Kim, “A hybrid feature model
and deep-learning-based bearing fault diagnosis,” Sensors, vol. 17,
no. 12, 2017. [Online]. Available: https://www.mdpi.com/1424-8220/
17/12/2876
[279] R. Moghaddass and S. Sheng, “An anomaly detection framework
for dynamic systems using a Bayesian hierarchical framework,”
Applied Energy, vol. 240, pp. 561–582, 2019. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0306261919303150
[280] M. J. Desforges, P. J. Jacob, and J. Cooper, “Applications of probability
density estimation to the detection of abnormal conditions in engineer-
ing,” in Proceedings of the Institution of Mechanical Engineers, vol.
212, 1998, pp. 687–703.