A Review On Machine Learning Techniques in Biomedical Research
A Review On Machine Learning Techniques in Biomedical Research
Machine learning (ML) approaches are a collection of algorithms that attempt to extract
patterns from data and to associate such patterns with discrete classes of samples
in the data—e.g., given a series of features describing persons, a ML model predicts
whether a person is diseased or healthy, or given features of animals, it predicts weather
an animal is treated or control, or whether molecules have the potential to interact
or not, etc. ML approaches can also find such patterns in an agnostic manner, i.e.,
without having information about the classes. Respectively, those methods are referred
to as supervised and unsupervised ML. A third type of ML is reinforcement learning,
which attempts to find a sequence of actions that contribute to achieving a specific
goal. All of these methods are becoming increasingly popular in biomedical research
in quite diverse areas including drug design, stratification of patients, medical images
analysis, molecular interactions, prediction of therapy outcomes and many more. We
Edited by:
describe several supervised and unsupervised ML techniques, and illustrate a series
Enrico Capobianco, of prototypical examples using state-of-the-art computational approaches. Given the
University of Miami, United States complexity of reinforcement learning, it is not discussed in detail here, instead, interested
Reviewed by: readers are referred to excellent reviews on that topic. We focus on concepts rather than
Rimpi Khurana,
University of Miami Miller School of procedures, as our goal is to attract the attention of researchers in biomedicine toward
Medicine, United States the plethora of powerful ML methods and their potential to leverage basic and applied
Chunpeng Wu,
Duke University, United States
research programs.
*Correspondence: Keywords: machine learning, biomedical research, supervised learning, unsupervised learning, reinforcement
Juan Jovel learning
[email protected]
Russell Greiner
[email protected] INTRODUCTION
Specialty section: Machine learning (ML) is a branch of artificial intelligence (AI) that deals with the implementation
This article was submitted to of computational algorithms that improve performance upon experience; in other words, a
Translational Medicine,
ML system learns from data (1, 2). In its classical definition, ML approaches include three
a section of the journal
types of knowledge acquisition: supervised learning, unsupervised learning and reinforcement
Frontiers in Medicine
learning (3) (Figure 1).
Received: 29 September 2021
In supervised learning, an algorithm trains a statistical model, which in turn is able to make
Accepted: 18 November 2021
predictions about an unlabeled instance. During training, a column of data containing the answer
Published: 16 December 2021
(label or target) is used to supervise the learning process (4, 5). For example, given a data set of
Citation:
cancer patients, the label column could contain tumor classes indicating whether the tumor of a
Jovel J and Greiner R (2021) An
Introduction to Machine Learning
patient ended up being benign or malignant. Alternatively, the label column could indicate the
Approaches for Biomedical Research. number of people affected by an infectious disease in each country of the world. In both cases, the
Front. Med. 8:771607. model learns to associate the values of a series of predictor variables, known as “features” in the ML
doi: 10.3389/fmed.2021.771607 jargon, with the label variable. Once trained, the model can predict the label value in new data only
used for training the model, and a test subset used for prediction such a population into branches and leaves so that after each
and evaluation. Often, separate evaluation and test datasets are split, the subset of samples remaining is more homogeneous than
used. In those cases, the evaluation dataset is used for tuning the its precursor. More formally, the process consists in splitting
parameters of the model and for feature selection, while the test precursor populations into decision nodes, which represent
data set is used for an unbiased evaluation. For simplicity, here attributes of the dataset under analysis (7, 22). A simplistic
we will refer to them just as the test dataset. During training, example involves the classification of birds and mammals from
the labels are shown to the model, so that it learns to associate a mixed root population including penguins, eagles, whales
patterns in the data with specific values of the target column. and wolves. In the first decision node, we could ask “does it
Once trained, the model can be evaluated on the test data subset, have feathers?” This will separate mammals from birds. In the
and its performance is determined based on metrics. When a mammals branch, we could create a second decision node by
model predicts values of the target variable with high accuracy, asking “does it live in the sea?” This will separate whales from
it is said that the model generalizes well. The description of wolves. Similarly, in the birds’ branch we could ask “does it fly?”
some popular supervised learning algorithms follows. Since most This will separate penguins from eagles. That is the basic idea.
algorithms are available for classification and regression analyses, The choice of attributes as decision nodes is informed by
unless otherwise mentioned, that should be assumed to be the statistics. Although there are many decision tree algorithms,
case. Currently, supervised learning is by far the most important all will calculate statistics on each attribute and will select the
approach for biomedical research and for this reason it receives one that best purifies the resulting subsets of samples with
special attention here. respect to the target variable (23–25). Let’s use the iterative
dichotomiser 3 (ID3) algorithm to illustrate the process (26, 27).
K-Nearest Neighbors The process starts by assigning all samples to the root of the
The k-NN algorithm assumes that similar things are in close tree. At each iteration, the algorithm calculates entropy (H) and
proximity to each other. Based on Euclidean distances among information gain (IG) for each attribute that can be used as
samples (or other more sophisticated distance metrics), the k- a decision node. Here, the entropy of a set is a measurement
NN algorithm iteratively assigns samples to clusters, so that of randomness of the labels of those instances; the larger the
samples with similar feature values will be separated by shorter entropy the less homogeneous the subset of samples comprising
distances and therefore will be assigned to the same cluster (1, a node. Conversely, information gain measures how well a given
18). However, it is possible that samples clustered together belong attribute allows separation of training samples according to the
to different classes, e.g., samples from benign and malignant target variable. Therefore, the algorithm aims at maximizing and
tumors might integrate the same cluster; in those cases, newly minimizing IG and H, respectively.
added samples are assigned to the majority class, through a When a decision tree algorithm was applied to the hepatitis
voting process (e.g., if most samples in that cluster belong to the dataset, an accuracy of 82.1% was obtained, which is quite similar
malignant cancer class, any benign sample erroneously assigned to the results obtained with the k-NN algorithm above. The
to that cluster, will also be classified as malignant). As proximal decision tree obtained is presented in Supplementary Figure 2A.
samples are added to a cluster with the nearest centroid, centroids As seen, it placed albumin in the first node (root) and asked
are recalculated. The final goal is to minimize inertia, which whether the content of each sample is smaller or equal to 1.598
is the sum of distances between the centroid and the samples and then split samples accordingly. Overfitting occurs when the
inside a cluster. Because the end data structure of a k-NN process model learns the training data too well and consequently does
is a distance matrix, it is not possible to assess the individual not perform well on test data. Intrinsically, decision trees are
contribution of each feature to the classification process (19). prone to overfitting, and that reduces their ability to generalize.
Let’s illustrate the k-NN algorithm with a real example. There are at least two ways to reduce overfitting in decision
The hepatitis data set (illustrated in Supplementary Figure 1), trees, one is by pruning the tree, i.e., reducing its depth. The
contains records of 155 patients affected by hepatitis with other way is to use ensembles of trees, which are implemented
measurements of a series of clinical variables aimed at helping in different algorithms like random forest or gradient boosting
clinicians with monitoring disease progression. Despite its small classifiers (1, 3).
size, this is a popular dataset in ML forums because its structure Random forest is a collection of decision trees (7, 19). The
is well-suited for explanation of ML concepts (20, 21). Of those, assumption is that individual trees will overfit the data in
32 patients died and 123 survived. Survival is the label here. The different ways and averaging the results of many trees will
optimal number of neighbors (k) to consider in a k-NN model reduce overfitting and consequently will improve accuracy of
is determined empirically using the training dataset. As seen in classification. Randomness is injected in two ways. First, the
Figure 2A, 1–5 neighbors lead to the highest accuracy (82%). algorithm bootstraps to extract n samples with replacement.
Predicting survival of patients correctly 82% of the time is a very It means that each data set extracted in this way will be
encouraging result, but let’s see whether that can be improved. the same size of the original dataset but some samples will
be missing while others will be repeated. Second, at each
Decision Tree-Based Approaches decision node, the algorithm randomly selects a subset of
Using the analogy of a tree to describe this algorithm is very these samples and selects the feature that best splits these
convenient. Let’s assume that the whole population of samples samples (3). Gradient boosting works in a similar way, but
corresponds to the root of the tree. The interest then is to split for the sake of space will not be discussed here. Instead, we
FIGURE 2 | Illustration of supervised learning algorithms. (A) Relationship between number of neighbors (k) and accuracy in the k-NN algorithm when applied to the
hepatitis dataset. (B) Feature importance when the random forest algorithm was applied to the hepatitis dataset. (C) Tri-dimensional scatter plot of values of albumin,
bilirubin and protime in patients included in the hepatitis dataset. (D) Decision surface of the logistic regression model applied to the hepatitis dataset illustrated in a
two dimensional plot including only albumin and bilirubin. (E) Comparison of the theoretical probability distribution of a logistic regression model with the probability
distribution of survival of patients in the hepatitis dataset when only albumin is considered as regressor. (F) Lollipop plot of accuracy achieved during classification of
survival in the hepatitis dataset. k-NN, k nearest network; SVC, Support vector classifier; LogReg, Logistic regression (R squared); SGDC, Stochastic gradient descent
classifier; DTC, Decision tree classifier; RFC, Random forest classifier; GBC, Gradient boosting classifier; MLPC, Multilayer perceptron classifier.
encourage reading (28–30). When a random forest algorithm random forest algorithm is smoother than the one of the single
with 60 individual trees was applied to the hepatitis dataset, the decision tree. A convenient attribute of decision trees, random
accuracy in classification was 85%, which represents a substantial forest and gradient boosting algorithms is that the individual
improvement compared to the individual decision tree described contribution of each feature to the model can be visualized. In
above, or to the k-NN approach (accuracy ∼ 82.1%). As shown Figure 2B, the individual contribution of each feature to the
in Supplementary Figure 2B (only considering bilirubin and classification process is presented. Albumin and bilirubin are
albumin as predictors), the decision boundary defined by the dominant, followed by protime (which is the time that takes
for the blood of a patient to clot in a prothrombin time test), Equation 2. Logistic regression equation for binary
ascites (build up of fluid in the peritoneal cavity) and age. Indeed, target variables.
when plotting samples in a 3D space including only albumin,
bilirubin and protime, a good separation of patients that died eX
ŷ =
from those that survived is achieved (Figure 2C). Although it 1 + eX
appears obvious that the content of bilirubin and albumin is
where x corresponds to the linear regression equation
higher and lower, respectively, in patients that died, separation of
presented above. The linear regression equation defined by the
patients based on only these three features is not perfect, which
exponent x gives rise to the logit or logarithm of the odds:
suggests that other features also contribute to the outcome of the
Equation 3. Log odds in logistic regression.
disease. Can we still improve on the great performance exhibited
by random forest? N
ŷ X
ln =b+ wi xij
Linear Models 1 − ŷ
i=1
Linear models are among the simplest, and therefore most
popular, models in ML (1, 31). Essentially, a linear model In other words, the linear model defines the natural logarithm
represents the weighted sum of the input features plus an of the probability of being in a class divided by the probability
intercept or bias term (32). Given their simplicity, they are ideal of being in the other class. When we applied logistic regression
for more formally explaining the concept of model or hypothesis. to the hepatitis dataset, an R2 of 0.90 (90%) was obtained,
Let’s consider the following equation: which is a pretty satisfactory result. To illustrate its intrinsic
Equation 1. General equation of linear models for regression. linear nature, we plotted the decision surface of this model
only for albumin and bilirubin (Figure 2D). The probability
N
distribution of survival based only on albumin, which is the most
ŷ = w [0]∗ x [0]+w [1]∗ x [1] . . .+w [n]∗ x [n]+b = b+
X
wi xij influential feature, closely mirrored the theoretical probability
distribution of logistic regression (Figure 2E). As can be seen,
i=1
there are no values for albumin concentration below some point,
likely because that is the threshold of lethality (Figure 2E). It
We see that each of the features (X) is weighted in the sum by is important to note that the R squared statistic is not directly
a value w and the whole line has a bias equal to b. It means that comparable to the accuracy of a model; it does not reflect
the contribution of each feature to the model may be different. the prediction power of the model, instead, it represents the
Generally, there will be a difference between the predicted target proportion of the variance explained by the model.
values and the real ones, in other words a cost associated with the
process of mapping. The magnitude of such cost can be estimated Deep Learning
by a cost function, using an estimator like the mean squared error The approaches discussed so far belong to the classical ML
(MSE). The problem becomes one of finding the parameters in realm. Artificial neural networks (ANN; often referred to as
the model that minimize the costs. This can be achieved using feed-forward neural nets) owe their name to the fact that they
another function called gradient descent (33, 34). aspire to emulate the interconnected system of neurons. ANNs
In a two-dimensional space, the equation above defines a are central to deep learning (37). Although ANNs were initially
line which starts at b and extends upwards or downwards proposed more than 70 years ago (38), interest in them has
depending on the slope (w). In three dimensions, the output recently been revived mainly due to the exponential increase
of the function is a plane, and in multidimensional spaces is a seen in computer power and data for training ANNs. This led
hyperplane. The problem is that the relationship between features to successful implementations that further boost their relevance.
and target is often non-linear, and therefore linear models have The perceptron is an ANN with a single neuron (Figure 3A),
reduced predictive potential to explain such relationships. Most and it is an ideal architecture to explain the foundations of neural
linear models are used for regression analysis and therefore networks (39). As seen in Figure 3A, the perceptron receives
are not suitable to predict survival in our hepatitis example. numeric inputs and calculates a weighted sum (as explained
There are also linear models for classification of categorical above for linear models of regression). It means that each input
target variables, logistic regression (35) and linear support vector value is multiplied by a weight and then all are summed, along
machines (SVM) being the most popular of them. We will discuss with a bias, to finally produce an output. In other words, the
classification with logistic regression here, for a discussion on weights estimate how large a change in the output is expected to
SVM, the reader is encouraged to review (7, 19). be when the input changes (i.e., the relative contribution of each
Logistic regression is very popular in biomedical research (36), feature to the output), while the bias allows shifting the activation
and is often used to predict whether a set of conditions will result, function by a constant to better fit the model to the data. The
or not, in disease or death of patients. Logistic regression is a non- reader probably already noticed that such a description is simply
linear function that models the probability of belonging to a class a linear model of regression. ANNs differ from classical linear
or another based on a linear combination of features (36). For models in what is called an activation function (Figure 3A),
a target variable with two outcomes, as in our hepatitis example and weights are calculated in a different manner. As biological
(death or alive), the logistic regression equation is as follows: neurons, an artificial neuron has to decide, using the activation
P
FIGURE 3 | Illustration of artificial neural networks (ANNs). (A) Perceptron, a neural network with a single neuron. represents the weighted sum, f represent the
activation function, and b represents the bias term. (B) Deep neural network with three hidden layers. Each interconnected node represents a neuron. (C) Neural
network with a single hidden layer. (D) Heat map representation of weights in the multilayer perceptron model applied to the hepatitis dataset.
function, whether it gets activated or not, based on the magnitude Figure 2F presents the classification results achieved by
of the stimulus received. Thus, the equation of a perceptron is the several algorithms. Logistic regression achieved an R2 of 90%,
linear regression equation multiplied by an activation function. accuracy of other algorithms (k-NN, support vector classifier,
Equation 4. Equation of a perceptron output. SVC; stochastic gradient descent classifier, SDGC; random forest
classifier, RFC and multi-layer perceptron classifier, MLPC)
N
!
X ranged between 82 and 85%.
ŷ = f b+ wi xij Another type of neural nets are convolutional neural networks
i=1 (CNNs or ConvNets), which are often applied to the field of
computer vision to conduct image classification. CNNs were
The activation function can be as simple as the
inspired by the organization of the visual cortex of the human
implementation of a threshold (i.e., step function) that acts
brain (43). In their seminal work on cats and monkeys, Hubel
like a switch that turns the neuron on and off if a threshold value
and Wiesel (44, 45) determined that the individual neurons in the
is exceeded, or there can be other non-linear transformations that
visual cortex were responsible for perceiving only a small portion
allow ANNs to learn complex data patterns. The true complexity
of the visual field and the tiling of many overlapping visual
is based in many layers applying simple functions. Non-linear
subfields acquired by many neurons creates complex images. As
transformations include the rectified linear function (ReLu;
illustrated in Figure 4A, when the brain attempts to perceive
which despite its name is non-linear), the sigmoid function (the
the image of a car, a whole image will be the composite of
same as the logistic regression) and hyperbolic tangent function
many subfields that observe individual overlapping sections of
(Tanh), among others (3, 40, 41).
the car. The authors also discovered a high level of diversity
Deep learning relies on deep neural networks (1, 19). Unlike
and specialization among neurons of the visual cortex: some
perceptrons, a deep neural network contains many neurons,
of them were dedicated to the perception of simple geometric
each of them connected to neurons in other layers by edges
patterns like lines and arcs, while other higher-level neurons
that represent the weights and each neuron has an activation
were able to perceive more complex patterns, derived from
function (Figure 3B). The weighted sum, including a bias term,
the combination of lower level patterns (Figure 4A). For such
and passed through an activation function is associated with an
breakthrough discoveries, Hubel and Wiesel won the Nobel prize
error that represents the distance from the output to the expected
for Physiology and Medicine in 1981.
prediction values. Neural nets adjust weights to reduce the error
Although CNNs were conceived in the 1980’s (47, 48), they
through algorithms like back-propagation (1, 42), but discussion
remained in the shadows because of their initial inability to
of such concepts is beyond the scope of this article. If we imagine
scale up: they needed a lot of images and hence computer
a neural net with four features as inputs, then going through a
resources to perform considerably well. However, it should not
single hidden layer with three nodes (Figure 3C), the equation to
be surprising that emulating the primary visual pathway of the
calculate the output would be the following.
human brain has been troublesome, after all it took nature 500
Equation 5. Deep neural network equation using tanh non-
million years to evolve such a system (49). Recent advances in
linearity.
computer power and the exponential accumulation of images
in diverse realms of research and technology revived interest in
ŷ = v [0]∗ h [0] + v [1]∗ h [1] + v [2]∗ h [2] + b
CNNs (50, 51).
v: weights between the hidden layer and the output. As said above, CNNs function in a highly hierarchical manner.
h: intermediate values stored in neurons of the hidden layer. Initially, a CNN layer starts detecting lines and arcs. This
h is calculated as: information is passed to the next convolutional layer, which
detects combinations of edges and corners. Eventually, the deeper
X (N−1)
(H−1) X layers of the convolutional network are able to detect complex
h [i] = tanh w[i, j] + b[i] patterns, like faces, cars, cancerous tissues, etc. They are called
(i=0) (j=0) convolutional because to transfer information across layers of the
network, mathematical convolution is used. Convolution refers
H: number of nodes in the hidden layers. to the combination of two functions to produce a third function;
N: number of features or more simply put, two sources of information are merged into
w: weight between the input and the nodes in the hidden layer. a single function (52).
When we applied a multilayer perceptron to the classification In a conventional neural net, the neurons of a hidden layer
problem on the hepatitis dataset, we achieved an accuracy of are fully connected to all neurons in the contiguous layer, and
84.6%. A disadvantage associated with neural networks is that finally a fully-connected output layer provides the predictions of
interpretation of the model is quite troublesome. In our example, the network (53). In a CNN, layers are three dimensional (width,
we applied a neural network and calculated and deployed height, and depth), and more importantly, the neurons in a layer
the weights associated with such a neural network in a heat are not connected to all neurons in the next layer, but instead
map (Figure 3D); results are far from clear. A statement that to only a small number of them. The output could be a class or
can be made is that features with smaller weights are less a vector of probability of classes. CNNs processing starts with
important for the model (their influence on the target variable feature extraction; for that, a filter or kernel (of size equal to
is less significant). the receptive field) is slid along the full image to create a feature
FIGURE 4 | Application of a convolutional neural network (CNN) to classify cancer tissue. (A) CNNs are emulations of sets of neurons that detect individual
overlapping visual sections called receptive fields. Such neurons detect simple features of the object, such as lines and arcs. Deeper neuronal layers detect more
complex shapes derived from the initial elements and progressively the whole object is resolved. (B) Representative patches from the invasive ductal carcinoma (IDC)
tissue sections described in Cruz-Roa et al. (46) and classified as non-cancerous tissue by a pathologist. (C) Representative patches from the invasive ductal
carcinoma (IDC) tissue sections classified as cancerous tissue by a pathologist. In (B,C), darker regions correspond to nuclei stained with hematoxylin, which appear
denser in (C) probably due to increased cell proliferation in cancerous tissues. (D) Digital reconstruction of tissue sections from individual patches in six different
patients. Red regions represent non-cancerous tissue, while blue regions represent cancerous tissue. (E) Accuracy for training and test data sets obtained when a
CNN was applied to the IDC dataset.
map, which is the sum of convolutions. Finally, a classification unbalanced, i.e., many negative and few positive patches, we
procedure is applied (43). extracted a subsample with equal number of positive and negative
In order to illustrate CNNs, we used breast cancer (invasive patches (∼13,000 patches in each class), in order to be able
ductal carcinoma, IDC) sections that had been classified by a to evaluate the performance of our CNN using accuracy as
pathologist (46). The authors divided the slides’ pictures into a metric (accuracy is not a reliable metric for unbalanced
100 × 100 pixels non-overlapping patches, which resulted in datasets). Although we run the model for 50 epochs (iterations)
more than 270,000 patches. A patch was considered positive if it quickly (around epoch 20) plateaued between 81 and 82%
at least 80% of the patch was contained inside the cancerous (Figure 4E). Thus, a relatively simple implementation of a
annotated region, or negative otherwise. Examples of negative CNN yielded satisfactory classification of cancerous and non-
and positive patches are presented in Figures 4B,C, respectively. cancerous breast tissues.
Digital reconstructions of sections from its constituent patches In addition to conventional (feed forward) neural nets and
are depicted in Figure 4D. Because the IDC dataset is largely CNNs, there exist other types of more complex neural nets
not discussed here, including long short-term memory (LSTM) discussion on autoencoders). Many ordination techniques
(50, 54) and Kohonen’s self-organizing maps (SOM) (55), which were developed in the context of population ecology, where
find applications in, for example, protein folding (54) and the interest was to know the relationship among groups (e.g.,
hematopoietic differentiation (56), respectively. species) in a community (60, 61). Popular ordination approaches
In summary, supervised ML approaches have great potential include distance-based techniques like Principal coordinates
in biomedical research because supervision of the model analysis (PCoA) and Non-metric multidimensional scaling
performance with real clinical data provides confidence for (MDS), Eigen vector gradient analysis like Principal component
making decisions about treatment of patients. The emphasis of analysis (PCA) and Correspondence analysis (CA), and manifold
this report is just to provide a general overview of how ML learning like autoencoders, isomaps and t-distributed stochastic
approaches work, providing enough details for the reader to neighbor embedding (t-SNE) (1, 57, 62).
gain a good grasp about the underlying methods but without Because those techniques have been widely used in biomedical
entering into particular details of algorithms or unnecessary research for quite some time (60, 61, 63, 64), they will not
mathematical explanations. be discussed in detail here. However, we will illustrate t-SNE,
PCA and MDS with real data. In a nutshell, t-SNE derives
Unsupervised Learning a probability distribution in the high-dimensional space using
Clustering and Ordination Euclidean distances between objects and a similar distribution
There are two families of techniques in unsupervised learning: in the low-dimensional space, trying to minimize the Kullback-
clustering and dimensionality reduction (ordination). Clustering Leibler divergence between the two probability distributions
aims at partitioning data into constituents, usually based on (65–67). The most important parameter of a t-SNE algorithm
distance among samples. The basic idea is that data points is the perplexity, which controls the width of the Gaussian
in the same cluster have similar properties, and are more kernel used to compute similarities between samples, and hence
different from data points in other clusters. There exist a variety to control the number of nearest neighbors associated with a
of clustering algorithms—including agglomerative clustering, specific data point (68). Nevertheless, t-SNE is criticized for not
DBSCAN, KMeans, Birch clustering, Gaussian mixture model, preserving global structure of data, which may be critical for
spectral clustering, etc. (1, 57). Their efficiency and reliability some practical applications (69). PCA is a multivariate statistical
depends on the distribution of the dataset under analysis; this technique developed at the very beginning of the twentieth
means no single algorithm performs best on all tasks This is why century by none other than Pearson (70). The central tenet in
deciding on the best model needs to be determined empirically. PCA is to reduce dimensionality of multidimensional datasets
Also, some algorithms scale better than others; for example, with interrelated features, to be able to visualize data in a
algorithms that compute pairwise similarities among all samples low-dimensional space that contain most of the variance of
do not work well for large datasets. such a dataset. This is achieved by transforming the original
Instead of defining each clustering algorithm, we will feature into uncorrelated, orthogonal, principal components
illustrate some of them with a simulated example. We generated (63, 71). MDS is a representation and dimensionality reduction
10,000 random instances (each a pair of values) grouped technique that maps instances into a low-dimensional space in
into three classes, each of them with a normal distribution a way that attempts to preserve the relative distances between
(Supplementary Figure 3A). To evaluate the performance of instances. Samples that are more similar will be represented
clustering methods, the Silhouette score may be used. The near to each other, while different samples will be represented
Silhouette score is a metric to estimate the robustness of apart (72, 73). Mathematically, it transforms, using Eigenvalues
clustering, whose value ranges from−1 to 1, with higher values decomposition, a dissimilarity matrix (distance between samples)
associated with clusters integrated by similar samples, while into a coordinate matrix while minimizing a loss function;
low values indicate that clusters contain heterogenous samples in other words, trying to preserve the original distances
(58, 59). We applied a series of clustering algorithms on our between samples (64).
simulated 3-clusters dataset (Supplementary Figures 3B–F), and To provide a more practical illustration of both clustering
then colored each according to the cluster it was assigned to by and ordination techniques, we analyze here, in an agnostic way,
the clustering algorithm. As seen, all clustering algorithms tested transcriptomics single-cell data from (8). The dataset contains
achieved very similar results, faithfully reflecting clustering in gene abundance derived by 10X Chromium technology from
the original dataset. KMeans clustering produced a Silhouette 1,291 individual microglia from mice with injury in their spinal
score (0.506) that was slightly higher than the other methods: cord, as well as naive animals. In order to select the optimal
Spectral clustering (0.492), Gaussian mixture clustering (0.484), number of clusters defined by the dataset, we used the elbow
Birch clustering (0.479) and Agglomerative clustering (0.471). As method of the KMeans clustering algorithm. In this method,
said above, no clustering method is the best, but KMeans is often the algorithm is fit to a range of cluster numbers (k). For
a good starting point. each k, we compute the inertia, which is the sum of squared
The number of input variables or features describing distances of instances to the closest cluster center; in other
an instance is called its dimensionality. Techniques for words, how far away points within a cluster are located. When
dimensionality reduction are often used for visualization. plotting the inertia as a function of the number of classes, we
Dimensionality reduction techniques use linear algebra, typically see an arm-like shape; we can then use the point of
projection methods and autoencoders (see below for a brief inflection of the curve (elbow) as an indicator of best fit of
the model. In our case, the elbow was located at three clusters at reconstructing the original data, it only uses the information
(Figure 5A). However, because we had the knowledge presented contained in the code. The difference between the original and
by Plemmel et al. in their paper, we knew that two (naive the regenerated data is called the error, which is estimated by
vs. injury), three (two naive vs. one major injury groups) or a loss function. Because autoencoders remove noise to generate
five (two naive vs. three injury groups) clusters were correct the compressed representation of the data, they naturally reduce
forms to partition the dataset. Silhouette coefficients of KMeans dimensions. There are several types of autoencoders, including
clusters suggested partition of data points into two (Figure 5B) undercomplete, stacked, sparse, convolutional, contractive and
or three (Figure 5C) clusters. For graphical representation, we variational ones, among others. For a detailed description of such
initially chose the t-SNE method. As seen in Figure 5D, t- approaches, see (19, 75).
SNE deployed instances into a somewhat circular distribution; For the sake of clarity, we reproduce here an example
however, the specific shape of the t-SNE plot is highly dependent presented by Géron (19) that helps to understand the encoding
on the data transformation method used prior to t-SNE and process. Consider two vectors of numbers:
the value of perplexity chosens. We applied the StandardScaler [40, 27, 25, 36, 81, 57, 10, 73, 19, 68]
method of Scikit-learn and a perplexity of 30, which is the [50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13,
default value in most t-SNE implementations. Because we knew 40, 20]
that the cells represented subpopulations from mice with or The second vector, despite being longer, is easier to encode
without injury in their spinal cord, we initially clustered the than the first one, because it contains a pattern. Every even
data points population into two clusters (Figure 5E; orange and number is followed by its half (50 by 50/2 = 25), while odd
green clusters) and subsequently into three clusters (Figure 5F; numbers are followed by its triple plus one (25 by 25 × 3+1 =
red, black and blue clusters). Plemmel and collaborators reported 76). Thus, to learn the second sequence of numbers, the encoder
that cells in lesion 1 exhibited higher expression of the genes only has to deduce these two rules, the first number in the series,
Apoe, Spp1, Cxcl2, Lyz2, and Cd74. Accordingly, we conducted and the length of the series of numbers. The first vector would
differential expression analysis, and found that indeed all those be difficult to compress. Thus, autoencoders work better when
five genes were differentially expressed when the two naive elements in a data set contain patterns and poorly when they are
clusters (together) were compared against lesion 1 samples independent from each other. Thus, the task of the autoencoder
(Figure 5G). Thus, combining KMeans clustering and t-SNE, we is to detect correlations between input features (19).
were able to recapitulate the results reported by Plemmel and The architecture of an autoencoder is similar to the ANNs
collaborators for the major lesion samples vs. the naive ones. To (multi-layer perceptrons) presented in Figure 3, with two
further test the reliability of our clustering, we subjected such caveats: (i) the number of neurons in the output layers is the
classification to supervised ML. We applied gradient boosting same as the number of inputs (features), because the autoencoder
classification and could indeed predict labels of the two clusters tries to regenerate the original data; and (ii) for undercomplete
(Figure 5E) with an accuracy of 98% and labels of the three autoencoders, the hidden layers have fewer neurons than inputs,
clusters (Figure 5F) with an accuracy of 94%. For comparison, that force the autoencoder to select only the most relevant
we also clustered the same data using PCA (Figure 5H) and MDS features in a compressed representation (Figure 6A). We applied
(Figure 5I) and found that both methods effectively separated the undercomplete autoencoders to assess whether a better visual
same three clusters but separation of clusters was less clear than representation of the suspected two or three clusters could be
in the case of t-SNE. When we applied a regularized logarithmic obtained for the microglia single cell transcriptomics data. As
transformation to the data, prior to ordination, it substantially mentioned above, among critical parameters of an ANN are the
improved the resolution of clusters for t-SNE, but had the activation and loss function used, which affect performance of
opposite effect for PCA and MDS (Supplementary Figure 4). We the model in a data-type dependent manner. When trying to
did not explore this in more detail. separate two clusters, we used the Tanh and GELU activation
functions (for hidden layers) and the Poisson NLL loss (pnl)
Autoencoders and the Kullback Liebler divergence (kl_div) loss functions. We
Autoeconders are typically artificial neural networks (ANNs) tested three configurations, and found each was successful in
used for representation learning. Representation learning is a separating naive from lesion cells (Figures 6B–D). In all cases,
technique that allows a model to learn features essential for the decoder loss was smaller than 10% (loss < 0.1), and the lowest
accomplishing a specific analytical task. An autoencoder learns error (loss) was reported for the configuration GELU-kl_div. We
to compress data and also has the capability to reconstruct such then run a gradient boosting classifier (GBC) to train a model
data to generate a representation that attempts to be as similar as that could classify instances into the appropriate original labels
possible to the original data. These two tasks are accomplished from the compressed representation of data in the latent space of
by two submodels: the encoder (recognition network) and the the autoencoder. The resulting GBC model was able to classify
decoder (generative network), respectively (74–76). The encoder the compressed representation of two clusters with an accuracy
can be viewed as a filter that selects some of the most relevant of 94%. We then tried to separate three clusters in the single
features that are sufficient to represent the data in a compressed cell population, using the same loss functions, combined with
format (fewer dimensions). Between them is the compressed code the ReLU or GELU activation functions. Three configurations
that is able to regenerate the data and is also known as latent- tested provided satisfactory separation of the three expected
space representation (74). However, although the decoder aims clusters (Figures 6E–G), but the lowest loss was obtained for the
FIGURE 5 | Clustering of microglia single cell transcriptomes using tSNE, PCA or MDS. (A) Determination of optimal number of clusters through K-means clustering.
Validation of cluster number (n) with the Silhouette method when two (B) or three (C) clusters were considered. (D) Transcriptome samples clustered with tSNE (n = 2)
and colored with a single color, or with two colors (E). Such clusters correspond to naive and lesion cells. (F) Transcriptome samples clustered with tSNE (n = 3) and
colored with three colors. Such clusters correspond to two naive and one lesion groups of cells. (G) Representative differentially expressed genes between naive and
lesion cells (see E), were in agreement with (8). (H) Transcriptome samples clustered by PCA or with MDS (I).
configuration ReLU-pnl. A GBC was able to classify data points in in a random sample from the same population. Assessment of
the latent space of the autoencoder with an accuracy of 88%. The such observations usually involves graphical methods like scatter
autoencoder compressed the 12,138 features (transcripts) into or box plots, and were historically identified and removed from
40 compressed features. Thus, although using an autoencoder datasets. ML also offers methods to detect outliers and refers to
to generate low-dimensional representation of the single-cell them as anomalies (77–79). Essentially, a model is trained with
RNAseq data slightly reduced the classification accuracy with a “normal” instances, and learns to identify instances that deviate
GBC model, the visual representation had higher resolution and from such a subset (19, 80). Initially, those samples were removed
allowed better discrimination of the clusters. A few instances for subsequent analysis, but more recently they are studied
corresponding to the lesion cluster were spread out away as cases that could represent critical stages of a phenomenon
from the corresponding cluster. It is possible that those data under study. Such methods have seen applications in many
points correspond to the small lesion clusters reported by fields, including biomedical research. Examples are detection of
Plemel et al. (8). anomalous signals from body sensors, or detection of cancer cells
in micrographs of tissue from cancer patients in early stages, and
Anomaly Detection many more (66, 81–83).
In classical statistics, an outlier can be defined as an observation Finally, we would like to mention a very exciting class
that lies at an unexpected distance from the rest of observations of ML frameworks dubbed generative adversarial networks
FIGURE 6 | Clustering of microglia single cell transcriptomes using autoencoders. (A) Cartoon depicting an autoencoder neural network. When aiming at
discriminating between two clusters, we used Tanh (B,C) and GELU (D) as activation functions for hidden layers, and either Poisson NLL (pnl) (B) or Kullback Liebler
divergence (kv_div) (C,D) loss functions. When aiming at discriminating between three clusters, either ReLU (E), or sigmoid (F,G) were used as activation functions.
Loss functions are also indicated (E–G).
(GANs) where supervised and unsupervised deep learning DATA AVAILABILITY STATEMENT
notions converge. Here, an unsupervised generative model is
trained using two neural networks that compete in a zero- Publicly available datasets were analyzed in this study. This data
sum game (84). Informed by statistics from the training can be found at: https://www.kaggle.com
dataset, a generative model learns to create new instances,
while a discriminator model attempts to differentiate between AUTHOR CONTRIBUTIONS
instances generated by the contesting model and the real
instances from the actual training dataset (50, 75, 84). The two JJ and RG: design of study, analysis of data, and preparation of
models update each other until the generator model fools the manuscript. Both authors contributed to the article and approved
discriminator model half of the time (85). GANs can be applied the submitted version.
to different fields in biomedical research, including clinical image
processing (through CNNs), prediction of disease outcome, ACKNOWLEDGMENTS
and modeling of cell differentiation from single cell RNAseq
We thank Laura Fink and Kaggle Kernels Grandmaster, for her
data (86–88).
suggestions to a previous version of this manuscript. To Laura
Fink and Paul Mooney because snippets of their code, published
in Kaggle, were used to analyze the IDC data set. To Jason Plemel
CONCLUDING REMARKS (University of Alberta) for providing the microglia single cell
transcriptomics raw data.
Machine learning (ML) comprises a vast collection of
computational methods that attempt to extract patterns from SUPPLEMENTARY MATERIAL
data, then use those patterns to derive mathematical models
that are able to generalize the learned rules on unseen data. In The Supplementary Material for this article can be found
other words, it generates artificial intelligence. In biomedical online at: https://www.frontiersin.org/articles/10.3389/fmed.
research, supervised and unsupervised ML techniques have 2021.771607/full#supplementary-material
been applied for decades, including regression analyses and Supplementary Figure 1 | Graphical description of the hepatitis dataset used to
illustrate supervised machine learning algorithms.
clustering (1, 32, 36, 57, 64). However, with the exponential
increase in computer power and data availability, ML has Supplementary Figure 2 | (A) Decision tree obtained when the DTC algorithm
gained renewed impetus, especially in the area of deep was applied to the hepatitis dataset. (B) Comparison of decision surfaces of a
single decision tree and a random forest applied for classification of survival of the
learning. Application of modern ML approaches extends hepatitis dataset (only albumin and bilirubin were used for creation of such plots).
across many areas in biomedical research (14) ranging from
Supplementary Figure 3 | Illustration of clustering methods on simulated data.
analysis of clinical images, to stratification of patients into
(A) 10,000 random points clustered into three clusters were generated. Such
most promising therapeutic interventions, to drug discovery, random points were reorganized into three clusters using agglomerative clustering
and robotics, just to mention a few. Our goal was to bring an (B), Birch clustering (C), KMeans clustering (D), spectral clustering (E) and
updated and simplified perspective of ML for non-experts in Gaussian mixture clustering (F). The Silhouette score is included in each case.
hope that members of the biomedical research community Supplementary Figure 4 | tSNE (A), PCA (B), and MDS (C) when the microglia
would realize the opportunities that ML offers for basic and single cell transcriptome data was normalized using a regularized logarithmic
applied researchers. transformation.
REFERENCES Intelligence, Smart Grid and Smart City Applications (Springer International
Publishing) (2020). doi: 10.1007/978-3-030-24051-6_58
1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data 8. Plemel JR, Stratton JA, Michaels NJ, Rawji KS, Zhang E, Sinha S, et
Mining, Inference, and Prediction, Second Edition. New York, NY: Springer al. Microglia response following acute demyelination is heterogeneous
Science & Business Media (2009). and limits infiltrating macrophage dispersion. Sci Adv. (2020)
2. Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, 6:eaay6324. doi: 10.1126/sciadv.aay6324
and prospects. Science. (2015) 349:255–60. doi: 10.1126/science.aaa 9. Francois-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J. An
8415 Introduction to Deep Reinforcement Learning. arXiv [cs.LG]. (2018). Available
3. Müller AC, Guido S. Introduction to Machine Learning with Python: A Guide online at: http://arxiv.org/abs/1811.12560
for Data Scientists. Sebastopol, CA: O’Reilly Media, Inc. (2016). 10. Majumder A. Introduction to reinforcement learning. Deep Reinforce. Learn.
4. Ayodele TO. Types of machine learning algorithms. New Adv Mach Learn. Unity. (2021) 2021:1–71. doi: 10.1007/978-1-4842-6503-1_1
(2010) 3:19–48. doi: 10.5772/9385 11. Sutton RS, Barto AG. Reinforcement Learning, Second edition: An
5. Berry MW, Mohamed A, Yap BW. Supervised and Unsupervised Learning for Introduction. Cambridge, MA: MIT Press (2018).
Data Science. Cham: Springer Nature (2019). doi: 10.1007/978-3-030-22475-2 12. Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy Gradient Methods
6. Walker A, Surda P. Unsupervised learning techniques for Reinforcement Learning With Function Approximation. Cambridge,
for the investigation of chronic rhinosinusitis. Ann Otol Massachusetts (1999). Available online at: https://proceedings.neurips.
Rhinol Laryngol. (2019) 128:1170–6. doi: 10.1177/0003489419 cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf (accessed
863822 May 20, 2021).
7. Sindhu Meena K, Suriya S. A survey on supervised and unsupervised 13. Chuang LY, Tsai JH, Yang CH. Operon prediction using particle
learning techniques. In: Proceedings of International Conference on Artificial swarm optimization and reinforcement learning. In: 2010 International
Conference on Technologies and Applications of Artificial Intelligence 39. Rosenblatt F. The perceptron: a probabilistic model for information
(2010). doi: 10.1109/TAAI.2010.65 storage and organization in the brain. Psychol Rev. (1958) 65:386–
14. Mahmud M, Kaiser MS, Hussain A, Vassanelli S. Applications of deep learning 408. doi: 10.1037/h0042519
and reinforcement learning to biological data. IEEE Trans Neural Netw Learn 40. Agostinelli F, Hoffman M, Sadowski P, Baldi P. Learning Activation Functions
Syst. (2018) 29:2063–79. doi: 10.1109/TNNLS.2018.2790388 to Improve Deep Neural Networks. arXiv [cs.NE]. (2014). Available online
15. Graesser L, Keng WL. Foundations of Deep Reinforcement Learning: Theory at: http://arxiv.org/abs/1412.6830
and Practice in Python. Addison-Wesley Professional (2019). 41. Nwankpa C, Ijomah W, Gachagan A, Marshall S. Activation Functions:
16. Petersen BK, Yang J, Grathwohl WS, Cockrell C, Santiago C, An G, et al. Deep Comparison of trends in Practice and Research for Deep Learning. arXiv
reinforcement learning and simulation as a path toward precision medicine. J [cs.LG]. (2018). Available online at: http://arxiv.org/abs/1811.03378
Comput Biol. (2019) 26:597–604. doi: 10.1089/cmb.2018.0168 42. Buturovic LJ, Citkusev LT. Back propagation and forward propagation. In:
17. McCorduck P. Machines Who Think: A Personal Inquiry into the [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.
History and Prospects of Artificial Intelligence. Canada: CRC Press Baltimore, MD (1992). doi: 10.1109/IJCNN.1992.227297
(2004). doi: 10.1201/9780429258985 43. Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional
18. Okfalisa O, Gazalba I, Mustakim, Reza NGI. Comparative analysis neural network. In: 2017 International Conference on Engineering and
of k-nearest neighbor and modified k-nearest neighbor algorithm for Technology (ICET). (2017). doi: 10.1109/ICEngTechnol.2017.8308186
data classification. In: 2017 2nd International conferences on Information 44. Hubel DH, Wiesel TN. Receptive fields of cells in striate cortex of
Technology, Information Systems and Electrical Engineering (ICITISEE). very young. Visually inexperienced kittens. J Neurophysiol. (1963) 26:994–
Yogyakarta (2017). doi: 10.1109/ICITISEE.2017.8285514 1002. doi: 10.1152/jn.1963.26.6.994
19. Géron A. Hands-On Machine Learning with Scikit-Learn, Keras, TensorFlow: 45. Hubel DH, Wiesel TN. Receptive fields and functional
Concepts. Tools, and Techniques to Build Intelligent Systems. Sebastopol, CA: architecture of monkey striate cortex. J Physiol. (1968) 195:215–
O’Reilly Media, Inc. (2019). 43. doi: 10.1113/jphysiol.1968.sp008455
20. Diaconis P, Efron B. Computer-intensive methods in statistics. Sci. Am. (1983) 46. Cruz-Roa A, Basavanhally A. Automatic detection of invasive ductal
248:116–31. doi: 10.1038/scientificamerican0583-116 carcinoma in whole slide images with convolutional neural networks. Med
21. Cestnik B, Kononenko I, Bratko I. A knowledge-elicitation tool for Imaging. (2014) 9041:3872. doi: 10.1117/12.2043872
sophisticated users. In: Proceedings of the 2nd European Conference on 47. LeCun B, Denker JS, Henderson D, Howard RE. Hubbard W, et al.
European Working Session on Learning EWSL’87. Sigma Press (1987). Backpropagation applied to handwritten zip code recognition. Neural
22. Baskaya B. Statistical Analysis of Decision Trees. Long Beach, CA: California Comput. (1989) 1:541–51. doi: 10.1162/neco.1989.1.4.541
State University (2011). 48. LeCun B, Bengio Y. Convolutional networks for images, speech, time series.
23. Kingsford C, Salzberg SL. What are decision trees? Nat Biotechnol. (2008) Handbook Brain Theory Neural Netw. (1995) 3361:1995.
26:1011–3. doi: 10.1038/nbt0908-1011 49. Suryanarayana SM, Pérez-Fernández J, Robertson B, Grillner
24. Kotsiantis SB. Decision trees: a recent overview. Artif Intell Rev. (2013) S. The evolutionary origin of visual and somatosensory
39:261–83. doi: 10.1007/s10462-011-9272-4 representation in the vertebrate pallium. Nat Ecol Evol. (2020)
25. Dahan H, Cohen S, Rokach L, Maimon O. Proactive Data Mining with 4:639–51. doi: 10.1038/s41559-020-1137-2
Decision Trees. New York, NY: Springer Science & Business Media. 50. Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, et al.
(2014). doi: 10.1007/978-1-4939-0539-3 The History Began from AlexNet: A Comprehensive Survey on Deep Learning
26. Slocum M. Decision making using id3 algorithm. Insight: River Acad J. Approaches. arXiv [cs.CV]. (2018). Available online at: http://arxiv.org/abs/
(2012) 2012:8. Available online at: https://www2.rivier.edu/journal/ROAJ- 1803.01164
Fall-2012/J674-Slocum-ID3-Algorithm.pdf (accessed March 13, 2020). 51. Ismail Fawaz H, Lucas B, Forestier G, Pelletier C, Schmidt DF, Weber
27. Yang S, Guo JZ, Jin JW. An improved Id3 algorithm for J, et al. InceptionTime: Finding AlexNet for time series classification.
medical data classification. Comput Electr Eng. (2018) 65:474– Data Min Knowl Discov. (2020) 34:1936–62. doi: 10.1007/s10618-020-0
87. doi: 10.1016/j.compeleceng.2017.08.005 0710-y
28. Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neurorobot. 52. Pang Y, Sun M, Jiang X, Li X. Convolution in convolution for network
(2013) 7:21. doi: 10.3389/fnbot.2013.00021 in network. IEEE Trans Neural Netw Learn Syst. (2018) 29:1587–
29. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: 97. doi: 10.1109/TNNLS.2017.2676130
A highly efficient gradient boosting decision tree. Adv Neural Inf Process 53. Abdi H. A neural network primer. J Biol Syst. (1994) 2:247–
Syst. (2017) 30:3146–54. Available online at: https://proceedings.neurips. 81. doi: 10.1142/S0218339094000179
cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf (accessed 54. Conover M, Staples M, Si D, Sun M, Cao R. AngularQA: protein model
August 05, 2020). quality assessment with LSTM networks. Comput Mathemat Biophys. (2019)
30. Zhang Z, Zhao Y, Canes A, Steinberg D, Lyashevska O. Predictive analytics 7:1–9. doi: 10.1515/cmb-2019-0001
with gradient boosting in clinical medicine. Ann Transl Med. (2019) 55. Miljković D. Brief review of self-organizing maps. In: 2017 40th International
7:152. doi: 10.21037/atm.2019.03.29 Convention on Information and Communication Technology, Electronics
31. Matloff N. Statistical Regression and Classification: From Linear and Microelectronics (MIPRO). (2017). doi: 10.23919/MIPRO.2017.79
Models to Machine Learning. Boca Raton, Fl: CRC Press 73581
(2017). doi: 10.1201/9781315119588 56. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, et al.
32. Montgomery DC, Peck EA, Geoffrey Vining G. Introduction to Linear Interpreting patterns of gene expression with self-organizing maps: methods
Regression Analysis. Hoboken, NJ: John Wiley & Sons (2012). and application to hematopoietic differentiation. Proc Natl Acad Sci USA.
33. Jacobson SH. Optimal mean squared error analysis of the harmonic gradient (1999) 96:2907–12. doi: 10.1073/pnas.96.6.2907
estimators. J Optimiz Theory App. (1994) 80:573–90. doi: 10.1007/BF02207781 57. Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, et al. A
34. Ruder S. An Overview of Gradient Descent Optimization Algorithms. arXiv review of clustering techniques and developments. Neurocomputing. (2017)
[cs.LG]. (2016). Available online at: http://arxiv.org/abs/1609.04747 267:664–81. doi: 10.1016/j.neucom.2017.06.053
35. Menard S. Logistic Regression: From Introductory to Advanced Concepts and 58. Aranganayagi S, Thangavel K. Clustering categorical data using
Applications. Thousand Oaks, CA: SAGE (2010). doi: 10.4135/9781483348964 silhouette coefficient as a relocating measure. In: International
36. Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. John Conference on Computational Intelligence and Multimedia Applications
Wiley & Sons. (2013). doi: 10.1002/9781118548387 (ICCIMA 2007) Tamil Nadu, India (2007). doi: 10.1109/ICCIMA.
37. LeCun BY, Hinton G. Deep learning. Nature. (2015) 521:436– 2007.328
44. doi: 10.1038/nature14539 59. Dinh DT, Fujinami T, Huynh VN. Estimating the optimal number of clusters
38. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous in categorical data clustering by silhouette coefficient. Commun Comp Inform
activity. Bull Math Biophys. (1943) 5:115–33. doi: 10.1007/BF02478259 Sci. (2019) 2019:1–17. doi: 10.1007/978-981-15-1209-4_1
60. Pielou EC. The Interpretation of Ecological Data: A Primer on Classification 78. Song X, Wu M, Jermaine C, Ranka S. Conditional Anomaly Detection.
and Ordination. New York, NY: John Wiley & Sons (1984). IEEE Trans Knowl Data Eng. (2007) 19:631–45. doi: 10.1109/TKDE.
61. Rohlf FJ, James Rohlf F. The interpretation of ecological data: a primer 2007.1009
on classification and ordination. E. C. Pielou. Q Rev Biol. (1985) 79. Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey. ACM
60:531. doi: 10.1086/414660 Comput Surv. (2009) 41:1–58.
62. Popat SK, Emmanuel M. Review and comparative study of clustering 80. Mehrotra KG, Mohan CK, Huang H. Anomaly Detection Principles and
techniques. Int J Comp Sci Inform Technol. (2014) 5:805–12. Available online Algorithms. Cham, Switzerland: Springer (2017).
at: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.433.3348& 81. Hauskrecht M, Valko M, Kveton B, Visweswaran S, Cooper GF. Evidence-
rep=rep1&type=pdf (accessed March 3, 2021). based anomaly detection in clinical domains. AMIA Annu Symp Proc. (2007)
63. Grossman GD, Nickerson DM, Freeman MC. Principal component analyses 319–23.
of assemblage structure data: Utility of tests based on eigenvalues. Ecology. 82. Antonelli D, Bruno G, Chiusano S. Anomaly detection in medical treatment
(1991) 72:341–7. doi: 10.2307/1938927 to discover unusual patient management. IIE Trans Healthc Syst Eng. (2013)
64. Borg I, Groenen PJF. Modern Multidimensional Scaling: Theory and 3:69–77. doi: 10.1080/19488300.2013.787564
Applications. New York, NY: Springer Science & Business Media (2005). 83. Churová V, Vyškovský R, Maršálová K, Kudláček D, Schwarz D. Anomaly
65. Peluffo-Ordóñez DH, Lee JA, Verleysen M. Short review of dimensionality Detection Algorithm for Real-World Data and Evidence in Clinical Research:
reduction methods based on stochastic neighbour embedding. In: Advances Implementation, Evaluation, and Validation Study. JMIR Med Inform. (2021)
in Self-Organizing Maps and Learning Vector Quantization (Springer 9:e27172. doi: 10.2196/27172
International Publishing) (2014). doi: 10.1007/978-3-319-07695-9_6 84. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S,
66. Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Efficient et al. Generative adversarial networks. Commun ACM. (2020) 63:139–44.
Algorithms for t-distributed Stochastic Neighborhood Embedding. arXiv doi: 10.1145/3422622
[cs.LG]. (2017). Available online at: http://arxiv.org/abs/1712.09005 85. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair
67. Rogovschi N, Kitazono J, Grozavu N, Omori T, Ozawa S. t-Distributed S, et al. Generative Adversarial Nets. In Ghahramani Z, Welling M,
stochastic neighbor embedding spectral clustering. In: 2017 International Cortes C, Lawrence N, Weinberger KQ, editors, Advances in Neural
Joint Conference on Neural Networks (IJCNN). Anchorage, AK Information Processing Systems, (Curran Associates, Inc.) (2014).
(2017). doi: 10.1109/IJCNN.2017.7966046 Available online at: https://proceedings.neurips.cc/paper/2014/file/
68. van der Maaten L. Visualizing Data using t-SNE. 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
(2008). Available online at: https://www.jmlr.org/papers/ 86. Bing X, Zhang W, Zheng L, Zhang Y. Medical Image Super Resolution Using
volume9/vandermaaten08a/vandermaaten08a.pdf?fbclid= Improved Generative Adversarial Networks. IEEE Access. (2019) 7:145030–8.
IwAR0Bgg1eA5TFmqOZeCQXsIoL6PKrVXUFaskUKtg6yBhVXAFFvZA6yQiYx- doi: 10.1109/access.2019.2944862
M (accessed May 17, 2021). 87. Guan S, Loew M. Using generative adversarial networks and transfer learning
69. Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat for breast cancer detection by convolutional neural networks. Medical Imaging
Commun. (2019) 10:5416. doi: 10.1038/s41467-019-13056-x 2019: Imaging Informatics for Healthcare, Research, and Applications. San
70. Pearson K. LIII. On lines and planes of closest fit to systems of points in space. Diego (2019). doi: 10.1117/12.2512671
London, Edinburgh, and Dublin Philosophical Magazine and J Sci. (1901) 88. Lan L, You L, Zhang Z, Fan Z, Zhao W, Zeng N, et al. Generative Adversarial
2:559–72. doi: 10.1080/14786440109462720 Networks and Its Applications in Biomedical Informatics. Front Public Health.
71. Jolliffe IT. Principal Component Analysis. Springer Science & Business Media (2020) 8:164. doi: 10.3389/fpubh.2020.00164
(2013). doi: 10.1002/9781118445112.stat06472
72. Kruskal JB. Multidimensional Scaling. SAGE (1978). Conflict of Interest: The authors declare that the research was conducted in the
doi: 10.4135/9781412985130 absence of any commercial or financial relationships that could be construed as a
73. Cox MAA, Cox TF. Multidimensional scaling. In: Chen CH, Härdle W, potential conflict of interest.
Unwin A, editors. Handbook of Data Visualization. Berlin: Springer Berlin
Heidelberg (2008). doi: 10.1007/978-3-540-33037-0_14 Publisher’s Note: All claims expressed in this article are solely those of the authors
74. Baldi P. Autoencoders, unsupervised learning, deep architectures. In: and do not necessarily represent those of their affiliated organizations, or those of
Proceedings of ICML Workshop on Unsupervised and Transfer Learning
the publisher, the editors and the reviewers. Any product that may be evaluated in
Proceedings of Machine Learning Research. (Bellevue, WA: PMLR) (2012)
this article, or claim that may be made by its manufacturer, is not guaranteed or
37–49.
75. Alpaydin E. Introduction to Machine Learning. Cambridge, MA: MIT Press endorsed by the publisher.
(2020). doi: 10.7551/mitpress/13811.001.0001
76. Bank D, Koenigstein N, Giryes R. Autoencoders. arXiv [cs.LG] (2020). Copyright © 2021 Jovel and Greiner. This is an open-access article distributed
Available online at: http://arxiv.org/abs/2003.05991 under the terms of the Creative Commons Attribution License (CC BY). The use,
77. Noble CC, Cook DJ. Graph-based anomaly detection. in Proceedings of the distribution or reproduction in other forums is permitted, provided the original
ninth ACM SIGKDD international conference on Knowledge discovery and data author(s) and the copyright owner(s) are credited and that the original publication
mining KDD ’03. (New York, NY: Association for Computing Machinery) in this journal is cited, in accordance with accepted academic practice. No use,
(2003) 631–36. doi: 10.1145/956750.956831 distribution or reproduction is permitted which does not comply with these terms.