A Brief Review On Deep Learning Applications in Ge
A Brief Review On Deep Learning Applications in Ge
Deep learning is a powerful tool for capturing complex structures within the data. It holds
great promise for genomic research due to its capacity of learning complex features in
genomic data. In this paper, we provide a brief review on deep learning techniques and
various applications of deep learning to genomic studies. We also briefly mention current
challenges and future perspectives on using emerging deep learning techniques for
ongoing and future genomic research.
Keywords: neural networks, supervised learning, unsupervised learning, semi-supervised learning, genomics,
genetics
1 INTRODUCTION
Edited by:
Rongling Wu, Deep learning has achieved great success in many areas such as computer vision and natural
The Pennsylvania State University language processing. It leads the data-driven science into a new era due to its ability of learning
(PSU), United States complex structure from data without human intervention. With its success in many areas, there are
Reviewed by: increasing interests in using deep learning in genomic research. Genomic data are sophisticated in
Ka-Chun Wong, nature, and have complex relationship with responses (e.g., disease outcomes). While classical
City University of Hong Kong, Hong methods (e.g., linear regression) have commonly used in genomic data analysis to detect simple
Kong SAR, China
linear effects, deep learning can learn complex features from the genomic data, making it a powerful
Chixiang Chen,
University of Maryland, Baltimore, method for considering nonlinear and interaction effects. In this review paper, we provide a brief
United States review on a variety of applications of deep learning to genomic research. Deep learning, as a class of
*Correspondence:
machine learning approaches, can also be categorized into supervised learning and unsupervised
Qing Lu learning. We start by introducing key concepts in supervised learning, unsupervised learningand
lucienq@ufl.edu semi-supervised learning, and then reviewing popular deep learning methods and their applications
in genomic research. Due to a large number of available deep learning methods and limited space, the
Specialty section: review mainly focused on classic deep learning methods, especially those having the potential to be
This article was submitted to applied to genomic data analysis.
Integrative Genetics and Genomics,
a section of the journal
Frontiers in Systems Biology
2 2 SUPERVISED, UNSUPERVISED AND SEMI-SUPERVISED
Received: 17 February 2022
Accepted: 02 May 2022
LEARNING
Published: 25 May 2022
2.1 Supervised Learning
Citation: Statistically speaking, there are three key elements in supervised learning: 1) a generator of random
Shen X, Jiang C, Wen Y, Li C and Lu Q
vectors X from a fixed unknown distribution P(x), 2) a supervisor (or a teacher) that returns Y for
(2022) A Brief Review on Deep
Learning Applications in
every X according to a conditional distribution P(y|x), and 3) a class of learning machines
Genomic Studies. {f(x, θ): θ ∈ Θ}. This concept was introduced by Vapnik, (1998). The question is given
Front. Syst. Biol. 2:877717. independent and identically distributed (i.i.d.) pairs of data (X 1 , Y1 ), . . . , (X n , Yn ) , often known
doi: 10.3389/fsysb.2022.877717 as the training data, from the joint distribution P(x, y) P(y|x)P(x), how to choose from
FIGURE 1 | Structures of popular machine learning and deep learning models. (A) A perceptron where a nonlinear activation function (e.g., a hard-threshold
function) is applied to the linear combination of inputs and weights to predict the output. (B) A neural network with one hidden layer, which consists of multiple
perceptrons. The blue computation units are the hidden units, which are generated by applying a nonlinear activation function (e.g., a ReLU function) to the linear
combination of inputs and weights. The output layer with computation units shown in orange, which uses an activation (e.g., a sigmoid or softmax function) to
produce predicted values. (C) A deep neural network with two hidden layers, where computation units in each hidden layer apply a nonlinear activation function to the
linear combination of weights and outputs from the previous layer. (D) A convolutional neural network (CNN), where the input is an image with three channels
representing red, green and blue. The hidden layers of a CNN comprise two types of layers: convolutional layers and pooling layers. A convolutional layer consists of
several filters, which have the same number of channels as the input data. Each filter acts as a sliding window and applies a nonlinear activation to the linear combination
of filter entries and the outputs from the previous layer. Such an operation is known as convolution. A pooling layer is used to reduce the size of the representation to
accelerate computations, as well as to make detected features more robust. A commonly used pooling layer is called max pooling, where a filter acts as a sliding window
and produce the maximum elements from that part. After several convolutional layers and pooling layers, the output is vectorized as the input of a fully connected neural
network. (E) A recurrent neural network, where both the input and the output are sequences with the same length. Each input xt (e.g., a word in a sentence) and the
output at−1 from the previous neural network are used to predict y^t and produce an output at , which then serves as the input for the next neural network. Typical
structures of the neural network used in RNN are RNN cell or long short-term memory (LSTM) cell shown in (F,G), respectively.
{f(x, θ): θ ∈ Θ} an f that predicts the supervisor’s response Y in perceptron, which is generated by applying a nonlinear
the “best” possible way. When Y is continuous, the learning activation function to a linear combination of weights and
problem is often known as a regression problem. In a regression, inputs.
we seek the best parameter θ that minimizes the quadratic loss An important characteristic of a neural network is the
function: universal approximation theorem (Hornik et al., 1989). The
theorem says that a neural network with one hidden layer can
1 n approximate a continuous function defined on a compact set in
θ^1 arg min Yi − f(X i , θ) .
2
θ∈Θ n
i1
Rd arbitrarily well as long as the number of hidden units is large
enough. Nevertheless, Györfi et al. (2006) and Shen et al. (2019)
When Y is dichotomous, the learning problem is known as a found that the number of hidden units cannot grow as fast as the
classification or pattern recognition problem. In a classification sample size in order to make the neural network estimators reach
problem, a commonly used loss function is the cross-entropy statistical consistency. Therefore, there is a gap between theory
function, and applications on this topic, which might worth further
n investigation.
1
θ^2 arg min − Yi log f(X i ,θ) − (1 − Yi )log1 − f(X i , θ).
θ∈Θ n
i1 2.3 Deep Neural Networks
When f(x, θ) x θ, θ^1 becomes the classical least squares
T A deep neural network is a neural network with more than one
estimator in a linear regression. Similarly, θ^2 is the estimator hidden layers. Figure 1C gives an example of a deep neural
for coefficients in a logistic regression if f(x, θ) (1 + e−x θ )−1.
T network with two hidden layers. One advantage of deep learning
is that it requires much less number of hidden units to learn
complex features, while a much larger number of hidden units
2.2 Neural Networks may be needed for a shallow neural network. An example is to
Neural networks are algorithms that try to mimic the function of learn the XOR (i.e., exclusive OR) function. As shown in Figure 2,
a human brain. A neural network is a collection of perceptrons. if we use a tree-structured deep neural network to learn the
Therefore, another commonly used name for neural networks is function, the number of hidden units required is O(log n). The
multi-layer perceptrons. The basic structure of a perceptron is number of hidden units increases exponentially in a neural
shown in Figure 1A. In a perceptron (Rosenblatt, 1958), a network with one hidden layer as we need to enumerate all 2n
nonlinear activation function is applied to the linear possible configurations of the input bits to learn the XOR
combination of weights and input features to produce an function.
output. Commonly used nonlinear activation functions in Among all the nonlinear activation functions mentioned
neural networks and in deep learning include: above, the ReLU activation function is one of the most
1 if u ≥ 0 popularly used functions in deep neural networks. For most of
• Hard-threshold (Heaviside) function: σ(u) ; other nonlinear activation functions, the function value is almost
0 if u < 0
unchanged when the input value is too large or too small.
• Soft-threshold (Logistic) function: σ(u) (1 + e−u )−1 ;
Therefore, when applying the back propagation algorithm, the
eu −e−u
• Hyperbolic tangent function: σ(u) tanh u eu +e−u ; gradient is close to zero which slows down the update of
• Normal cumulative distribution parameters (Rumelhart et al., 1988). ReLU avoids this
u t2 vanishing gradient problem and is computationally efficient,
function: σ(u) −∞ √12π e− 2 dt; and
which makes it ideal for training deep neural networks with
• Rectified linear unit (ReLU) function: σ(u) max(0, u) many layers.
(Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., Besides the well-known fully connected feed-forward neural
2011). network, there are two other types of neural networks that have
been widely used. One is known as the convolutional neural
While linear activation functions can be used in the output network (CNN) (LeCun, 1989) and the other is the recurrent
layer for regression types of problems, it is important to use neural network (RNN) (Rumelhart et al., 1988). CNN is
nonlinear activation functions in hidden layers. The use of commonly used for grid-like data structure such as images,
nonlinear activation functions makes it possible for neural while RNN is often used for sequence data such as a DNA
networks to capture nonlinear relationships between input sequence. The main feature of CNN is that convolution
data and output data. If linear activation functions are used in operations are used in place of matrix multiplications
hidden layers instead, a neural network then collapses to a (Goodfellow et al., 2016) and the convolution operation
linear regression or a logistic regression. Figure 1B shows a captures spatial information in the data. Figure 1D provides a
general structure of a neural network with one hidden layer. typical structure of CNN. Hidden layers in CNN usually consist
As we can see from the figure, the difference between a of two parts. One type of hidden layer is the convolutional layer,
perceptron and a neural network with one hidden layer is where several filters having the same number of channels are
an additional layer, known as the hidden layer, lies between applied to the output of previous layer. Each filter acts as a sliding
the input layer and the output layer. Each hidden unit in the window and a nonlinear activation function is applied to the
hidden layer is formed in the same way as the processor in a linear combination of weights in the filter and elements in the
FIGURE 2 | A tree structured deep neural network representing the XOR (i.e., exclusive or) function on the input data, where each input unit can only take two
values, 0 and 1. By using a deep neural network, the depth of the network is (log n) , therefore we don’t need a large number of nodes to approximate the XOR function.
However, if we approximate this function using only one hidden layer, then the number of units in this hidden layer can be exponentially large as we need to enumerate all
2n possible configurations of the input bits.
“window” that come from the output of previous layer. The other et al., 2014) and the other one is the long short-term memory (LSTM)
type of hidden layer in CNN is called the pooling layer. A unit (Sak et al., 2014). Figure 1G shows the basic structure of an
commonly used pooling layer is known as the max pooling LSTM unit. The blue computation unit is known as the forget gate,
layer, where a filter is served as a sliding window and the which is used to get rid of previously stored memory value. The
maximum element from that window is extracted. There are orange computation unit is known as the update gate. The updated
no parameters that need to be learned in a pooling layer. A value of the cell is given by ct Γtf ct−1 + Γtu c~t , which is determined
Pooling layer is used to reduce the size of the representation, by the value from the update gate and the forget gate. Therefore, both
which speeds up the computation and makes detected features the update gate and the forget gate control the update of the cell value.
more robust. With the increasing number of inputs, most learning
While CNN can be used to capture spatial information in the algorithms need to deal with the over-fitting issue. One can
data, RNN is used to capture the temporal dynamic behavior in build a sophisticated model on a training dataset with small
the data. Figure 1E provides an example of RNN. In RNN, an training error but such a model may not have good
input (e.g., a word in a sentence) is combined with the output generalizability. When applying this model to a different testing
from the last hidden layer of the previous neural network, to serve dataset, the model could be subject to high generalization error or
as inputs for the next successive neural network. The structure of testing error. Complex models usually have low bias but high
commonly used RNN cells is shown in Figure 1F. Two unique variance. Therefore, the over-fitting issue is the same as the bias-
features of RNNs are: variance tradeoff in statistics. Commonly used approaches for
addressing the over-fitting issue in deep learning include
1) The input length and the output length can be different. Since regularization and dropout (Srivastava et al., 2014). For
the input of RNN is usually a sequence and the output can be a regularization approaches, a penalty term is often added to the
different sequence or a class label, it is likely that the length of loss function to solve the over-fitting issue. While the model
the input is different from the length of the output. The increases its complexity to reduce the discrepancy between the
structure of RNN is quite flexible, making it feasible to estimated value and the true value, it can also increase the penalty.
accommodate such scenarios. Therefore, minimizing the loss function with the penalty term
2) Parameters are shared across neural networks. In an RNN, the helps to keep the balance between bias and variance. Dropout is
weight matrices are shared across all neural networks, which another popularly used approach in neural networks. Figure 3
greatly reduces the number of parameters to be estimated. provides an illustration of the dropout approach. In dropout, we
randomly delete hidden units with certain probability and remove
One issue of a classical RNN shown in Figure 1E is that it only all the in-and-out edges associated with those hidden units. The
uses the information earlier in the sequence, which can be addressed intuition behind the dropout is that since the “input” hidden units
by using a bidirectional RNN (Schuster and Paliwal, 1997). Another can be randomly dropout, the “output” hidden units cannot rely
drawback for a classical RNN is that it can run into a vanishing on any one of the features. Therefore, the weights have to be
gradient problem, which makes it difficult to capture long range shrunk towards zero. As pointed out in Wager (Wager et al., 2013),
dependencies. To address this issue, two modifications of the classical when applied to linear regression, dropout is equivalent to the
RNN cells were proposed, one is the gated recurrent unit (GRU) (Cho classical L2-regularization.
FIGURE 3 | An illustration of dropout regularization. Each hidden unit is randomly deleted with some probability, marked by X in the figure, and the in-and-out edges
associated with those hidden units are also removed.
FIGURE 4 | Structures of popular unsupervised learning methods. (A) Basic structure of an autoencoder. The three layers on the left represent encoding process,
which extract important features from the input data and the two layers on the right represent decoding process, which tries to reproduce the original data. An
autoencoder is usually learned by minimizing the discrepancy between the original data and the respective reproduced data. (B) A deep belief network (DBN) with two
hidden layers. A main characteristic of a DBN is that the edges between the top two layers are undirected and the edges between all other layers are directed
pointing towards the layer that is closest to the data. (C) A deep Boltzmann machine (DBM) is a generative model having similar structures with a DBN except that all the
connections between layers are undirected. (D) A variational autoencoder (VAE) learns two conditional distributions, one is the conditional distribution of latent features
given the input data qφ (z|x) and the conditional distribution of outputs given the latent features pθ (x|z), which is the target distribution used to generate new samples. (E)
General structure of a generative adversarial network (GAN). A GAN starts with samples from a simple distribution such as random noises and uses a neural network (the
generator network) to learn the complex transformation of the samples to create fake outputs and use a discriminator network to see if the generated outputs are close
enough to the real data.
2.4 Unsupervised Learning distribution of the output given the latent factors, which is
In supervised learning, there is a teacher (i.e., labeled responses) used to generate new samples.
supervising the performance of the learning machine through • Generative adversarial networks (GAN) (Goodfellow et al.,
some metric quantifying the discrepancy. In unsupervised 2014) are popular methods that enable us sample data from
learning, however, there are no labeled responses. Instead, we complex, and high-dimensional training distributions even
are more interested in data compression by extracting useful there is no direct way to do it. The basic structure of a GAN
information from the input data. The dimension of extracted is shown in Figure 4E. A generator network is used to learn
features is usually much smaller than the dimension of the the complex transformation of the samples from a simple
original input data. By doing so, we can not only reduce the distribution such as random noise and to produce fake
cost of data storage, but also make the downstream analyses more outputs. A discriminator network is then applied to the fake
efficient. outputs and real data. The goal is to train the network so
A commonly used unsupervised learning algorithm is the that the discriminator network cannot distinguish the fake
Principle Component Analysis (PCA). There is a counterpart data and the real data.
of PCA in deep neural network, known as autoencoder (LeCun,
1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994). 2.5 Semi-Supervised Learning
Figure 4A provides an illustration of an autoencoder. In an As suggested by its name, semi-supervised learning sits between
autoencoder, important features are extracted from the original supervised learning and unsupervised learning. In supervised
data. To determine whether the extracted features represent the learning, each data point in the training data has a label,
original input, we reconstruct the “original data” from the which serves as a “teacher” to guide the performance of
extracted features and use the difference between the prediction (Chapelle et al., 2006; Zhu 2008). In many real-
reconstructed data and the original data as a guideline to train world problems, additional data points without labels may also
the network. be available. The goal of semi-supervised learning is to construct a
One of the most active research topics in unsupervised learner by using both the labeled training data and the unlabeled
learning is generative models. The goal of these models is to data for improved performance. Although there is no guarantee
learn the model distribution from the data so that we can generate that prediction performances will be improved by incorporating
new data from the distribution. Here are some most commonly additional unlabeled data, empirical studies have shown
used generative models: consistent performance gain, compared with their supervised
counterparts, by using semi-supervised learning methods based
• Boltzmann machines (BM) (Fahlman et al., 1983; Hinton on neural networks. Therefore, semi-supervised learning
et al., 1984) provide a way to model the joint distribution of methods using deep neural networks have been widely applied
a large number of binary random variables. to genomic studies, especially for cell-type classification using
• Restricted Boltzmann machines (RBM) (Smolensky, 1986) single-cell RNA-seq data. We provide a detailed survey on this
is a bipartite undirected graph containing one visible layer topic in Section 5.
and one hidden layer. Both layers contain nodes taking Van Engelen and Hoos, (2020) provide a comprehensive
binary values, and the model is used to approximate any survey on semi-supervised learning, where they taxonomize
joint distribution of binary random variables. Therefore, semi-supervised learning methods into two main categories:
RBM is often known as a stochastic neural network. inductive methods and transductive methods. The goals of
• Deep Belief Networks (DBN) (Hinton, 2009) are generative inductive methods are similar to those of supervised learning.
models that have multiple layers of latent binary variables. A weak learner, mapping from the input space to the output
As we can see from Figure 4B, the connections between the space, is produced. In supervised learning, only labeled data is
top two hidden layers are undirected, while the connections used, while in semi-supervised learning, both the labeled data and
between other layers are directed and point towards the unlabeled data are used. On the other hand, the goal of
layers closer to the visible data. transductive methods is to solely predict labels for the
• Deep Boltzmann machines (DBMs) (Salakhutdinov and unlabeled data points.
Hinton, 2009) are similar to deep belief networks except As inductive methods share the same goals as supervised
that all the edges in a DBM are undirected, as shown in learning methods, these methods can be used for any
Figure 4C. supervised learners. Different inductive methods use
• Variational autoencoders (VAE) (Kingma and Welling, different methods to incorporate unlabeled data. For
2014) is a probabilistic version of autoencoders, which example, one can use an autoencoder to extract important
allows us to sample data from the model. The structure features from the unlabeled data and use these features to train
of a VAE is shown in Figure 4D. In VAE, the hidden layers the labeled data. This is known as unsupervised preprocessing
represent some latent factors, denoted by z in Figure 4D, in van Engelen and Hoos, (2020). One can also train a classifier
which are used to generate the input data. The goal of VAE using the labeled data and create pseudo-labels for unlabeled
is to learn parameters in two conditional distributions. The data. The classifier is then retrained on the labeled dataset and
first one [i.e., qφ (z|x) in Figure 4D] is the conditional pseudo-labeled dataset. Such a method is called a wrapper
distribution of the latent factors given the input data, and method according to van Engelen and Hoos, (2020). Unlabeled
the other one [i.e., pθ (x|z) in Figure 4D] is the conditional data can also be incorporated by adding additional terms into
FIGURE 5 | The block diagram of some main deep learning methods to be discussed in the rest of the review.
FIGURE 6 | The structure of DanQ. One-hot encoding is applied to the original DNA sequence. A convolutional neural network is used, followed by a bidirectional
recurrent neural network with long short-term memory units. The outputs from the bidirectional recurrent neural network are fed into a neural network to make final
predictions.
the loss function, and such inductive methods are called 3 APPLICATIONS OF SUPERVISED DEEP
intrinsically semi-supervised methods. LEARNING TO GENOMIC STUDIES
Since transductive methods only focus on predicting labels for
unlabeled data without training a classifier, almost all In recent years, deep learning techniques have been successfully
transductive methods are graph-based, which mainly consist of applied to various areas such as computer vision, natural
three steps: 1) constructing a graph based on some similarity language processing, autonomous driving, etc. Starting from
measures, 2) weighing the edges, and 3) drawing inference on the seminal studies in 2015, which established the applicability
the graph. of deep learning to DNA sequence data (Alipanahi et al., 2015;
Figure 5 provides a block diagram showing some major Zhou and Troyanskaya, 2015; Eraslan et al., 2019), there is an
deep learning methods to be discussed in the following increasing interest in using deep learning in genomic studies. As
sections. mentioned in Park and Kellis, (2015), deep learning holds great
promise for genomic research since various levels of information Troyanskaya, 2015) and DanQ (Quang and Xie, 2016) are two
and abstraction can be captured by different layers in deep important works in this area. DeepSEA is a CNN approach with
learning. three convolutional layers and two max pooling layers. The
Fully connected deep neural networks have been used in network structure of DanQ, as shown in Figure 6, is similar
various genomic studies. For instance, Quang et al. (2014) to that of DeepSEA. However, instead of applying two more
proposed DANN, a method that makes predictions on the convolutional layers and a max pooling layer, DanQ uses a bi-
deleteriousness of genetic variants using deep neural networks. directional long short-term memory RNN after the first
Compared with a commonly used algorithm known as combined convolutional and max pooling layer. The outputs from the
annotation—dependent depletion (CADD) (Kircher et al., 2014), LSTM units are then flattened in DanQ and a dense layer of
DANN reduces the relative error rate by 19%. The reason is that rectified linear units is applied followed by a multi-task sigmoid
CADD uses a linear kernel support vector machine and only unit. Both methods attain great performances in terms of
linear representation can be learned from the data. Another area prediction accuracy, while DanQ outperforms DeepSEA and
of deep learning application is in gene expression inference. other methods (e.g., logistic regression) across several other
D-GEX (Chen et al., 2016), used a deep neural network to metrics.
predict the expression of target genes from the expression of Another area of using deep learning is in genetic association
landmark genes. The relative performance of D-GEX, in terms of studies. In the past decade, genome-wide association studies
the overall error rate, improves 15.33% over linear regression and (GWAS) have uncovered numerous genetic variants
D-GEX also achieves lower error than linear regression through a predisposing to human traits and diseases (Consortium, 2007;
gene-wise comparative analysis. Scott et al., 2007; Sladek et al., 2007). Nevertheless, most of the
CNNs are great tools for analyzing data with spatial identified variants are associated with small effects and account
dependencies. It holds great promise for DNA sequence data for only a small fraction of heritability (Maher, 2008). Part of the
as it can take linkage disequilibrium into account. Primary works missing heritability can be explained by gene-gene interactions or
on applying CNNs to genomic studies include DeepBind epistatis (Manolio et al., 2009). While each genetic variant is
(Alipanahi et al., 2015), DeepSEA (Zhou and Troyanskaya, associated with a small effect, it can interact with other variants to
2015) and Basset (Kelley et al., 2016). Since DNA sequence is play an important role on diseases. This leads to many multi-
a one-dimensional data, when applying CNNs, one-hot encoding locus interaction studies in order to understand the joint effects of
is usually used to deal with the four DNA bases. For example, we multiple loci on complex diseases (Cordell, 2009; Gusareva et al.,
can code each DNA base as A = [1,0,0,0], G = [0,1,0,0], C = 2014).
[0,0,1,0], T = [0,0,0,1] so that a DNA sequence now becomes a Due to the large number of genetic markers in association
matrix with four columns and a classical CNN can be applied. If studies, inference on gene-gene interactions is computational
there are any missing values in the DNA coding, one possible challenging for most classical statistical methods. Neural
solution is to add an additional column, which corresponds to the networks, on the other hand, can be used to model complex
missing value, to the DNA one-hot encoding matrix. For the relations between traits and genetic markers without having to
purpose of classifying transcription factors, the filters in the first enumerate all the possible interactions between genetic
convolutional layers are actually the motif detectors, which are markers. Researchers have used neural networks in genetic
similar to position weight matrices without requiring the entries data analyses, but the results are inconsistent (Lucek and Ott,
to be probabilities or log-odds ratios. 1997; Lucek et al., 1998; Saccone et al., 1999; Curtis et al., 2001;
Besides CNNs, RNNs have also been applied to genomic Marinov and Weeks, 2001; North et al., 2003). One possible
studies. Pouladi et al. (2015) used matrix factorization and explanation is the existence of multiple local minima in the
RNNs to construct a genotype imputation and phenotype optimization and the selection of suboptimal neural network
sequences prediction system, which attained better structures. Machine learning approaches, such as genetic
performance than long short-term memory and spatial partial programming neural networks (Motsinger et al., 2006) and
least squares models. Boža et al. (2017) proposed DeepNano, an grammatical evolution neural networks (GENN) (Motsinger-
RNN-based approach, which substantially improves the base Reif et al., 2008) have been developed to address these issues by
calling accuracy for MinION sequencing data (Mikheyev and choosing the best neural network architectures based on a
Tin, 2014). A combination of RNN and particle swarm given data set. Motsinger et al. (2006) have demonstrated that
optimization was proposed by Xu et al. (2007) to infer genetic GENN have higher power than the classical neural network
regulatory networks and produce meaningful insights on the using back-propagation. Furthermore, Motsinger-Reif et al.
nonlinear dynamics of the gene expression time series. Recently, (2008) showed that the performance of GENN is better than
ProLanGo (Cao et al., 2017), a RNN-based model, was proposed that of GPNN when there exist high order gene-gene
for prediction of protein function. interactions. Besides classical neural networks, Bayesian
As mentioned by Eraslan et al. (2019), an important area of neural networks (Beam et al., 2014) have also been used to
applying deep learning to genomics is predicting the effect of detect gene-gene interactions. Studies showed that Bayesian
non-coding regions. 98% of the human-genome is non-coding neural networks are more powerful than other popularly used
and 93% identified disease-associated variants from over methods, such as χ 2 test and Bayesian Epistasis Association
1,200 genome-wide association studies are located in the non- Mapping (Zhang and Liu, 2007). Recently, Uppu et al. (2016a)
coding regions (Pennisi, 2011). DeepSEA (Zhou and and Uppu et al. (2016b) have applied deep neural network to
detect gene-gene interactions in association studies and For convenience, we summarize the methods discussed in the
achieved promising results. section in Table 1.
Deep learning-based survival prediction with gene expression
profiles has recently emerged as a new research area. Primary works
include SurvivalNet (Yousefi et al., 2017), Cox-nnet (Ching et al., 4 APPLICATIONS OF UNSUPERVISED
2018) and SALMON (Huang et al., 2019). These methods all adopt DEEP LEARNING TO GENOMIC STUDIES
feedforward neural networks with an output of hazard ratio in the
Cox proportional hazards model, and they use negative log partial Besides the use of supervised deep learning methods in genomic
likelihood as cost function for network training. The three methods studies, there are also many applications of using unsupervised
differ in network design, regularization, and pre-training of gene deep learning methods in genomics. For instance, Scholz et al.
expression data. Cox-nnet and SALMON are both single hidden (2005) used autoencoder to estimate missing values for
layer neural networks, whereas SurvivalNet uses a Bayesian metabolite data and gene expression data. Their results
optimization technique to determine the network design. showed that autoencoders can better estimate missing values
SurvivalNet and Cox-nnet adopt dropout (Srivastava et al., 2014) for nonlinear structured data as compared with linear methods.
to prevent the neural networks from overfitting, whereas SALMON Similarly, Tan et al. (2016); Tan et al. (2017a); Tan et al. (2017b)
applies the LASSO penalty (Tibshirani, 1996) to the network weights proposed a method called ADAGE, which uses autoencoders to
in the cost function. Rather than using raw gene expression values as build gene expression signatures consistent with biological
network inputs, as SurvivalNet and Cox-nnet do, SALMON pathways. Through analysis of KEGG pathways, ADAGE and
performs a gene co-expression module analysis and uses the the popular gene set enrichment analysis (GSEA) (Subramanian
resulting eigengene matrices of gene co-expression modules as et al., 2005) both detected five pathways. Moreover, ADAGE
the inputs, which greatly reduces the number of parameters in detected nine pathways that were not significantly enriched
the neural network. Through an empirical study based on real data, in GSEA.
Huang et al. (2019) found that Cox-nnet and SALMON had There are also some applications of using generative models in
comparable discriminative abilities and outperformed the elastic- genomics. A deep variational autoencoder for single-cell RNA
net Cox regression (Simon et al., 2011) and the random survival sequencing data (VASC) (Wang and Gu, 2018) was developed to
forest (Ishwaran et al., 2008). No prediction performance model the dropout events and to find the nonlinear hierarchical
comparison has been made between these two methods and feature representations of the data. By comparing the results on
SurvivalNet, though. Surprisingly, all these survival learning 20 datasets with different numbers of cells included and
machines just output a prognostic index, namely the hazard ratio sequencing protocols used, VASC outperformed other
relative to the baseline, for each subject, not a predicted survival dimension reduction methods such as PCA, t-SNE, ZIFA
curve, although this drawback can be easily fixed by using Breslow’s (Pierson and Yau, 2015) and SIMLR (Wang and Gu, 2018).
estimator (Breslow, 1974) to generate a baseline hazard function. DeepSequence (Riesselman et al., 2018) used VAE to predict
Also, it is worthwhile to extend deep neural networks to other mutation effects and the results are significantly better than the
popular survival models such as the accelerated failure time model. existing method.
Bellot et al. (2018) provides a comprehensive comparison on The first application of GAN to genomic studies is due to
the predictive accuracy between deep learning methods (e.g., deep Ghahramani et al. (2018). They applied GAN to simulate single
neural networks and CNNs) and classical methods (e.g., linear cell RNA-seq data. Not only can they provide biologically
regression and Bayesian ridge regression). They applied both meaningful interpretation of their model parameters, the effect
types of methods to the UKBiobank (www.ukbiobank.ac.uk) with of cell state perturbation can be predicted as well. Recently, Gupta
80,000 training samples and 20,221 testing samples. Using and Zou, (2019) proposed a generative model known as the
genotype data, they use different methods to predict five feedback GAN (FBGAN) to produce synthetic gene sequences for
phenotypes: human height, bone heel mineral density desired properties. In FBGAN, a function analyzer is used to
(BHMD), body mass index (BMI), systolic blood pressure produce a score for the synthetic gene sequences generated from
(SBP), and waist-hip ratio (WHR). They found that the the generator in GAN and gradually replace the real data by the
performances of deep learning methods rely on the network synthetic gene sequence with the highest score from the function
architecture of deep learning. Depending on the trait, deep analyzer. FBGAN has been applied to generate genes coding for
learning and classical methods may have different antimicrobial peptides as well as to optimize synthetic genes for
performances. For example, for human height, a highly the secondary structure of the resulting peptides. The results
polygenetic trait with a predominant additive genetic basis, demonstrated that proteins generated from FBGAN have good
there is not much performance differences among all methods. biophysical properties. Despite its good properties, applying
One reason is that in such scenario, linear models work pretty GAN architectures to produce long and complex sequences is
well. Through this empirical study, they also demonstrated that still challenging and worth further investigations. DBNs have also
CNNs have comparable and slightly better performance than been used in genomic studies. For example, Ghasemi et al. (2018)
linear methods, except for human height. Since CNNs can proposed using DBNs to initialize parameters in a deep neural
capture spatial correlation of SNPs due to linkage network for Quantitative Structure Activity Relationship (QSAR)
disequilibrium, they suggest future research is needed on studies. The results of their study showed that the prediction
studying the performance of CNN in genetic predictions. performance has been improved by using DBNs.
TABLE 1 | Summary of reviewed supervised learning methods for genetic and genomic studies.
DANN DNN https://cbcl.ics.uci.edu/ DANN: a deep learning approach Can be applied on non-coding Only has better performance
public_data/DANN/ for annotating the pathogenicity human variation. Can capture compared to shallow structure
of genetic variants nonlinear relationships among models like CADD and LR.
features.
Support large numbers of samples No performance advantage in a
and features. coding-biased dataset compared
to CADD and LR method.
D-GEX DNN https://github.com/uci-cbcl/ Gene expression inference with Can capture intrinsic nonlinear Require using GPUs with large
D-GEX deep learning features. Outperform linear memory or multi-GPU techniques
regression and KNN. to jointly train all target genes when
a large number of target genes are
included in the model.
Python Can handle a large number of target To avoid overfitting, the process of
genes. tuning hyper parameters is
Have good performance even on necessary.
dataset obtained from different
platforms.
DeepBind CNN https://github.com/ predicting the sequence The first method to accurately Does not capture long distance
kundajelab/deepbind specificities of DNA-and RNA- represent and visualize protein target dependency or detect structure
binding proteins by deep learning binding motifs. binding motifs.
C Can discover new patterns in Computationally heavy and GPU
sequence without knowing the required due to the automatically
locations of them. tuning process.
Training model fully automatically
without hand-tuning.
DeepSEA CNN https://hb.flatironinstitute. predicting effects of noncoding The first method to identify functional Can only handle sequence context
org/deepsea; online variants with deep learning- effects of noncoding variants from with size up to 1 kbp. The
based sequence model only genomic sequences with single- predictive ability of the model could
nucleotide sensitivity. be affected by lack of flexibility in
reading heterogeneous input.
The first method to accurately There are developed improved
predict transcription factor (TF) techniques for motif discovery and
binding, DNA accessibility and assigning importance scores to
histone marks of sequences from nucleotides (DeepLIFT).
genomic sequences.
Basset CNN https://github.com/ Basset: learning the regulatory Same units (traits) are shared in fully Cannot directly extend the model to
davek44/Basset code of the accessible genome connected layers across different predict other functional activity
with deep convolutional neural cells. As a result, the model can be other than DNase-seq peak. To
networks easily applied to new data sets. predict other functional activity,
Python With computational efficiency, a model training is still required for
single-task new data set can be tuning and full multi-task
trained on common computer computation.
hardware (CPU) in a few hours.
DeepNano RNN https://github.com/ DeepNano: Deep recurrent Model training and base calling a Only about 2% improvement of 2D
jeammimi/deepnano neural networks for base calling read are much faster than Nanonet. base calling error rate compared to
Python in MinION nanopore reads DeepNano obtains the traditional HMM (hidden Markov
computational advantage by model). The reason is that the
introducing a smaller output layer performance of DeepNano is
and GRU instead of LSTM with a sensitive to falsely split or missing of
price of worse accuracy. input sequences in 2D base calling
tasks.
ProLanGo RNN ProLanGO: Protein Function Convert protein function prediction Does not outperform DeepGO,
Prediction Using Neural Machine problem to language translation FANN-GO, PANNZER model.
Translation Based on a problem. It is feasible to apply novel With a three layer RNN structure in
Recurrent Neural Network techniques and the latter the encoding part, capturing the
architecture of NMT in ProLanGo long-term dependencies is
method. challenging for long protein
sequence data. That is, the
relationship between protein
sequence at the beginning and
function prediction in the decoding
part is too weak to obtain good
performance on training.
(Continued on following page)
TABLE 1 | (Continued) Summary of reviewed supervised learning methods for genetic and genomic studies.
DanQ CNN, RNN https://github.com/uci-cbcl/ DanQ: a hybrid convolutional and First application of hybrid The current structure only works for
DanQ recurrent deep neural network for convolution and recurrent network input sizes of 1 kbp. An arbitrary
quantifying the function of DNA predicting function de novo from input length of sequence and
sequences DNA sequence. additional BLSTM layers could be
Python Simple structure with only one the extension of DanQ, so that the
convolutional layer and a BLSTM model could flexibly incorporate
layer, still has a better performance contextual information on two sides
capturing long-term dependencies of the target bin.
than pure CNN model.
Outperform DeepSEA model (pure
CNN model) while using the same
data set, comparable architectural
structure and less free weight.
GENN DNN, GE http://grammatical- Comparison of Approaches for Can optimize inputs, architecture Lack of comparison with other gene
evolution.org/software.html Machine-Learning Optimization and weights of a NN and detect association methods.
of Neural Networks for Detecting disease-risk loci in high-order
Gene-Gene Interactions in epistatic models.
Genetic Epidemiology Capable for genome-wide studies Heavy computationally burden
using parallel computing. even running on multiple
processors.
SurvivalNet DNN https://github.com/ Predicting clinical outcomes from flexible architecture designed and Does not outperform Cox elastic
PathologyDataScience/ large scale cancer genomic easy to apply. net in predicting survival using
SurvivalNet profiles with deep survival lower-dimensional features.
Python models Hyperparameters are automatically Only drop-out regularization
tuned without technical expertise. technique is applied to reduce
overfitting, resulting in a much
longer training time.
Cox-nnet DNN https://github.com/ Cox-nnet: An artificial neural Compared with Cox-PH, more The architecture is simple, only one
lanagarmire/cox-nnet network method for prognosis significantly enriched pathways are and two hidden layers are applied in
prediction of high-throughput identified by using GSEA under the the model.
Python omics data same significant threshold. Does not outperform Cox-PH in
some of TCGA data sets.
SALMON DNN https://github.com/ SALMON: Survival Analysis Achieves similar performance to Obtaining Eigengene matrices of
huangzhii/SALMON/ Learning with Multi-Omics Neural Cox-nnet while maintaining a simple gene co-expression modules is
Networks on Breast Cancer architecture of only 500 weights and required before implementing a
much less inputs. neural network.
Python Include different combinations of
multi-omics data as input source.
In terms of using unsupervised learning methods in genetic and then apply a DBM to the small clusters of correlated SNPs.
studies, Montañez et al. (2018) used both stacked autoencoders Such method is called partitioned DBM. The result demonstrated
(SAE) and a deep neural network to classify extremely obese and that the partitioned DBM can identify almost twice number of
non-obese individuals. In SAE, the output from a single layer significant SNPs compared with univariate testing, while the type
autoencoder is used to train a second autoencoder and the process I error can also be controlled.
is repeated multiple times. The output of the final autoencoder is Recently, Yelmen et al. (2021) demonstrated that GANs
used to pre-trained the weights in a deep neural network. Based and RBMs can be used to generate high quality artificial
on a study of feature selection from a set of 2,465 SNPs (p-values genomes and the outcomes are promising and the
<1e-2) and using extracted features to classify obese samples from generated artificial genome can inherit genotype-phenotype
normal control samples through a deep neural network, it is associations. Since GWAS usually requires a huge number of
found that although the performance on validation set and testing samples and most research data are not publicly available due
set deteriorate according to classification accuracy when 50 to privacy issues. The success of generating high quality
features were extracted, the AUC was still over 85% and artificial genomes provides a great substitute for those
relatively low overfitting occurred in the study. private databases.
Directly applying a DBM is not a good option as in genetic The advances in spatially resolved transcriptomics (SRT) have
studies, the number of SNPs often exceeds the number of enabled gene expression profiling with spatial location
individuals. To overcome this issue, Hess et al. (2017) first information in tissues (Asp et al., 2020). One important step
estimated the relation between SNPs using stagewise before further analysis in SRT studies is to cluster the spots and
regression, where each SNP is regressed on all the other SNPs, this is accomplished in many recent studies with the help of deep
TABLE 2 | Summary of reviewed unsupervised learning methods for genetic and genomic studies.
ADAGE Denoising https://github.com/ ADAGE-based integration of Can integrate diverse gene Not robust as different
autoencoder greenelab/adage publicly available Pseudomonas expression data ADAGE models can perform
Python aeruginosa gene expression Can reveal biologically equally well
data with denoising meaningful signals within
autoencoders illuminates datasets
microbe-host interactions
OUTRIDER Autoencoder https://bioconductor.org/ OUTRIDER: a statistical method Can compute p-values that Cannot include known
packages/OUTRIDER for detecting aberrantly can be adjusted to confounding covariates into
expressed genes in RNA control FDR the model
R sequencing data Model parameters are
automatically fitted through
optimization
VASC Variational https://github.com/wang- VASC: dimension reduction and Can re-establish cell Computationally intensive
Autoencoder research/VASC visualization of single-cell RNA- dynamics in the reduced
Python seq data by deep variational subspace and associated
autoencoder marker genes can be idetified
DeepSequence Variational https://github.com/ Deep generative models of Can capture higher-order Sensitivity analysis on the
Autoencoder debbiemarkslab/ genetic variation capture the correlations in biological choices of prior distributions
DeepSequence effects of mutations sequence families is not provided
Python
scRNAseq- Generative https://github.com/luslab/ Generative adversarial networks Can interpret internal Methods only apply to
WGAN-GP Adversarial Networks arshamg-scrnaseq-wgan simulate gene expression and parameters in a biologically scRNA-seq data on skin
Python and R predict perturbations in single meaningful way
cells
FBGAN Generative https://github.com/ Feedback GAN for DNA Robust to the type of Not applicable to produce
Adversarial Networks av1659/fbgan optimizes protein functions analyzer used and the long and complex sequences,
Python analyzer does not need to be such as whole proteins
differentiable
DBM Deep Boltzmann https://github.com/ Partitioned learning of deep Can handle high- Performance can be affected
Machines binderh/ Boltzmann machines for SNP dimensionality in SNP data by the choices of model
BoltzmannMachines.jl data using partitioned learning parameters
Julia
AG Generative https://gitlab.inria.fr/ml_ Creating artificial human Can replicate characteristics Can only be used to create a
Adversarial Networks genetics/public/artificial_ genomes using generative of the source data such as dense chunk of genomes
and Restricted genomes neural networks allele frequency, LD and rather than the whole genome
Boltzmann Machines Python population structure sequence
learning. For example, one of the workflows of SpaCell (Tan et al., 5 APPLICATIONS OF SEMI-SUPERVISED
2019) is for cell-type clustering using autoencoders by integrating LEARNING FOR SINGLE-CELL RNA-SEQ
pixel-intensity values with gene expression measurements from
spots in a tissue. StLearn (Pham et al., 2020) uses a transfer
DATA
learning deep neural network to extract features from pixel image The fast-emerging technology has made it possible to collect
tiles created from the hematoxylin and eosin-stained microscopy global transcriptome profiling on the single cell level. Through
image containing tissue morphology information. A graph accurate identification of cell types, the formation of complex
convolutional network is applied in SpaGCN (Li et al., 2020) organs and various cancers could be better understood (Kim
to aggregate gene expression information from neighboring spots et al., 2019). However, using single cell RNA-seq data to
and then detects spatially variable genes based on the aggregated accurately identify cell types remains a challenging task (Stegle
gene expression. We refer interested readers to Hu et al. (2021) et al., 2015). Recently, semi-supervised learning has become a
for a review on statistical and machine learning methods for SRT technique popularly used for single-cell RNA-seq data analysis.
with histology. Table 3 provides a list of the semi-supervised learning methods
There are various other applications of deep learning in for single-cell RNA-seq data discussed in this section.
genomic studies. We refer interested readers to review papers Before fully diving into the semi-supervised learning methods
on this topic (Angermueller et al., 2016; Jones et al., 2017; used for cell type annotations, it is worthwhile to mention single-
Min et al., 2017; Ching et al., 2018; Wainberg et al., 2018; Yue cell Variational Inference (scVI) (Lopez et al., 2018) and its
and Wang, 2018; Zou et al., 2018; Eraslan et al., 2019). Table 2 extension to single-cell ANnotation using Variational
provides a brief summary of the reviewed recent Inference (scANVI) (Xu et al., 2021). Both methods use
unsupervised learning methods having applications to variational inference and deep generative models to fully
genomic studies. characterize the distribution of single-cell RNA-seq data.
TABLE 3 | Summary of reviewed semi-supervised learning methods for single-cell RNA-seq data.
scANVI Python https://github.com/ Harmonization and annotation of Achieves high accuracy when The assumptions that the low-
scverse/scvi-tools single-cell transcriptomics data transferring labels from one dataset dimensional latent space follows a
with deep generative models to another Gaussian mixture model limits the
representation ability
Clustering performance is not robust
due to the Gaussian mixture model
assumption
scReClassify R https://github.com/ scReClassify: post hoc cell type When the initial mislabeling rate is Performance depends on the initial
SydneyBioX/ classification of single-cell RNA- small (<30%), scReClassify has mislabeling rate and the learning
scReClassify seq data nearly perfect performance in method used as the base classifier.
reclassifying the mislabeled cell types Does not take into consideration the
sizes of different cell types in a single
cell RNA-seq dataset.
Does not account for a
nonrandomness in cell type
mislabeling caused by the
relatedness of cell types located in
ambiguous regions
SCINA R/Web https://cran.r- SCINA: a semi-supervised First semi-supervised “signature-to- Only takes signature genes into
Server project.org/web/ subtyping algorithm of single cell category” cell type classification account Performance influenced by
packages/SCINA and bulk sample algorithm for single cell profiling data the size of the data, the total numbers
https://github.com/ of the cell types in the data and the
jcao89757/SCINA signature gene numbers for every cell
https://lce.biohpc. type
swmed.edu/scina/
scSemiCluster Python https://github.com/ Single-cell RNA-seq data semi- Computationally faster than scANVI Performance depends on whether
xuebaliang/ supervised clustering and False alignment between the outlier the cell types to be annotated appear
scSemiCluster annotation via structural reference categories and target data in the reference dataset or not
regularized domain adaption does not affect the performance of
scSemiCluster too much
CALLR R https://github.com/ CALLR: a semi-supervised cell- Stable performance under changes in Cannot determine the number of cell
MathSZhang/ type annotation method for single- parameter and labeled subset types automatically
CALLR cell RNA sequencing data Only Gaussian kernel is used to
create the adjacency matrix
scNym Python https://github.com/ Semi-supervised adversarial neural Can learn biologically interpretable Does not have implementation of
calico/scnym networks for single-cell features of cell types multi-task domain adversary to
classification Can synthesize information from handle multiple independent
multiple data sources to improve variables.
accuracy
Robust to hyperparameter selection
scANVI can also be used to annotate cell types and has been used knowledge of signature genes is taken into account in the
as a baseline method for recently proposed methods, such as unsupervised estimation process. SCINA is the first semi-
scSemiCluster (Chen et al., 2021) and scNym (Kimmel and supervised “signature-to-category” cell type classification
Kelley, 2021). algorithm for single cell-RNA-seq data. Nonetheless, it only
scReClassify proposed by Kim et al. (2019) uses PCA to takes signature genes into consideration, and its performance
perform dimension reduction of the original single cell RNA- depends on the size of the data, the total number of cell types in
seq data, and then apply a semi-supervised learning method to the data and the number of signature genes for every cell type.
reclassify the mislabeled cell types caused by human scSemiCluster (Chen et al., 2021) and CALLR (Wei and
inspection. When the initial mislabeling rate is small, Zhang, 2021) are two new cell type annotation methods based
scReClassify can reclassify those mislabeled cell types to the on semi-supervised learning. scSemiCluster is computationally
correct ones. However, there is no gain in performance when faster than scANVI, and its performance will be less affected by
the initial mislabeling rate is high. Moreover, scReClassify does the false alignment between the outlier reference categories and
not consider the sizes of different cell types in a single cell target data. Nevertheless, since the reference dataset is used in
RNA-seq dataset and the non-randomness in cell type scSemiCluster to make predictions, its performance depends on
mislabeling due to the relatedness of the cell types located whether the cell types to be annotated are contained in the
in ambiguous regions. reference dataset. CALLR is an optimization-based method
Zhang et al. (2019) developed the SCINA algorithm for cell that combines a graph Laplacian matrix constructed from all
type classifications in single cell RNA-seq data. The prior the cells with sparse logistic regression. While CALLR is robust to
changes in parameters and labeled subset, it cannot determine the measurements to compare the performances of different
number of cell types automatically. methods. However, such measurements may become obsolete
scNym (Kimmel and Kelley, 2020) is one of the newest semi- due to the discovery of double descent phenomena (Belkin et al.,
supervised methods for analyzing single cell RNA-seq data. 2019) for deep neural networks. As long as the networks have
Instead of relying on a reference dataset to annotate cell types, been trained for a sufficiently long period, the training error will
scNym uses an adversarial network to improve the classification keep decreasing to zero, while the testing error will increase first
performance. Moreover, scNym is robust to hyperparameter and then decrease again to reach an even smaller testing error.
selection and can further improve accuracy by learning The double descent phenomena suggest that a deep neural
biologically interpretable features and synthesizing information network has the potential to achieve RMSE close to zero and
from multiple data sources. However, the current method does correlation close to one if it has been trained for a sufficiently long
not consider a multi-task domain, which makes it less useful period. Therefore, new measurements for comparing
when there are multiple independent variables. performances from different methods need to be proposed in
the future.
Besides the researches on interpreting deep learning models,
6 CONCLUSION AND PERSPECTIVES transfer learning (Pan and Yang, 2009) is another promising
research area. Generalizing knowledge learned in one setting (e.g.,
With the rapid progress in graphical processing unit (GPU) variants discovered from Caucasian population) to another
technology, complex deep learning algorithms can be setting (e.g., other minority populations) is the major goal of
accomplished in rather short time, which leads to wide use of transfer learning. Given the knowledge gained from animal
deep learning in many areas. One of the advantages of deep studies, transfer learning can be used to generalize findings
learning methods is the convenient and easy access of deep learned from animal studies to human studies. In addition,
learning platforms such as Keras (https://keras.io/), natural language processing methods such as BERT (Devlin
TensorFlow (https://www.tensorflow.org/) and PyTorch et al., 2019) showed that by adding only a few more layer to a
(https://pytorch.org/). Benefiting from these well-developed pre-trained network and fine tuning the parameters, better
platforms, researchers can implement deep learning algorithms prediction performance can be achieved. Given the easy
without knowing the mathematical details behind them, which implementation of deep learning algorithms and the flexible
makes it feasible for researchers to focus more on applying deep deep learning models, we believe that deep learning will play
learning to their own research fields. an important role in future genomic and genetic research.
In this paper, we have reviewed some important deep learning
developments in genomics studies. Despite its great improvement
in prediction performance compared to other classical statistical AUTHOR CONTRIBUTIONS
methods (LeCun et al., 2015), there are still many challenging
issues in this research field. One challenge of deep learning is lack XS, CJ, and QL contributed to conception and design of the study.
of interpretability. In genetic association studies, identifying and XS and CJ wrote the first draft of the manuscript. YW, CL, and QL
interpreting disease-associated genetic markers is of major wrote sections of the manuscript. All authors contributed to
interest. Nevertheless, deep learning has been considered as a manuscript revision, read, and approved the submitted version.
black box, which hinders its application in genetic association
studies. Shen et al. (2019) and Horel and Giesecke, (2020) have
developed theories to address such an issue, but the applicability FUNDING
of the theories to real data applications remains a challenging
task. To make the results from deep learning interpretable, This work is supported by NIH 1R01DA043501-01 and NIH
DeepLIFT (Shrikumar et al., 2017) assigns importance scores 1R01LM012848-01.
to the input for a given response to determine the crucial features.
Sundararajan et al. (2017) considered sensitivity and
implementation invariance as two fundamental axioms and SUPPLEMENTARY MATERIAL
proposed an integrated gradients method for attributing the
prediction of a deep network to its inputs. The Supplementary Material for this article can be found online at:
On the other hand, root mean square error (RMSE) and the https://www.frontiersin.org/articles/10.3389/fsysb.2022.877717/
correlation between prediction and original data are often used as full#supplementary-material
Angermueller, C., Pärnamaa, T., Parts, L., and Stegle, O. (2016). Deep Learning
REFERENCES for Computational Biology. Mol. Syst. Biol. 12, 878. doi:10.15252/msb.
20156651
Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the Asp, M., Bergenstråhle, J., and Lundeberg, J. (2020). Spatially Resolved
Sequence Specificities of DNA- and RNA-Binding Proteins by Deep Learning. Transcriptomes-Next Generation Tools for Tissue Exploration. BioEssays 42
Nat. Biotechnol. 33, 831–838. doi:10.1038/nbt.3300 (10), 1900221. doi:10.1002/bies.201900221
Beam, A. L., Motsinger-Reif, A., and Doyle, J. (2014). Bayesian Neural Networks Glorot, X., Bordes, A., and Bengio, Y. (2011). “Deep Sparse Rectifier Neural
for Detecting Epistasis in Genetic Association Studies. BMC Bioinforma. 15, Networks,” in Proceedings of the Fourteenth International Conference on
368. doi:10.1186/s12859-014-0368-0 Artificial Intelligence and Statistics, 315–323.
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling Modern Machine- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT press.
Learning Practice and the Classical Bias-Variance Trade-Off. Proc. Natl. Acad. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Sci. U.S.A. 116 (32), 15849–15854. doi:10.1073/pnas.1903070116 et al. (2014). “Generative Adversarial Nets,” in Advances in Neural Information
Bellot, P., de los Campos, G., and Pérez-Enciso, M. (2018). Can Deep Learning Processing Systems. Editors D. A. Cohn, M. S. Kearns, and S. A. Solla (MIT
Improve Genomic Prediction of Complex Human Traits? Genetics 210, press), 2672–2680.
809–819. doi:10.1534/genetics.118.301298 Gupta, A., and Zou, J. (2019). Feedback GAN for DNA Optimizes Protein
Bourlard, H., and Kamp, Y. (1988). Auto-association by Multilayer Perceptrons Functions. Nat. Mach. Intell. 1, 105–111. doi:10.1038/s42256-019-0017-4
and Singular Value Decomposition. Biol. Cybern. 59, 291–294. doi:10.1007/ Gusareva, E. S., Carrasquillo, M. M., Bellenguez, C., Cuyvers, E., Colon, S., Graff-
bf00332918 Radford, N. R., et al. (2014). Genome-wide Association Interaction Analysis for
Boža, V., Brejová, B., and Vinař, T. (2017). DeepNano: Deep Recurrent Neural Alzheimer’s Disease. Neurobiol. Aging 35, 2436–2443. doi:10.1016/j.
Networks for Base Calling in MinION Nanopore Reads. PloS One 12, e0178751. neurobiolaging.2014.05.014
doi:10.1371/journal.pone.0178751 Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A Distribution-free Theory
Brechtmann, F., Mertes, C., Matusevičiūtė, A., Yépez, V. A., Avsec, Ž., Herzog, M., of Nonparametric Regression. Springer Science & Business Media.
et al. (2018). OUTRIDER: A Statistical Method for Detecting Aberrantly Hess, M., Lenz, S., Blätte, T. J., Bullinger, L., and Binder, H. (2017). Partitioned
Expressed Genes in RNA Sequencing Data. Am. J. Hum. Genet. 103, Learning of Deep Boltzmann Machines for SNP Data. Bioinformatics 33,
907–917. doi:10.1016/j.ajhg.2018.10.025 3173–3180. doi:10.1093/bioinformatics/btx408
Breslow, N. (1974). Covariance Analysis of Censored Survival Data. Biometrics, Hinton, G. (2009). Deep Belief Networks. Scholarpedia 4, 5947. doi:10.4249/
89–99. doi:10.2307/2529620 scholarpedia.5947
Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen, Z. (2017). ProLanGO: Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984). Boltzmann Machines:
Protein Function Prediction Using Neural Machine Translation Based on a Constraint Satisfaction Networks that Learn. Carnegie-Mellon University,
Recurrent Neural Network. Molecules 22, 1732. doi:10.3390/ Department of Computer Science Pittsburgh.
molecules22101732 Hinton, G. E., and Zemel, R. S. (1994). “Autoencoders, Minimum Description
Chapelle, O., Chi, M., and Zien, A. (2006). Semi-supervised Learning. 1st ed. Length and Helmholtz Free Energy,” in Advances in Neural Information
Cambridge: The MIT Press. Processing Systems, 3–10.
Chen, L., He, Q., Zhai, Y., and Deng, M. (2021). Single-cell RNA-Seq Data Semi- Horel, E., and Giesecke, K., 2020. Significance Tests for Neural Networks. J. Mach.
supervised Clustering and Annotation via Structural Regularized Domain Learn. Res. 21 (227), 1–29.
Adaptation. Bioinformatics 37 (6), 775–784. doi:10.1093/bioinformatics/ Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer Feedforward
btaa908 Networks Are Universal Approximators. Neural Netw. 2, 359–366. doi:10.
Chen, Y., Li, Y., Narayan, R., Subramanian, A., and Xie, X. (2016). Gene Expression 1016/0893-6080(89)90020-8
Inference with Deep Learning. Bioinformatics 32, 1832–1839. doi:10.1093/ Hu, J., Schroeder, A., Coleman, K., Chen, C., Auerbach, B. J., and Li, M. (2021).
bioinformatics/btw074 Statistical and Machine Learning Methods for Spatially Resolved
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Transcriptomics with Histology. Comput. Struct. Biotechnol. J. 19,
Way, G. P., et al. (2018). Opportunities and Obstacles for Deep Learning in 3829–3841. doi:10.1016/j.csbj.2021.06.052
Biology and Medicine. J. R. Soc. Interface. 15, 20170387. doi:10.1098/rsif. Huang, Z., Zhan, X., Xiang, S., Johnson, T. S., Helm, B., Yu, C. Y., et al. (2019).
2017.0387 SALMON: Survival Analysis Learning With Multi-Omics Neural Networks on
Cho, K., Van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., Bengio, Y., Breast Cancer. Front. Genet. 10, 166.
et al. (2014). “Learning Phrase Representations Using RNN Encoder-Decoder Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008). Random
for Statistical Machine Translation” in Conference on Empirical Methods in Survival Forests. Ann. Appl. Stat. 2 (3), 841–860.
Natural Language Processing (EMNLP 2014). ArXiv Prepr. ArXiv14061078. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). “What Is the Best
doi:10.3115/v1/d14-1179 Multi-Stage Architecture for Object Recognition?,” in 2009 IEEE 12th
Consortium, W. T. C. C. (2007). Genome-wide Association Study of 14,000 Cases International Conference on Computer Vision (IEEE), 2146–2153.
of Seven Common Diseases and 3,000 Shared Controls. Nature 447, 661–678. Jones, W., Alasoo, K., Fishman, D., and Parts, L. (2017). Computational Biology:
doi:10.1038/nature05911 Deep Learning. Emerg. Top. Life Sci. 1, 257–274. doi:10.1042/etls20160025
Cordell, H. J. (2009). Detecting Gene-Gene Interactions that Underlie Human Kelley, D. R., Snoek, J., and Rinn, J. L. (2016). Basset: Learning the Regulatory Code
Diseases. Nat. Rev. Genet. 10, 392–404. doi:10.1038/nrg2579 of the Accessible Genome with Deep Convolutional Neural Networks. Genome
Curtis, D., North, B. V., and Sham, P. C. (2001). Use of an Artificial Neural Res. 26, 990–999. doi:10.1101/gr.200535.115
Network to Detect Association between a Disease and Multiple Marker Kim, T., Lo, K., Geddes, T. A., Kim, H. J., Yang, J. Y. H., and Yang, P. (2019).
Genotypes. Ann. Hum. Genet. 65, 95–107. doi:10.1046/j.1469-1809.2001. scReClassify: Post Hoc Cell Type Classification of Single-Cell rNA-Seq Data.
6510095.x BMC genomics 20 (9), 1–10. doi:10.1186/s12864-019-6305-x
Devlin, J., Chang, M., Lee, K., and Toutanova, K., 2019. “BERT: Pre-Training of Kimmel, J. C., and Kelley, D. R. 2021. Semisupervised Adversarial Neural Networks
Deep Bidirectional Transformers for Language Understanding.” in Proceedings for Single-Cell Classification. Gen. Res. 31 (10), 1781–1793. bioRxiv.
of NAACL. ArXiv Prepr. ArXiv181004805. Kingma, D. P., and Welling, M., 2014. “Auto-Encoding Variational Bayes,” in
Eraslan, G., Avsec, Ž., Gagneur, J., and Theis, F. J. (2019). Deep Learning: New Proceedings of the International Conference on Learning Representations
Computational Modelling Techniques for Genomics. Nat. Rev. Genet. 20, (ICLR). ArXiv Prepr. ArXiv13126114.
389–403. doi:10.1038/s41576-019-0122-6 Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., and Shendure, J.
Fahlman, S. E., Hinton, G. E., and Sejnowski, T. J. (1983). “Massively Parallel (2014). A General Framework for Estimating the Relative Pathogenicity of
Architectures for Al: NETL, Thistle, and Boltzmann Machines,” in Proceedings Human Genetic Variants. Nat. Genet. 46, 310–315. doi:10.1038/ng.2892
of the National Conference on Artificial Intelligence AAAI-83 (AAAI). LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep Learning. nature 521, 436–444.
Ghahramani, A., Watt, F. M., and Luscombe, N. M. (2018). Generative Adversarial doi:10.1038/nature14539
Networks Simulate Gene Expression and Predict Perturbations in Single Cells. LeCun, Y. (1989). “Generalization and Network Design Strategies,” in
BioRxiv, 262501. doi:10.1101/262501 Connectionism in Perspective. Editors F. Fogelman-Soulié, L. Steels,
Ghasemi, F., Mehridehnavi, A., Fassihi, A., and Pérez-Sánchez, H. (2018). Deep R. Pfeifer, and Z. Schreter (Elsevier Science). Citeseer.
Neural Network in QSAR Studies Using Deep Belief Network. Appl. Soft LeCun, Y. (1987). Modeles connexionnistes de lapprentissage (PhD Thesis), 6.
Comput. 62, 251–258. doi:10.1016/j.asoc.2017.09.040 Universite Paris, PhD thesis, These de Doctorat.
Li, M., Hu, J., Li, X., Coleman, K., Schroeder, A., Irwin, D., et al. (2020). Integrating Riesselman, A. J., Ingraham, J. B., and Marks, D. S. (2018). Deep Generative Models
Gene Expression, Spatial Location and Histology to Identify Spatial Domains of Genetic Variation Capture the Effects of Mutations. Nat. Methods 15,
and Spatially Variable Genes by Graph Convolutional Network. Nat. Methods 816–822. doi:10.1038/s41592-018-0138-4
18 (11), 1342–1351. doi:10.1038/s41592-021-01255-8 Rosenblatt, F. (1958). The Perceptron: a Probabilistic Model for Information
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., and Yosef, N. (2018). Deep Storage and Organization in the Brain. Psychol. Rev. 65, 386–408. doi:10.
Generative Modeling for Single-Cell Transcriptomics. Nat. Methods 15 (12), 1037/h0042519
1053–1058. doi:10.1038/s41592-018-0229-2 Rui Xu, R., Wunsch, D. C., II, and Frank, R. L. (2007). Inference of Genetic
Lucek, P., Hanke, J., Reich, J., Solla, S. A., and Ott, J. (1998). Multi-locus Regulatory Networks with Recurrent Neural Network Models Using Particle
Nonparametric Linkage Analysis of Complex Trait Loci with Neural Swarm Optimization. IEEE/ACM Trans. Comput. Biol. Bioinf. 4, 681–692.
Networks. Hum. Hered. 48, 275–284. doi:10.1159/000022816 doi:10.1109/tcbb.2007.1057
Lucek, P. R., and Ott, J. (1997). Neural Network Analysis of Complex Traits. Genet. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning
Epidemiol. 14, 1101–1106. doi:10.1002/(sici)1098-2272(1997)14:6<1101::aid- Representations by Back-Propagating Errors. Cogn. Model. 5, 1.
gepi90>3.0.co;2-k Saccone, N. L., Downey, T. J., Jr, Meyer, D. J., Neuman, R. J., and Rice, J. P. (1999).
Maher, B. (2008). Personal Genomes: The Case of the Missing Heritability. Nature Mapping Genotype to Phenotype for Linkage Analysis. Genet. Epidemiol. 17,
456, 18–21. doi:10.1038/456018a S703–S708. doi:10.1002/gepi.13701707115
Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. Sak, H., Senior, A., and Beaufays, F. (2014). “Long Short-Term Memory Recurrent
J., et al. (2009). Finding the Missing Heritability of Complex Diseases. Nature Neural Network Architectures for Large Scale Acoustic Modeling,” in Fifteenth
461, 747–753. doi:10.1038/nature08494 Annual Conference of the International Speech Communication Association.
Marinov, M., and Weeks, D. E. (2001). The Complexity of Linkage Analysis with doi:10.21437/interspeech.2014-80
Neural Networks. Hum. Hered. 51, 169–176. doi:10.1159/000053338 Salakhutdinov, R., and Hinton, G. (2009). “Deep Boltzmann Machines,” in
Mikheyev, A. S., and Tin, M. M. Y. (2014). A First Look at the Oxford Nanopore Artificial Intelligence and Statistics, 448–455.
MinION Sequencer. Mol. Ecol. Resour. 14, 1097–1102. doi:10.1111/1755-0998. Scholz, M., Kaplan, F., Guy, C. L., Kopka, J., and Selbig, J. (2005). Non-linear PCA:
12324 a Missing Data Approach. Bioinformatics 21, 3887–3895. doi:10.1093/
Min, S., Lee, B., and Yoon, S. (2017). Deep Learning in Bioinformatics. Brief. bioinformatics/bti634
Bioinform. 18, 851–869. doi:10.1093/bib/bbw068 Schuster, M., and Paliwal, K. K. (1997). Bidirectional Recurrent Neural Networks.
Curbelo Montañez, C. A., Fergus, P., Chalmers, C., and Hind, J., 2018. “Analysis of IEEE Trans. Signal Process. 45, 2673–2681. doi:10.1109/78.650093
Extremely Obese Individuals Using Deep Learning Stacked Autoencoders and Scott, L. J., Mohlke, K. L., Bonnycastle, L. L., Willer, C. J., Li, Y., Duren, W. L., et al.
Genome-Wide Genetic Data,” in International Meeting on Computational (2007). A Genome-wide Association Study of Type 2 Diabetes in Finns Detects
Intelligence Methods for Bioinformatics and Biostatistics (Cham: Springer), Multiple Susceptibility Variants. science 316, 1341–1345. doi:10.1126/science.
262–272. ArXiv Prepr. ArXiv180406262. 1142382
Motsinger, A. A., Dudek, S. M., Hahn, L. W., and Ritchie, M. D. (2006). Shen, X., Jiang, C., Sakhanenko, L., and Lu, Q. (2019). “Asymptotic Properties of
“Comparison of Neural Network Optimization Approaches for Studies of Neural Network Sieve Estimators,”. ArXiv Prepr. ArXiv190600875.
Human Genetics,” in Workshops on Applications of Evolutionary Shrikumar, A., Greenside, P., and Kundaje, A. (2017). “Learning Important
Computation (Springer), 103–114. doi:10.1007/11732242_10 Features through Propagating Activation Differences,” in Proceedings of the
Motsinger-Reif, A. A., Dudek, S. M., Hahn, L. W., and Ritchie, M. D. (2008). 34th International Conference on Machine Learning-Volume 70 (JMLR. org),
Comparison of Approaches for Machine-Learning Optimization of Neural 3145–3153.
Networks for Detecting Gene-Gene Interactions in Genetic Epidemiology. Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization Paths
Genet. Epidemiol. 32, 325–340. doi:10.1002/gepi.20307 for Cox’s Proportional Hazards Model via Coordinate Descent. J. Stat. Softw. 39
Nair, V., and Hinton, G. E. (2010). “Rectified Linear Units Improve Restricted (5), 1.
Boltzmann Machines,” in Proceedings of the 27th International Conference on Sladek, R., Rocheleau, G., Rung, J., Dina, C., Shen, L., Serre, D., et al. (2007). A
Machine Learning (Omnipress, Madison: ICML-10), 807–814. Genome-wide Association Study Identifies Novel Risk Loci for Type 2 Diabetes.
North, B. V., Curtis, D., Cassell, P. G., Hitman, G. A., and Sham, P. C. (2003). Nature 445, 881–885. doi:10.1038/nature05616
Assessing Optimal Neural Network Architecture for Identifying Disease- Smolensky, P. (1986). “Information Processing in Dynamical Systems:
Associated Multi-Marker Genotypes Using a Permutation Test, and Foundations of Harmony Theory,” in Parallel Distributed Processing. Editors
Application to Calpain 10 Polymorphisms Associated with Diabetes. Ann. D. E. Rumelhart and J. L. McClelland (Cambridge: MIT Press) 1.
Hum. Genet. 67, 348–356. doi:10.1046/j.1469-1809.2003.00030.x Stegle, O., Teichmann, S. A., and Marioni, J. C. (2015). Computational and
Pan, S. J., and Yang, Q. (2009). A Survey on Transfer Learning. IEEE Trans. Knowl. Analytical Challenges in Single-Cell Transcriptomics. Nat. Rev. Genet. 16 (3,
Data Eng. 22, 1345–1359. 133–145.
Park, Y., and Kellis, M. (2015). Deep Learning for Regulatory Genomics. Nat. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
Biotechnol. 33, 825–826. doi:10.1038/nbt.3313 (2014). Dropout: a Simple Way to Prevent Neural Networks from Overfitting.
Pennisi, E. (2011). Disease Risk Links to Gene Regulation. American Association for J. Mach. Learn. Res. 15, 1929–1958.
the Advancement of Science. Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette,
Pham, D., Tan, X., Xu, J., Grice, L. F., Lam, P. Y., Raghubar, A., and Nguyen, Q. 2020. M. A., et al. (2005). Gene Set Enrichment Analysis: A Knowledge-Based
stLearn: Integrating Spatial Location, Tissue Morphology and Gene Expression to Approach for Interpreting Genome-Wide Expression Profiles. Proc. Natl.
Find Cell Types, Cell-Cell Interactions and Spatial Trajectories within Undissociated Acad. Sci. 102 (43), 15545–15550.
Tissues. BioRxiv. doi:10.1101/2020.05.31.125658 Sundararajan, M., Taly, A., and Yan, Q. (2017). “Axiomatic Attribution for Deep
Pierson, E., and Yau, C. (2015). ZIFA: Dimensionality Reduction for Zero-Inflated Networks,” in Proceedings of the 34th International Conference on Machine
Single-Cell Gene Expression Analysis. Genome Biol. 16 (1), 1–10. Learning-Volume 70 (JMLR. org), 3319–3328.
Pouladi, F., Salehinejad, H., and Gilani, A. M. (2015). “Recurrent Neural Networks Tan, J., Hammond, J. H., Hogan, D. A., and Greene, C. S. (2016). ADAGE-based
for Sequential Phenotype Prediction in Genomics,” in 2015 International Integration of Publicly Available Pseudomonas aeruginosa Gene Expression
Conference on Developments of E-Systems Engineering (DeSE) (IEEE), Data with Denoising Autoencoders Illuminates Microbe-Host Interactions.
225–230. doi:10.1109/dese.2015.52 MSystems 1, e00025–15. doi:10.1128/mSystems.00025-15
Quang, D., Chen, Y., and Xie, X. (2014). DANN: a Deep Learning Approach for Tan, J., Doing, G., Lewis, K. A., Price, C. E., Chen, K. M., Cady, K. C., et al. (2017a).
Annotating the Pathogenicity of Genetic Variants. Bioinformatics 31, 761–763. Unsupervised Extraction of Stable Expression Signatures from Public
doi:10.1093/bioinformatics/btu703 Compendia with an Ensemble of Neural Networks. Cell Syst. 5, 63–71.
Quang, D., and Xie, X. (2016). DanQ: a Hybrid Convolutional and Recurrent Deep doi:10.1016/j.cels.2017.06.003
Neural Network for Quantifying the Function of DNA Sequences. Nucleic Acids Tan, J., Huyck, M., Hu, D., Zelaya, R. A., Hogan, D. A., and Greene, C. S.
Res. 44, e107. doi:10.1093/nar/gkw226 (2017b). ADAGE Signature Analysis: Differential Expression Analysis with
Data-Defined Gene Sets. BMC Bioinforma. 18, 512. doi:10.1186/s12859- Yue, T., and Wang, H., 2018. Deep Learning for Genomics: A Concise Overview.
017-1905-4 ArXiv Prepr. ArXiv180200810.
Tan, X., Su, A., Tran, M., and Nguyen, Q. (2019). SpaCell: Integrating Tissue Zhang, Y., and Liu, J. S. (2007). Bayesian Inference of Epistatic Interactions in Case-
Morphology and Spatial Gene Expression to Predict Disease Cells. Control Studies. Nat. Genet. 39, 1167–1173. doi:10.1038/ng2110
Bioinformatics 36 (7), 2293–2294. doi:10.1093/bioinformatics/btz914 Zhang, Z., Luo, D., Zhong, X., Choi, J. H., Ma, Y., Wang, S., Mahrt, T., Guo,
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. J. R. Stat. fnm., Stawiski, fnm., Modrusan, fnm., Seshagiri, fnm., Kapur, fnm., Hon,
Soc. B: Methodol. 58 (1), 267–288. fnm., Brugarolas, fnm., and Wang, fnm. (2019). SCINA: Semi-supervised
Uppu, S., Krishna, A., and Gopalan, R. P. (2016a). “Towards Deep Learning in Analysis of Single Cells In Silico. Genes 10 (7), 531. doi:10.3390/
Genome-wide Association Interaction Studies,” in PACIS, 20. genes10070531
Uppu, S., Krishna, A., Krishna, A., and Gopalan, R. P. (2016b). A Deep Learning Zhou, J., and Troyanskaya, O. G. (2015). Predicting Effects of Noncoding Variants
Approach to Detect SNP Interactions. JSW 11, 965–975. doi:10.17706/jsw.11. with Deep Learning-Based Sequence Model. Nat. Methods 12, 931–934. doi:10.
10.965-975 1038/nmeth.3547
Van Engelen, J. E., and Hoos, H. H. (2020). A Survey on Semi-supervised Learning. Zhu, X. (2008). Semi-supervised Learning Literature Survey. Technical Report.
Mach. Learn 109 (2), 373–440. doi:10.1007/s10994-019-05855-6 University of Wisconsin-Madison, 1530.
Vapnik, V. N. (1998). Statistical Learning Theory. 1 edition. New York: Wiley- Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., and Telenti, A. (2018).
Interscience. A Primer on Deep Learning in Genomics. Nat. Genet. 1. doi:10.1038/s41588-
Wager, S., Wang, S., and Liang, P. S. (2013). “Dropout Training as Adaptive 018-0295-5
Regularization,” in Advances in Neural Information Processing Systems, 351–359.
Wainberg, M., Merico, D., Delong, A., and Frey, B. J. (2018). Deep Learning in Conflict of Interest: The authors declare that the research was conducted in the
Biomedicine. Nat. Biotechnol. 36, 829–838. doi:10.1038/nbt.4233 absence of any commercial or financial relationships that could be construed as a
Wang, D., and Gu, J. (2018). VASC: Dimension Reduction and Visualization of potential conflict of interest.
Single-Cell RNA-Seq Data by Deep Variational Autoencoder. Genomics,
Proteomics Bioinforma. 16, 320–331. doi:10.1016/j.gpb.2018.08.003 Publisher’s Note: All claims expressed in this article are solely those of the authors
Wei, Z., and Zhang, S. (2021). CALLR: a Semi-supervised Cell-type Annotation and do not necessarily represent those of their affiliated organizations, or those of
Method for Single-Cell RNA Sequencing Data. Bioinformatics 37 (Suppl. the publisher, the editors and the reviewers. Any product that may be evaluated in
ment_1), i51–i58. doi:10.1093/bioinformatics/btab286 this article, or claim that may be made by its manufacturer, is not guaranteed or
Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M. I., and Yosef, N. (2021). Probabilistic endorsed by the publisher.
Harmonization and Annotation of Single-Cell Transcriptomics Data with Deep
Generative Models. Mol. Syst. Biol. 17 (1), e9620. doi:10.15252/msb.20209620 Copyright © 2022 Shen, Jiang, Wen, Li and Lu. This is an open-access article
Yelmen, B., Decelle, A., Ongaro, L., Marnetto, D., Tallec, C., Montinaro, F., et al. 2021. distributed under the terms of the Creative Commons Attribution License (CC BY).
Creating Artificial Human Genomes Using Generative Neural Networks. PLoS The use, distribution or reproduction in other forums is permitted, provided the
Genet. 17 (2), e1009303. bioRxiv 769091. original author(s) and the copyright owner(s) are credited and that the original
Yousefi, S., Amrollahi, F., Amgad, M., Dong, C., Lewis, J. E., Song, C., et al. (2017). publication in this journal is cited, in accordance with accepted academic practice.
Predicting Clinical Outcomes From Large Scale Cancer Genomic Profiles With Deep No use, distribution or reproduction is permitted which does not comply with
Survival Models. Sci. Rep. 7 (1), 1–11. these terms.