0% found this document useful (0 votes)
3 views10 pages

Deeplearning Survery

This document reviews the application of deep learning techniques in DNA/RNA motif mining, which is crucial for understanding gene regulation. It discusses various deep learning models, including CNN, RNN, and hybrid approaches, and highlights the progress made in this area compared to other fields like computer vision and NLP. The authors aim to provide insights and a summary of existing methods to aid researchers in grasping the complexities of motif mining using deep learning.

Uploaded by

harry.om7489
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Deeplearning Survery

This document reviews the application of deep learning techniques in DNA/RNA motif mining, which is crucial for understanding gene regulation. It discusses various deep learning models, including CNN, RNN, and hybrid approaches, and highlights the progress made in this area compared to other fields like computer vision and NLP. The authors aim to provide insights and a summary of existing methods to aid researchers in grasping the complexities of motif mining using deep learning.

Uploaded by

harry.om7489
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Briefings in Bioinformatics, 22(4), 2021, 1–10

https://doi.org/10.1093/bib/bbaa229
Method Review

A survey on deep learning in DNA/RNA motif mining


Ying He , Zhen Shen, Qinhu Zhang, Siguo Wang and De-Shuang Huang
Corresponding author: De-Shuang Huang, Department of College of Electronics and Information Engineering, Tongji University, 4800 Caoan Rd, Shanghai
201804, China. E-mail: [email protected]

Abstract
DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely
important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene
regulation and management. For the past few decades, researchers have been working on designing new efficient and
accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration
approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the
algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining
can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural
network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field
of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the
differences between the basic deep learning models. Through the analysis and comparison of existing deep learning
methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and
the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP),
computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help
researchers understand this field.

Key words: motif mining; deep learning; protein binding site; recurrent neural networks; convolutional neural network

Introduction gradually identified a large number of proteins with binding


Motif plays a key role in the gene-expression regulating both functions and their corresponding binding sites on the genome
transcriptional and posttranscriptional levels. DNA/RNA motifs sequences, the binding sites of the same protein are certain
involve many biological processes, including alternative splicing, conservative short sequences regarded as motifs, people ini-
transcription and translation [1–4]. From the late 1990s to the tially used conservative sequences to describe protein binding
early 21st century, researchers through biological experiments sites [5–8]. With the deepening of researchers’ understanding

Ying He is pursuing a Ph.D. degree in computer science and technology at Tongji University, China. His research interests include bioinformatics, machine
learning and deep learning.
Zhen Shen is pursuing a Ph.D. degree in computer science and technology at Tongji University, China. His research interests include bioinformatics,
machine learning and deep learning.
Qinhu Zhang received a Ph.D. degree in computer science and technology at Tongji University, China, in 2019. He is currently working at Tongji University
as a post-doctor. His research interests include bioinformatics, machine learning and deep learning.
Siguo Wang is working toward the Ph.D. degree in computer science and technology, Tongji University, China. Her research interests include bioinformatics,
machine learning and deep learning.
De-Shuang Huang is a chaired professor at Tongji University. At present, he is the Director of the Institute of Machines Learning and Systems Biology, Tongji
University. Dr. Huang is currently IAPR Fellow and a senior member of the IEEE. His current research interest includes bioinformatics, pattern recognition
and machine learning.
Submitted: 18 July 2020; Received (in revised form): 19 August 2020
© The Author(s) 2020. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/
licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact [email protected]

1
2 He et al.

of motif research, various motif mining algorithms emerge [9]. such as iDeepS that combine CNN and RNN to target specific
Early motif mining methods are mainly divided into two prin- RNA binding proteins (RBP) [54]. The advantage of the combined
cipal types: enumeration methods and probabilistic methods: model of RNN and CNN is that the newly added RNN layer can
enumeration approach and probabilistic method [10]. capture the long-term dependency between sequence features
The first class is based on simple word enumeration. Yeast by learning the features extracted by the CNN layer to improve
Motif Finder (YMF) algorithm used consensus representation to the accuracy of prediction. Other researchers used a pure RNN-
detect short motifs with a small number of degenerate positions based method: the KEGRU method [55] created an internal state
in the yeast genome developed by Sinha et. al [11]. YMF is mainly of the network by using a k-mer representation and embedding
divided into two steps: the first step enumerates all motifs of layer, and it captures long-term dependencies by combining with
search spaces and the second step calculates the z-score of all a layer of bidirectional gated recurrent units (bi-GRUs). Besides,
motifs to find the greatest one. Bailey proposed discriminative many researchers have done a lot of works based on three basic
regular expression motif elicitation algorithm that calculated the models, for example, Xiaoyong Pan [56], Qinhu Zhang [51, 57],
significance of motifs using Fisher’s Exact test [12]. Wenxuan Xu [58], Dailun Wang [59] and Wenbo Yu [60].
To accelerate the running speed of word enumeration-based Although, there are currently many deep learning methods
motif mining methods, some special methods were used, like in motif mining. Those methods compared to the deep learn-
suffix trees, parallel processing [13]. Besides, motif mining algo- ing methods in the field of computer vision and NLP, such as
rithms, such as LMMO [14], DirectFS [9], ABC [15], DiscMLA [16], image field [61, 62], video field [63] and question answering field
CisFinder [12], Weeder [17], Fmotif [18] and MCES [19] all used [64], are also relatively primitive and simple. Therefore, it is
this idea in the model. necessary to summarize the motif mining through deep learn-
In probabilistic-based motif mining methods, a probabilistic ing to help researchers to better understand the field. In this
model that needs a few parameters will be constructed [20]. paper, we introduce the basic biological background knowledge
These methods provided a base distribution of bases for each about motif mining and provide insights into the differences
site in the binding region to distinguish the motif is exist or not between the basic models of deep learning CNN and RNN, and
[21]. These methods usually built distribution by the position- discuss some new trends in the development of deep learning.
specific scoring matrix (PSSM/PWM) or motif matrix [22]. PWM This article hopes to help researchers who do not have basic
was an m by n size matrix (m represents the length of a specific deep learning or basic biology Background knowledge to quickly
protein binding site, and n represents the type of nucleotide understand topic mining.
base), which was used to indicate the degree of preference of The remainder of this paper is organized as follows: The
a specific protein binding motif at each position [23]. Just as second section describes the basic biological background knowl-
Figure 1 shows, PWM can intuitively express the binding pref- edge, several common databases and the basic knowledge of
erence of a specific protein with fewer parameters, so if a set motif. Then, the third section describes different models of
of specific protein binding site data is given, the parameters of deep learning algorithms for DNA/RNA motif mining. Finally, we
PWM can be learned from these binding site data. Some methods further discuss some new developments and challenges in motif
are based on PWM approaches such as MEME [11], STEME [24], mining deep learning and possible future directions in the fourth
EXTREME [25], AlignACE [26] and BioProspector [27]. section.
ChIP-seq and high-throughput sequencing have tremen-
dously increased the amount of data available in vivo [28], which
makes it possible to study the motif mining by deep learning
Basic Knowledge of Motif
[29]. In bioinformatics, although deep learning methods are not In this section, we introduce the some basic knowledge of motif
many at present, it is now on the rise [30]. Known applications mining. Motif mining (or motif discovery) in biological sequences
include DNA methylation [31, 32], protein classification [33–35], can be defined as the problem of finding a set of short, similar,
splicing regulation and gene expression [36–38] and biological conserved sequence elements (‘motifs’) that are often short
image analysis tasks [39–42]. Of particular relevance to our work and similar in nucleotide sequence with common biological
is the development of applications for motif mining, such as functions [65]. Motif mining has been one of the widely studied
DNA-/RNA-protein binding sites [43], chromatin accessibility problems in bioinformatics, such as transcription factor binding
[36, 44–46], enhancer [47–49], DNA-shape [50, 51]. site (TFBS) because its biological significance and bioinformatics
DeepBind [43] is the first study to apply deep learning in significance is highly significant [66, 67].
motif mining. Just as Figure 2 shows, DeepBind attempted to As shown in Figure 3, it shows how multiple sequences rec-
describe the method by CNN and predicts DNA-protein/RNA- ognize the same transcription factor (CREB). Their ‘consensus’
protein binding sites in a way that machine learning or genomics means that each position has its own more friendly nucleic acid
researchers can easily understand. It treated a genome sequence by the transcription factor. Since transcription factor binding
window as a picture. Unlike an image composed of pixels with can tolerate approximate values, all oligos that differ from the
three color channels (R, G, B), it treated the genomic sequence consensus sequence to the maximum number of nucleotide
as a fixed-length sequence window composed of four channels substitutions can be considered as valid instances of the same
(A, C, G, T) or (A, C, G, U). Therefore, the problem of DNA pro- TFBS.
tein binding site prediction is similar to the problem of binary After understanding the basic concept of motif, we introduce
classification of pictures. common databases and data preprocessing methods. The com-
After this, a series of research on deep learning in motifs min- monly used motif mining database is as follows: TCGA database
ing appeared. Some researchers focused on the impact of various [68], NCBI database [69] and ENCODE database [70]. Generally
parameters in deep learning, such as the number of layers, on speaking, two data preprocessing methods are the following
motif mining [52]. Some researchers have made more attempts methods as shown in Figure 4, bottom left.
for deep learning frameworks, adding a long short-term memory The simple method is to use the one-hot encoding. One-hot
(LSTM) layer to DeepBind, and obtained a new model combining is often used for indicating the state of a state machine [71].
CNN and RNN for motif mining [53]. Besides, there are methods For example, using one-hot codes to encode DNA sequences
Survey on deep learning in DNA/RNA motif mining 3

Figure 1. The process of generating PSSM, position frequency matrix (PFM) and logo of SPI1 [104]. The process of as follows generating PSSM, PFM and logo of SPI1.
First, generate a PFM based on the number of times each type of nucleotide appears in each position of the alignment. Then, convert the PFM into a logarithmic scale
PSSM/PWM. By adding the corresponding nucleotide values of PSSM, the score of any DNA sequence window with the same length as the matrix can be calculated and
drawn into a logo map.

as binary vectors: A = (1,0,0,0), G = (0,1,0,0), C = (0,0,1,0) and fully connected network (FCN) augmented at the end to trans-
T = (0,0,0,1). RNA sequences can also be encoded similarly by form feature vectors into a scalar binding score. It also opened
simply changing T to U. It is easy to design and modify, and up a precedent for deep learning in motif mining and provides
easy to detect illegal states. However, it is easily sparse and a basic framework for other deep learning methods. It corre-
context-free. sponded to each base to four channels similar to the RGB channel
Another method is to label with k-mers and vectorize by in color and used one-hot encoding to complete vectorization.
embedding [44]. For example, we can tokenize the DNA sequence Many subsequent methods use this to build their models.
‘ATCGCGTACGATCCG’ as different k-mers, as shown in Table 1. DeepSEA [38] was a deep learning method based on CNN,
Different k-mers can be vectorized using the embedding method which used three convolution layers with 320, 480 and 960
widely used in the NLP field [72], such as word2vec [73]. RNA kernels, respectively. Higher-level convolutional layers receive
sequences can be represented similarly. input from a larger spatial range, and lower-level convolutional
network layers can represent more complex features. DeepSEA
added an FCN layer on top of the third convolutional layer, in
Deep Learning in Motif Mining
which all neurons receive input from all outputs of the previous
In recent years, deep learning has achieved great success in layer so that the information of the entire sequence data can
various application scenarios, which makes researchers try to be completely obtained. The convolution step of the DeepSEA
apply it to DNA or RNA motif mining. Next, we introduce these model consisted of three convolutional layers and two maxi-
models in detail. There are three main types of deep learn- mum merge layers, and the motif was learned in alternating
ing frameworks in motif mining: CNN-based models (Figure 4, order.
left), RNN-based models (Figure 4, center), hybrid CNN–RNN- DeepSNR [74] was a deep learning method based on CNN. The
based models (Figure 4, right). We summarize several classic convolution part of the DeepSNR model had the same structure
deep learning methods in motif mining, as shown in Table 2. as the DeepBind network. But DeepSNR added that the deconvo-
DeepBind [43] is the first attempt to use CNN to predict DNA lution network is a mirrored version of the convolution network,
or RNA motifs from original DNA or RNA sequences. DeepBind which can reduce the size of the activation and enlarges the acti-
used a single CNN layer, which consists of one convolutional vations through combinations of unpooling and deconvolution
layer, followed by rectification and pooling operation, and one operations.
4 He et al.

Figure 2. The parallel training process of Deepbind [43]. (A) The DeepBind model processes five independent sequences in parallel. The data first passes through the
convolutional layer to extract features, then passes through the pooling layer to optimize the features. Finally, features go through the activation function to output
the prediction result and compare with the target to calculate the loss and update weight to improve the prediction accuracy. (B) It is shown in detail that the dataset
is divided into validate set, train set and test set, which are used to calculate validate AUC (area under the curve), training AUC and test AUC, respectively, to select the
optimal parameters.

Table 1. Different parameters for k-mers

Length Window Tokenized Vectorization

3 3 ATC GCG TAC GAT CCG 0321 3412 4532 4214


4 4 ATCG CGTA CGAT 0123 3412 4532
5 5 ATCGC GTACG ATCCG 4124 5124 2134
4 2 ATCG CGCG CGTA TACG CGAT ATCC 2563 3124 4236 3578 2145
4 3 ATCG GCGT TACG GATC 4252 5134 2136 3451 2411

It shows DNA sequence ‘ATCGCGTACGATCCG’ is cut into multiple different k-mers and his vector when the length is (3,4,5,4,4) and the window is (3,4,5,2,3).

Table 2. Deep learning algorithm in DNA motif mining

Model DeepBind DeepSNR DeepSEA Dilated DanQ BiRen KEGRU iDeeps

Architecture CNN CNN CNN CNN CNN + RNN CNN + RNN RNN CNN + RNN
Embedding NO NO NO NO NO NO YES NO
Input One-hot One-hot One-hot One-hot One-hot k-mer k-mer One-hot

It shows the architecture, embedding and input of eight classic deep learning models in motif mining.

Dilated [75] was a deep learning method based on dilated DanQ [53] used a single layer CNN followed by a bidirectional
multilayer CNN. This method learns the mapping from the DNA LSTM (BLSTM). The first layer of the DanQ model aimed to scan
region of the nucleotide sequence to the position of the regula- the position of the motif in the sequence through convolution
tory marker in this region. The dilated convolution can capture a filtering. The convolution step of the DanQ model was much
hierarchical representation of the input space that is larger than simpler than DeepSEA. It contained a convolutional layer and
the standard convolution so that they can be scaled to larger a maximum merge layer to learn the motif. After the largest
before and after sequences. pooling layer was the BLSTM layer. Motifs can follow the
Survey on deep learning in DNA/RNA motif mining 5

algorithms is very sensitive to different parameters [77]. The


deepRAM [78] provides implementations of several existing
architectures and their variants: DeepBind (single layer CNN),
DeepBind∗ (multilayer CNN), DeepBind-E∗ (multilayer CNN, k-
mer embedding), DanQ (single layer CNN, bidirectional LSTM),
DanQ∗ (multilayers CNN, bidirectional LSTM), Dilated (multilayer
dilated CNN), KEGRU (k-mer embedding, single layer GRU),
ECLSTM (k-mer embedding, single-layer CNN and LSTM) and
ECBLSTM (k-mer embedding, single-layer CNN and bidirectional
LSTM). They conducted a lot of experimental comparisons,
which gave researchers a deeper understanding of these
methods.
Before introducing the experimental results of deepRAM [78],
Figure 3. A set of binding sites recognized by the same TF (CREB) [65]. It
we introduce two sets of datasets used in the experiment. The
shows how multiple sequences recognize the same transcription factor (CREB).
First, Zambelli built their ‘consensus’ (bottom left) by counting the frequency first group is the DNA datasets include 83 ChIP-seq data from
of each nucleic acid in the sequence [65]. And ‘consensus’ (bottom left) with the ENCODE project [70]. The second group is the RNA datasets
the highest frequency of nucleotides at each position to indicate the motifs include 31 CLIP-seq data for 19 proteins [79–81].
they form a ‘degenerate’ consensus, which includes nucleotides that have no The deepRAM [78] has conducted a large number of exper-
obvious preference position (K = G or T; M = A or C; N = any nucleotide; according
iments on these two datasets of experimental data and con-
to IUPAC codes [105]). Besides, motifs can be converted into an alignment matrix
ducted an in-depth comparison and description of the above
of the nucleotide frequency (top right) by dividing each column by the number
of sites used, as well as a ‘sequence logo’ (bottom left) [106] showing nucleotide deep learning models. The experimental results of the model on
conservation and corresponding information. these datasets are shown in Figure 5.
Among all models, the ECBLSTM model performed best,
whether it was a median AUC of 0.930 on ChIP-seq data or
adjustment grammar determined by physical constraints, which a median AUC of 0.951 on CLIP-seq data, and the simplest
determine the spatial arrangement and frequency of the pattern DeepBind of all models is here. The median AUC on the two
combination in vivo, which is a feature related to tissue-specific datasets was 0.902 and 0.914, respectively. DeepBind is the
functional elements (such as enhancers). So the LSTM layer is simplest model considered here: it uses a single hot sequence
after the maximum pooling layer. The last two layers of the encoding and a single convolutional layer. By comparing the
DanQ model were dense layers of rectified linear units and performance of ECBLSTM with the model of DeepBindE∗ , it
multitask sigmod output, similar to the DeepSEA model. The can be seen that adding an LSTM layer can further improve
advantage of the combined model of RNN and CNN was that the performance. Because LSTM layers are better at capturing long-
newly added RNN layer can capture the long-term dependency term dependencies than CNN layers. Compared with the original
between sequence features by learning the features extracted DeepBind, both DeepBind∗ or DeepBind-E∗ can provide improved
by the CNN layer to improve the accuracy of prediction. performance. By comparing the performance of DanQ and
BiRen [49] developed a hybrid architecture based on DanQ∗ , it is further found that the performance of models deeper
deep learning, which combines the sequence encoding and than single-layer CNN tends to perform better. Experiment
representation capabilities of CNN and bidirectional recurrent results demonstrate the performance advantages of deeper
neural network of processing long sequences of DNA excellent and more complex networks. Zhang [17] found that the simpler
ability. BiRen had undergone limited experimental verification model performs best in this task, and the conclusions found
of enhancer element training, which comes from the VISTA through deepRAM’s experiment are just the opposite. Based on
enhancer browser [76], and has enhanced gene activity, as the experimental results and theoretical analysis, it is found
evaluated in transgenic mice. BiRen could learn regulatory codes that the complexity of the model should be related to the task
directly from genomic sequences, and demonstrate excellent and data. Too many parameters can easily cause over-fitting [82].
recognition accuracy, overcoming the robustness of noisy data, Generally, the parameters of our task model should not exceed
and two new methods for other species based on sequence the data sample too much.
features for other species General k-mer for enhancer prediction.
BiRen enabled researchers to have a deeper understanding of the
regulatory codes of enhancer sequences.
Discussion
KEGRU [55], which usesd a layer of GRU and k-mer embed- From the traditional method of motif to the latest development
ding, was a pure RNN layer model without CNN layer. KEGRU process of deep learning, we can find great progress with the
mainly used the k-mer and embedding layer to achieve the development of sequencing technology and new algorithms. We
purpose of CNN feature extraction tasks in other models. Such a analyzed the existing models, and their variants found that the
structure made it perform better in sequence relationships and more complex models tend to perform better when data are
achieves a good structure in RNA motif mining. sufficient in the third section. The recent research trends can
iDeeps [54] which used convolutional neural networks (CNNs) be found that the model is usually more and more complex.
and a BLSTM network to simultaneously identify the binding For example, researchers try to combine existing models with
sequence and structure motifs from RNA sequences. The CNN new models, such as combining attention units [83, 84], capsule
module embedded in iDeep can also automatically capture the network [85], multiscale convolutional gated recurrent unit net-
interpretable binding motif of RBP. The BLSTM network made works [86], weakly supervised CNN [87] and multiple-instance
the iDeep framework to not only achieve better performance on learning [88]. However, the existing deep learning models in
binding sequence but also easily capture structure motifs. motif mining are too simple, no more than three layers, com-
Model selection may be the most challenging step in pared to the model in the image field usually over 10 layers.
deep learning because the performance of deep learning Therefore, there is still much room for improvement.
6 He et al.

Figure 4. Sequence representation of motif mining [78]. It shows two data preprocessing methods(bottom left) and three architectures include CNN-only (left), RNN-only
(center) and hybrid CNN–RNN models (right).

Recently, since the adversarial training of neural networks As we enter the era of big data, whether it is in academic
can lead to regularization to provide higher performance, this or industrial, deep learning is already a very important
field has developed rapidly, including involving adversarial gen- development direction. In bioinformatics, which has made
erative networks [89] and a series of related research such as great progress in traditional machine learning, deep learning
Wasserstein GAN [90], MolGAN [91] and NetGAN [92]. In motif is expected to produce encouraging results [99]. In this review,
mining, GAN may be used to automatically generate negative we conducted a comprehensive review of the application of deep
examples instead of simple random generation or shuffling the learning in the field of motif mining. We desire that this review
positive sequence. Besides, pretraining models [93] that have will provide help researchers understand this field and promote
achieved significant results in the NLP field, from word2vec [73, the application of motif mining in research.
94] to now Bert [95] and GPT [96]. In motif mining, pretraining Of course, we also need to recognize the limitations of deep
can be used to enhance the robustness and generalization ability learning methods and the promising direction of future research.
of the model. The great success of AlphaGo [97] has set off Although deep learning is promising, it is not a panacea.
an unprecedented change in the Go world, and it has made In many applications of motif mining, there are still many
deep reinforcement learning familiar to the public. In particular, potential challenges, including unbalanced or limited data,
AlphaGo Zero does not require any history of human chess, and interpretation of deep learning results [71] and the choice of
only uses deep reinforcement learning [98]. The achievement of appropriate architecture and hyperparameters. For unbalanced
training from 0 to 3 days has far exceeded the knowledge of Go or limited data, the common methods are enhanced datasets [48]
that humans have accumulated for thousands of years. In motif or few-shot learning [100]. For interpretation of deep learning
mining, reinforcement learning may enable people to learn more results, common methods are the interpretability of the model
motifs beyond human knowledge. itself [101] or the interpretation after the prediction [71]. For
Survey on deep learning in DNA/RNA motif mining 7

Figure 5. Comparison results of nine deep learning models [78]. It compares the performance of these models in predicting DNA and RNA motif mining tasks. (A) The
AUC distribution of nine models in 83 ChIP-seq datasets. (B) P-value annotated heat maps using paired models of nine models in 83 ChIP-seq datasets. (C) The AUC
distribution of nine models in 31 CLIP-seq datasets. (D) P-value annotated heat maps using paired models of nine models in 31 datasets.

the choice of appropriate architecture and hyperparameters,


• Briefly, we also introduce the application of deep
frameworks such as Spearmint [102], Hyperopt [103] and
DeepRAM [78] allow to automatically explore the hyper- learning in the field of motif mining in terms of
parameter space. Besides, how to make full use of the ability of data preprocessing, features of existing deep learning
deep learning to accelerate the training process of deep learning architectures and comparing the differences between
also needs further research. Therefore, we hope that the issues the basic deep learning models.
discussed in this article will be helpful to the success of future
deep learning methods in motif mining.

Key Points
• Motif mining (or motif discovery) in biological Acknowledgement
sequences can be defined as the problem of finding
This work was supported by the grant of National Key R&D
a set of short, similar, conserved sequence elements
Program of China (Nos. 2018AAA0100100 & 2018YFA0902600)
(‘motifs’) that are often short and similar in nucleotide
and partly supported by National Natural Science Foun-
sequence with common biological functions. Motif
plays a key role in the gene-expression regulating dation of China (Grant nos. 61861146002, 61520106006,
both transcriptional and posttranscriptional levels. 61732012, 61932008, 61772370, 61672382, 61702371, 61532008,
• In recent years, deep learning has achieved great suc- 61772357, and 61672203) and China Postdoctoral Science
cess in various application scenarios, which makes Foundation (Grant no. 2017M611619) and supported by
researchers try to apply it to DNA or RNA motif mining. “BAGUI Scholar” Program and the Scientific & Technological
There are three main types of deep learning frame- Base and Talent Special Program, GuiKe AD18126015 of
works in motif mining: CNN-based models, RNN- the Guangxi Zhuang Autonomous Region of China and
based models and hybrid CNN–RNN-based models. supported by Shanghai Municipal Science and Technology
Major Project (No. 2018SHZDZX01), LCNBI and ZJLab.
8 He et al.

References 21. Yu Q, Huo H, Chen X, et al. An efficient algorithm for dis-


covering motifs in large DNA data sets. IEEE Trans Nanobio-
1. Ferre F, Colantoni A, Helmer-Citterich M. Revealing protein–
science 2015;14:535–44.
lncRNA interaction. Brief Bioinform 2016;17:106–16.
22. Stormo GD. DNA binding sites: representation and discov-
2. Gerstberger S, Hafner M, Tuschl T. A census of human RNA-
ery. Bioinformatics 2000;16:16–23.
binding proteins. Nat Rev Genet 2014;15:829–45.
23. Xia X. Position weight matrix, gibbs sampler, and the
3. Rajyaguru P, She M, Parker R. Scd6 targets eIF4G to repress
associated significance tests in motif characterization and
translation: RGG motif proteins as a class of eIF4G-binding
prediction. Forensic Sci 2012;2012:1–15.
proteins. Mol Cell 2012;45:244–54.
24. van Helden J, André B, Collado-Vides J. Extracting regu-
4. Guo W-L, Huang D-S. An efficient method to transcription
latory sites from the upstream region of yeast genes by
factor binding sites imputation via simultaneous comple-
computational analysis of oligonucleotide frequencies. J
tion of multiple matrices with positional consistency. Mol
Mol Biol 1998;281:827–42.
Biosyst 2017;13:1827–37.
25. Thomas-Chollier M, Herrmann C, Defrance M, et al. RSAT
5. Stormo GD, Hartzell GW. Identifying protein-binding
peak-motifs: motif analysis in full-size ChIP-seq datasets.
sites from unaligned DNA fragments. Proc Natl Acad Sci
Nucleic Acids Res 2012;40:e31–1.
1989;86:1183–7.
26. Ma X, Kulkarni A, Zhang Z, et al. A highly efficient and effec-
6. Welch W, Ruppert J, Jain AN. Hammerhead: fast, fully auto-
tive motif discovery method for ChIP-seq/ChIP-chip data
mated docking of flexible ligands to protein binding sites.
using positional information. Nucleic Acids Res 2012;40:e50.
Chem Biol 1996;3:449–62.
27. Pavesi G, Mauri G, Pesole G. An algorithm for finding sig-
7. Neuvirth H, Raz R, Schreiber G. ProMate: a structure based
nals of unknown length in DNA sequences. Bioinformatics
prediction program to identify the location of protein–
2001;17:S207–14.
protein binding sites. J Mol Biol 2004;338:181–99.
28. Myllykangas S, Buenrostro J, Ji HP. Overview of sequencing
8. Bradford JR, Westhead DR. Improved prediction of protein–
technology platforms. In: Bioinformatics for High Throughput
protein binding sites using a support vector machines
Sequencing. Berlin: Springer, 2012, 11–25.
approach. Bioinformatics 2005;21:1487–94.
29. Zhu L, Guo W-L, Huang D-S, et al. Imputation of ChIP-
9. Zhu L, Li N, Bao W, et al. Learning regulatory motifs by
seq datasets via Low Rank Convex Co-Embedding. In: 2015
direct optimization of Fisher Exact Test Score. In: 2016
IEEE International Conference on Bioinformatics and Biomedicine.
IEEE International Conference on Bioinformatics and Biomedicine
2015, pp. 141-4.
(BIBM). 2016, pp. 86-91.
30. Angermueller C, Pärnamaa T, Parts L, et al. Deep learning
10. Hashim FA, Mabrouk MS, Al-Atabany W. Review of dif-
for computational biology. Mol Syst Biol 2016;12:878.
ferent sequence motif finding algorithms. Avicenna J Med
31. Vidaki A, Ballard D, Aliferi A, et al. DNA methylation-
Biotechnol 2019;11:130.
based forensic age prediction using artificial neural net-
11. Sinha S, Tompa M. YMF: a program for discovery of novel
works and next generation sequencing. Forensic Sci Int Genet
transcription factor binding sites by statistical overrepre-
2017;28:225–36.
sentation. Nucleic Acids Res 2003;31:3586–8.
32. Angermueller C, Lee HJ, Reik W, et al. DeepCpG: accurate
12. Bailey TL. DREME: motif discovery in transcription factor
prediction of single-cell DNA methylation states using deep
ChIP-seq data. Bioinformatics 2011;27:1653–9.
learning. Genome Biol 2017;18:1–13.
13. Pavesi G, Mereghetti P, Mauri G, et al. Weeder web: discovery
33. Asgari E, Mofrad MR. Continuous distributed represen-
of transcription factor binding sites in a set of sequences
tation of biological sequences for deep proteomics and
from co-regulated genes. Nucleic Acids Res 2004;32:
genomics. PLoS One 2015;10:e0141287.
W199–203.
34. Pärnamaa T, Parts L. Accurate classification of protein
14. Zhu L, Zhang H-B, Huang D-S. LMMO: a large margin
subcellular localization from high-throughput microscopy
approach for refining regulatory motifs, IEEE/ACM Trans-
images using deep learning. G3: Genes, Genomes, Genet
actions on Computational Biology and Bioinformatics 2017;15:
2017;7:1385–92.
913-25.
35. Almagro Armenteros JJ, Sønderby CK, Sønderby SK, et al.
15. Karaboga D, Aslan S. A discrete artificial bee colony algo-
DeepLoc: prediction of protein subcellular localization
rithm for detecting transcription factor binding sites in
using deep learning. Bioinformatics 2017;33:3387–95.
DNA sequences. Genet Mol Res 2016;15:1–11.
36. Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory
16. Zhang H, Zhu L, Huang D. DiscMLA: AUC-based discrimina-
code of the accessible genome with deep convolutional
tive motif learning. In: 2015 IEEE International Conference on
neural networks. Genome Res 2016;26:990–9.
Bioinformatics and Biomedicine. 2015, pp. 250-5.
37. Leung MK, Xiong HY, Lee LJ, et al. Deep learning of
17. Zhang Y, Wang P, Yan M. An entropy-based position pro-
the tissue-regulated splicing code. Bioinformatics 2014;30:
jection algorithm for motif discovery. Biomed Res Int 2016;
i121–9.
2016:1–11.
38. Zhou J, Troyanskaya OG. Predicting effects of noncoding
18. Sharov AA, Ko MS. Exhaustive search for over-
variants with deep learning–based sequence model. Nat
represented DNA sequence motifs with CisFinder. DNA Res
Methods 2015;12:931–4.
2009;16:261–73.
39. Bar Y, Diamant I, Wolf L, et al. Deep learning with non-
19. Jia C, Carson MB, Wang Y, et al. A new exhaustive method
medical training used for chest pathology identification. In:
and strategy for finding motifs in ChIP-enriched regions.
Medical Imaging 2015: Computer-Aided Diagnosis. Bellingham,
PLoS One 2014;9:e86044.
WA: International Society for Optics and Photonics, 2015,
20. Sinha S. On counting position weight matrix matches in a
94140V.
sequence, with application to discriminative motif finding.
40. Tron R, Zhou X, Daniilidis K. A survey on rotation opti-
Bioinformatics 2006;22:e454–63.
mization in structure from motion. Proceedings of the IEEE
Survey on deep learning in DNA/RNA motif mining 9

Conference on Computer Vision and Pattern Recognition Work- 60. Yu W, Yuan C-A, Qin X, et al. Hierarchical attention network
shops 2016, 77–85. for predicting DNA-protein binding sites. In: International
41. Mahmud M, Kaiser MS, Hussain A, et al. Applications of Conference on Intelligent Computing. Berlin: Springer, 2019,
deep learning and reinforcement learning to biological 366–73.
data. IEEE Trans Neural Netw Learn Syst 2018;29:2063–79. 61. Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image
42. Affonso C, Rossi ALD, Vieira FHA, et al. Deep learn- caption generation with visual attention. International Con-
ing for biological image classification. Expert Syst Appl ference on Machine Learning, 2015, 2048–57.
2017;85:114–22. 62. Tang P, Wang H, Kwong S. G-MS2F: GoogLeNet based multi-
43. Alipanahi B, Delong A, Weirauch MT, et al. Predicting the stage feature fusion of deep CNN for scene recognition.
sequence specificities of DNA-and RNA-binding proteins Neurocomputing 2017;225:188–97.
by deep learning. Nat Biotechnol 2015;33:831–8. 63. Yao L, Torabi A, Cho K, et al. Describing videos by exploiting
44. Min X, Zeng W, Chen N, et al. Chromatin accessibil- temporal structure. In: Proceedings of the IEEE International
ity prediction via convolutional long short-term memory Conference on Computer Vision. 2015, pp. 4507-15.
networks with k-mer embedding. Bioinformatics 2017;33: 64. Noh H, Hongsuck Seo P, Han B. Image question answering
i92–101. using convolutional neural network with dynamic param-
45. Nair S, Kim DS, Perricone J, et al. Integrating regulatory eter prediction. In: Proceedings of the IEEE Conference on
DNA sequence and gene expression to predict genome- Computer Vision and Pattern Recognition. 2016, pp. 30-8.
wide chromatin accessibility across cellular contexts. Bioin- 65. Zambelli F, Pesole G, Pavesi G. Motif discovery and
formatics 2019;35:i108–16. transcription factor binding sites before and after the
46. Liu Q, Xia F, Yin Q, et al. Chromatin accessibility prediction next-generation sequencing era. Brief Bioinform 2013;14:
via a hybrid deep convolutional neural network. Bioinfor- 225–37.
matics 2018;34:732–8. 66. Pavesi G, Mauri G, Pesole G. In silico representation and dis-
47. Kleftogiannis D, Kalnis P, Bajic VB. DEEP: a general compu- covery of transcription factor binding sites. Brief Bioinform
tational framework for predicting enhancers. Nucleic Acids 2004;5:217–36.
Res 2015;43:e6–6. 67. Sandve GK, Drabløs F. A survey of motif discovery methods
48. Cohn D, Zuk O, Kaplan T. Enhancer identification using in an integrated framework. Biol Direct 2006;1:1–16.
transfer and adversarial deep learning of DNA sequences. 68. Tomczak K, Czerwińska P, Wiznerowicz M. The cancer
BioRxiv 2018; 264200. genome atlas (TCGA): an immeasurable source of knowl-
49. Yang B, Liu F, Ren C, et al. BiRen: predicting enhancers with a edge. Contemporary Oncol 2015;19:A68.
deep-learning-based model using the DNA sequence alone. 69. Sherry ST, Ward M-H, Kholodov M, et al. dbSNP: the
Bioinformatics 2017;33:1930–6. NCBI database of genetic variation. Nucleic Acids Res
50. Yang J, Ma A, Hoppe AD, et al. Prediction of regulatory motifs 2001;29:308–11.
from human Chip-sequencing data using a deep learning 70. Consortium EP. The ENCODE (ENCyclopedia of DNA ele-
framework. Nucleic Acids Res 2019;47:7809–24. ments) project. Science 2004;306:636–40.
51. Zhang Q, Shen Z, Huang D-S. Predicting in-vitro transcrip- 71. Lanchantin J, Singh R, Wang B, et al. Deep motif dashboard:
tion factor binding sites using DNA sequence+ shape. visualizing and understanding genomic sequences using
IEEE/ACM Trans Comput Biol Bioinform 2019. deep neural networks. In: Pacific Symposium on Biocomputing,
52. Zhang S, Zhou J, Hu H, et al. A deep learning framework Vol. 2017. Singapore: World Scientific, 2017, 254–65.
for modeling structural features of RNA-binding protein 72. Koren S, Walenz BP, Berlin K, et al. Canu: scalable and
targets. Nucleic Acids Res 2016;44:e32–2. accurate long-read assembly via adaptive k-mer weighting
53. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent and repeat separation. Genome Res 2017;27:722–36.
deep neural network for quantifying the function of DNA 73. Goldberg Y, Levy O. word2vec explained: deriving Mikolov
sequences. Nucleic Acids Res 2016;44:e107–7. et al.’s negative-sampling word-embedding method.
54. Pan X, Rijnbeek P, Yan J, et al. Prediction of RNA- arXiv:1402.3722. 2014.
protein sequence and structure binding preferences using 74. Salekin S, Zhang JM, Huang Y. A deep learning model
deep convolutional and recurrent neural networks. BMC for predicting transcription factor binding location at sin-
Genomics 2018;19:511. gle nucleotide resolution. In: 2017 IEEE EMBS International
55. Shen Z, Bao W, Huang D-S. Recurrent neural network Conference on Biomedical & Health Informatics. 2017, pp. 57-60.
for predicting transcription factor binding sites. Sci Rep 75. Gupta A, Rush AM. Dilated convolutions for modeling long-
2018;8:1–10. distance genomic dependencies. arXiv:1710.01278. 2017.
56. Pan X, Shen H-B. Predicting RNA–protein binding sites and 76. Visel A, Minovitsky S, Dubchak I, et al. VISTA enhancer
motifs through combining local and global deep convolu- browser—a database of tissue-specific human enhancers.
tional neural networks. Bioinformatics 2018;34:3427–36. Nucleic Acids Res 2007;35:D88–92.
57. Zhang Q, Zhu L, Huang D-S. High-order convolutional 77. Lipton ZC, Steinhardt J. Troubling trends in machine learn-
neural network architecture for predicting DNA-protein ing scholarship. arXiv:1807.03341. 2018.
binding sites. IEEE/ACM Trans Comput Biol Bioinform 2018; 78. Trabelsi A, Chaabane M, Ben-Hur A. Comprehensive eval-
16:1184–92. uation of deep learning architectures for prediction of
58. Xu W, Zhu L, Huang D-S. DCDE: an efficient deep convo- DNA/RNA sequence binding specificities. Bioinformatics
lutional divergence encoding method for human promoter 2019;35:i269–77.
recognition. IEEE Trans Nanobioscience 2019;18:136–45. 79. Blin K, Dieterich C, Wurmus R, et al. DoRiNA 2.0—
59. Wang D, Zhang Q, Yuan C-A, et al. Motif discovery via convo- upgrading the doRiNA database of RNA interactions in
lutional networks with K-mer embedding. In: International post-transcriptional regulation. Nucleic Acids Res 2015;
Conference on Intelligent Computing. Berlin: Springer, 2019, 43:D160–7.
374–82. 80. iCount. iCount. http://icount.biolab.si/.
10 He et al.

81. Stražar M, Žitnik M, Zupan B, et al. Orthogonal matrix 94. Rong X. word2vec parameter learning explained. arXiv:
factorization enables integrative analysis of multiple RNA 1411.2738. 2014.
binding proteins. Bioinformatics 2016;32:1527–35. 95. Devlin J, Chang M-W, Lee K, et al. Bert: pre-training of
82. Cawley GC, Talbot NL. On over-fitting in model selection deep bidirectional transformers for language understand-
and subsequent selection bias in performance evaluation. ing. arXiv:1810.04805. 2018.
J Mach Learn Res 2010;11:2079–107. 96. Radford A, Narasimhan K, Salimans T, et al. Improving
83. Hong Z, Zeng X, Wei L, et al. Identifying enhancer– language understanding by generative pre-training. 2018.
promoter interactions with neural network based on pre- 97. Silver D, Hassabis D. Alphago: mastering the ancient
trained DNA vectors and attention mechanism. Bioinformat- game of go with machine learning. Res Blog 2016;9. https://
ics 2020;36:1037–43. ai.googleblog.com/2016/01/alphago-mastering-ancient-
84. Shen Z, Zhang Q, Kyungsook H, et al. A deep learning model game-of-go.html.
for RNA-protein binding preference prediction based on 98. Silver D, Schrittwieser J, Simonyan K, et al. Master-
hierarchical LSTM and attention network. IEEE/ACM Trans ing the game of go without human knowledge. Nature
Comput Biol Bioinform 2020. 2017;550:354–9.
85. Shen Z, Deng S-P, D-S H. Capsule network for predict- 99. Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief
ing RNA-protein binding preferences using hybrid feature. Bioinform 2017;18:851–69.
IEEE/ACM Trans Comput Biol Bioinform 2019. 100. Snell J, Swersky K, Zemel R. Prototypical networks for few-
86. Shen Z, Deng S-P, Huang D-S. RNA-protein binding sites shot learning. In: Advances in Neural Information Process-
prediction via multi scale convolutional gated recurrent ing Systems, Long Beach, CA, USA: NIPS Foundation, 2017,
unit networks. IEEE/ACM Trans Comput Biol Bioinform 2019. 4077–87.
87. Zhang Q, Zhu L, Bao W, et al. Weakly-supervised con- 101. Hu H-J, Wang H, Harrison R, et al. Understanding the
volutional neural network architecture for predicting prediction of transmembrane proteins by support vector
protein-DNA binding. IEEE/ACM Trans Comput Biol Bioinform machine using association rule mining. In: 2007 IEEE Sym-
2018, 2672–80. posium on Computational Intelligence and Bioinformatics and
88. Zhang Q, Shen Z, Huang D-S. Modeling in-vivo protein-DNA Computational Biology. 2007, pp. 418-25.
binding by combining multiple-instance learning with a 102. Snoek J, Larochelle H. Spearmint. https://github.com/Jaspe
hybrid deep neural network. Sci Rep 2019;9:1–12. rSnoek/spearmint 2012.
89. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative 103. Bergstra J, Yamins D, Cox DD. Hyperopt: a python library for
adversarial nets. In: Advances in Neural Information Processing optimizing the hyperparameters of machine learning algo-
Systems, 2014, 2672–80. rithms. In: Proceedings of the 12th Python in Science Conference.
90. Arjovsky M, Chintala S, Bottou L. Wasserstein GAN. 2013, p. 20.
arXiv:1701.07875. 2017. 104. Worsley-Hunt R, Bernard V, Wasserman WW. Identification
91. De Cao N, Kipf T. MolGAN: an implicit generative model for of cis-regulatory sequence variations in individual genome
small molecular graphs. arXiv:1805.11973. 2018. sequences. Genome Med 2011;3:65.
92. Bojchevski A, Shchur O, Zügner D, et al. Netgan: generating 105. Cornish-Bowden A. Nomenclature for incompletely spec-
graphs via random walks. arXiv:1803.00816. 2018. ified bases in nucleic acid sequences: recommendations
93. Mikolov T, Grave E, Bojanowski P, et al. Advances in 1984. Nucleic Acids Res 1985;13:3021.
pre-training distributed word representations. arXiv: 106. Crooks GE, Hon G, Chandonia J-M, et al. WebLogo: a
1712.09405. 2017. sequence logo generator. Genome Res 2004;14:1188–90.

You might also like