0% found this document useful (0 votes)
117 views19 pages

Deep Learning in Bioinformatics: Seonwoo Min, Byunghan Lee and Sungroh Yoon

This document reviews the application of deep learning techniques in bioinformatics. It categorizes recent research examples by bioinformatics domain (omics, biomedical imaging, biomedical signal processing) and deep learning architecture type (deep neural networks, convolutional neural networks, recurrent neural networks, emergent architectures). The review discusses key elements of deep learning like network construction and training. It aims to provide insight into how deep learning can extract knowledge from biological big data and help advance research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views19 pages

Deep Learning in Bioinformatics: Seonwoo Min, Byunghan Lee and Sungroh Yoon

This document reviews the application of deep learning techniques in bioinformatics. It categorizes recent research examples by bioinformatics domain (omics, biomedical imaging, biomedical signal processing) and deep learning architecture type (deep neural networks, convolutional neural networks, recurrent neural networks, emergent architectures). The review discusses key elements of deep learning like network construction and training. It aims to provide insight into how deep learning can extract knowledge from biological big data and help advance research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Briefings in Bioinformatics, 18(5), 2017, 851–869

doi: 10.1093/bib/bbw068
Advance Access Publication Date: 25 July 2016
Paper

Deep learning in bioinformatics


Seonwoo Min, Byunghan Lee and Sungroh Yoon

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


Corresponding author: Sungroh Yoon, Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Korea.
Tel.: +82-2-880-1401; Fax: +82-2-871-5974; E-mail: [email protected]

Abstract
In the era of big data, transformation of biomedical big data into valuable knowledge has been one of the most important
challenges in bioinformatics. Deep learning has advanced rapidly since the early 2000s and now demonstrates state-of-the-
art performance in various fields. Accordingly, application of deep learning in bioinformatics to gain insight from data has
been emphasized in both academia and industry. Here, we review deep learning in bioinformatics, presenting examples of
current research. To provide a useful and comprehensive perspective, we categorize research both by the bioinformatics do-
main (i.e. omics, biomedical imaging, biomedical signal processing) and deep learning architecture (i.e. deep neural net-
works, convolutional neural networks, recurrent neural networks, emergent architectures) and present brief descriptions of
each study. Additionally, we discuss theoretical and practical issues of deep learning in bioinformatics and suggest future
research directions. We believe that this review will provide valuable insights and serve as a starting point for researchers
to apply deep learning approaches in their bioinformatics studies.

Key words: deep learning; neural network; machine learning; bioinformatics; omics; biomedical imaging; biomedical signal
processing.

Introduction
underlying patterns, build models, and make predictions based
In the era of ‘big data,’ transformation of large quantities of data on the best fit model. Indeed, some well-known algorithms (i.e.
into valuable knowledge has become increasingly important in support vector machines, random forests, hidden Markov mod-
various domains [1], and bioinformatics is no exception. els, Bayesian networks, Gaussian networks) have been applied
Significant amounts of biomedical data, including omics, image in genomics, proteomics, systems biology and numerous other
and signal data, have been accumulated, and the resulting po- domains [6].
tential for applications in biological and healthcare research The proper performance of conventional machine learning
has caught the attention of both industry and academia. For in- algorithms relies heavily on data representations called fea-
stance, IBM developed Watson for Oncology, a platform analyz- tures [7]. However, features are typically designed by human en-
ing patients’ medical information and assisting clinicians with gineers with extensive domain expertise and identifying which
treatment options [2, 3]. In addition, Google DeepMind, having features are more appropriate for the given task remains diffi-
achieved great success with AlphaGo in the game of Go, re- cult. Deep learning, a branch of machine learning, has recently
cently launched DeepMind Health to develop effective health- emerged based on big data, the power of parallel and distributed
care technologies [4, 5]. computing, and sophisticated algorithms. Deep learning has
To extract knowledge from big data in bioinformatics, ma- overcome previous limitations, and academic interest has
chine learning has been a widely used and successful method- increased rapidly since the early 2000s (Figure 1). Furthermore
ology. Machine learning algorithms use training data to uncover deep learning is responsible for major advances in diverse fields

Seonwoo Min is a M.S./Ph.D. candidate at the Department of Electrical and Computer Engineering, Seoul National University, Korea. His research areas in-
clude high-performance bioinformatics, machine learning for biomedical big data, and deep learning.
Byunghan Lee is a Ph.D. candidate at the Department of Electrical and Computer Engineering, Seoul National University, Korea. His research areas include
high-performance bioinformatics, machine learning for biomedical big data, and data mining.
Sungroh Yoon is an associate professor at the Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea. He received
his Ph.D. and postdoctoral training from Stanford University, Stanford, USA. His research interests include machine learning and deep learning for bio-
informatics, and high-performance bioinformatics.
Submitted: 20 March 2016; Received (in revised form): 16 June 2016
C The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
V

851
852 | Min et al.

where the artificial intelligence (AI) community has struggled data-driven features, representation learning, particularly deep
for many years [8]. One of the most important advancements learning has shown great promise. Representation learning can
thus far has been in image and speech recognition [9–15], al- discover effective features as well as their mappings from data
though promising results have been disseminated in natural for given tasks. Furthermore, deep learning can learn complex
language processing [16, 17] and language translation [18, 19]. features by combining simpler features learned from data. In
Certainly, bioinformatics can also benefit from deep learning other words, with artificial neural networks of multiple non-lin-
(Figure 2): splice junctions can be discovered from DNA se- ear layers, referred to as deep learning architectures, hierarch-
quences, finger joints can be recognized from X-ray images, ical representations of data can be discovered with increasing
lapses can be detected from electroencephalography (EEG) sig- levels of abstraction [25].
nals, and so on.
Previous reviews have addressed machine learning in bio-
Key elements of deep learning
informatics [6, 20] and the fundamentals of deep learning [7, 8,
21]. In addition, although recently published reviews by Leung The successes of deep learning are built on a foundation of sig-
et al. [22], Mamoshina et al. [23], and Greenspan et al. [24] dis- nificant algorithmic details and generally can be understood in

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


cussed deep learning applications in bioinformatics research, two parts: construction and training of deep learning architec-
the former two are limited to applications in genomic medicine, tures. Deep learning architectures are basically artificial neural
and the latter to medical imaging. In this article, we provide a networks of multiple non-linear layers and several types have
more comprehensive review of deep learning for bioinformatics been proposed according to input data characteristics and re-
and research examples categorized by bioinformatics domain search objectives (Table 1). Here, we categorized deep learning
(i.e. omics, biomedical imaging, biomedical signal processing) architectures into four groups (i.e. deep neural networks (DNNs)
and deep learning architecture (i.e. deep neural networks, con- [26–30], convolutional neural networks (CNNs) [31–33], recurrent
volutional neural networks, recurrent neural networks, emer- neural networks (RNNs) [34–37], emergent architectures [38–41])
gent architectures). The goal of this article is to provide valuable and explained each group in detail (Table 2). Some papers have
insight and to serve as a starting point to facilitate the applica- used ‘DNNs’ to encompass all deep learning architectures [7, 8];
tion of deep learning in bioinformatics studies. To the best of however, in this review, we use ‘DNNs’ to refer specifically to
our knowledge, we are one of the first groups to review deep multilayer perceptron (MLP) [26], stacked auto-encoder (SAE)
learning applications in bioinformatics. [27, 28] and deep belief networks (DBNs) [29, 30], which use per-
ceptrons [42], auto-encoders (AEs) [43] and restricted Boltzmann
machines (RBMs) [44, 45] as the building blocks of neural net-
Deep learning: a brief overview works, respectively. CNNs are architectures that have suc-
Efforts to create AI systems have a long history. Figure 3 illus- ceeded particularly in image recognition and consist of
trates the relationships and high-level schematics of different convolution layers, non-linear layers and pooling layers. RNNs
disciplines. Early approaches attempted to explicitly program are designed to utilize sequential information of input data
the required knowledge for given tasks; however, these faced with cyclic connections among building blocks like perceptrons,
difficulties in dealing with complex real-world problems be- long short-term memory units (LSTMs) [36, 37] or gated recur-
cause designing all the detail required for an AI system to ac- rent units (GRUs) [19]. In addition, many other emergent deep
complish satisfactory results by hand is such a demanding job learning architectures have been suggested, such as deep
[7]. Machine learning provided more viable solutions with the spatio-temporal neural networks (DST-NNs) [38], multi-
capability to improve through experience and data. Although dimensional recurrent neural networks (MD-RNNs) [39] and
machine learning can extract patterns from data, there are limi- convolutional auto-encoders (CAEs) [40, 41].
tations in raw data processing, which is highly dependent on The goal of training deep learning architectures is optimiza-
hand-designed features. To advance from hand-designed to tion of the weight parameters in each layer, which gradually

Figure 1. Approximate number of published deep learning articles by year. The number of articles is based on the search results on http://www.scopus.com with the
two queries: ‘Deep learning,’ ‘Deep learning’ AND ‘bio*’.
Deep learning in bioinformatics | 853

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


Figure 2. Application of deep learning in bioinformatics research. (A) Overview diagram with input data and research objectives. (B) A research example in the omics
domain. Prediction of splice junctions in DNA sequence data with a deep neural network [94]. (C) A research example in biomedical imaging. Finger joint detection
from X-ray images with a convolutional neural network [145]. (D) A research example in biomedical signal processing. Lapse detection from EEG signal with a recurrent
neural network [178].

Figure 3. Relationships and high-level schematics of artificial intelligence, machine learning, representation learning, and deep learning [7].

combines simpler features into complex features so that the the inferenced outputs and the given labels. To minimize the
most suitable hierarchical representations can be learned from training error, the backward pass uses the chain rule to back-
data. A single cycle of the optimization process is organized as propagate error signals and compute gradients with respect to
follows [8]. First, given a training dataset, the forward pass se- all weights throughout the neural network [46]. Finally, the
quentially computes the output in each layer and propagates weight parameters are updated using optimization algorithms
the function signals forward through the network. In the final based on stochastic gradient descent (SGD) [47]. Whereas batch
output layer, an objective loss function measures error between gradient descent performs parameter updates for each
854 | Min et al.

Table 1. Abbreviations in alphabetical order advantage in the processing speed. C þþ based Caffe [59] and
Lua-based Torch [60] offer great advantages in terms of pre-
Abbreviation Full word
trained models and functional extensionality, respectively.
AE Auto-Encoder Python-based Theano [61, 62] provides a low-level library to de-
AI Artificial intelligence fine and optimize mathematical expressions; moreover, numer-
AUC Area-under-the-receiver operation characteristics curve ous higher-level wrappers such as Keras [63], Lasagne [64]
AUC-PR Area-under-the-precision–recall curve and Blocks [65] have been developed on top of Theano to pro-
BRNN Bidirectional recurrent neural network vide more intuitive interfaces. Google recently released the
CAE Convolutional auto-encoder C þþ based TensorFlow [66] with a Python interface. This library
CNN Convolutional neural network currently shows limited performance but is undergoing con-
DBN Deep belief network tinuous improvement, as heterogeneous distributed computing
DNN Deep neural network is now supported. In addition, TensorFlow can also take advan-
DST-NN Deep spatio-temporal neural network tage of Keras, which provides an additional model-level
ECG Electrocardiography interface.

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


ECoG Electrocorticography
EEG Electroencephalography
EMG Electromyography Deep neural networks
EOG Electrooculography
The basic structure of DNNs consists of an input layer, multiple
GRU Gated recurrent unit
hidden layers and an output layer (Figure 4). Once input data
LSTM Long short-term memory
are given to the DNNs, output values are computed sequentially
MD-RNN Multi-dimensional recurrent neural network
MLP Multilayer perceptron
along the layers of the network. At each layer, the input vector
MRI Magnetic resonance image comprising the output values of each unit in the layer below is
PCA Principal component analysis multiplied by the weight vector for each unit in the current layer
PET Positron emission tomography to produce the weighted sum. Then, a non-linear function, such
PSSM Position specific scoring matrix as a sigmoid, hyperbolic tangent or rectified linear unit (ReLU)
RBM Restricted Boltzmann machine [67], is applied to the weighted sum to compute the output val-
ReLU Rectified linear unit ues of the layer. The computation in each layer transforms the
RNN Recurrent neural network representations in the layer below into slightly more abstract
SAE Stacked auto-encoder representations [8]. Based on the types of layers used in DNNs
SGD Stochastic gradient descent and the corresponding learning method, DNNs can be classified
as MLP, SAE or DBN.
MLP has a similar structure to the usual neural networks but
complete dataset, SGD provides stochastic approximations by includes more stacked layers. It is trained in a purely supervised
performing the updates for each small set of data examples. manner that uses only labeled data. Since the training method
Several optimization algorithms stem from SGD. For example, is a process of optimization in high-dimensional parameter
Adagrad [48] and Adam [49] perform SGD while adaptively mod- space, MLP is typically used when a large number of labeled
ifying learning rates based on update frequency and moments data are available [25].
of the gradients for each parameter, respectively. SAE and DBN use AEs and RBMs as building blocks of the
Another core element in the training of deep learning archi- architectures, respectively. The main difference between these
tectures is regularization, which refers to strategies intended to and MLP is that training is executed in two phases: unsuper-
avoid overfitting and thus achieve good generalization perform- vised pre-training and supervised fine-tuning. First, in unsuper-
ance. For example, weight decay [50], a well-known conven- vised pre-training (Figure 5), the layers are stacked sequentially
tional approach, adds a penalty term to the objective loss and trained in a layer-wise manner as an AE or RBM using un-
function so that weight parameters converge to smaller abso- labeled data. Afterwards, in supervised fine-tuning, an output
lute values. Currently, the most widely used regularization ap- classifier layer is stacked, and the whole neural network is opti-
proach is dropout [51]. Dropout randomly removes hidden units mized by retraining with labeled data. Since both SAE and DBN
from neural networks during training and can be considered an exploit unlabeled data and can help avoid overfitting, re-
ensemble of possible subnetworks [52]. To enhance the capabil- searchers are able to obtain fairly regularized results, even
ities of dropout, a new activation function, maxout [53], and a when labeled data are insufficient as is common in the real
variant of dropout for RNNs called rnnDrop [54], have been pro- world [68].
posed. Furthermore, recently proposed batch normalization [55] DNNs are renowned for their suitability in analyzing high-
provides a new regularization method through normalization of dimensional data. Given that bioinformatics data are typically
scalar features for each activation within a mini-batch and complex and high-dimensional, DNNs have great promise for
learning each mean and variance as parameters. bioinformatics research. We believe that DNNs, as hierarchical
representation learning methods, can discover previously un-
known highly abstract patterns and correlations to provide in-
Deep learning libraries sight to better understand the nature of the data. However, it
To actually implement deep learning algorithms, a great deal of has occurred to us that the capabilities of DNNs have not yet
attention to algorithmic details is required. Fortunately, many fully been exploited. Although the key characteristic of DNNs is
open source deep learning libraries are available online (Table that hierarchical features are learned solely from data, human-
3). There are still no clear front-runners, and each library has its designed features have often been given as inputs instead of
own strengths [56]. According to benchmark test results raw data forms. We expect that the future progress of DNNs in
of CNNs, specifically AlexNet [33] implementation in bioinformatics will come from investigations into proper ways
Baharampour et al. [57], Python-based Neon [58] shows a great to encode raw data and learn suitable features from them.
Deep learning in bioinformatics | 855

Table 2. Categorization of deep learning applied research in bioinformatics

Omics Biomedical imaging Biomedical signal processing

Research topics Reference Research topics Reference Research topics Reference

Deep neural networks Protein structure [84–87] Anomaly classification [122–124] Brain decoding [158–163]
Gene expression regulation [93–98] Segmentation [133] Anomaly classification [171–175]
Protein classification [108] Recognition [142, 143]
Anomaly classification [111] Brain decoding [149, 150]

Convolutional neural Gene expression regulation [99–104] Anomaly classification [125–132] Brain decoding [164–167]
networks Segmentation [134–140] Anomaly classification [176]
Recognition [144–147]

Recurrent neural Protein structure [88–90] Brain decoding [168]

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


networks Gene expression regulation [105–107] Anomaly classification [177, 178]
Protein classification [109, 110]

Emergent architectures Protein structure [91, 92] Segmentation [141] Brain decoding [169, 170]

Table 3. Comparison of deep learning libraries

Core Speed for batch* (ms) Multi-GPU Distributed Strengths [56,57]

Caffe C þþ 651.6 O X Pre-trained models supported


Neon Python 386.8 O X Speed
TensorFlow C þþ 962.0 O O Heterogeneous distributed computing
Theano Python 733.5 X X Ease of use with higher-level wrappers
Torch Lua 506.6 O X Functional extensionality

Notes: Speed for batch* is based on the averaged processing times for AlexNet [33] with batch size of 256 on a single GPU [57]; Caffe, Neon, Theano, Torch was utilized
with cuDNN v.3 while TensorFlow was utilized with cuDNN v.2.

Convolutional neural networks


CNNs are designed to process multiple data types, especially
two-dimensional images, and are directly inspired by the visual
cortex of the brain. In the visual cortex, there is a hierarchy of
two basic cell types: simple cells and complex cells [69]. Simple
cells react to primitive patterns in sub-regions of visual stimuli,
and complex cells synthesize the information from simple cells
to identify more intricate forms. Since the visual cortex is such
a powerful and natural visual processing system, CNNs are
applied to imitate three key ideas: local connectivity, invariance
to location and invariance to local transition [8].
The basic structure of CNNs consists of convolution layers,
non-linear layers and pooling layers (Figure 6). To use highly
correlated sub-regions of data, groups of local weighted sums,
called feature maps, are obtained at each convolution layer by
computing convolutions between local patches and weight vec-
tors called filters. Furthermore, since identical patterns can ap-
pear regardless of the location in the data, filters are applied Figure 4. Basic structure of DNNs with input units x, three hidden units h1, h2
repeatedly across the entire dataset, which also improves train- and h3, in each layer and output units y [26]. At each layer, the weighted sum
ing efficiency by reducing the number of parameters to learn. and non-linear function of its inputs are computed so that the hierarchical rep-
resentations can be obtained.
Then non-linear layers increase the non-linear properties of
feature maps. At each pooling layer, maximum or average sub-
sampling of non-overlapping regions in feature maps is per- spatial information. Thanks to their developments in the field
formed. This non-overlapping subsampling enables CNNs to of object recognition, we believe the primary research achieve-
handle somewhat different but semantically similar features ments in bioinformatics will come from the biomedical imaging
and thus aggregate local features to identify more complex domain. Despite the different data characteristics between nor-
features. mal and biomedical imaging, CNN will nonetheless offer
Currently, CNNs are one of the most successful deep learn- straightforward applications compared to other domains.
ing architectures owing to their outstanding capacity to analyze Indeed, CNNs also have great potential in omics and biomedical
856 | Min et al.

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


Figure 5. Unsupervised layer-wise pre-training process in SAE and DBN [29]. First, weight vector W1 is trained between input units x and hidden units h1 in the first hid-
den layer as an RBM or AE. After the W1 is trained, another hidden layer is stacked, and the obtained representations in h1 are used to train W2 between hidden units
h1 and h2 as another RBM or AE. The process is repeated for the desired number of layers.

Figure 6. Basic structure of CNNs consisting of a convolution layer, a non-linear layer and a pooling layer [32]. The convolution layer of CNNs uses multiple learned fil-
ters to obtain multiple filter maps detecting low-level filters, and then the pooling layer combines them into higher-level features.

signal processing. The three keys ideas of CNNs can be applied RNNs have been used successfully in many areas including nat-
not only in a one-dimensional grid to discover meaningful ural language processing [16, 17] and language translation [18,
recurring patterns with small variance, such as genomic se- 19].
quence motifs, but also in two-dimensional grids, such as inter- Even though RNNs have been explored less than DNNs and
actions within omics data and in time–frequency matrices of CNNs, they still provide very powerful analysis methods for se-
biomedical signals. Thus, we believe that the popularity and quential information. Since omics data and biomedical signals
promise of CNNs in bioinformatics applications will continue in are typically sequential and often considered languages of na-
the years ahead. ture, the capabilities of RNNs for mapping a variable-length in-
put sequence to another sequence or fixed-size prediction are
promising for bioinformatics research. With regard to biomed-
Recurrent neural networks
ical imaging, RNNs are currently not the first choice of many re-
RNNs, which are designed to utilize sequential information, searchers. Nevertheless, we believe that dissemination of
have a basic structure with a cyclic connection (Figure 7). Since dynamic CT and MRI [71, 72] would lead to the incorporation of
input data are processed sequentially, recurrent computation is RNNs and CNNs and elevate their importance in the long term.
performed in the hidden units where cyclic connection exists. Furthermore, we expect that their successes in natural language
Therefore, past information is implicitly stored in the hidden processing will lead RNNs to be applied in biomedical text ana-
units called state vectors, and output for the current input is lysis [73] and that employing an attention mechanism [74–77]
computed considering all previous inputs using these state vec- will improve performance and extract more relevant informa-
tors [8]. Since there are many cases where both past and future tion from bioinformatics data.
inputs affect output for the current input (e.g. in speech recogni-
tion), bidirectional recurrent neural networks (BRNNs) [70] have
also been designed and used widely (Figure 8).
Emergent architectures
Although RNNs do not seem to be deep as DNNs or CNNs in Emergent architectures refer to deep learning architectures be-
terms of the number of layers, they can be regarded as an even sides DNNs, CNNs and RNNs. In this review, we introduce three
deeper structure if unrolled in time (Figure 7). Therefore, for a emergent architectures (i.e. DST-NNs, MD-RNNs and CAEs) and
long time, researchers struggled against vanishing gradient their applications in bioinformatics.
problems while training RNNs, and learning long-term depend- DST-NNs [38] are designed to learn multi-dimensional out-
ency among data were difficult [35]. Fortunately, substituting put targets through progressive refinement. The basic structure
the simple perceptron hidden units with more complex units of DST-NNs consists of multi-dimensional hidden layers (Figure
such as LSTM [36, 37] or GRU [19], which function as memory 9). The key aspect of the structure, progressive refinement, con-
cells, significantly helps to prevent the problem. More recently, siders local correlations and is performed via input feature
Deep learning in bioinformatics | 857

Figure 7. Basic structure of RNNs with an input unit x, a hidden unit h and an
output unit y [8]. A cyclic connection exists so that the computation in the hid-
den unit receives inputs from the hidden unit at the previous time step and
from the input unit at the current time step. The recurrent computation can be
expressed more explicitly if the RNNs are unrolled in time. The index of each
symbol represents the time step. In this way, ht receives input from xt and ht–1

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


and then propagates the computed results to yt and ht þ 1.

Figure 9. Basic structure of DST-NNs [38]. The notation hki;j represents the hidden
unit at (i, j) coordinate of the kth hidden layer. To conduct the progressive refine-
ment, the neighborhood units of hki;j and input units x are used in the computa-
tion of. hkþ1
i;j .

decoder, which extract feature vectors from input data and re-
create the data from the feature vectors, respectively. In CNNs,
convolution and pooling layers can be regarded as a type of en-
coder. Therefore, the CNN encoder and decoder consisting of
deconvolution and unpooling layers are integrated to form a
CAE and are trained in the same manner as in AE.
Deep learning is a rapidly growing research area, and a
plethora of new deep learning architecture is being proposed
but awaits wide applications in bioinformatics. Newly proposed
architectures have different advantages from existing architec-
tures, so we expect them to produce promising results in vari-
ous research areas. For example, the progressive refinement of
Figure 8. Basic structure of BRNNs unrolled in time [70]. There are two hidden
DST-NNs fits the dynamic folding process of proteins and can
units h! ! !
t and ht for each time step. ht receives input from xt and hðtþ1Þ to reflect
be effectively utilized in protein structure prediction [38]; the
past information; ht receives input from xt and hðtþ1Þ to reflect future informa-
tion. The information from both hidden units is propagated to yt. capabilities of MD-RNNs are suitable for segmentation of bio-
medical images since segmentation requires interpretation of
local and global contexts; the unsupervised representation
compositions in each layer: spatial features and temporal fea-
learning with consideration of spatial information in CAEs can
tures. Spatial features refer to the original inputs for the whole
provide great advantages in discovering recurring patterns in
DST-NN and are used identically in every layer. However, tem-
limited and imbalanced bioinformatics data.
poral features are gradually altered so as to progress to the
upper layers. Except for the first layer, to compute each hidden
unit in the current layer, only the adjacent hidden units of the Omics
same coordinate in the layer below are used so that local correl-
In omics research, genetic information such as genome, transcrip-
ations are reflected progressively. tome and proteome data is used to approach problems in bioinfor-
MD-RNNs [39] are designed to apply the capabilities of RNNs matics. Some of the most common input data in omics are raw
to non-sequential multi-dimensional data by treating them as biological sequences (i.e. DNA, RNA, amino acid sequences) which
groups of sequential data. For instance, two-dimensional data have become relatively affordable and easy to obtain with next-
are treated as groups of horizontal and vertical sequence data. generation sequencing technology. In addition, extracted features
Similar to BRNNs which use contexts in both directions in one- from sequences such as a position specific scoring matrices (PSSM)
dimensional data, MD-RNNs use contexts in all possible direc- [78], physicochemical properties [79, 80], Atchley factors [81] and
tions in the multi-dimensional data (Figure 10). In the example one-dimensional structural properties [82, 83] are often used as in-
of a two-dimensional dataset, four contexts that vary with the puts for deep learning algorithms to alleviate difficulties from com-
order of data processing are reflected in the computation of four plex biological data and improve results. In addition, protein
hidden units for each position in the hidden layer. The hidden contact maps, which present distances of amino acid pairs in their
units are connected to a single output layer, and the final re- three-dimensional structure, and microarray gene expression data
sults are computed with consideration of all possible contexts. are also used according to the characteristics of interest. We cate-
CAEs [40, 41] are designed to utilize the advantages of both gorized the topics of interest in omics into four groups (Table 4).
AE and CNNs so that it can learn good hierarchical representa- One of the most researched problems is protein structure predic-
tions of data reflecting spatial information and be well regular- tion, which aims to predict the secondary structure or contact map
ized by unsupervised training (Figure 11). In training of AEs, of a protein [84–92]. Gene expression regulation [93–107], including
reconstruction error is minimized using an encoder and splice junctions or RNA binding proteins, and protein classification
858 | Min et al.

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


Figure 10. Basic structure of MD-RNNs for two-dimensional data [39]. There are four groups of two-dimensional hidden units, each reflecting different contexts. For ex-
ample, the (i, j) hidden unit in context 1 receives input from the (i–1, j) and (i, j–1) hidden units in context 1 and the (i, j) unit from the input layer so that the upper-left
information is reflected. The hidden units from all four contexts are propagated to compute the (i, j) unit in the output layer.

Figure 11. Basic structure of CAEs consisting of a convolution layer and a pooling layer working as an encoder and a deconvolution layer and an unpooling layer work-
ing as a decoder [41]. The basic idea is similar to the AE, which learns hierarchical representations through reconstructing its input data, but CAE additionally utilizes
spatial information by integrating convolutions.

[108–110], including super family or subcellular localization, are expression regulation [93–98]. For example, Lee et al. [94] uti-
also actively investigated. Furthermore, anomaly classification lized DBN in splice junction prediction, a major research av-
[111] approaches have been used with omics data to detect cancer. enue in understanding gene expression [112], and proposed a
new DBN training method called boosted contrastive diver-
Deep neural networks gence for imbalanced data and a new regularization term for
sparsity of DNA sequences; their work showed not only signifi-
DNNs have been widely applied in protein structure prediction cantly improved performance but also the ability to detect sub-
[84–87] research. Since complete prediction in three-dimensional tle non-canonical splicing signals. Moreover, Chen et al. [96]
space is complex and challenging, several studies have used applied MLP to both microarray and RNA-seq expression data
simpler approaches, such as predicting the secondary struc- to infer expression of up to 21 000 target genes from only 1000
ture or torsion angles of protein. For instance, Heffernan et al. landmark genes. In terms of protein classification, Asgari et al.
[85] applied SAE to protein amino acid sequences to solve pre- [108] adopted the skip-gram model, a widely known method in
diction problems for secondary structure, torsion angle and ac- natural language processing, that can be considered a variant
cessible surface area. In another study, Spencer et al. [86] of MLP and showed that it could effectively learn a distributed
applied DBN to amino acid sequences along with PSSM and representation of biological sequences with general use for
Atchley factors to predict protein secondary structure. DNNs many omics applications, including protein family classifica-
have also shown great capabilities in the area of gene tion. For anomaly classification, Fakoor et al. [111] used
Deep learning in bioinformatics | 859

Table 4. Deep learning applied bioinformatics research avenues and input data

Input data Research avenues

Omics Sequencing data (DNA-seq, RNA-seq, ChIP-seq, DNase-seq) Protein structure prediction [84–92]
Features from genomic sequence 1-Dimensional structural properties
Position specific scoring matrix (PSSM) Contact map
Physicochemical properties (steric parameter, volume) Structure model quality assessment
Atchley factors (FAC) Gene expression regulation [93–107]
1-Dimensional structural properties Splice junction
Contact map (distance of amino acid pairs in 3D structure) Genetic variants affecting splicing
Microarray gene expression Sequence specificity
Protein classification [108–110]
Super family
Subcellular localization

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


Anomaly classification [111]
Cancer

Biomedical imaging Magnetic resonance image (MRI) Anomaly classification [122–132]


Radiographic image Gene expression pattern
Positron emission tomography (PET) Cancer
Histopathology image Alzheimer’s disease
Volumetric electron microscopy image Schizophrenia
Retinal image Segmentation [133–141]
In situ hybridization (ISH) image Cell structure
Neuronal structure
Vessel map
Brain tumor
Recognition [142–147]
Cell nuclei
Finger joint
Anatomical structure
Brain decoding [149–150]
Behavior

Biomedical signal ECoG, ECG, EMG, EOG Brain decoding [158–170]


processing EEG (raw, wavelet, frequency, differential entropy) Behavior
Extracted features from EEG Emotion
Normalized decay Anomaly classification [171–178]
Peak variation Alzheimer’s disease
Seizure
Sleep stage

principal component analysis (PCA) [113] to reduce the dimen- For example, as an early approach, Denas et al. [99] prepro-
sionality of microarray gene expression data and applied SAE cessed ChIP-seq data into a two-dimensional matrix with the
to classify various cancers, including acute myeloid leukemia, rows as transcription factor activity profiles for each gene and
breast cancer and ovarian cancer. exploited a two-dimensional CNN similar to its use in image
processing. Recently, more studies focused on directly using
Convolutional neural networks one-dimensional CNNs with biological sequence data.
Alipanahi et al. [100] and Kelley et al. [103] proposed CNN-based
Relatively few studies have used CNNs to solve problems
approaches for transcription factor binding site prediction and
involving biological sequences, specifically gene expression
164 cell-specific DNA accessibility multitask prediction, respect-
regulation problems [99–104]; nevertheless, those have intro-
ively; both groups presented downstream applications for
duced the strong advantages of CNNs, showing their great
promise for future research. First, an initial convolution layer disease-associated genetic variant identification. Furthermore,
can powerfully capture local sequence patterns and can be con- Zeng et al. [102] performed a systematic exploration of CNN
sidered a motif detector for which PSSMs are solely learned architectures for transcription factor-binding site prediction
from data instead of hard-coded. The depth of CNNs enables and showed that the number of convolutional filters is more
learning more complex patterns and can capture longer motifs, important than the number of layers for motif-based tasks.
integrate cumulative effects of observed motifs, and eventually Zhou et al. [104] developed a CNN-based algorithmic framework,
learn sophisticated regulatory codes [114]. Moreover, CNNs are DeepSEA, that performs multitask joint learning of chromatin
suited to exploit the benefits of multitask joint learning. By factors (i.e. transcription factor binding, DNase I sensitivity,
training CNNs to simultaneously predict closely related factors, histone-mark profile) and prioritizes expression quantitative
features with predictive strengths are more efficiently learned trait loci and disease-associated genetic variants based on the
and shared across different tasks. predictions.
860 | Min et al.

Recurrent neural networks Convolutional neural networks


RNNs are expected to be an appropriate deep learning architec- The largest number of studies has been conducted in biomed-
ture because biological sequences have variable lengths, and ical imaging, since these avenues are similar to general image-
their sequential information has great importance. Several related tasks. In anomaly classification [125–132], Roth et al.
studies have applied RNNs to protein structure prediction [88– [125] applied CNNs to three different CT image datasets to clas-
90], gene expression regulation [105–107] and protein classifica- sify sclerotic metastases, lymph nodes and colonic polyps.
tion [109, 110]. In early studies, Baldi et al. [88] used BRNNs with Additionally, Ciresan et al. [128] used CNNs to detect mitosis in
perceptron hidden units in protein secondary structure predic- breast cancer histopathology images, a crucial approach for
tion. Thereafter, the improved performance of LSTM hidden cancer diagnosis and assessment. PET images of esophageal
units became widely recognized, so Sønderby et al. [110] applied cancer were used by Ypsilantis et al. [129] to predict responses
BRNNs with LSTM hidden units and a one-dimensional convo- to neoadjuvant chemotherapy. Other applications of CNNs can
lution layer to learn representations from amino acid sequences be found in segmentation [134–140] and recognition [144–147].
and classify the subcellular locations of proteins. Furthermore, For example, Ning et al. [134] studied pixel-wise segmentation
Park et al. [105] and Lee et al. [107] exploited RNNs with LSTM patterns of the cell wall, cytoplasm, nuclear membrane, nucleus

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


hidden units in microRNA identification and target prediction and outside media using microscopic image, and Havaei et al.
and obtained significantly improved accuracy relative to state- [139] proposed a cascaded CNN architecture exploiting both
of-the-art approaches demonstrating the high capacity of RNNs local and global contextual features and performed brain tumor
to analyze biological sequences. segmentation from MRIs. For recognition, Cho et al. [144] re-
searched anatomical structure recognition among CT images,
Emergent architectures and Lee et al. [145] proposed a CNN-based finger joint detection
system, FingerNet, which is a crucial step for medical examin-
Emergent architectures have been used in protein structure pre-
ations of bone age, growth disorders and rheumatoid arthritis
diction research [91, 92], specifically in contact map prediction.
[151].
Di Lena et al. [91] applied DST-NNs using spatial features includ-
ing protein secondary structure, orientation probability, and
alignment probability. Additionally, Baldi et al. [92] applied MD- Recurrent neural networks
RNNs to amino acid sequences, correlated profiles, and protein Traditionally, images are considered data that involve internal
secondary structures. correlations or spatial information rather than sequential infor-
mation. Treating biomedical images as non-sequential data,
Biomedical imaging most studies in biomedical imaging have chosen approaches
involving DNNs or CNNs instead of RNNs.
Biomedical imaging [115] is another an actively researched do-
main with a wide application of deep learning in general image-
Emergent architectures
related tasks. Most biomedical images used for clinical treatment
of patients—magnetic resonance imaging (MRI) [116, 117], radio- Attempts to apply the unique capabilities of RNNs to image
graphic imaging [118, 119], positron emission tomography (PET) data using augmented RNN structures have continued. MD-
[120] and histopathology imaging [121]—have been used as input RNNs [39] have been applied beyond two-dimensional images
data for deep learning algorithms. We categorized the research to three-dimensional images. For example, Stollenga et al. [141]
avenues in biomedical imaging into four groups (Table 4). One of applied MD-RNNs to three-dimensional electron microscopy
the most researched problems is anomaly classification [122–132] images and MRIs to segment neuronal structures.
to diagnose diseases such as cancer or schizophrenia. As in gen-
eral image-related tasks, segmentation [133–141] (i.e. partitioning
Biomedical signal processing
specific structures such as cellular structures or a brain tumor)
and recognition [142–147] (i.e. detection of cell nuclei or a finger Biomedical signal processing [115] is a domain where re-
joint) are studied frequently in biomedical imaging. Studies of searchers use recorded electrical activity from the human body
popular high content screening [148], which involves quantifying to solve problems in bioinformatics. Various data from EEG
microscopic images for cell biology, are covered in the former [152], electrocorticography (ECoG) [153], electrocardiography
groups [128, 134, 137]. Additionally, cranial MRIs have been used (ECG) [154], electromyography (EMG) [155] and electrooculogra-
in brain decoding [149, 150] to interpret human behavior or phy (EOG) [156, 157] have been used, with most studies focusing
emotion. on EEG activity so far. Because recorded signals are usually
noisy and include many artifacts, raw signals are often decom-
Deep neural networks posed into wavelet or frequency components before they are
used as input in deep learning algorithms. In addition, human-
In terms of biomedical imaging, DNNs have been applied in sev- designed features like normalized decay and peak variation are
eral research areas, including anomaly classification [122–124], used in some studies to improve the results. We categorized the
segmentation [133], recognition [142, 143] and brain decoding research avenues in biomedical signal processing into two
[149, 150]. Plis et al. [122] classified schizophrenia patients from groups (Table 4): brain decoding [158–170] using EEG signals and
brain MRIs using DBN, and Xu et al. [142] used SAE to detect cell anomaly classification [171–178] to diagnose diseases.
nuclei from histopathology images. Interestingly, similar to
handwritten digit image recognition, Van Gerven et al. [149]
Deep neural networks
classified handwritten digit images with DBN not by analyzing
the images themselves but by indirectly analyzing indirectly Since biomedical signals usually contain noise and artifacts,
functional MRIs of participants who are looking at the digit decomposed features are more frequently used than raw sig-
images. nals. In brain decoding [158–163], An et al. [159] applied DBN to
Deep learning in bioinformatics | 861

the frequency components of EEG signals to classify left- and A few assessment metrics have been used to clearly observe
right-hand motor imagery skills. Moreover, Jia et al. [161] and how limited and imbalanced data might compromise the per-
Jirayucharoensak et al. [163] used DBN and SAE, respectively, for formance of deep learning [181]. While accuracy often gives
emotion classification. In anomaly classification [171–175], misleading results, the F-measure, the harmonic mean of preci-
Huanhuan et al. [171] published one of the few studies applying sion and recall, provides more insightful performance scores.
DBN to ECG signals and classified each beat into either a normal To measure performance over different class distributions, the
or abnormal beat. A few studies have used raw EEG signals. area-under-the-receiver operating characteristic curve (AUC)
Wulsin et al. [172] analyzed individual second-long waveform and the area-under-the-precision–recall curve (AUC-PR) are
abnormalities using DBN with both raw EEG signals and ex- commonly used. These two measures are strongly correlated
tracted features as inputs, whereas Zhao et al. [174] used only such that a curve dominates in one measure if and only if it
raw EEG signals as inputs for DBN to diagnose Alzheimer’s dominates in the other. Nevertheless, in contrast with AUC-PR,
disease. AUC might present a more optimistic view of performance,
since false positive rates in the receiver operating characteristic
Convolutional neural networks curve fail to capture large changes of false positives if classes

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


are negatively skewed [182].
Raw EEG signals have been analyzed in brain decoding [164–167] Solutions to limited and imbalanced data can be divided into
and anomaly classification [176] via CNNs, which perform three major groups [181, 183]: data preprocessing, cost-sensitive
one-dimensional convolutions. For instance, Stober et al. [165] learning and algorithmic modification. Data preprocessing typ-
classified the rhythm type and genre of music that participants ically provides a better dataset through sampling or basic fea-
listened to, and Cecotti et al. [167] classified characters that the ture extraction. Sampling methods balance the distribution of
participants viewed. Another approach to apply CNNs to bio- imbalanced data, and several approaches have been proposed,
medical signal processing was reported by Mirowski et al. [176], including informed undersampling [184], the synthetic minority
who extracted features such as phase-locking synchrony and oversampling technique [185] and cluster-based sampling [186].
wavelet coherence and coded them as pixel colors to formulate For example, Li et al. [127] and Roth et al. [146] performed enrich-
two-dimensional patterns. Then, ordinary two-dimensional ment analyses of CT images through spatial deformations such
CNNs, like the one used in biomedical imaging, were used to as random shifting and rotation. Although basic feature extrac-
predict seizures. tion methods deviate from the concept of deep learning, they
are occasionally used to lessen the difficulties of learning from
Recurrent neural networks limited and imbalanced data. Research in bioinformatics using
human designed features as input data such as PSSM from gen-
Since biomedical signals represent naturally sequential data,
omics sequences or wavelet energy from EEG signals can be
RNNs are an appropriate deep learning architecture to analyze
understood in the same context [86, 92, 172, 176].
data and are expected to produce promising results. To present
Cost-sensitive learning methods define different costs for
some of the studies in brain decoding [168] and anomaly classi-
misclassifying data examples from individual classes to solve
fication [177, 178], Petrosian et al. [177] applied perceptron RNNs
the limited and imbalanced data problems. Cost sensitivity can
to raw EEG signals and corresponding wavelet decomposed fea-
be applied in an objective loss function of neural networks ei-
tures to predict seizures. In addition, Davidson et al. [178] used
ther explicitly or implicitly [187]. For example, we can explicitly
LSTM RNNs on EEG log-power spectra features to detect lapses.
replace the objective loss function to reflect class imbalance or
implicitly modify the learning rates according to data instance
Emergent architectures classes during training.
CAE has been applied in a few brain decoding studies [169, 170]. Algorithmic modification methods accommodate learning algo-
Wang et al. [169] performed finger flex and extend classifica- rithms to increase their suitability for limited and imbalanced data.
tions using raw ECoG signals. In addition, Stober et al. [170] clas- A simple and effective approach is adoption of pre-training.
sified musical rhythms that participants listened to with raw Unsupervised pre-training can be a great help to learn representa-
EEG signals. tion for each class and to produce more regularized results [68]. In
addition, transfer learning, which consists of pre-training with suf-
ficient data from similar but different domains and fine-tuning
Discussion with real data, has great advantages [24, 188]. For instance, Lee
Limited and imbalanced data et al. [107] proposed a microRNA target prediction method, which
exploits unsupervised pre-training with RNN based AE, and
Considering the necessity of optimizing a tremendous number
achieved a >25% increase in F-measure compared to the existing
of weight parameters in neural networks, most deep learning
alternatives. Bar et al. [132] performed transfer learning using nat-
algorithms have assumed sufficient and balanced data.
ural images from the ImageNet database [189] as pre-training data
Unfortunately, however, this is usually not true for problems in
and fine-tuned with chest X-ray images to identify chest patholo-
bioinformatics. Complex and expensive data acquisition proc-
gies and to classify healthy and abnormal images. In addition to
esses limit the size of bioinformatics datasets. In addition, such
pre-training, sophisticated training methods have also been exe-
processes often show significantly unequal class distributions,
cuted. Lee et al. [94] suggested DBN with boosted categorical RBM,
where an instance from one class is significantly higher than in-
and Havaei et al. [139] suggested CNNs with two-phase training,
stances from other classes [179]. For example in clinical or
combining ideas of undersampling and pre-training.
disease-related cases, there is inevitably less data from treat-
ment groups than from the normal (control) group. The former
are also rarely disclosed to the public due to privacy restrictions
Changing the black-box into the white-box
and ethical requirements creating a further imbalance in avail- A main criticism against deep learning is that it is used as a
able data [180]. black-box: even though it produces outstanding results, we
862 | Min et al.

know very little about how such results are derived internally. machine learning research, which aims to automatically opti-
In bioinformatics, particularly in biomedical domains, it is not mize hyperparameters is growing constantly [196]. A few algo-
enough to simply produce good outcomes. Since many studies rithms have been proposed including sequential model based
are connected to patients’ health, it is crucial to change the global optimization [197], Bayesian optimization with Gaussian
black-box into the white-box providing logical reasoning just as process priors [198] and random search approaches [199].
clinicians do for medical treatments.
Transformation of deep learning from the black-box into the Multimodal deep learning
white-box is still in the early stages. One of the most widely
Multimodal deep learning [200], which exploits information
used approaches is interpretation through visualizing a trained
from multiple input sources, is a promising avenue for the fu-
deep learning model. In terms of image input, a deconvolutional
ture of deep learning research. In particular, bioinformatics is
network has been proposed to reconstruct and visualize hier-
expected to benefit greatly, as it is a field where various types of
archical representations for a specific input of CNNs [190]. In
data can be assimilated naturally [201]. For example, not only
addition, to visualize a generalized class representative image
are omics data, images, signals, drug responses and electronic
rather than being dependent on a particular input, gradient

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


medical records available as input data, but X-ray, CT, MRI and
ascent optimization in input space through backpropagation-
PET forms are also available from a single image.
to-input (cf. backpropagation-to-weights) has provided another
A few bioinformatics studies have already begun to use
effective methodology [191, 192]. Regarding genomic sequence
multimodal deep learning. For example, Suk et al. [124] studied
input, several approaches have been proposed to infer PSSMs
Alzheimer’s disease classification using cerebrospinal fluid and
from a trained model and to visualize the corresponding motifs
brain images in the forms of MRI and PET scan and Soleymani
with heat maps or sequence logos. For example, Lee et al. [94]
et al. [168] conducted an emotion detection study with both EEG
extracted motifs by choosing the most class discriminative
signal and face image data.
weight vector among those in the first layer of DBN; DeepBind
[100] and DeMo [101] extracted motifs from trained CNNs by
counting nucleotide frequencies of positive input subsequences Accelerating deep learning
with high activation values and backpropagation-to-input for As more deep learning model parameters and training data be-
each feature map, respectively. come available, better learning performances can be achieved.
Specifically for transcription factor binding site prediction, However, at the same time, this inevitably leads to a drastic in-
Alipanahi et al. [100] developed a visualization method, a muta- crease in training time, emphasizing the necessity for acceler-
tion map, for illustrating the effects of genetic variants on bind- ated deep learning [7, 25].
ing scores predicted by CNNs. A mutation map consists of a Approaches to accelerating deep learning can be divided into
heat map, which shows how much each mutation alters the three groups: advanced optimization algorithms, parallel and
binding score, and the input sequence logo, where the height of distributed computing and specialized hardware. Since the
each base is scaled as the maximum decrease of binding score main reason for long training times is that parameter optimiza-
among all possible mutations. Moreover, Kelley et al. [103] fur- tion through plain SGD takes too long, several studies have
ther complemented the mutation map with a line plot to show focused on advanced optimization algorithms [202]. To this end,
the maximum increases as well as the maximum decreases of some widely employed algorithms include Adagrad [48], Adam
prediction scores. In addition to interpretation through visual- [49], batch normalization [55] and Hessian-free optimization
ization, attention mechanisms [74–77] designed to focus expli- [203]. Parallel and distributed computing can significantly accel-
citly on salient points and the mathematical rationale behind erate the time to completion and have enabled many deep
deep learning [193, 194] are being studied. learning studies [204–208]. These approaches exploit both scale-
up methods, which use a graphic processing unit, and scale-out
Selection of an appropriate deep learning architecture methods, which use large-scale clusters of machines in a dis-
and hyperparameters tributed environment. A few deep learning frameworks, includ-
ing the recently released DeepSpark [209] and TensorFlow [210]
Choosing the appropriate deep learning architecture is crucial
provide parallel and distributed computing abilities. Although
to proper applications of deep learning. To obtain robust and re-
development of specialized hardware for deep learning is still in
liable results, awareness of the capabilities of each deep learn-
its infancy, it will provide major accelerations and become far
ing architecture and selection according to capabilities in
more important in the long term [211]. Currently, field program-
addition to input data characteristics and research objectives
mable gate array-based processors are under development, and
are essential. However, to date, the advantages of each architec-
neuromorphic chips modeled from the brain are greatly antici-
ture are only roughly understood; for example, DNNs are suit-
pated as promising technologies [212–214].
able for analysis of internal correlations in high-dimensional
data, CNNs are suitable for analysis of spatial information, and
RNNs are suitable for analysis of sequential information [7]. Future trends of deep learning
Indeed, a detailed methodology for selecting the most appropri- Incorporation of traditional deep learning architectures is a
ate or ‘best fit’ deep learning architecture remains a challenge promising future trend. For instance, joint networks of CNNs
to be studied in the future. and RNNs integrated with attention models have been applied
Even once a deep learning architecture is selected, there are in image captioning [75], video summarization [215] and image
many hyperparameters—the number of layers, the number of question answering [216]. A few studies toward augmenting the
hidden units, weight initialization values, learning iterations structures of RNNs have been conducted as well. Neural Turing
and even the learning rate—for researchers to set, all of which machines [217] and memory networks [218] have adopted ad-
can influence the results remarkably [195]. For many years, dressable external memory in RNNs and shown great results for
hyperparameter tuning was rarely systematic and left up to tasks requiring intricate inferences, such as algorithm learning
human machine learning experts. Nevertheless, automation of and complex question answering. Recently, adversarial
Deep learning in bioinformatics | 863

examples, which degrade performance with small human- neural networks, convolutional neural networks, re-
imperceptible perturbations, have received increased attention current neural networks, emergent architectures).
from the machine learning community [219, 220]. Since adver- • Furthermore, we discuss the theoretical and practical
sarial training of neural networks can result in regularization to issues plaguing the applications of deep learning
provide higher performance, we expect additional studies in in bioinformatics, including imbalanced data, inter-
this area, including those involving adversarial generative net- pretation, hyperparameter optimization, multimodal
works [221] and manifold regularized networks [222]. deep learning, and training acceleration.
In terms of learning methodology, semi-supervised learning • As a comprehensive review of existing works, we be-
and reinforcement learning are also receiving attention. Semi- lieve that this paper will provide valuable insight and
supervised learning exploits both unlabeled and labeled data,
serve as a launching point for researchers to apply
and a few algorithms have been proposed. For example, ladder
deep learning approaches in their bioinformatics
networks [223] add skip connections to MLP or CNNs, and simul-
studies.
taneously minimize the sum of supervised and unsupervised
cost functions to denoise representations at every level of the

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


model. Reinforcement learning leverages reward outcome sig- Acknowledgements
nals resulting from actions rather than correctly labeled data.
The authors would like to thank Prof. Russ Altman and Prof.
Since reinforcement learning most closely resembles how
Tsachy Weissman at Stanford University, Prof. Honglak Lee
humans actually learn, this approach has great promise for arti-
ficial general intelligence [224]. Currently, its applications are
at University of Michigan, Prof. V. Narry Kim and Prof.
mainly focused on game playing [4] and robotics [225]. Daehyun Baek at Seoul National University, and Prof.
Young-Han Kim at University of California, San Diego for
helpful discussions on applying artificial intelligence and
Conclusion machine learning to bioinformatics.
As we enter the major era of big data, deep learning is taking
center stage for international academic and business interests.
Funding
In bioinformatics, where great advances have been made with
conventional machine learning, deep learning is anticipated to This research was supported by the National Research
produce promising results. In this review, we provided an ex- Foundation (NRF) of Korea grants funded by the Korean
tensive review of bioinformatics research applying deep learn- Government (Ministry of Science, ICT and Future Planning)
ing in terms of input data, research objectives and the (Nos. 2014M3C9A3063541 and 2015M3A9A7029735); the
characteristics of established deep learning architectures. We Korea Health Technology R&D Project through the Korea
further discussed limitations of the approach and promising
Health Industry Development Institute (KHIDI) funded by
directions of future research.
the Ministry of Health & Welfare (No. HI15C3224); and SNU
Although deep learning holds promise, it is not a silver bullet
ECE Brain Korea 21+ project in 2016.
and cannot provide great results in ad hoc bioinformatics appli-
cations. There remain many potential challenges, including
limited or imbalanced data, interpretation of deep learning re- References
sults, and selection of an appropriate architecture and hyper-
1. Manyika J, Chui M, Brown B, et al. Big data: the next frontier
parameters. Furthermore, to fully exploit the capabilities of
for innovation, competition, and productivity. Technical
deep learning, multimodality and acceleration of deep learning
report, McKinsey Global Institute, 2011.
require further study. Thus, we are confident that prudent prep-
2. Ferrucci D, Brown E, Chu-Carroll J, et al. Building Watson: an
arations regarding the issues discussed herein are key to the
overview of the DeepQA project. AI Magazine
success of future deep learning approaches in bioinformatics.
2010;31(3):59–79.
We believe that this review will provide valuable insight and
3. IBM Watson for Oncology. IBM. http://www.ibm.com/smar
serve as a starting point for application of deep learning to ad-
terplanet/us/en/ibmwatson/watson-oncology.html, 2016.
vance bioinformatics in future research.
4. Silver D, Huang A, Maddison CJ, et al. Mastering the game of
Go with deep neural networks and tree search. Nature
Key Points 2016;529(7587):484–9.
5. DeepMind Health. Google DeepMind. https://www.deep
• As a great deal of biomedical data has been accumu- mind.com/health, 2016.
lated, various machine algorithms are now being 6. Larran ~ aga P, Calvo B, Santana R, et al. Machine learning in
widely applied in bioinformatics to extract knowledge bioinformatics. Brief Bioinformatics 2006;7(1):86–112.
from big data. 7. Goodfellow I, Bengio Y, Courville A. Deep Learning. Book in
• Deep learning, which has evolved from the acquisition preparation for MIT Press, 2016.
of big data, the power of parallel and distributed 8. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature
computing and sophisticated training algorithms, has 2015;521(7553):436–44.
facilitated major advances in numerous domains such 9. Farabet C, Couprie C, Najman L, et al. Learning hierarchical
as image recognition, speech recognition and natural features for scene labeling. IEEE Trans Pattern Anal Mach
language processing. Intell, 2013;35(8):1915–29.
• We review deep learning for bioinformatics and pre- 10. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolu-
sent research categorized by bioinformatics domain tions. arXiv Preprint arXiv:1409.4842, 2014.
(i.e. omics, biomedical imaging, biomedical signal pro- 11. Tompson JJ, Jain A, LeCun Y, et al. Joint training of a convolu-
cessing) and deep learning architecture (i.e. deep tional network and a graphical model for human pose
864 | Min et al.

estimation. In: Advances in Neural Information Processing 32. Lawrence S, Giles CL, Tsoi AC, et al. Face recognition: a con-
Systems. 2014, 1799–807. volutional neural-network approach. IEEE Trans Neural Netw
12. Liu N, Han J, Zhang D, et al. Predicting eye fixations using 1997;8(1):98–113.
convolutional neural networks. In: Proceedings of the IEEE 33. Krizhevsky A, Sutskever I, Hinton G. Imagenet classification
Conference on Computer Vision and Pattern Recognition. 2015. p. with deep convolutional neural networks. In: Advances in
362–70. Neural Information Processing Systems, 2012. p. 1097–105.
13. Hinton G, Deng L, Yu D, et al. Deep neural networks for 34. Williams RJ, Zipser D. A learning algorithm for continually
acoustic modeling in speech recognition: the shared views running fully recurrent neural networks. Neural Comput
of four research groups. IEEE Signal Process Mag 1989;1(2):270–80.
2012;29(6):82–97. 35. Bengio Y, Simard P, Frasconi P. Learning long-term depend-
14. Sainath TN, Mohamed A-R, Kingsbury B, et al. Deep convolu- encies with gradient descent is difficult. IEEE Trans Neural
tional neural networks for LVCSR. In: 2013 IEEE International Netw 1994;5(2):157–66.
Conference on Acoustics, Speech and Signal Processing (ICASSP), 36. Hochreiter S, Schmidhuber J. Long short-term memory.
2013. p. 8614–8. IEEE, New York. Neural Comput 1997;9(8):1735–80.

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


15. Chorowski JK, Bahdanau D, Serdyuk D, et al. Attention-based 37. Gers FA, Schmidhuber J, Cummins F. Learning to forget: con-
models for speech recognition. In: Adv Neural Inf Process Syst tinual prediction with LSTM. Neural Comput
2015;577–85. 2000;12(10):2451–71.
16. Kiros R, Zhu Y, Salakhutdinov RR, et al. Skip-thought vectors. 38. Lena PD, Nagata K, Baldi PF. Deep spatio-temporal architec-
In: Advances in Neural Information Processing Systems. 2015, p. tures and learning for protein structure prediction. In: Advances
3276–84. in Neural Information Processing Systems, 2012. p. 512–20.
17. Li J, Luong M-T, Jurafsky D. A hierarchical neural autoen- 39. Graves A, Schmidhuber J. Offline handwriting recognition
coder for paragraphs and documents. arXiv Preprint with multidimensional recurrent neural networks. In:
arXiv:1506.01057, 2015. Advances in Neural Information Processing Systems, 2009. p.
18. Luong M-T, Pham H, Manning CD. Effective approaches to 545–52.
attention-based neural machine translation. arXiv Preprint 40. Hadsell R, Sermanet P, Ben J, et al. Learning long-range vision
arXiv:1508.04025, 2015. for autonomous off-road driving. J Field Robot
19. Cho K, Van Merrie €nboer B, Gulcehre C, et al. Learning 2009;26(2):120–44.
phrase representations using RNN encoder–decoder for 41. Masci J, Meier U, Cireşan D, et al. Stacked convolutional
statistical machine translation. arXiv Preprint auto-encoders for hierarchical feature extraction. In:
arXiv:1406.1078, 2014. Artificial Neural Networks and Machine Learning – ICANN 2011.
20. Libbrecht MW, Noble WS. Machine learning applications in Springer, Berlin, Heidelberg, 2011, 52–9.
genetics and genomics. Nat Rev Genet 2015;16(6):321–32. 42. Minsky M, Papert S. Perceptron: an introduction to computa-
21. Schmidhuber J. Deep learning in neural networks: an over- tional geometry. MIT Press, Cambridge, Expanded Edition
view. Neural Networks 2015;61:85–117. 1969;19(88):2.
22. Leung MK, Delong A, Alipanahi B, et al. machine learning in 43. Fukushima K. Cognitron: a self-organizing multilayered
genomic medicine: a review of computational problems and neural network. Biol Cybern 1975;20(3–4):121–36.
data sets. Proc IEEE 2016;104:176–97. 44. Hinton G, Sejnowski TJ. Learning and releaming in
23. Mamoshina P, Vieira A, Putin E, et al. Applications of deep Boltzmann machines. Parallel Distrib Process: Explor
learning in biomedicine. Mol Pharm 2016;13:1445–54. Microstruct Cogn 1986;1:282–317.
24. Greenspan H, van Ginneken B, Summers RM. Guest editorial 45. Hinton G. A practical guide to training restricted Boltzmann
deep learning in medical imaging: overview and future machines. Momentum 2010;9(1):926.
promise of an exciting new technique. IEEE Trans Med 46. Hecht-Nielsen R. Theory of the backpropagation neural net-
Imaging 2016;35(5):1153–9. work. In: International Joint Conference on Neural Networks,
25. LeCun Y, Ranzato M. Deep learning tutorial. In: Tutorials in 1989. IJCNN, 1989. p. 593–605. IEEE, Washington, DC.
International Conference on Machine Learning (ICML’13), 2013. 47. Bottou L. Stochastic gradient learning in neural networks.
Citeseer. Proc Neuro-Nımes 1991;91(8).
26. Svozil D, Kvasnicka V, Pospichal J. Introduction to multi- 48. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods
layer feed-forward neural networks. Chemometr Intell Lab for online learning and stochastic optimization. J Mach Learn
Syst 1997;39(1):43–62. Res 2011;12:2121–59.
27. Vincent P, Larochelle H, Bengio Y, et al. Extracting and com- 49. Kingma D, Ba J. Adam: a method for stochastic optimization.
posing robust features with denoising autoencoders. In: arXiv preprint arXiv:1412.6980, 2014.
Proceedings of the 25th International Conference on Machine 50. Moody J, Hanson S, Krogh A, et al. A simple weight decay can
Learning, 2008, p. 1096–103. ACM, New York. improve generalization. Adv Neural Inf Process Syst
28. Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising 1995;4:950–7.
autoencoders: learning useful representations in a deep net- 51. Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple
work with a local denoising criterion. J Mach Learn Res way to prevent neural networks from overfitting. J Mach
2010;11:3371–408. Learn Res 2014;15(1):1929–58.
29. Hinton G, Osindero S, Teh Y-W. A fast learning algorithm for 52. Baldi P, Sadowski PJ. Understanding dropout. In: Advances in
deep belief nets. Neural Comput 2006;18(7):1527–54. Neural Information Processing Systems. 2013, 2814–22.
30. Hinton G, Salakhutdinov RR. Reducing the dimensionality of 53. Goodfellow IJ, Warde-Farley D, Mirza M, et al. Maxout net-
data with neural networks. Science 2006;313(5786):504–7. works. arXiv Preprint arXiv:1302.4389, 2013.
31. LeCun Y, Boser B, Denker JS, et al. Handwritten digit recogni- 54. Moon T, Choi H, Lee H, et al. RnnDrop: a novel dropout for
tion with a back-propagation network. In: Advances in Neural RNNs in ASR. In: Automatic Speech Recognition and
Information Processing Systems, 1990. Citeseer. Understanding (ASRU), Scottsdale, AZ, 2015.
Deep learning in bioinformatics | 865

55. Ioffe S, Szegedy C. Batch normalization: accelerating deep 77. Mnih V, Heess N, Graves A. Recurrent models of visual at-
network training by reducing internal covariate shift. arXiv tention. In: Advances in Neural Information Processing Systems,
Preprint arXiv:1502.03167, 2015. 2014, p. 2204–12.
56. Deeplearning4j Development Team. Deeplearning4j: open- 78. Jones DT. Protein secondary structure prediction based on
source distributed deep learning for the JVM. Apache position-specific scoring matrices. J Mol Biol 1999;292(2):195–202.
Software Foundation License 2.0. http://deeplearning4j.org, 79. Ponomarenko JV, Ponomarenko MP, Frolov AS, et al.
2016. Conformational and physicochemical DNA features specific
57. Bahrampour S, Ramakrishnan N, Schott L, et al. for transcription factor binding sites. Bioinformatics
Comparative study of deep learning software frameworks. 1999;15(7):654–68.
arXiv Preprint arXiv:1511.06435, 2015. 80. Cai Y-D, Lin SL. Support vector machines for predicting
58. Nervana Systems. Neon. https://github.com/ rRNA-, RNA-, and DNA-binding proteins from amino acid
NervanaSystems/neon, 2016. sequence. Biochim Biophys Acta (BBA) – Proteins Proteomics
59. Jia Y. Caffe: an open source convolutional architecture for 2003;1648(1):127–33.
fast feature embedding. In: ACM International Conference on 81. Atchley WR, Zhao J, Fernandes AD, et al. Solving the protein

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


Multimedia. ACM, Washington, DC, 2014. sequence metric problem. Proc Natl Acad Sci USA
60. Collobert R, Kavukcuoglu K, Farabet C. Torch7: a matlab-like 2005;102(18):6395–400.
environment for machine learning. In: BigLearn, NIPS 82. Branden CI. Introduction to protein structure. Garland Science,
Workshop, 2011. New York, 1999.
61. Bergstra J, Breuleux O, Bastien F, et al. Theano: a CPU and GPU 83. Richardson JS. The anatomy and taxonomy of protein struc-
math expression compiler. In: Proceedings of the Python for ture. Adv Protein Chem 1981;34:167–339.
Scientific Computing Conference (SciPy). 2010, p. 3. Austin, TX. 84. Lyons J, Dehzangi A, Heffernan R, et al. Predicting backbone
62. Bastien F, Lamblin P, Pascanu R, et al. Theano: new features Ca angles and dihedrals from protein sequences by stacked
sparse auto-encoder deep neural network. J Comput Chem
and speed improvements. arXiv Preprint arXiv:1211.5590,
2014;35(28):2040–6.
2012.
85. Heffernan R, Paliwal K, Lyons J, et al. Improving prediction of
63. Chollet F. Keras: Theano-based Deep Learning library. Code:
secondary structure, local backbone angles, and solvent ac-
https://github. com/fchollet. Documentation: http://keras.io
cessible surface area of proteins by iterative deep learning.
2015.
Sci Rep 2015;5:11476.
64. Dieleman S, Heilman M, Kelly J, et al. Lasagne: First Release,
86. Spencer M, Eickholt J, Cheng J. A deep learning network ap-
2015.
proach to ab initio protein secondary structure prediction.
65. van Merrie €nboer B, Bahdanau D, Dumoulin V, et al. Blocks
IEEE/ACM Trans Comput Biol Bioinformat 2015;12(1):103–12.
and fuel: frameworks for deep learning. arXiv Preprint
87. Nguyen SP, Shang Y, Xu D. DL-PRO: A novel deep learning
arXiv:1506.00619, 2015.
method for protein model quality assessment. In: 2014
66. Abadi M, Agarwal A, Barham P, et al. TensorFlow: large-scale
International Joint Conference on Neural Networks (IJCNN), 2014,
machine learning on heterogeneous distributed systems.
p. 2071–8. IEEE, New York.
arXiv Preprint arXiv:1603.04467, 2016.
88. Baldi P, Brunak S, Frasconi P, et al. Exploiting the past and
67. Nair V, Hinton G. Rectified linear units improve re-
the future in protein secondary structure prediction.
stricted boltzmann machines. In: Proceedings of the 27th
Bioinformatics 1999;15(11):937–46.
International Conference on Machine Learning (ICML-10), 2010. p.
89. Baldi P, Pollastri G, Andersen CA, et al. Matching protein
807–14.
beta-sheet partners by feedforward and recurrent neural
68. Erhan D, Bengio Y, Courville A, et al. Why does unsupervised
networks. In: Proceedings of the 2000 Conference on Intelligent
pre-training help deep learning? J Mach Learn Res Systems for Molecular Biology (ISMB00), La Jolla, CA, 2000. p.
2010;11:625–60. 25–36.
69. Hubel DH, Wiesel TN. Receptive fields and functional archi- 90. Sønderby SK, Winther O. Protein secondary structure pre-
tecture of monkey striate cortex. J Physiol 1968;195(1):215–43. diction with long short term memory networks. arXiv
70. Schuster M, Paliwal KK. Bidirectional recurrent neural net- Preprint arXiv:1412.7828, 2014.
works. IEEE Trans Signal Process 1997;45(11):2673–81. 91. Lena PD, Nagata K, Baldi P. Deep architectures for protein
71. Cenic A, Nabavi DG, Craen RA, et al. Dynamic CT measure- contact map prediction. Bioinformatics 2012;28(19):2449–57.
ment of cerebral blood flow: a validation study. Am J 92. Baldi P, Pollastri G. The principled design of large-scale re-
Neuroradiol 1999;20(1):63–73. cursive neural network architectures – dag-rnns and the
72. Tsao J, Boesiger P, Pruessmann KP. k-t BLAST and k-t SENSE: protein structure prediction problem. J Mach Learn Res
dynamic MRI with high frame rate exploiting spatiotempo- 2003;4:575–602.
ral correlations. Magn Reson Med 2003;50(5):1031–42. 93. Leung MK, Xiong HY, Lee LJ, et al. Deep learning of the
73. Cohen AM, Hersh WR. A survey of current work in biomed- tissue-regulated splicing code. Bioinformatics
ical text mining. Brief Bioinformatics 2005;6(1):57–71. 2014;30(12):i121–9.
74. Bahdanau D, Cho K, Bengio Y. Neural machine translation 94. Lee T, Yoon S. Boosted categorical restricted boltzmann ma-
by jointly learning to align and translate. arXiv Preprint chine for computational prediction of splice junctions. In:
arXiv:1409.0473, 2014. International Conference on Machine Learning, Lille, France,
75. Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image 2015. p. 2483–92.
caption generation with visual attention. arXiv Preprint 95. Zhang S, Zhou J, Hu H, et al. A deep learning framework for
arXiv:1502.03044, 2015. modeling structural features of RNA-binding protein targets.
76. Cho K, Courville A, Bengio Y. Describing multimedia content Nucleic Acids Res 2015;gkv1025.
using attention-based encoder–decoder networks. IEEE 96. Chen Y, Li Y, Narayan R, et al. Gene expression inference
Trans Multimed 2015;17(11):1875–86. with deep learning. Bioinformatics 2016;btw074.
866 | Min et al.

97. Li Y, Shi W, Wasserman WW. Genome-wide prediction of 120. Bailey DL, Townsend DW, Valk PE, et al. Positron Emission
cis-regulatory regions using supervised deep learning meth- Tomography. Springer, London, 2005.
ods. bioRxiv 2016;041616. 121. Gurcan MN, Boucheron LE, Can A, et al. Histopathological
98. Liu F, Ren C, Li H, et al. De novo identification of replication- image analysis: a review. Biomed Eng, IEEE Rev 2009;2:147–71.
timing domains in the human genome by deep learning. 122. Plis SM, Hjelm DR, Salakhutdinov R, et al. Deep learning for
Bioinformatics 2015;btv643. neuroimaging: a validation study. Front Neurosci 2014;8:229.
99. Denas O, Taylor J. Deep modeling of gene expression regula- 123. Hua K-L, Hsu C-H, Hidayati SC, et al. Computer-aided classi-
tion in an Erythropoiesis model. In: International Conference fication of lung nodules on computed tomography images
on Machine Learning workshop on Representation Learning. via deep learning technique. Onco Targets Ther 2015;8:
Atlanta, Georgia, USA, 2013. 2015–22.
100. Alipanahi B, Delong A, Weirauch MT, et al. Predicting the se- 124. Suk H-I, Shen D. Deep learning-based feature representation
quence specificities of DNA-and RNA-binding proteins by for AD/MCI classification. In: Medical Image Computing and
deep learning. Nat Biotechnol 2015;33(8):825–6. Computer-Assisted Intervention – MICCAI 2013. Springer, New
101. Lanchantin J, Singh R, Lin Z, et al. Deep motif: visualizing York, 2013. 583–90.

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


genomic sequence classifications. arXiv Preprint arXiv: 125. Roth HR, Lu L, Liu J, et al. Improving computer-aided detec-
1605.01133, 2016. tion using convolutional neural networks and random view
102. Zeng H, Edwards MD, Liu G, et al. Convolutional neural net- aggregation. arXiv Preprint arXiv:1505.03046, 2015.
work architectures for predicting DNA–protein binding. 126. Roth HR, Yao J, Lu L, et al. Detection of sclerotic spine meta-
Bioinformatics 2016;32(12):i121–7. stases via random aggregation of deep convolutional neural
103. Kelley DR, Snoek J, Rinn J. Basset: learning the regulatory network classifications. In: Recent Advances in Computational
code of the accessible genome with deep convolutional neu- Methods and Clinical Applications for Spine Imaging. Springer,
ral networks. bioRxiv 2015;028399. Heidelberg, 2015, 3–12.
104. Zhou J, Troyanskaya OG. Predicting effects of noncoding 127. Li Q, Cai W, Wang X, et al. Medical image classification with
variants with deep learning-based sequence model. Nat convolutional neural network. In: 2014 13th International
Methods 2015;12(10):931–4. Conference on Control Automation Robotics & Vision (ICARCV),
105. Park S, Min S, Choi H-S, et al. deepMiRGene: deep neural net- 2014. p. 844–8. IEEE, Singapore.
work based precursor microRNA prediction. arXiv Preprint 128. Cireşan DC, Giusti A, Gambardella LM, et al. Mitosis detec-
arXiv:1605.00017, 2016. tion in breast cancer histology images with deep neural net-
106. Lee B, Lee T, Na B, et al. DNA-level splice junction prediction works. In: Medical Image Computing and Computer-Assisted
using deep recurrent neural networks. arXiv Preprint Intervention – MICCAI 2013. Springer, Heidelberg, 2013, 411–8.
arXiv:1512.05135, 2015. 129. Ypsilantis P-P, Siddique M, Sohn H-M, et al. Predicting re-
107. Lee B, Baek J, Park S, et al. deepTarget: end-to-end learning sponse to neoadjuvant chemotherapy with PET imaging
framework for microRNA target prediction using deep recur- using convolutional neural networks. PloS One
rent neural networks. arXiv Preprint arXiv:1603.09123, 2016. 2015;10(9):e0137036.
108. Asgari E, Mofrad MR. Continuous distributed representation 130. Zeng T, Li R, Mukkamala R, et al. Deep convolutional neural
of biological sequences for deep proteomics and genomics. networks for annotating gene expression patterns in the
PloS One 2015;10(11):e0141287. mouse brain. BMC Bioinformatics 2015;16(1):1–10.
109. Hochreiter S, Heusel M, Obermayer K. Fast model-based pro- 131. Cruz-Roa AA, Ovalle JEA, Madabhushi A, et al. A deep learn-
tein homology detection without alignment. Bioinformatics ing architecture for image representation, visual interpret-
2007;23(14):1728–36. ability and automated basal-cell carcinoma cancer
110. Sønderby SK, Sønderby CK, Nielsen H, et al. Convolutional detection. In: Medical Image Computing and Computer-Assisted
LSTM networks for subcellular localization of proteins. arXiv Intervention – MICCAI 2013. Springer, Heidelberg, 2013,
Preprint arXiv:1503.01919, 2015. 403–10.
111. Fakoor R, Ladhak F, Nazi A, et al. Using deep learning to en- 132. Bar Y, Diamant I, Wolf L, et al. Deep learning with non-
hance cancer diagnosis and classification. In: Proceedings of medical training used for chest pathology identification. In:
the International Conference on Machine Learning, 2013. SPIE Medical Imaging. International Society for Optics and
112. Nilsen TW, Graveley BR. Expansion of the eukaryotic prote- Photonics, 2015, 94140V-V-7.
ome by alternative splicing. Nature 2010;463(7280):457–63. 133. Li Q, Feng B, Xie L, et al. A cross-modality learning approach
113. Jolliffe I. Principal component analysis. Wiley Online Library, for vessel segmentation in retinal images. IEEE Trans Med
2002. Imaging 2015;35(1):109–8.
114. Park Y, Kellis M. Deep learning for regulatory genomics. Nat 134. Ning F, Delhomme D, LeCun Y, et al. Toward automatic phe-
Biotechnol 2015;33(8):825–6. notyping of developing embryos from videos. IEEE Trans
115. Najarian K, Splinter R. Biomedical Signal and Image Processing. Image Process 2005;14(9):1360–71.
CRC Press, New York, 2005. 135. Turaga SC, Murray JF, Jain V, et al. Convolutional networks
116. Edelman RR, Warach S. Magnetic resonance imaging. N Engl can learn to generate affinity graphs for image segmenta-
J Med 1993;328(10):708–16. tion. Neural Comput 2010;22(2):511–38.
117. Ogawa S, Lee T-M, Kay AR, et al. Brain magnetic resonance 136. Helmstaedter M, Briggman KL, Turaga SC, et al.
imaging with contrast dependent on blood oxygenation. Connectomic reconstruction of the inner plexiform layer in
Proc Natl Acad Sci 1990;87(24):9868–72. the mouse retina. Nature 2013;500(7461):168–74.
118. Hsieh J. Computed tomography: principles, design, artifacts, 137. Ciresan D, Giusti A, Gambardella LM, et al. Deep neural net-
and recent advances. In: SPIE Bellingham, WA, 2009. works segment neuronal membranes in electron micros-
119. Chapman D, Thomlinson W, Johnston R, et al. Diffraction copy images. In: Advances in Neural Information Processing
enhanced x-ray imaging. Phys Med Biol 1997;42(11):2015. Systems. 2012, 2843–51.
Deep learning in bioinformatics | 867

138. Prasoon A, Petersen K, Igel C, et al. Deep feature learning for 158. Freudenburg ZV, Ramsey NF, Wronkeiwicz M, et al. Real-
knee cartilage segmentation using a triplanar convolutional time naive learning of neural correlates in ECoG electro-
neural network. Medical Image Computing and Computer- physiology. Int J Mach Learn Comput 2011.
Assisted Intervention – MICCAI 2013. Springer, Heidelberg, 159. An X, Kuang D, Guo X, et al. A deep learning method for clas-
2013, 246–53. sification of EEG data based on motor imagery. In: Intelligent
139. Havaei M, Davy A, Warde-Farley D, et al. Brain tumor seg- Computing in Bioinformatics. Springer, Heidelberg, 2014,
mentation with deep neural networks. arXiv Preprint 203–10.
arXiv:1505.03540, 2015. 160. Li K, Li X, Zhang Y, et al. Affective state recognition from
140. Roth HR, Lu L, Farag A, et al. Deeporgan: multi-level deep EEG with deep belief networks. In: 2013 IEEE International
convolutional networks for automated pancreas segmenta- Conference on Bioinformatics and Biomedicine (BIBM), 2013.
tion. In: Medical Image Computing and Computer-Assisted p. 305–10. IEEE, New York.
Intervention – MICCAI 2015. Springer, Heidelberg, 2015, 161. Jia X, Li K, Li X, et al. A novel semi-supervised deep learning
556–64. framework for affective state recognition on EEG signals. In:
141. Stollenga MF, Byeon W, Liwicki M, et al. Parallel multi- 2014 IEEE International Conference on Bioinformatics and

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


dimensional LSTM, with application to fast biomedical volu- Bioengineering (BIBE), 2014. p. 30–7. IEEE, New York.
metric image segmentation. arXiv Preprint arXiv:1506.07452, 162. Zheng W-L, Guo H-T, Lu B-L. Revealing critical channels and
2015. frequency bands for emotion recognition from EEG with
142. Xu J, Xiang L, Liu Q, et al. Stacked Sparse Autoencoder (SSAE) deep belief network. In: 2015 7th International IEEE/EMBS
for nuclei detection on breast cancer histopathology images. Conference on Neural Engineering (NER), 2015. p. 154–7. IEEE,
IEEE Trans Med Imaging 2015;35(1):119–30. New York.
143. Chen CL, Mahjoubfar A, Tai L-C, et al. Deep learning in label- 163. Jirayucharoensak S, Pan-Ngum S, Israsena P. EEG-based
free cell classification. Sci Rep 2016;6. emotion recognition using deep learning network with prin-
144. Cho J, Lee K, Shin E, et al. Medical image deep learning with cipal component based covariate shift adaptation. Sci World
hospital PACS dataset. arXiv Preprint arXiv:1511.06348, 2015. J 2014;2014; doi:10.1155/2014/627892.
145. Lee S, Choi M, Choi H-S, et al. FingerNet: Deep learning- 164. Stober S, Cameron DJ, Grahn JA. Classifying EEG recordings
based robust finger joint detection from radiographs. In: of rhythm perception. In: 15th International Society for Music
Biomedical Circuits and Systems Conference (BioCAS), 2015 IEEE. Information Retrieval Conference (ISMIR’14). 2014. p. 649–54.
2015. p. 1–4. IEEE, New York. 165. Stober S, Cameron DJ, Grahn JA. 2014;1449–57. Using convo-
146. Roth HR, Lee CT, Shin H-C, et al. Anatomy-specific classifica- lutional neural networks to recognize rhythm. In: Advances
tion of medical images using deep convolutional nets. arXiv in Neural Information Processing Systems.
Preprint arXiv:1504.04003, 2015. 166. Cecotti H, Graeser A. Convolutional neural network with
147. Roth HR, Lu L, Seff A, et al. A new 2.5 D representation for embedded Fourier transform for EEG classification. In: 19th
lymph node detection using random sets of deep convolu- International Conference on Pattern Recognition, 2008. ICPR 2008,
tional neural network observations. In: Medical Image 2008. p. 1–4. IEEE, New York.
Computing and Computer-Assisted Intervention – MICCAI 2014. 167. Cecotti H, Gra € ser A. Convolutional neural networks for P300
Springer, Heidelberg, 2014, 520–7. detection with application to brain-computer interfaces.
148. Kraus OZ, Frey BJ. Computer vision for high content screen- IEEE Trans Pattern Anal Mach Intell 2011;33(3):433–45.
ing. Crit Rev Biochem Mol Biol 2016;51(2):102–9. 168. Soleymani M, Asghari-Esfeden S, Pantic M, et al. Continuous
149. Gerven MAV, De Lange FP, Heskes T. Neural decoding with emotion detection using EEG signals and facial expressions.
hierarchical generative models. Neural Comput In: 2014 IEEE International Conference on Multimedia and Expo
2010;22(12):3127–42. (ICME), 2014. p. 1–6. IEEE, New York.
150. Koyamada S, Shikauchi Y, Nakae K, et al. Deep learning of 169. Wang Z, Lyu S, Schalk G, et al. Deep feature learning using
fMRI big data: a novel approach to subject-transfer decod- target priors with applications in ECoG signal decoding for
ing. arXiv Preprint arXiv:1502.00093, 2015. BCI. In: Proceedings of the Twenty-Third International Joint
151. Duryea J, Jiang Y, Countryman P, et al. Automated algorithm Conference on Artificial Intelligence. 2013. p. 1785–91. AAAI
for the identification of joint space and phalanx margin lo- Press, Palo Alto.
cations on digitized hand radiographs. Med Phys 170. Stober S, Sternin A, Owen AM, et al. Deep feature learning for
1999;26(3):453–61. EEG Recordings. arXiv Preprint arXiv:1511.04306, 2015.
152. Niedermeyer E, da Silva FL. Electroencephalography: Basic 171. Huanhuan M, Yue Z. Classification of electrocardiogram sig-
Principles, Clinical Applications, and Related Fields. Lippincott nals with deep belief networks. In: 2014 IEEE 17th
Williams & Wilkins, New York, 2005. International Conference on Computational Science and
153. Buzsa  ki G, Anastassiou CA, Koch C. The origin of extracellu- Engineering (CSE), 2014. p. 7–12. IEEE, New York.
lar fields and currents – EEG, ECoG, LFP and spikes. Nat Rev 172. Wulsin D, Gupta J, Mani R, et al. Modeling electroencephalog-
Neurosci 2012;13(6):407–20. raphy waveforms with semi-supervised deep belief nets:
154. Marriott HJL, Wagner GS. Practical electrocardiography. fast classification and anomaly measurement. J Neural Eng
Williams & Wilkins, Baltimore, 1988. 2011;8(3):036015.
155. De Luca CJ. The use of surface electromyography in bio- 173. Turner J, Page A, Mohsenin T, et al. Deep belief networks
mechanics. J Appl Biomech 1997;13:135–63. used on high resolution multichannel electroencephalog-
156. Young LR, Sheena D. Eye-movement measurement tech- raphy data for seizure detection. In: 2014 AAAI Spring
niques. Am Psychol 1975;30(3):315. Symposium Series, 2014.
157. Barea R, Boquete L, Mazo M, et al. System for assisted mobil- 174. Zhao Y, He L. Deep learning in the EEG diagnosis of
ity using eye movements based on electrooculography. IEEE Alzheimer’s disease. In: Computer Vision-ACCV 2014
Trans Neural Syst Rehabil Eng 2002;10(4):209–18. Workshops. Springer, New York, 2014, 340–53.
868 | Min et al.

175. La€ ngkvist M, Karlsson L, Loutfi A. Sleep stage classification 197. Hutter F, Hoos HH, Leyton-Brown K. Sequential model-
using unsupervised feature learning. Adv Artif Neural Syst based optimization for general algorithm configuration. In:
2012;2012:5. Learning and Intelligent Optimization. Springer, Berlin, 2011,
176. Mirowski P, Madhavan D, LeCun Y, et al. Classification of pat- 507–23.
terns of EEG synchronization for seizure prediction. Clin 198. Snoek J, Larochelle H, Adams RP. Practical bayesian opti-
Neurophysiol 2009;120(11):1927–40. mization of machine learning algorithms. In: Advances in
177. Petrosian A, Prokhorov D, Homan R, et al. Recurrent neural Neural Information Processing Systems. 2012, 2951–9.
network based prediction of epileptic seizures in intra-and 199. Bergstra J, Bengio Y. Random search for hyper-parameter
extracranial EEG. Neurocomputing 2000;30(1):201–18. optimization. J Mach Learn Res 2012;13(1):281–305.
178. Davidson PR, Jones RD, Peiris MT. EEG-based lapse detection 200. Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning.
with high temporal resolution. IEEE Trans Biomed Eng In: Proceedings of the 28th International Conference on Machine
2007;54(5):832–9. Learning (ICML-11), 2011. p. 689–96.
179. Oh S, Lee MS, Zhang B-T. Ensemble learning with active ex- 201. Cao Y, Steffey S, He J, et al. Medical image retrieval: a multi-
ample selection for imbalanced biomedical data classifica- modal approach. Cancer Inform 2014;13(Suppl 3):125.

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020


tion. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB) 202. Ngiam J, Coates A, Lahiri A, et al. On optimization me-
2011;8(2):316–25. thods for deep learning. In: Proceedings of the 28th
180. Malin BA, El Emam K, O’Keefe CM. Biomedical data privacy: International Conference on Machine Learning (ICML-11), 2011. p.
problems, perspectives, and recent advances. J Am Med 265–72.
Inform Assoc 2013;20(1):2–6. 203. Martens J. Deep learning via Hessian-free optimization. In:
181. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Proceedings of the 27th International Conference on Machine
Knowledge Data Eng 2009;21(9):1263–84. Learning (ICML-10), 2010. p. 735–42.
182. Davis J, Goadrich M. The relationship between Precision–Recall 204. Raina R, Madhavan A, Ng AY. Large-scale deep unsupervised
and ROC curves. In: Proceedings of the 23rd International Conference learning using graphics processors. In: Proceedings of the 26th
on Machine Learning. 2006. p. 233–40. ACM, New York. Annual International Conference on Machine Learning, 2009. p.
183. Lo pez V, Ferna
 ndez A, Garcıa S, et al. An insight into classifi- 873–80. ACM.
cation with imbalanced data: empirical results and current 205. Ho Q, Cipar J, Cui H, et al. More effective distributed ml via a
trends on using data intrinsic characteristics. Inform Sci stale synchronous parallel parameter server. In: Advances in
2013;250:113–41. Neural Information Processing Systems. 2013. p. 1223–31.
184. Liu X-Y, Wu J, Zhou Z-H. Exploratory undersampling for 206. Bengio Y, Schwenk H, Senécal J-S, et al. Neural probabilistic
class-imbalance learning. IEEE Trans Syst Man Cybern Part B: language models. Innovations in Machine Learning. Springer,
Cybern 2009;39(2):539–50. Berlin, 2006, 137–86.
185. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic mi- 207. Li M, Andersen DG, Park JW, et al. Scaling distributed ma-
nority over-sampling technique. J Artif Intell Res 2002;321–57. chine learning with the parameter server. In: 11th USENIX
186. Jo T, Japkowicz N. Class imbalances versus small disjuncts. Symposium on Operating Systems Design and Implementation
ACM Sigkdd Explor Newslett 2004;6(1):40–9. (OSDI 14). 2014. p. 583–98.
187. Kukar M, Kononenko I. Cost-sensitive learning with neural 208. Dean J, Corrado G, Monga R, et al. Large scale distributed
networks. In: ECAI. 1998, 445–9. Citeseer. deep networks. In: Advances in Neural Information Processing
188. Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Systems. 2012, 1223–31.
Knowl Data Eng 2010;22(10):1345–59. 209. Kin H, Park J, Jang J, et al. DeepSpark: spark-based deep learn-
189. Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hier- ing supporting asynchronous updates and caffe compatibil-
archical image database. In: CVPR 2009. IEEE Conference on ity. arXiv Preprint arXiv:1602.08191, 2016.
Computer Vision and Pattern Recognition, 2009, 2009. p. 248–55. 210. Abadi M, Agarwal A, Barham P, et al. TensorFlow: Large-
IEEE. scale machine learning on heterogeneous systems, 2015.
190. Zeiler MD, Fergus R. Visualizing and understanding convo- Software available from tensorflow. org, 2015.
lutional networks. Computer Vision–ECCV 2014. Springer, 211. Simonite T. Thinking in Silicon. MIT Technology Review,
2014, 818–33. 2013.
191. Erhan D, Bengio Y, Courville A, et al. Visualizing higher- 212. Ovtcharov K, Ruwase O, Kim J-Y, et al. Accelerating deep
layer features of a deep network. University of Montreal, 2009, convolutional neural networks using specialized hardware.
1341. Microsoft Res Whitepaper 2015;2.
192. Simonyan K, Vedaldi A, Zisserman A. Deep inside convolu- 213. Farabet C, Poulet C, Han JY, et al. Cnp: an fpga-based proces-
tional networks: visualising image classification models and sor for convolutional networks. In: FPL 2009. International
saliency maps. arXiv Preprint arXiv:1312.6034, 2013. Conference on Field Programmable Logic and Applications, 2009,
193. Choromanska A, Henaff M, Mathieu M, et al. The loss surfaces 2009. p. 32–7. IEEE, New York.
of multilayer networks. arXiv Preprint arXiv:1412.0233, 2014. 214. Hof RD. Neuromorphic Chips. MIT Technology Review, 2014.
194. Dauphin YN, Pascanu R, Gulcehre C, et al. Identifying and at- 215. Yao L, Torabi A, Cho K, et al. Describing videos by exploiting
tacking the saddle point problem in high-dimensional non- temporal structure. In: Proceedings of the IEEE International
convex optimization. In: Advances in Neural Information Conference on Computer Vision, 2015. p. 4507–15.
Processing Systems. 2014, 2933–41. 216. Noh H, Seo PH, Han B. Image question answering using con-
195. Bengio Y. Practical recommendations for gradient-based volutional neural network with dynamic parameter predic-
training of deep architectures. In: Neural Networks: Tricks of tion. arXiv preprint arXiv:1511.05756, 2015.
the Trade. Springer, Heidelberg, 2012, 437–78. 217. Graves A, Wayne G, Danihelka I. Neural turning machines.
196. Bergstra J, Bardenet R, Bengio Y, et al. Algorithms for hyper- arXiv Preprint arXiv:1410.5401, 2014.
parameter optimization. In: Advances in Neural Information 218. Weston J, Chopra S, Bordes A. Memory networks. arXiv
Processing Systems. 2011, 2546–54. Preprint arXiv:1410.3916, 2014.
Deep learning in bioinformatics | 869

219. Szegedy C, Zaremba W, Sutskever I, et al. Intriguing prop- 223. Rasmus A, Berglund M, Honkala M, et al. Semi-supervised
erties of neural networks. arXiv Preprint arXiv:1312.6199, learning with ladder networks. In: Advances in Neural
2013. Information Processing Systems. 2015, 3532–40.
220. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harness- 224. Arel I. Deep reinforcement learning as foundation for artifi-
ing adversarial examples. arXiv Preprint arXiv:1412.6572, cial general intelligence. In: Theoretical Foundations of Artificial
2014. General Intelligence. Springer, Berlin, 2012, 89–102.
221. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative ad- 225. Cutler M, How JP. Efficient reinforcement learning for robots using
versarial nets. In: Advances in Neural Information Processing informative simulated priors. In: 2015 IEEE International Conference
Systems. 2014, 2672–80. on Robotics and Automation (ICRA), 2015. p. 2605–12. IEEE, New York.
222. Lee T, Choi M, Yoon S. Manifold regularized deep neural net-
works using adversarial examples. arXiv Preprint arXiv:
1511.06381, 2015.

Downloaded from https://academic.oup.com/bib/article/18/5/851/2562808 by guest on 02 December 2020

You might also like