DL Protein
DL Protein
doi: 10.1093/bib/bbz156
Advance Access Publication Date:
Review Article
Abstract
The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address
the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a
powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions.
In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data.
The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks,
two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex
neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction,
three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and
deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some
practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture
optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous
protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives
on general deep learning techniques for protein data analysis.
Key words: deep learning; protein big data; residue-level prediction; sequence-level prediction; 3D-structure prediction;
interaction prediction; protein mass spectrometry
Introduction biomedical data [11, 12], drug discovery [13], and healthcare [14,
15] have been discussed in detail.
Deep learning has seen success in the fields of vision and speech
Protein data analysis is an important branch of bioinfor-
recognition [1]. Since deep learning approaches can automati-
cally learn the representations of data with multiple levels of matics, the computational methods of which have been greatly
abstraction, they impact almost every discipline of science and improved with the rapid growth of sequential and structural pro-
engineering, including the physical [2], chemical [3], medical [4], tein data and the continual development of deep learning tech-
and biological sciences [5, 6]. Deep learning plays a particularly nology. The richness of protein data provides a solid foundation
important role in knowledge discovery and practical solutions for data-driven hypothesis generation and biological knowledge
from biological/biomedical big data [7]. Recently, the applica- discovery. Deep learning can automatically extract nonlinear,
tions of deep learning in bioinformatics [7], biomedicine [8–10], intrinsic, abstract, and complex patterns from large-scale data
Qiang Shi is a postdoctoral fellow at the School of Software Engineering, Huazhong University of Science and Technology. His main interests cover machine
learning especially deep learning, protein data analysis, and big data mining.
Weiya Chen is an assistant professor at School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests
cover bioinformatics, virtual reality, and data visualization.
Siqi Huang is a master’s student of Software Engineering at Huazhong University of science and technology, focusing on Machine learning and data mining.
Yan Wang is an associated professor at School of life, University of Science & Technology; her main interests cover protein structure and function prediction
and big data mining.
Zhidong Xue is professor at School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover
bioinformatics, machine learning, and image processing.
Submitted: 16 August 2019; Received (in revised form): 21 October 2019
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1
2 Shi et al.
without prior knowledge [16] and is suitable for the analysis simultaneous feature reconstruction and classifier training, the
of large-scale protein data [17]. Therefore, analysis of protein shallow features directly extracted from raw data are still needed
solubility [18], secondary structures [19–23], sequence profiles to represent protein data. This is because protein data should be
[5], protein–protein interactions (PPIs) [24, 25], protein threading converted to numerical vectors that the algorithm can recognize
[26], protein design [27–29], posttranslational modifications [30], directly [40]. It is an important process for machine-learning
function annotation [31–34], and other applications [35–41] has methods, since effective mathematical expressions can describe
benefited from deep learning. the intrinsic correlation with the corresponding structural and
According to the inputs for deep learning models, these functional attributes [194–196]. Considering that MS techniques
approaches can be categorized by four aspects: sequence, struc- reflect protein structures [197], only some preprocessing, such
Table 1. Various deep learning approaches based on input data types for protein property prediction
Sequences Residue-level DNN Backbone dihedral angles [42–44]; torsion angle [45]; solvent
prediction accessibility [46]; PTM site [47–52]; secondary structure [53–56];
contact prediction [57, 58, 83]; disorder [59]
CNN DNA-binding site [60, 61]; signal peptide [62]; disorder [63–65];
secondary structure [66–72]; dihedral angles [73, 74]; torsion angle
[75]; solubility [18, 76]; binding site [36, 77, 78]; residue–residue
Residue-level prediction from protein long short-term memory (LSTM) and gated recurrent unit (GRU)]
sequence [19], and the hybrid of CNN and RNN are utilized for residue-level
prediction.
Residue-level prediction means that the properties are associ-
ated with specific residues, such as secondary structure [53–
56, 66–72], disorder [63–65, 102], solvent accessibility [46, 100, DNN-based approaches
101], protein-ligand sites [61, 84, 85], PTM sites [30, 47, 48, 51, MLP- [48, 52, 55], SAE- [38, 44, 46, 56], and DBN-based [47,
90, 93, 94, 97, 114], residue contact [57, 79–82], signal peptides 54] approaches have been widely utilized for various tasks
[62, 110], backbone angles [42, 73, 74, 99], and so on [39, 41, 78]. in residue-level prediction. MLP-based methods have been
These properties are affected by neighbors that are close in the applied to predict secondary structures [55], lysine acetylation
primary sequence or in the 3D structure. However, residues that sites [52], and nitration and nitrosylation sites [48]. SAE-
are neighbors in 3D might be far apart in the primary sequence. based approaches have been used to predict secondary
These local or nonlocal dependencies are essential to property structure [56], solvent accessibility and contact number [46],
prediction at the residue level. To model these dependencies to and backbone Cα angles and dihedrals [44]. DBN-based meth-
improve predictive performance, DNN, CNN [87], RNN [including ods have been adopted to predict secondary structures [54]
4 Shi et al.
and S-sulfenylation sites [47]. Although these approaches prediction of secondary structures, local backbone angles,
outperform traditional methods, there is a long way for the and solvent-accessible surface area by employing previous
actual application. Especially, to further improve predictive predicted results as input for the next iterative training of deep
performance, iterative deep learning is introduced to improve learning [38]. Similar to the two-level strategy, the DeepConPred
Deep learning for mining protein data 5
model of Xiong [83] improves long-range residue–residue contact few homologs are insufficient for accurate contact-assisted pro-
prediction based on a hierarchical strategy. DeepCCon is adopted tein folding.
to predict the probabilities of parallel contact, anti-parallel Gao [73] proposed RaptorX-Angle, which employs ResNet
contact, and no contact. Then the coarse contact predicted to construct a much deeper DCNN to predict backbone dihe-
by DeepCCon, smoothed PSSM, the natural vector of the dral angles from the sequence alone. RaptorX-Angle uses PSSM,
intervening sequence, the contact propensity of the residue position-specific frequency matrix (PSFM), ACC, SA, and SS prob-
pair, and coevolutionary information of the residue pair are abilities as input features and adopts six ResNets with different
combined and fed into DeepRCon to predict the final contact numbers of layers to extract deep features. The deep features
map. are fed into a logistic regression layer to get the probability of 20
Figure 2. The architecture of deep RNN based on BLSTM for residue-level prediction. Shallow features extracted from protein sequences are combined as input for
deep learning. Then several BLSTM layers are stacked to extract long-range dependencies and achieve more abstract and distinguishable features for classification.
Finally, MLPs or CRFs are used as classifiers for results.
approaches. This is because the glutarylation PTM site is only Hybrid of CNN and RNN approaches
sensitive to local motif patterns, and others are sensitive to the
The motivation to combine CNN and RNN is that residues
combination of long and local information. The two-stacked are influenced by not only their sequential neighbors but
BLSTM approach is used to capture nonlocal interactions for also structural adjacent residues. A suitable prediction model
SS, SA, backbone angles, and contact numbers [40]. This work can exploit this phenomenon to learn useful local patterns
highlights the importance of capturing nonlocal interactions by CNN and then use RNN to learn aggregate features of the
to predict one-dimensional structural properties. Similarly, entire sequence. A simple idea of hybrid deep learning is the
Zhang [100] introduced a three-stacked BLSTM, which is concatenation of CNN and BLSTM models. The framework
called a stacked deep bidirectional recurrent neural network of this hybrid model, as shown in Figure 3, consists of an
(SDBRNN), to predict solvent accessibility. Besides the PSSM input layer of sequences with encoding, the convolutional
and physiochemical properties, they used conservation score layer, BLSTM layers, and classifier layers (MLP or CRF). This
and protein encodings as inputs. They redesigned BLSTM using hybrid model has been used to predict antibody paratope [109]
three types of merging operators (concat, sum, and weighting and protein hydroxylation sites [114]. However, this simple
sum) and used logistic activation as a predictor. Compared with model cannot deal with the prediction problem with small
BLSTM using a single merging operator, SDBRNN can capture samples. To tackle this issue, a complex hybrid model, SignalP
more protein features and is more generalizable. 5.0, is constructed for signal peptide predictions [110]. In this
Deep learning for mining protein data 7
DNCON [58] DBN based on RBM Length of the Dataset: Casp10 http://iris.rnet. First deep learning
protein; SS and SA; Long range: missouri.edu/dncon/ model for
PSSM sums; PSSM Top L/10: 0.663 Top residue–residue
sum cosines; Atchley L/5: 0.615 contact prediction
factors; statistical Medium range:
potentials Top L/10: 0.749 Top
approach, SignalP 5.0 integrates one-dimensional convolutions are ignored. To simultaneously use local and global features, Shi
to obtain learnable nonlinear PSSMs before combining CNN and et al. [112] proposed the DNN-Dom architecture for the boundary
BLSTM. In addition, SignalP 5.0 adopts transform learning to prediction of protein domains. In DNN-Dom, the combination
improve predictive performance in organism groups with little of local and global features is fed into parallel balanced random
data (notably Archaea). In transform learning, pretrained deep forests for boundary prediction. Furthermore, considering that
learning in other taxonomic groups is fine-tuned for Archaea, the traditional convolutional layers ignore features from the
gram-positive bacteria, gram-negative bacteria, and Eukarya. feature vector dimension, CNN is replaced by asymmetric
In these hybrid models, global features from BLSTM are only convolutional neural networks (ACNNs) for constructing hybrid
adopted for prediction, and local features extracted from CNN models. For example, the DeepACLSTM consisting of ACNN and
8 Shi et al.
BLSTM is introduced to predict eight-category SS [106]. In this for proteins, a hierarchical stack of multitask feed-forward deep
new method, ACNN is adopted to learn phrase-level features as neural networks, named DEEPred, is proposed for automatic pro-
inputs for BLSTM. Since current approaches aim to solve one tein function annotation [120]. DEEPred uses a multitask feed-
problem and cannot simultaneously solve several issues, web forward network to generate a practical large-scale protein func-
servers based on integrated deep learning, called NetSurfP-2.0 tion prediction pipeline. However, DEEPred does not consider an
and MUFold-SSW [39, 41], have been presented to predict several optimized network initialization. Therefore, a new deep model
properties. that utilizes a restricted Boltzmann machine (RBM) for network
initialization is introduced for fold recognition [116].
The above approaches have achieved state-of-the-art perfor-
Sequence-level property prediction mance using hidden layers to get nonlinear and abstract fea-
Sequence-level prediction indicates that properties to be pre- tures. However, these hidden layers cannot be used to describe
dicted are decided with a whole protein sequence. Since shal- the probability distribution of raw data. To solve this problem,
low features are difficult to extract from sequences and lack a the deep generative model is proposed for T-cell receptor (TCR)
representative for the classification task, traditional algorithms protein sequences [238]. In this model, variational autoencoder
do not perform satisfactorily. In addition, they cannot model (VAE) models parameterized by deep neural networks are fitted
the complex relationships between sequences and properties. to TCR repertoires.
Therefore, based on its ability to automatically learn repre-
sentations from data with multiple levels of abstraction, deep
learning is utilized for property prediction at the sequence level. CNN-based approaches
Various deep learning approaches, including DNN, CNN, RNN, Based on the fact that deep learning can efficiently extract infor-
and hybrids of CNN and RNN, are utilized for fold analysis [116, mation from unstructured data far better than human experts,
125, 126, 132], function prediction [31–34, 117–120, 140], target DCNN-based approaches have been proposed for function pre-
identification [123, 124], subcellular location [121, 122, 127, 134, diction [31, 34], fold recognition [40], and family prediction [128]
135], remote homology detection [129–131], antimicrobial pep- from protein sequences. The typical framework of DCNN for
tide (AMP) recognition [138, 139], and enzyme EC classification protein sequence prediction, as shown in Figure 4, consists of an
[136, 137]. input layer of protein sequences with shallow features, several
convolutional layers, and max pooling, fully connected, and
softmax layers.
DNN-based approaches
For more accurate and faster family prediction, Seo [128]
Inspired by the representation power of deep learning models, proposed DeepFam, which consists of one convolution layer
DNN-based approaches are introduced to improve predictive and 1-max pooling layer, and fully connected and softmax
performance of sequence-level properties, such as subcellular layers. Combinations of various hyper-parameters, including the
localization [121], protein function [120], target protein identifi- number and length of convolution kernels, the number of
cation [123], and fold recognition [116]. Given the multiple labels perceptions in the fully connected layer, the coefficient of
and hierarchical structure of the gene ontology (GO) function regularization, dropout rate, learning rate, and batch size, were
Deep learning for mining protein data 9
tested in experiments. Hou [40] similarly proposed DeepSF to construct hybrid models. The simple model consists of one CNN
directly classify any protein sequence into one of 1195 known and one RNN, while the complex model includes multichannel
folds. DeepSF adopted 1D DCNN for fold classification, which CNN and BLSTM with an attention mechanism to capture fea-
consists of 10 convolutional layers, 1 max pooling layer, 1 tures for protein prediction.
flattening layer, 1 fully connected hidden layer, and an output To improve the AMP recognition, expert-free features are
layer. DeepSF used softmax as a classifier for fold recognition. extracted by the deep learning approach consisting of CNN and
Kulmanov [31] proposed DeepGO to predict functions from RNN [138]. The outputs of the convolutional layer are fed into
sequences. DeepGO combines the deep features learned from an LSTM layer, which is a general process in bioinformatics.
sequences by deep learning with a feature vector extracted from Armenteros [134] similarly presented DeepLoc to predict subcel-
a cross-species PPI network for predictions. These combined lular localization. In DeepLoc, the CNN is also followed by RNN.
features are fed into a hierarchical classifier to make predictions. However, there are several differences: (1) DeepLoc uses con-
However, for novel or uncharacterized proteins, there is no addi- volutional filters of different sizes to extract meaningful motif
tional information from the protein’s interactions. DeepGO has information; (2) bidirectional LSTMs are adopted to capture long-
been extended and improved with DeepGOPlus [34], which over- range features in both the forward and backward directions; and
comes its main limitations related to sequence length, missing (3) attention mechanisms [239] are used to improve prediction.
features, and number of predicted classes. In addition, because of the hierarchical categories of subcellular
localization, a hierarchical tree [240] with multiple nodes is
developed.
RNN-based approaches To avoid limitations such as homology requirements, fea-
Li [129] proposed ProDec-BLSTM as a predictor to improve ture design, and feature dimensionality nonuniformity, Li [136]
remote homology detection. ProDec-BLSTM can capture both the introduced the DEEPre model to improve enzyme EC number
long and short dependency, as shown in Figure 5. The protein prediction. This method uses both the deep features from the
sequence is encoded by one-hot encoding as the input. BLSTM hybrid of CNN and RNN and shallow features such as sequence
extracts more comprehensive dependence information, which is length-independent features for classification. The combination
included in the mediate hidden units. The values of these hidden of sequence one-hot encoding, PSSM, SA, and SS is fed into a
units are fed into the time-distributed dense layer, which can hybrid model. This is different from DeepLoc and deep learning-
reassign the weights of the dependence relationships extracted based AMP, which use one-hot encoding as inputs.
from different cells. Finally, the outputs of time-distributed In summary, various architectures of deep learning for pro-
dense layer are concatenated into one feature vector, which tein function prediction are listed in Table 4. Complex models
is fed into an SVM classifier for decision-making. Thanks to the that combine the deep features from hybrid deep learning archi-
time-distributed dense layer, the fused features that contain tectures and shallow features are usually utilized for sequence-
complex dependencies are more discriminative; hence, the level tasks.
ProDec-BLSTM predictor achieves higher performance than
various related methods, including kernel-, SVM-, and LSTM-
based approaches. Three-dimensional structural data mining
Central to protein biology is the understanding of how the
structural arrangement of amino acids creates functional
Hybrid of CNN and RNN approaches characteristics within protein sites. The surfeit of protein
To obtain a more comprehensive sequence representation, CNN structural data enables development of computational methods
and RNN are combined for AMP recognition [138], subcellular to systematically derive rules governing structural–functional
localization prediction [134], and enzyme EC classification [144]. relationships. However, performance of these methods depends
As shown in Figure 6, there are simple and complex strategies to critically on the choice of protein structural representation. Good
10 Shi et al.
Figure 6. The architecture of a hybrid model for sequence-level prediction. (a) A simple strategy of the hybrid model consisting of one CNN and one LSTM is used for
AMP prediction. (b) A complex hybrid model with multiple channel CNN and BLSTM with an attention mechanism is adopted for subcellular prediction.
representations efficiently capture the most critical information, similarity analyses [145], model quality assessment [141, 152],
while poor representations create a noisy distribution with no and so on [142, 155, 156].
underlying patterns. Most current methods rely on features
that are manually selected based on knowledge of protein
structures. In addition, designing hand-engineered features
Low-dimensional mapping methods
is labor-intensive, time-consuming, and suboptimal for some The idea of the projected-based approach is to reduce the data
tasks. Fortunately, the surfeit of protein structures and the recent dimension from 3D to 2D or 1D using geometric and topo-
success of deep learning algorithms provide an opportunity to logical relations within the 3D structure and to then employ
develop tools to automatically extract task-specific representa- deep learning methods to extract deep features for prediction.
tions of protein structures. Following the voxel-, projected-, or These approaches include distance matrix- and topology-based
graph-based representation of protein 3D structures (Figure 7), DL methods. Nguyen [141] proposed a distance matrix-based
several deep learning approaches automatically extract features deep learning model (DL-PRO) for 3D structure quality assess-
from the protein 3D structure and are applied to predict func- ment. DL-PRO first calculates the pairwise distance matrix of
tions [143], binding pockets [151, 157], ligand-binding pockets the C-α atoms of residues. These distance matrixes, and their
[146–150], enzyme classifications [144], amino acid environment corresponding labels indicating good or bad models, are fed into
Deep learning for mining protein data 11
a stacked autoencoder network for training. DL-PRO is a purely map irregular atomic coordinates to regular representations of
geometric method that can extract effective features represent- 3D grids, including the occupancy grid [144], multiple atom-
ing good models. Since DL-PRO only uses a distance matrix and channel grid [145], and multiple atom-type grid [146, 148, 151].
loses some information, the method proposed by [143] combines To avoid less reliable function prediction caused by sequences,
local shape features with features characterizing the interaction Amidi [144] proposed EnzyNet for enzyme classification
of amino acids to form a multichannel image, which is fed into according to a voxel-based spatial structure. Enzymes are
2D CNN for function prediction. represented as binary volumetric shapes with voxels. A voxel
To extract geometric and biological complexities of biomolecules of vertices takes the value 1 if the backbone of the enzyme
and improve predictive performance, Cang [142] proposed passes through the voxel, and 0 otherwise. Although this
TopologyNet to predict protein-ligand binding affinities. The occupancy grid can be directly fed into 3D CNN, it ignores
element-specific topological fingerprint (ESTF) that can provide physical chemistry properties of atoms. Therefore, because
a sufficient and structured low-level representation is com- amino acid microenvironments are characterized by 3D spatial
puted. Then 1D CNN is used to learn high-level representations. distributions of oxygen, carbon, nitrogen, and sulfur atoms of
Shallow and deep features are combined to feed into a multitask amino acids in a local box, Torng and Altman [145] proposed
learning framework for prediction. 3D CNN for residue microenvironment analysis. The voxelizing
Since information is lost when mapping 3D data to lower process includes local box sampling, local box extraction, and
dimensions, these approaches are expected to combine addi- local box featurization to produce four channel structures,
tional information extracted directly from 3D structures. including oxygen, carbon, nitrogen, and sulfur, which serve as
input samples to 3DCNN. This approach can systematically
derive rules governing structural–functional relationships
Voxel-based methods motivated by the surfeit of structural data. Torng’s approach
Atoms of proteins do not locate at regular grids as 2D images. [145] utilized four atom channels and ignored other atom types.
Voxelizing the protein structure can help to directly put the Jiménez et al. [146] introduced a new 3D CNN, DeepSite, for
protein into the CNN. There are several voxelizing methods to ligand-binding site prediction with seven atom categories:
12 Shi et al.
hydrophobic, aromatic, hydrogen bond acceptor or donor, 184], and compounds [161–163, 171–173, 180–182], play an
positive or negative ionizable, and metallic. Skalic et al. [151] important role in many cellular biological processes, such as
similarly proposed the LigVoxel model to predict ligand chemical signal transduction, immune response, cellular organization,
properties like occupancy, aromaticity, and donor–acceptor. protein synthesis, and viral infectivity. Furthermore, protein–
Additionally, for accurate classification of ligand-binding compound interactions facilitate network pharmacology and
pockets, DeepDrug3D used 14 atom types to calculate features drug discovery. Considering that they must simultaneously
of a voxel [150]. process two inputs, these approaches can be classified as
early- or late-fusion strategies, as shown in Figure 8. The
Graph-based methods former consists of representation calculation of two kinds of
biomolecules, representation stitching, deep feature extraction,
Although voxel-based methods achieve state-of-the-art perfor-
and classification. The latter includes representation calcula-
mance, they ignore intrinsic irregular topology, which directly
tion, deep feature extraction for two kinds of biomolecules, deep
governs protein properties. To describe 3D topology such as spa-
feature fusion, and classification. L. Wang [168] adopted the
tial distances and directions between atoms, protein structures
early-fusion strategy for protein–RNA interaction prediction.
are represented as 3D molecular graphs. Based on this rep-
The PSSM from the protein sequence and order-preserving
resentation, a three-dimensional graph convolutional network
transformation (OPT) from the RNA sequence are stitched,
(3DGCN) is introduced to efficiently deal with these irregular
and the deep features are extracted by DCNN from these
topologies for molecule interpretation [155]. In 3DGCN, a con-
stitched representations. An extreme learning machine (ELM)
volutional layer contains two phases. One combines the fea-
[244] classifier that executes quickly and guarantees learning
tures from each node and generates the intermediate features.
accuracy predicts interactions. Similarly, H. Yi [170] used SAE
The other collects and sums these intermediate features along
to predict ncRNA–protein interactions, K. Tian [171] adopted
neighborhoods and generates higher-level features. Experiments
DNN based on ELM to boost compound–protein interaction
on four datasets in the chemical and biological fields demon-
prediction, and T. Sun [24] used SAE for PPIs. Different from early-
strate that 3DGCN achieves state-of-the-art performance in vir-
fusion approaches that first combine shallow representations,
tual drug screening, protein-ligand interactions, and protein
Hashemifar [245] presented DPPI to predict PPIs. PSSMs from
docking.
the interactive sequences are fed into DCNN to detect various
In summary, deep architectures used for protein 3D struc-
patterns. A representation to model paired sequences is
ture analysis are listed in Table 5. Although low-dimensional
generated by a random projection module. Maximizing the log-
mapping approaches, voxel-based approaches, and 3D graph-
likelihood of the interaction is used to predict PPIs. Similarly, Lei
based methods are adopted for protein property prediction,
[168] introduced a multimodal deep polynomial network (MDPN)
graph-based deep learning that can directly model the intrinsic
for PPI prediction. A two-stage DPN extracts high-level and
irregular topology of protein structure may be promising.
complex features from paired sequences. The first stage feeds
multiple protein features into DPN encoding to obtain the high-
Interaction prediction of proteins and other level feature, while the second stage fuses and learns features by
cascading three types of high-level features in the DPN encoding.
molecules
A regularized extreme learning machine (RELM) [246] predicts
The interactions between proteins and other molecules [241], PPIs.
such as RNA [168, 169, 178, 242, 243], noncoding RNA (ncRNA) Specifically, considering that analyzing interactions requires
[158–160, 170, 179], other proteins [24, 25, 164–167, 177, 183, one to handle rich relation information among elements,
Deep learning for mining protein data 13
Figure 8. Two strategies for interaction prediction between proteins and other biomolecules. (a) In the early-fusion strategy, the fused shallow features from two
biomolecules are fed into deep neural networks to extract deep features, which are fed into classifiers such as ELM and logistic regression. (b) In the late-fusion
strategy, shallow features are fed into deep learning to achieve deep features, which are combined to feed into classifiers.
a graph model is used to describe proteins. Graph neural interactions. Therefore, a new graph convolutional neural
networks can be utilized to learn from graph inputs [247]. network, Decagon, is introduced for multi-relational link
This method has been used to predict protein interfaces prediction [176]. Decagon first constructs a graph model as
[177]. The interface prediction problem can be converted to a multimodal graph encoding drug, protein, and side-effect
classify pairs of nodes from two protein graphs. Following relationships. It operates directly on this graph by a graph
the late-fusion strategy, the features from two GNNs are convolutional encoder and tensor factorization decoder. This
combined for classifying. However, this GNN only deals with one approach can be classified as a late-fusion strategy. Since the
relationship between proteins and cannot handle multimodal parameters across multiple edge types are shared, Decagon
relationships from protein–protein, drug–protein, and drug–drug achieves better performance.
14 Shi et al.
DeepPPI [25] MLP with late fusion ACC; dipeptide Accuracy: 92.50% http://ailab.ahu.edu. It employs deep neural
strategy composition; Precision: 94.38% cn:8087/DeepPPI/index. networks to learn
composition, transition, Recall: 90.56% html. effectively the
and distribution; Specificity: 94.49% representations of
amphiphilic Matthews Correlation proteins from common
pseudoamino AAC; Coefficient (MCC): protein descriptors
85.08%
In summary, deep architectures used for PPI predictions of neoantigen-targeted immunotherapies for cancer patients.
are listed in Table 6. Although early- and late-fusion strategies However, considering that MLP cannot be used for unsupervised
are both used to predict PPIs, deep models become more dimension reduction, SAE-based approaches are proposed to
complex. Specifically, to improve prediction performance, a compress MS imaging data [186]. These approaches not only
late fusion strategy adopts complex architectures includ- can nonlinearly project the unseen high-dimensional data to
ing GNNs [177], residual recurrent convolutional networks the low-dimensional space but can enhance the stability of
(RCNNs) [184], and combinations of CNN and LSTM [183] the initial parameters used during fine-tuning across differ-
for PPIs. ent runs. Focusing on fully supervised learning, the CNN-based
approach is usually adopted for tumor classification and protein
inference. IsotopeNet, which is a specialized architecture for
Protein MS data interpretation tumor classification by imaging MS, is constructed [188]. Com-
MS-based technologies are powerful tools to study the ensemble pared to ResNet, IsotopeNet is sensitive to a large number of
of proteins in cells or organs under different circumstances to peaks. In addition, using peptide profiles, a DCNN method, called
gain insight into the functionalities of proteins [197]. Since MS DeepPep, is built for protein inference that predicts the protein
spectra contain much noise and ambiguity, computational pro- set from a proteomics mixture [189]. Comparison to leading
teomics is still a challenge [199]. Motivated by its breakthroughs methods shows that DeepPep has the most robust performance
on these problems, deep learning models, such as MLP, SAE, for various instruments and datasets.
DCNN, BLSTM, and the hybrid of CNN and LSTM, have been Motivated by BLSTM’s capability to model the influences of
applied to understand MS data. both N- and C-terminal amino acids of each cleavage position,
Benefiting from the advantages of highly nonlinear modeling pDeep built by two-layer BLSTM is introduced to predict MS/MS
from MLP, EDGE based on the MLP method improves neoanti- spectra of peptides [190]. Although pDeep can predict peptides
gen identification using tumor human leukocyte antigen (HLA) with high accuracy, it cannot give the peptide sequencing. To
peptide MS datasets [185]. EDGE can facilitate the development realize de novo peptide sequencing given an MS/MS spectrum
Deep learning for mining protein data 15
EDGE [185] MLP and rectified Peptide MS data The average PPV — Benefited from deep
linear unit (ReLU) (short for positive learning, EDGE
predictive values) at achieved an
40% recall was 0.54 improved
performance
DeepPep [189] DCNN with each Peptide pairs AUC: 0.80 https://deeppep. DeepPep uses
and the peptide mass, DeepNovo was presented following the protein MS interpretation. However, as shown in Table 8, several
recently trending topic of “automatically generating a descrip- challenges should be addressed in the future. These include
tion for an image” [192]. DeepNovo learns amino acid sequence optimal feature analysis in protein big data, robust deep learning
patterns of the peptide in association with the feature’s spectra for protein noisy data, network architecture optimization for
by designing the model of spectrum-CNN coupled with LSTM protein data mining, efficient deep learning with limited protein
and provides a complete end-to-end training and prediction data, multimodal deep learning for heterogeneous protein data,
solution. Furthermore, DeepNovo is extended to DeepNovo-DIA and interpretable deep learning for protein understanding.
for data-independent acquisition (DIA) of MS data [193]. The key
idea of this extended model is to learn features of fragment ions
Optimal feature analysis in protein big data
and peptide sequences from DIA MS data.
In summary, deep architectures used for MS interpretation Various types of shallow features extracted from proteins have
are listed in Table 7. Although these approaches have achieved been adopted by deep learning approaches. Protein data are
state-of-the-art performance in neoantigen identification [187], becoming bigger not only in terms of the abundance of pat-
peptide inference [189], peptide MS prediction [185], and peptide terns (data instances or tuples) but also in the dimensionality
sequencing [192, 193], other mechanisms [248, 249] are needed to of features. Irrelevant or redundant features may significantly
fuse protein heterogeneous data for protein understanding and degrade the accuracy and efficiency of machine learning algo-
scientific studies. rithms. Selecting the optimal feature subset from protein big
data becomes an urgent task [250, 251].
Due to the properties of protein big data, existing feature
selection methods face demanding challenges in a variety of
Discussion and future trends
phases, for example, the speed of data processing, imbalanced
Deep learning has achieved state-of-the-art performance in pro- data, and dealing with structural features. Traditional feature-
tein data mining from residue-level prediction, sequence-level selection methods face three challenges with respect to big
prediction, 3D structure data mining, interaction prediction, and data: (1) existing methods usually require large amounts of
16 Shi et al.
Shallow feature selection High-dimensional features degrade the accuracy 1) Large-scale feature selection
and efficiency of deep learning. However, 2) Feature selection for imbalanced data
traditional feature selection approaches cannot 3) Feature selection for structured data
solve the problems of large-scale instance,
high-dimensional features, imbalanced classes,
and structured data caused by protein big data
learning time, so it is hard for processing speeds to catch up predicted labels is much higher than the number of manual
with the changes of big data; (2) traditional methods are mainly labels. To ensure the reliability of deep learning models, existing
influenced by instances from the majority classes, and this bias approaches only use manually annotated samples. However,
will result in the selected features being unsuitable to predict these methods are not scalable and risk the removal of crucial
rare classes; and (3) most algorithms are designed for generic examples that may be significant for small datasets. In addition,
data and completely ignore the intrinsic structures among fea- removing samples with noisy labels works against the need for
tures. Current techniques, such as distributed computing [252], large-scale data in deep learning approaches.
graphics processing unit (GPU)-accelerated methods [253], cost- To guarantee the convergence and high performance of com-
sensitive learning [254], and the least absolute shrinkage and plex deep models, valuable samples with noisy labels are also
selection operator (lasso) [255], can provide solutions for the utilized for model training. Of course, the mechanism for dealing
above issues in feature selection. However, these methods are with noisy labels should be introduced to achieve robust deep
extremely specific, and how to extract valuable information learning. There are two strategies to solve this issue for deep
from protein big data is still an open issue. Additionally, from learning: robust loss function and modeling latent labels [257].
the perspective of the system, it is valuable to construct practical The former aims to design a robust loss function to alleviate
tools or systems for feature selection in the context of protein big noise effects, while the latter targets the modeling of latent
data. labels to train the classifier and the building of a transition for
adaption from latent labels to the noisy labels. For instance,
in the strategy of robust loss functions, predicted labels in the
Robust deep learning for protein noisy data
cross-entropy loss can be rectified by a label-correction network
In a database related to protein properties, labels with exper- trained on the extra clear dataset [258]. In the strategy of model-
imental evidence and with no direct experimental evidence ing latent labels, a linear adaption layer can be adopted to model
are usually available at the same time [256]. The number of the asymmetric label noise, and this layer can be added on top of
Deep learning for mining protein data 17
a DNN [259]. This encourages the network to learn a “pessimistic” the hidden layers of the deep model trained on large-scale
noise model that denoises the corrupted labels during learning. labeled data as the initial values. Although this method can
However, the above approaches risk the misestimation of some perform satisfactorily, performance may be poor when the
labels when they attempt to correct noisy labels or reweigh the amount of target domain data is small [267]. In practice, the
terms of all the data points. Therefore, other approaches that target domain usually has scant labeled data. To solve this
represent trustworthiness of noisy labels [257] or adopt semi- issue, recent semi-supervised approaches focus on DNNs to
supervised learning methods by concealing the labels of the construct semi-supervised transfer learning approaches, with
noisy set [260] are needed to achieve robust deep learning for promising results based on several benchmarks [268]. But most
protein data mining with noisy labels. of their experiments are based on models trained from scratch.
20. Hou J, Guo Z, Cheng J. DNSS2: improved ab initio protein structure, backbone angles, contact numbers and solvent
secondary structure prediction using advanced deep learn- accessibility. Bioinformatics 2017;33(18):2842–9.
ing architectures. bioRxiv 2019; (2019):639021. 41. Fang C, et al. MUFold-SSW: a new web server for predicting
21. Yang Y, et al. Sixty-five years of the long march in protein protein secondary structures, torsion angles, and turns.
secondary structure prediction: the final stretch? Brief Bioinf Bioinformatics 2019.
2016;19(3):482–94. 42. Gao J, Yang Y, Zhou Y. Predicting the errors of pre-
22. Jiang Q, et al. Protein secondary structure prediction: a dicted local backbone angles and non-local solvent-
survey of the state of the art. J Mol Graph Model 2017;76: accessibilities of proteins by deep neural networks. Bioin-
379–402. formatics 2016;32(24):3768–73.
sequence features. In: IEEE International Conference on Bioin- 80. Jones DT, Kandathil SM. High precision in protein con-
formatics & Biomedicine, 2017. tact prediction using fully convolutional neural net-
61. Zhang Q, Zhu L, Huang D-S. High-order convolutional neu- works and minimal sequence features. Bioinformatics
ral network architecture for predicting DNA-protein bind- 2018;34(19):3308–15.
ing sites. IEEE/ACM Trans Comput Biol Bioinform 2018. 81. Adhikari B, Hou J, Cheng J. DNCON2: improved protein con-
62. Savojardo C, et al. DeepSig: deep learning improves signal tact prediction using two-level deep convolutional neural
peptide detection in proteins. Bioinformatics 2017;34(10). networks. Bioinformatics 2017;34(9):1466–721472.
63. Wang S, Ma J, Xu J. AUCpreD: proteome-level protein dis- 82. Schaarschmidt J, et al. Assessment of contact predictions
order prediction by AUC-maximized deep convolutional in CASP12: co-evolution and deep learning coming of age.
101. Heffernan R, et al. Single-sequence-based prediction of Data Mining and Information Security. Singapore: Springer,
protein secondary structures and solvent accessibility by 2019, 29–38.
deep whole-sequence learning. J Comput Chem 2018;39(26): 123. Wang Q, et al. A novel framework for the identifica-
2210–6. tion of drug target proteins: combining stacked auto-
102. Hanson J, et al. Improving protein disorder prediction by encoders with a biased support vector machine. PLoS One
deep bidirectional long short-term memory recurrent neu- 2017;12(4):e0176486.
ral networks. Bioinformatics 2016;33(5):685–92. 124. Mayr A, et al. Large-scale comparison of machine learning
103. Hanson J, et al. Accurate prediction of protein contact maps methods for drug target prediction on ChEMBL. Chem Sci
by coupling residual two-dimensional bidirectional long 2018;9(24):5441–51.
143. Zacharaki EI. Prediction of protein function using a 164. Wang Y-B, et al. Predicting protein–protein interactions
deep convolutional neural network ensemble. PeerJ 2017;3: from protein sequences by a stacked sparse autoencoder
1–17. deep neural network. Mol Biosyst 2017;13(7):1336–44.
144. Amidi A, et al. EnzyNet: enzyme classification using 3D 165. Patel S, et al. DeepInteract: deep neural network based
convolutional neural networks on spatial representation. protein-protein interaction prediction tool. Curr Bioinforma
PeerJ 2018;6:e4750. 2017;12(6):551–7.
145. Torng W, Altman RB. 3D deep convolutional neural net- 166. Zhao Z, Gong X. Protein-protein interaction interface
works for amino acid environment similarity analysis. BMC residue pair prediction based on deep learning architec-
Bioinf 2017;18(1):302. ture. IEEE/ACM Trans Comput Biol Bioinform 2017;1–1.
184. Chen M, et al. Multifaceted protein–protein interaction pre- 206. Stefan S, Markus G, Johannes SD. CCMpred–fast and
diction based on Siamese residual RCNN. Bioinformatics precise prediction of protein residue-residue contacts
2019;35(14):i305–i314. from correlated mutations. Bioinformatics 2014;30(21):
185. Bulik-Sullivan B, et al. Deep learning using tumor HLA 3128.
peptide mass spectrometry datasets improves neoantigen 207. KabakIoglu A, et al. Statistical properties of contact vec-
identification. Nat Biotechnol 2019;37(1):55. tors. Phys Rev E Stat Nonlinear Soft Matter Phys 2002;65:
186. Thomas SA, et al. Dimensionality reduction of mass spec- 041904.
trometry imaging data using autoencoders. In: 2016 IEEE 208. Kinjo AR, Horimoto K, Nishikawa K. Predicting absolute
Symposium Series on Computational Intelligence (SSCI), IEEE, contact numbers of native protein structure from amino
226. Finn RD, et al. The Pfam protein families database: 248. Kanezaki A, et al. Deep learning for multimodal data fusion.
towards a more sustainable future. Nucleic Acids Res In: Multimodal Scene Understanding. Elsevier, Pittsburgh: Aca-
2016;44(Database issue):D279–85. demic Press, 2019, 9–39.
227. Xia K, Wei GW. Persistent homology analysis of protein 249. Ramachandram D, Taylor GW. Deep multimodal learning:
structure, flexibility, and folding. Int J Numer Methods Biomed a survey on recent advances and trends. IEEE Signal Process
Eng 2014;30(8):814–44. Mag 2017;34(6):96–108.
228. Xia K, et al. Persistent homology for the quantitative pre- 250. Rong M, Gong D, Gao X. Feature selection and its use
diction of fullerene stability. J Comput Chem 2015;36(6): in big data: challenges, methods, and trends. IEEE Access
408–22. 2019;7:19709–25.
271. Dey S, et al. Predicting adverse drug reactions through 273. Papernot N, McDaniel P. Deep k-nearest neighbors: towards
interpretable deep learning framework. BMC Bioinf confident, interpretable and robust deep learning. arXiv
2018;19(21):476. preprintarXiv:1803.04765 2018.
272. Murdoch WJ, et al. Interpretable machine learning: def- 274. Koh PW, Liang P. Understanding black-box predictions via
initions, methods, and applications. arXiv preprintarXiv: influence functions. In: Proceedings of the 34th International
1901.04592 2019. Conference on Machine Learning-Volume 70, 2017, JMLR. org.