A Structure Based Approach For Accurate Prediction of Protein
A Structure Based Approach For Accurate Prediction of Protein
Interactions Networks
Keywords: Protein Interactions, Protein Structure, 3D Templates, Protein Interaction Network, Protein Binding Sites.
Abstract: In the recent days, extraordinary revolution in genome sequencing technologies have produced an overwhelm-
ing amount of genes that code for proteins, resulting in deluge of proteomics data. Since proteins are involved
in almost every biological activity, therefore due to this rapid uncovering of biological “facts”, the field of
System Biology now stands on the doorstep of considerable theoretical and practical advancements. Precise
understanding of proteins, specially their functional associations or interactions are inevitable to explicate how
complex biological processes occur at molecular level, as well as to understand how these processes are con-
trolled and modified in different disease states. In this paper, we present a novel protein structure based method
to precisely predict the interactions of two putative protein pairs. We also utilize the interspecies relationship
of proteins i.e., the sequence homology, which is crucial in cases of limited information from other sources
of biological data. We further enhance our model to account for protein binding sites by linking individual
residues in structural templates which bind to other residues. Finally, we evaluate our model by combining
different sources of information using Naive Bayes classification. The proposed model provides substantial
improvements in terms of accuracy, precision, recall when compared with previous approaches. We report an
accuracy of 90% when tested for a protein interaction network of yeast proteome.
237
Rehman, H., Zafar, U., Benso, A. and Islam, N.
A Structure based Approach for Accurate Prediction of Protein Interactions Networks.
DOI: 10.5220/0005705002370244
In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 3: BIOINFORMATICS, pages 237-244
ISBN: 978-989-758-170-0
Copyright
c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
238
A Structure based Approach for Accurate Prediction of Protein Interactions Networks
239
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
1 2 3
Query 1 Templates
S1 SA1 SA2 . . . . . . . . SAi S2N1
SB1
BLAST SA1
SB1 SB2 SBj
S2 ........ For i, j Templates
Query 2 i x j Template Pairs in
Total
Finds Structural
Homolog Neighbor Templates
PDB of each Homolog
Structures Structure
Figure 2: The general shceme of heterogenous information integration for our PPI prediction algorithm.
the same molecular functional activity. structures that have similar macromolecular com-
The input of our algorithm is a pair of proteins plexes. The macromolecular similarities are evalu-
(also called query proteins) say P1 and P2 (in our ated using purely molecule’s geometric criteria, with-
implementation we used Uniprot IDs (The UniProt out considering sequence similarity, thus it is able to
Consortium, 2015)), whose interaction information identify even distant homologs structures. We queried
we want to find or predict. In the first step, since VAST+ with default parameters and with a thresh-
these proteins can possibly be sparsely annotated so, old of ten templates i.e. we select top ten neighbor-
we need to associate them through homology to other ing templates of each model structure. The structural
proteins. To capture ortholog based homolog similar- neighbors are named as SAi for model template S1
ity we run a single iteration BLAST (Altschul et al., and SBi for model template S2 , (where, i = 1,2.....10)
1990) search for each query protein Pi against the pro- .
tein NR database with an E-value cutoff of 0.0001.
We selected protein structures (namely, S1 and S2 ; Step 3: Formation of Templates Pairs
also called Model structures) that are highly similar
to our query, with the additional constraint that match- At this stage, we have 10 structural neighbors for each
ing PDB structures should have at least 90% or higher query protein Pi . To check the overall binding poten-
sequence identity. It is pertinent to note that this sim- tial for interaction, of individual template pairs, we
ilarity is only in sequences not in structure. The ob- construct pairs of each structural neighbor SAi with
tained structures as a result of sequence similarity are SBi (where i = 1,2.....10) i.e., SA1 pairs with SB1 ,
then queried to PDB (M. Berman et al., 2000), to ob- SB2 .... and so on up to SB10 , likewise we repeat pair-
tain their structural details (i.e. atomic coordinates, ing for SA2 , SA3 .... up to SA10 . This step results in a
residues information etc.). total of 100 template pairs.
240
A Structure based Approach for Accurate Prediction of Protein Interactions Networks
241
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
and ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) be the random vari- interaction network. We chose the interaction net-
able that capture different aspects of structural as- work of HSP75 YEAST protein (Uniprot ID: P11484)
sociation. The conditional probability that Pi and for our experiment and tried to reproduce its inter-
Pj interact given the distribution of random variables action network using our proposed algorithm. The
ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) is given by: HSP75 YEAST protein was chosen firstly because it
is involved in heterogeneous molecular activities and
secondly, because the interaction networks of this
P(Ci j = 1/ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) = protein contains many experimentally validated in-
P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) /Ci j = 1).P(Ci j = 1) teractions. Thus, to better evaluate the prediction
performance of our algorithm we chose this net-
P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) work. The HSP75 YEAST is a fully reviewed protein
in UniProtKB/Swiss-Prot database (which is a high
quality manually curated, as well as non-redundant
= Π8k=1 P(ϕ(k) /Ci j = 1).P(Ci j = 1) 6 protein sequence database).
The HSP75 YEAST protein’s interaction network
Π8k=1 P(ϕ(k) /Ci j = 1).P(Ci j = 1) +
in IntAct database contain 4,449 interactions as of
August, 2015. The protein interaction network
Π8k=1 P(ϕ(k) /Ci j = 0).P(Ci j = 0) (6)
databases contain false positive interactions that are a
bottleneck to predict the overall performance of an al-
Where P(Ci j = 1) is the prior probability that
gorithm as well as to judge the statistical significance
Pi and Pj interact, P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) of experiments conducted. In order to deal with this
is the probability that Pi and Pj has limitation, we filtered interaction network to include
ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) features and interactions that are of high confidence with the crite-
P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) /Ci j = 1) is the proba- ria that each interaction in the network must be sup-
bility that Pi and Pj has ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ported by at least two experimental methods. After
features given that Pi and Pj interact. filtering our network reduced to 1770 interactions.
All feature values ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) are We call these interactions as high confidence in-
normalized and we used binning of feature values teractions because each interaction is supported and
so that values of features lie in known ranges. As validated by at least two experimental methods. The
many machine learning algorithms specially Bayes interaction network contains proteins from the same
classification produce better results when continuous (Saccharomyces cerevisiae) as well as other species
attributes are made discrete. Finally, we calculate namely: Arabidopsis thaliana, Rattus norvegicus,
the value of P(Ci j = 1/ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) for Arabidopsis thaliana, and Dictyostelium discoideum.
each protein pair Pi and Pj.
4.1 Performance Evaluation
242
A Structure based Approach for Accurate Prediction of Protein Interactions Networks
and 100
Comparison with Pre-PPI
precision + recall 70
60
4.2 Cross Validation Analysis of 50
Prediction Accuracy 40
30
243
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
244