0% found this document useful (0 votes)
42 views8 pages

A Structure Based Approach For Accurate Prediction of Protein

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views8 pages

A Structure Based Approach For Accurate Prediction of Protein

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Structure based Approach for Accurate Prediction of Protein

Interactions Networks

Hafeez Ur Rehman1 , Usman Zafar1 , Alfredo Benso2 and Naveed Islam1,3


1 Department of Computer Science, National University of Computer & Emerging Sciences, Hayatabad, Peshawar, Pakistan.
2 Department of Control & Computer Engineering, Politecnico di Torino, I-10129, Torino, Italy.
3 Department of Computer Science, Islamia College University, Peshawar, Pakistan.

Keywords: Protein Interactions, Protein Structure, 3D Templates, Protein Interaction Network, Protein Binding Sites.

Abstract: In the recent days, extraordinary revolution in genome sequencing technologies have produced an overwhelm-
ing amount of genes that code for proteins, resulting in deluge of proteomics data. Since proteins are involved
in almost every biological activity, therefore due to this rapid uncovering of biological “facts”, the field of
System Biology now stands on the doorstep of considerable theoretical and practical advancements. Precise
understanding of proteins, specially their functional associations or interactions are inevitable to explicate how
complex biological processes occur at molecular level, as well as to understand how these processes are con-
trolled and modified in different disease states. In this paper, we present a novel protein structure based method
to precisely predict the interactions of two putative protein pairs. We also utilize the interspecies relationship
of proteins i.e., the sequence homology, which is crucial in cases of limited information from other sources
of biological data. We further enhance our model to account for protein binding sites by linking individual
residues in structural templates which bind to other residues. Finally, we evaluate our model by combining
different sources of information using Naive Bayes classification. The proposed model provides substantial
improvements in terms of accuracy, precision, recall when compared with previous approaches. We report an
accuracy of 90% when tested for a protein interaction network of yeast proteome.

1 INTRODUCTION Protein-fragment Complementation Assays (PCA)


(N. Pelletier et al., 1999), or co complex interaction
Proteins are the most essential macromolecules that maps (that are attained by high-throughput Co-affinity
are involved in almost every biological activity. Our Purification followed by Mass Spectrometry (AP/MS)
knowledge of new proteins is increasing with a rapid to identify protein-protein (bait) interactions) (Rigaut
pace as next generation sequencing technologies are et al., 1999; A. Shoemaker and R. Panchenko, 2007a)
uncovering new genomes. The knowledge of proteins etc., are either slow, costly or prone to noise because
alone, is not sufficient since proteins rarely act in iso- of the nature of these experiments. Moreover, the ex-
lation. The overall complexity of biological systems isting noise in protein interaction databases resulted
at different levels primarily arise due to the combina- by these experiments, plus the deluge of protein data
torial interactions caused by the proteins in the cells. produced by next generation sequencing technologies
One of the crucial step for understanding biological motivates the need to make accurate computational
cells as engineered systems is to map networks of techniques that can precisely map the interactions of
DNA-protein, RNA-protein and protein-protein inter- proteins on genome wide scale.
actions (PPIs) of a species as completely and accu- Several computational techniques have been pro-
rately as possible. Precise knowledge of protein inter- posed in the past that incorporate a wide variety
actions is also a precondition for fulfilling the promise of data e.g., phylogenetic profiles, sequence homol-
of preventive as well as personalized medicines that ogy, and co-expression of genes etc., to accurately
which means more rational development of antibacte- infer genome-wide protein-protein interactions (A.
rial compounds, drugs, and vaccines. Shoemaker and R. Panchenko, 2007b; Salwinski and
The conventional wet lab experiments e.g., Yeast Eisenberg, 2003). However, comparative studies ad-
two-Hybrid (Y2H) (Ito et al., 2001) screening, vocate that the development of noise free protein in-

237
Rehman, H., Zafar, U., Benso, A. and Islam, N.
A Structure based Approach for Accurate Prediction of Protein Interactions Networks.
DOI: 10.5220/0005705002370244
In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 3: BIOINFORMATICS, pages 237-244
ISBN: 978-989-758-170-0
Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

Homolog Species classifier and an overall confidence score is calculated


Saccharomyces
HDHD1 Protein
Known Structures that determines the binding potential of two proteins
Cerevisiae Species as interacting pairs.
The remaining part of the paper is organized as
Hypothetical Protein
follows: In Section 2, we first give an overview of
Homo Sapiens
YKL033W-A
CG15441 Protein the closely related approaches used for the predic-
tion of protein-protein interactions; with the expla-
Drosophila nation of why structure based approaches stand out
Melanogaster
from other techniques. We then introduce, in Section
Figure 1: A hypothetical protein connected to structurally 3, a heterogeneous information based Bayesian net-
known proteins using protein homology. work model that combines different types of informa-
tion (i.e., sequence similarity, protein homology, pro-
teraction repertoires of different genomes, is still in tein binding sites, and geometrical features) to predict
its early stages (Braun and et al., 2009; Deane et al., protein-protein interactions. Section 4 demonstrates
2002). The most prominent computational methods the effectiveness of the proposed model when applied
that produce high confidence interactions utilize pro- to cross validate, a subset of interacting as well as non
tein’s structural information e.g., (C. Zhang et al., interacting proteins in the yeast network. We lastly
2010; Wass et al., 2011). But unfortunately, there is a discuss the results of our scheme and also compare
huge difference between the number of known protein the performance of most recent related state-of-the-art
sequences and their relative known structures; even structure based schemes with our scheme. In Section
for the well studied organism such as Saccharomyces 5 we present conclusion of our study with possible
Cerevisiae, the known structural information is sparse future considerations.
i.e. less than 10% proteins are with known structure
(Zhang et al., 2012). Moreover, the protein complex
information of known PPIs is even sparser. 2 RELATED WORK
Fortunately, homology models (see Figure 1) as
well as known protein complexes (across species) in Protein-protein interactions are key to most of the bio-
well-known databases e.g. PDB (Protein Data Bank) logical processes. These interactions orchestracted by
(M. Berman et al., 2000), present the opportunity molecular mechanisms that have yet not been clearly
to relate unknown structure sequences with known understood. Understanding protein-protein interac-
structures using geometrical features of the individ- tions would also provide us crucial clues about in-
ual templates. Approaches incorporating this type of terracellular signalling pathways. Numerous experi-
information have shown great success; in such cases ments have been devised by researchers in the labs
protein structure have multiple clues that associate including yeast two-hybrid systems, mass spectrom-
the geometric features of individual templates. How- etry, protein microarrays and others. Unfortunately,
ever, these methods exhibit much less success on pro- experimental techniques have not been able to charac-
teins with inconsistent homolog templates (i.e. the terize the proteins to a great extent. Thus, our knowl-
homolog templates whose geometrical features are edge of protein functions as well as their interactions
much variant; hence they result in effecting the over- is very limited. This low contribution by experimental
all accuracy of the prediction schemes). techniques and lesser knowledge about protein inter-
In this paper we proposed and evaluated a novel actions is being complemented by the advancement of
approach that combines heterogeneous structural in- computational methods.
formation of proteins and determine their potency for Since, protein sequence is the most basic as well
interaction in the form of a probability score. The fun- as most easily available type of information about
damental conceptual innovation of our method is to proteins; therefore, many methods devised in the be-
connect geometrical features of proteins with protein ginning focused on use of sequence information to see
binding sites and to enhance the algorithms power as the mutual evolution of proteins. One such method,
well as applicability for heterogeneous homolog tem- focused on evolutionary information related to struc-
plates. Our new approach relies on scores (features) ture and function was proposed by (Valencia and Pa-
obtained by combining diverse sources of biological zos, 2003). This method constructs and utilized evo-
information which includes: sequence similarity, pro- lutionary relationship among proteins to infer PPI as
tein homology, protein binding sites, and geometrical such proteins co-evolve. Another approach was a
features like, no of interacting residues, no of surface multiple classifier based system harnessing sequence
residues etc. These scores are combined using Bayes of proteins(F. Xia et al., 2010). They utilized two

238
A Structure based Approach for Accurate Prediction of Protein Interactions Networks

classifiers rotation forest and autocorelation descrip- 3 METHODS


tor. This group tested their system on Saccharomyces
cerevisiae and Helicobacter pylori data. Sequence In our work, we employ the idea of integrating hetero-
based approach for PPI prediction has been used by geneous biological information associated with two
another group but with slight variations. They pre- queried proteins and determine their strength for in-
dicted PPIs more precisely from sequence alignments teraction, by combining this information in the form
of proteins by using a Bayesian classifier (Burger and of scores using Bayesian statistics. The distinctive-
V. Nimwegen, 2008). A similar and more recent set of ness of our technique comes from the fact that po-
techniques utilized only sequence information for PPI tential interaction information e.g., protein binding
prediction e.g., (Shen et al., 2006; You et al., 2015). sites (which are strongly associated to molecular in-
Mathematical probabilistic models were adopted teraction), can be combined with geometrical features
by some researcher for the prediction of protein- present in the structural templates of two interacting
protein interactions. In one such case Probabilitic proteins to decide if they interact or not. This com-
analysis predicted nearly 40,000 interactions in hu- bination also increases the power of our algorithm to
mans (R. Rhodes et al., 2005). This probabilistic include structural templates that are varied in geom-
model combined interaction data, functional anno- etry but contain sites that can bind to other proteins.
tation data, protein domain data and genome-wide Our proposed approach relies on scores (features) ob-
gene expression data. Probabilistic models have also tained by combining diverse sources of biological in-
provided a motivation for researches to model more formation which includes: sequence similarity, pro-
protein-protein interactions. A work was done us- tein homology, protein binding sites, and geometrical
ing Generative Probabilistic Models with bi clique features like, no of interacting residues, no of surface
perspective to model the interaction network of Sac- residues etc.
charomyces cerevisiae(Schweiger et al., 2011). This The prediction of protein interactions is more
method concluded that nave unmodified DD (dupli- challenging for proteins which are not well annotated
cationdivergence) model is much more effective than or whose molecular details are limited. To enhance
Preferential Attachment model at capturing key as- the predictive power of our automated PPI predic-
pects of PPI prediction. Another work employed the tion algorithm, we combine very powerful associative
use of distant conservation of patterns in protein se- sources of information namely: protein homolog &
quences, also called motifs and their structural rela- sequence similarity, as a baseline to capture proteins
tionships in proteins (Espadaler et al., 2005). which are most similar. This is particularly impor-
Most recent approaches that integrate structural tant as each type of data typically captures distinct
and non-structural type of information into compu- aspects of associative activity. The overall process of
tational models and use machine learning algorithms our technique for PPI prediction is divided into seven
e.g., Bayesian classifier or Support Vector Machines steps (as shown in Figure 02):
etc., to infer interaction of putative proteins. One such
work is done in the recent past by (Zhang et al., 2012),
that utilizes structural as well as non-structural type of
Step 1: Selection of Homolog Sequences
features with a blend of Bayesian Classifier for pre-
diction of PPI on a genome wide scale. The authors of To predict the interactions for sparsely annotated pro-
this study presented their results for Saccharomyces teins, the first useful type of information that can
cerevisiae, and reported that structural features out- associate them is the protein homology information.
perform non structural features with great margin in Evolutionary relationships between species advocate
terms of statistical performance measures i.e., pre- that the homolog (specifically orthologous) proteins
cision, recall, accuracy, false positive rate etc., The of different species, whose functions have been estab-
major contribution in this work was the use of struc- lished before speciation event and which share high
tural features and evaluation of their impact on the sequence similarity are likely to interact for similar
prediction accuracy. Thus structural information of functional activities.
proteins plays a key role in deciphering the underly- Two proteins are said to be homologous if they
ing mechanism of protein interactions. Therefore, in share a common ancestor. To detect homology, se-
our work we also mainly integrate, structural infor- quence information is often used to deduce if proteins
mation of proteins along with protein binding sites to are homologous or not. If two proteins share high se-
predict their associations. quence similarity i.e., above 25 % sequence similarity
(Benso et al., 2013; Mitrofanova et al., 2011; Benso
et al., 2012), they are very likely to be homologous
and have similar structures and in many cases part of

239
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

1 2 3
Query 1 Templates
S1 SA1 SA2 . . . . . . . . SAi S2N1
SB1
BLAST SA1
SB1 SB2 SBj
S2 ........ For i, j Templates
Query 2 i x j Template Pairs in
Total

Finds Structural
Homolog Neighbor Templates
PDB of each Homolog
Structures Structure

NR Database VAST+ Server


7 6 5 4
Model by Superimposing Multiprot
Program to
Calculate Templates with Detect
Interacting
Scores Homologs Residue Pairs in
the Structure
Yes/No Bayesian on all 100
Decision Classifier Template
Model PDBe Binding
Pairs sites server

Figure 2: The general shceme of heterogenous information integration for our PPI prediction algorithm.

the same molecular functional activity. structures that have similar macromolecular com-
The input of our algorithm is a pair of proteins plexes. The macromolecular similarities are evalu-
(also called query proteins) say P1 and P2 (in our ated using purely molecule’s geometric criteria, with-
implementation we used Uniprot IDs (The UniProt out considering sequence similarity, thus it is able to
Consortium, 2015)), whose interaction information identify even distant homologs structures. We queried
we want to find or predict. In the first step, since VAST+ with default parameters and with a thresh-
these proteins can possibly be sparsely annotated so, old of ten templates i.e. we select top ten neighbor-
we need to associate them through homology to other ing templates of each model structure. The structural
proteins. To capture ortholog based homolog similar- neighbors are named as SAi for model template S1
ity we run a single iteration BLAST (Altschul et al., and SBi for model template S2 , (where, i = 1,2.....10)
1990) search for each query protein Pi against the pro- .
tein NR database with an E-value cutoff of 0.0001.
We selected protein structures (namely, S1 and S2 ; Step 3: Formation of Templates Pairs
also called Model structures) that are highly similar
to our query, with the additional constraint that match- At this stage, we have 10 structural neighbors for each
ing PDB structures should have at least 90% or higher query protein Pi . To check the overall binding poten-
sequence identity. It is pertinent to note that this sim- tial for interaction, of individual template pairs, we
ilarity is only in sequences not in structure. The ob- construct pairs of each structural neighbor SAi with
tained structures as a result of sequence similarity are SBi (where i = 1,2.....10) i.e., SA1 pairs with SB1 ,
then queried to PDB (M. Berman et al., 2000), to ob- SB2 .... and so on up to SB10 , likewise we repeat pair-
tain their structural details (i.e. atomic coordinates, ing for SA2 , SA3 .... up to SA10 . This step results in a
residues information etc.). total of 100 template pairs.

Step 2: Finding Structural Neighbors Step 4: Identification of Interacting


Residues and Binding Sites
In the second step, structural representatives of each
model structure i.e. S1 and S2 , were taken directly As a first step, to evaluate the propensity for inter-
by querying each model structure to VAST+ (Vector action, of individual template pairs, we first identify
Alignment Search Tool Plus) Server (Madej et al., the # of interacting residues in the template pairs. For
2013). VAST+ is a tool designed by NCBI (Na- this purpose, we use a tool called Multiprot in a proto-
tional Center for Biotechnology Information) and uti- col known as PRISM (PRotein Interactions by Struc-
lizes Molecular Modeling Database (MMDB), for 3- tural Matching) (Tuncbag et al., 2011; Shatsky et al.,
dimensional structures, with the need of finding those 2004) . The Multiprot rationale is based on the fact

240
A Structure based Approach for Accurate Prediction of Protein Interactions Networks

that globally different protein structures can interact (1)


ξMod i j
via chains of architecturally similar residues called (2)
ξMod i j = (1)
motifs. Thus Multiprot predicts binding residues by Average(S1 , S2 )
utilizing structural similarity as well as evolutionary (3)
The third score ξMod i j is the same as the first score,
conservation of putative binding residue also called
hot spots. For each template pair Multiprot calculates with the additional check that the interacting residues,
the # of interacting residues. both in template and model are also shared by the
To further strengthen the PPI prediction of our binding sites retrieved using PDBeMotif service, and
technique we also utilize the PDBeMotif (Golovin is calculated as shown in the equation 2.
and Henrick, 2008). PDBeMotif is an incredibly fast h i
and powerful search tool that facilitates the explo- (3) (1)
ξMod i j = ξMod i j ∩ Binding− Sites(Modi j ) (2)
ration of binding sites of single proteins or classes of
proteins e.g., Pepsin, and locates the conserved struc- (4)
tural features of individual residues both within the Lastly, the final score ξMod i j , of our technique is
same specie as well as in different species. We em- calculated by taking shared binding sites in the super-
ploy PDBeMotif to locate residues that are binding imposed template and model pairs as shown in equa-
(4)
sites in our template pairs. tion 3. ξMod i j is the number of binding sites in the
template that align to the number of binding sites in
Step 5: Modeling Structural Templates the model.
using Homolog Pairs (4)  
ξMod i j = Binding− Sites(S1, S2) ∩ Binding− Sites(Modi j )
In this step, we build an interaction model Modij by (3)
superposing the template pairs SAi and SBj over the
model template S1 and S2 . Overall 100 models are Step 7: PPI Prediction using Bayesian
built for (10x10) template pairs. Each model Modij is Networks
used to calculate four structure based scores.
Once all scores are calculated for hundred interaction
Step 6: Calculating Interaction Scores models, we then combine their effect into one score
from Interaction Models by taking the mean and standard deviation of individ-
ual scores as shown in equation 4 and 5.
From the 100 interaction models we prepared in the
previous step, we evaluate and combine associated in-  
(k)
formation to calculate four scores for each interaction ∑10 10
i=1 ∑ j=1 ξMod i j
model Modij . The scores are based on the criterion ϕ(k) =   . . . For, k = {1, 2, 3, 4}
100
that make use of interacting residues, binding sites
as well as sequence information. We name our first (4)
(1)
score as ξMod i j , where Modij denotes the interaction
v
u  2 
(1)
model for which this score is calculated. ξMod i j is u 10 10 (k)
t ∑i=1 ∑ j=1 ξMod i j − ϕ (k)

calculated by taking into account the number of in- ϕ(l) =  
 100 
teracting residues in the template (calculated using
Multiprot) that are preserved in the homolog mod-
els S1 and S2 , i.e. both template and model share . . . For, k = {1, 2, 3, 4} and l = {5, 6, 7, 8} (5)
those residue pairs. Templates have different varia-
tions in their amino acid sequence, this score captures The Standard deviation of scores captures the fact
the strength of interaction model in terms of # of in- that, whether the templates that our method finds are
teracting residues preserved, when compared with ho- different from each other or not; because when dif-
molog template pair. ferences among homologs are spread out the standard
(2) deviation will be high.
The second score of our model is called ξMod i j
Lastly we use Bayesian classification to com-
and is estimated by taking fraction of total interacting bine the mean values as well as the standard de-
(1)
residues preserved i.e., ξMod i j , divided by the average viations of our scores captured in eight variables
of total number of residues in both homolog templates ϕ(k) , where k={1,2,...,8}. Let Pi and Pj be the
i.e., S1 and S2 , as shown in equation 1. query proteins whose interaction we want to predict

241
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

and ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) be the random vari- interaction network. We chose the interaction net-
able that capture different aspects of structural as- work of HSP75 YEAST protein (Uniprot ID: P11484)
sociation. The conditional probability that Pi and for our experiment and tried to reproduce its inter-
Pj interact given the distribution of random variables action network using our proposed algorithm. The
ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) is given by: HSP75 YEAST protein was chosen firstly because it
is involved in heterogeneous molecular activities and
secondly, because the interaction networks of this
P(Ci j = 1/ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) = protein contains many experimentally validated in-
P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) /Ci j = 1).P(Ci j = 1) teractions. Thus, to better evaluate the prediction
performance of our algorithm we chose this net-
P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) work. The HSP75 YEAST is a fully reviewed protein
in UniProtKB/Swiss-Prot database (which is a high
  quality manually curated, as well as non-redundant
= Π8k=1 P(ϕ(k) /Ci j = 1).P(Ci j = 1) 6 protein sequence database).
  The HSP75 YEAST protein’s interaction network
Π8k=1 P(ϕ(k) /Ci j = 1).P(Ci j = 1) +
in IntAct database contain 4,449 interactions as of
 
August, 2015. The protein interaction network
Π8k=1 P(ϕ(k) /Ci j = 0).P(Ci j = 0) (6)
databases contain false positive interactions that are a
bottleneck to predict the overall performance of an al-
Where P(Ci j = 1) is the prior probability that
gorithm as well as to judge the statistical significance
Pi and Pj interact, P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) of experiments conducted. In order to deal with this
is the probability that Pi and Pj has limitation, we filtered interaction network to include
ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) features and interactions that are of high confidence with the crite-
P(ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) /Ci j = 1) is the proba- ria that each interaction in the network must be sup-
bility that Pi and Pj has ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ported by at least two experimental methods. After
features given that Pi and Pj interact. filtering our network reduced to 1770 interactions.
All feature values ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) are We call these interactions as high confidence in-
normalized and we used binning of feature values teractions because each interaction is supported and
so that values of features lie in known ranges. As validated by at least two experimental methods. The
many machine learning algorithms specially Bayes interaction network contains proteins from the same
classification produce better results when continuous (Saccharomyces cerevisiae) as well as other species
attributes are made discrete. Finally, we calculate namely: Arabidopsis thaliana, Rattus norvegicus,
the value of P(Ci j = 1/ϕ(1) , ϕ(2) , ϕ(2) , ........., ϕ(8) ) for Arabidopsis thaliana, and Dictyostelium discoideum.
each protein pair Pi and Pj.
4.1 Performance Evaluation

4 EXPERIMENTAL SETUP AND For evaluating prediction performance we use cross


RESULTS validation approach to estimate the prediction potency
of our proposed scheme i.e., for each protein pair Pi
and Pj in the interaction data set, we assumed the
The integration technique described in the Methods
interaction of Pi and Pj were unknown and then at-
section is evaluated on the task of predicting protein-
tempted to predict the interaction by means of our
protein interactions for an interaction network of Sac-
algorithm. Lastly, we compare the predicted inter-
charomyces cerevisiae species proteins. We tested
actions with the true interaction set. For assessment
our algorithm on a data set of Yeast species proteins
of our methodology, we computed performance mea-
obtained from IntAct database [Results-01]. The al-
sures, such as: precision, recall, accuracy and F1
gorithm fuses probabilities derived from diverse data
which are estimated using the following formulas:
sources including sequence similarity, protein homol-
ogy, protein binding sites used in combination with TP+TN
other geometrical features. A well known power- accuracy =
T P + FP + T N + FN
ful classification scheme i.e. Bayes classification,
was used to combine mutually independent features TP
precision =
(scores). T P + FP
In this work, we present results of our scheme TP
for a portion of Saccharomyces cerevisiae species recall =
T P + FN

242
A Structure based Approach for Accurate Prediction of Protein Interactions Networks

and 100
Comparison with Pre-PPI

Percentage of values corresponding to each measure


90
2 ∗ precision ∗ recall
F1 = 80

precision + recall 70

60
4.2 Cross Validation Analysis of 50

Prediction Accuracy 40

30

For the interaction network described earlier we first 20

attempted to predict protein-protein interactions by 10

10-fold cross validation. For each protein pair Pi and 0


Acurracy Recall Precision F1

Pj the probability that protein Pi interacts with Pj is Our Method


Pre-PPI Method
90
89
95.2
94
94.11
92
94.49
92
calculated using equation 6. Predicted protein interac-
Figure 3: Comparison of Acurracy, Recall, Precision, and
tions having a probability estimate of greater than 0.5 F1 measure of proposed scheme with Pre-PPI scheme.
were considered as positive interactions otherwise we
conclude that proteins don’t interact. By applying our and recall, respectively. The improved performance
algorithm on high confidence interaction network re- of our algorithm can be attributed to the most impor-
trieved from IntAct, we obtained an overall accuracy tant functional clue called protein binding sites, which
of 90%, recall of 95.2%, precision of 94.11 and an F1 was further improved by combining with other struc-
score of 94.49%. tural information to precisely model the interaction
activity.
4.3 Comparison with other Approaches

In this section, we broadly compare our method to the 5 CONCLUSIONS


most widely used group of techniques, such as Pre-
PPI algorithm proposed by Q. C. Zhang et al. (Zhang In this work, we presented a novel approach that uses
et al., 2012), which combines structural as well as heterogeneous biological information associated with
non structural type of information to predict protein- two queried proteins and determine their strength for
protein interactions. In such methodologies, inter- interaction, by combining this information in the form
actions among proteins are predicted by combining of scores using Bayesian statistics. The distinctive-
structural clues with non structural clues using some ness of our technique comes from the fact that po-
machine learning algorithm such as, Support Vector tential interaction information i.e., protein binding
Machines (SVM), Bayesian framework etc., which sites, can be combined with other geometrical fea-
consequently assign a probability score to a protein tures present in the structural templates of two inter-
pair of interest as positively or negatively interact- acting proteins to decide if they interact or not. This
ing. Fundamentals of Bayesian techniques are at the combination also increases the power of our algo-
heart of the overwhelming majority of methods cur- rithm to include structural templates that are varied
rently used to combine heterogeneous sources infor- in geometry but contain sites that can bind to other
mation for PPI prediction. Since this scheme (Zhang proteins. The proposed model provides substantial
et al., 2012), uses Bayesian technique as well as uti- improvements in terms of accuracy, precision, recall
lizes structural information to predict PPI therefore, when compared with previous approaches. The pro-
we compare our algorithm against this computational posed scheme may additionally be used in combina-
technique. tion with non structural features to enhance the pre-
To obtain the most correct comparative results, diction confidence.
we use the same species proteins i.e., Saccharomyces
cerevisiae and compare results in a 10 fold cross-
validation setting. The results in figure 03 clearly sig-
nify that our method performs better than the Q. C. ACKNOWLEDGEMENTS
Zhang’s Pre-PPI method (Zhang et al., 2012) across
all measures reported i.e., precision, recall, accuracy We would like to show our gratitude to Dr. Omar
and F1 scores. We observed that for almost the same Usman, Assistant Professor at National University of
accuracy values, Pre-PPI method produced higher Computer & Emerging Sciences for his worthy com-
number false positive as well as false negatives pre- ments that greatly improved the manuscript.
dictions, which resulted in lower values of precision

243
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

REFERENCES (2013). Mmdb and vast+: tracking structural simi-


larities between macromolecular complexes. Nucleic
A. Shoemaker, B. and R. Panchenko, A. (2007a). De- Acids Res., 42:(D1): D297–D303. [PubMed PMID:
ciphering protein-protein interactions. part i. exper- 24319143].
imental techniques and databases. PLOS Comput. Mitrofanova, A., Pavlovic, V., and Mishra, B. (2011). Pre-
Biol., 3(3):e42. diction of protein functions with gene ontology and in-
A. Shoemaker, B. and R. Panchenko, A. (2007b). Deci- terspecies protein homology data. IEEE/ACM Trans-
phering protein-protein interactions. part ii. computa- actions on Computational Biology and Bioinformat-
tional methods to predict protein and domain interac- ics, 8 no. 3:775–784.
tion partners. PLOS Comput. Biol., 3(3):e43. N. Pelletier, J., Arndt, K., Pluckthun, A., and et al. (1999).
Altschul, S., Gish, W., Miller, Myers, E., and J. Lipman, D. An in vivo library versus library selection of opti-
(1990). Basic local alignment search tool. Molecular mized protein protein interactions. Nat Biotechnol,
Biology, 215:403–410. 17:683–90.
Benso, A., Di Carlo, S., Ur Rehman, H., Politano, G., R. Rhodes, D., A. Tomlins, S., and Varambally, S. (2005).
Savino, A., and Suravajhala, P. (2012). Using gnome Probabilistic model of the human protein-protein in-
wide data for protein function prediction by exploit- teraction network. Nature Biotechnology, 23:951 –
ing gene ontology relationships. pages 497–502. IEEE 959.
International Conference on Automation Quality and Rigaut, G., Shevchenko, A., Rutz, B., and et al. (1999). A
Testing Robotics (AQTR)., IEEE. generic protein purification method for protein com-
Benso, A., Di Carlo, S., Ur Rehman, H., Politano, G., plex characterization and proteome exploration. Nat
Savino, A., and Suravajhala, P. (2013). A combined Biotechnol, 17:1030–32.
approach for genome wide protein function annota- Salwinski, L. and Eisenberg, D. (2003). Computational
tion/prediction. PROTEOME SCIENCE, 11(S1):1– methods of analysis of protein protein interactions.
12. ISSN: 1477-5956. Curr. Opin. Struct. Biol., 13:377 to 382.
Braun, P. and et al. (2009). An experimentally derived con- Schweiger, R., Linial, M., and Linial, N. (2011). Gener-
fidence score for binary protein-protein interactions. ative probabilistic models for protein-protein interac-
Nature Methods, 6:91 to 97. tion network the biclique perspective. Oxford Jour-
Burger, L. and V. Nimwegen, E. (2008). Accurate pre- nals, Volume 27.
diction of protein protein interactions from sequence Shatsky, M., Nussinov, R., and J. Wolfson, H. (2004). A
alignments using a bayesian method. Mol Syst Biol, method for simultaneous alignment of multiple pro-
4:165. tein structures. PROTEINS: Structure, Function, and
C. Zhang, Q., Petrey, D., Norel, R., and Honig, B. (2010). Bioinformatics, 56:143–156.
Protein interface conservation across structure space. Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., and et al.
Proc. Natl Acad. Sci. USA, 107:10896–10901. (2006). Predicting protein-protein interactions based
Deane, C. M., Salwinski, L., Xenarios, I., and Eisenberg, only on sequences information. Proceedings of the
D. (2002). Protein interactions: two methods for as- National Academy of Sciences, vol. 104:4337–4341.
sessment of the reliability of high throughput observa- The UniProt Consortium (2015). Uniprot: a hub for protein
tions. . Mol. Cell. Proteomics, 1:349 to 356. information. Nucleic Acids Res. 43: D204-D212.
Espadaler, J., Romero, O., M. Jackson, R., and et al. (2005). Tuncbag, N., Gursoy, A., Nussinov, R., and Keskin, O.
Prediction of protein-protein interactions using distant (2011). Predicting protein-protein interactions on a
conservation of sequence patterns and structure rela- proteome scale by matching evolutionary and struc-
tionships. Oxford Journals, Volume 21, Issue 16:3360 tural similarities at interfaces using prism. Nature Pro-
–3368. tocols, 06 NO.09:1341–1354.
F. Xia, J., Han, K., and S. Huang, D. (2010). Sequence- Valencia, A. and Pazos, F. (2003). Prediction of protein-
based prediction of protein-protein interactions by protein interactions from evolutionary information.
means of rotation forest and autocorrelation descrip- Methods Biochem Anal, 44:411–26.
tor. Protein Pept Lett, 17(1):137–45. Wass, M., Fuentes, G., Pons, C., Pazos, F., and Valencia,
Golovin, A. and Henrick, K. (2008). Msdmotif: exploring A. (2011). Towards the prediction of protein interac-
protein sites and motifs. BMC Bioinformatics, 9:1–11. tion partners using physical docking. Mol. Syst. Biol.,
Springer-Verlag Berlin Heidelberg. 7:469.
Ito, T., Chiba, T., Ozawa, R., and et al. (2001). A com- You, Z. H., Chan, K. C. C., and Hu, P. (2015). Predict-
prehensive analysis of protein protein interactions in ing protein-protein interactions from primary protein
saccharomyces cerevisiae. Proc Natl Acad Sci USA, sequences using a novel multi-scale local feature rep-
98:4569–74. resentation scheme and the random forest. PLoS ONE,
M. Berman, H., Westbrook, J., Feng, Z., Gilliland, G., 10(5).
N. Bhat, T., Weissig, H., N. Shindyalov, I., and E. Zhang, Q. C., Petrey, D., and et al. (2012). Structure based
Bourne, P. (2000). The protein data bank. Nucleic prediction of protein-protein interactions on a genome
Acids Research, 28:235–242. wide scale. Nature, 490(7421):556 to 60.
Madej, T., J. Lanczycki, C., Zhang, D., A. Thiessen,
P., C. Geer, R., M. Bauer, A., and H. Bryant, S.

244

You might also like