0% found this document useful (0 votes)
24 views12 pages

A Bayesian Approach For Estimating Protein-Protein

This document discusses a new Bayesian approach for estimating protein-protein interactions by integrating structural and non-structural biological data. The approach combines structural and non-structural data to precisely predict protein interactions in order to address the limitations and noise in individual high-throughput experimental techniques.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

A Bayesian Approach For Estimating Protein-Protein

This document discusses a new Bayesian approach for estimating protein-protein interactions by integrating structural and non-structural biological data. The approach combines structural and non-structural data to precisely predict protein interactions in order to address the limitations and noise in individual high-throughput experimental techniques.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

View Article Online

Molecular
View Journal

BioSystems
Accepted Manuscript

This article can be cited before page numbers have been issued, to do this please use: H. Ur Rehman, I.
Bari, A. Ali and H. Mahmood, Mol. BioSyst., 2017, DOI: 10.1039/C7MB00484B.

Volume 12 Number 1 January 2016 Pages 1–314 This is an Accepted Manuscript, which has been through the

Molecular Royal Society of Chemistry peer review process and has been
accepted for publication.
BioSystems
Interfacing chemical biology with the -omic sciences and systems biology Accepted Manuscripts are published online shortly after
www.rsc.org/molecularbiosystems

acceptance, before technical editing, formatting and proof reading.


Using this free service, authors can make their results available
to the community, in citable form, before we publish the edited
article. We will replace this Accepted Manuscript with the edited
and formatted Advance Article as soon as it is available.

You can find more information about Accepted Manuscripts in the


author guidelines.

Please note that technical editing may introduce minor changes


to the text and/or graphics, which may alter content. The journal’s
ISSN 1742-206X standard Terms & Conditions and the ethical guidelines, outlined
in our author and reviewer resource centre, still apply. In no
lin d in

PAPER
Eberhard O. Voit et al.
e!
ed xe

New insights into the complex regulation of the glycolytic pathway in

event shall the Royal Society of Chemistry be held responsible


M nde

Lactococcus lactis. I. Construction and diagnosis of a comprehensive


dynamic model
I

for any errors or omissions in this Accepted Manuscript or any


consequences arising from the use of any information it contains.

rsc.li/molecular-biosystems
Page 1 of 11 Molecular BioSystems
View Article Online
DOI: 10.1039/C7MB00484B

Jour
nal
Name

A Bayesian Approach for Estimating Protein-Protein

Molecular BioSystems Accepted Manuscript


Interactions by Integrating Structural and Non-
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

Structural Biological Data†


Hafeez Ur Rehman,∗a Inam Bari,b Anwar Ali,b and Haroon Mahmoodc

Accurate elucidation of genome wide protein-protein interactions is crucial for understanding reg-
ulatory processes of the cell. High-throughput techniques, such as yeast-2-hybrid (Y2H) assay,
Co-Immunoprecipitation (co-IP), mass spectrometric (MS) protein complex identification, affinity
purification (AP) etc., are generally relied upon to determine protein interactions. Unfortunately,
each type of method is inherently subject to different type of noise and result in false positive in-
teractions. On the other hand, precise understanding of proteins, specially the knowledge of their
functional associations is inevitable for understanding how complex molecular machines function.
To solve this problem, computational techniques are generally relied upon to precisely predict
protein interactions. In this work, we present a novel method that combines structural and non-
structural biological data to precisely predict protein interactions. The conceptual novelty of our
approach lies in identifying and precisely associating biological information that provides sub-
stantial interaction clues. Our model combines structural and non-structural information using
Bayesian statistics to calculate the likelihood of each interaction. The proposed model is tested
on Saccharomyces cerevisiae’s interactions extracted from DIP and IntAct databases and pro-
vides substantial improvements in terms of accuracy, precision, recall and F1 score, as compared
with most widely used related state-of-the-art techniques.

1 Introduction
leads to more rational development of antibacterial compounds,
Proteins are the most essential macro-molecules that are involved drugs, and vaccines etc.
in almost every biological activity. Our knowledge of new pro- The conventional wet lab experiments e.g., Yeast two-Hybrid
teins is increasing with a rapid pace as next generation sequenc- (Y2H) 1 screening, Protein-fragment Complementation Assays
ing technologies are uncovering new genomes. The knowledge of (PCA) 2 , or “co complex” interaction maps (that are attained
proteins alone, is not sufficient since proteins rarely act in isola- by high-throughput Co-affinity Purification followed by Mass
tion. The overall complexity of biological systems at different lev- Spectrometry (AP/MS) to identify protein-protein (bait) interac-
els primarily arise due to the combinatorial interactions caused by tions) 3,4 etc., are either slow, costly or prone to noise because
the proteins in the cells. One of the crucial step for understanding of the nature of these experiments. Moreover, the inherent noise
biological cells as engineered systems is to map networks of DNA- present in existing protein interaction databases, as well as the
protein, RNA-protein and protein-protein interactions (PPIs) of a overwhelming amount of proteomic data produced by next gen-
species as completely and accurately as possible. Precise knowl- eration sequencing technologies, motivates the need to make ac-
edge of protein interactions is also a precondition for fulfilling the curate computational techniques that can precisely map the inter-
promise of preventive as well as personalized medicines, which actions of proteins on genome wide scale.
Several computational techniques have been proposed in the
past that incorporate a wide variety of data e.g., phylogenetic pro-
a
Department of Computer Science, FAST National University of Computer & Emerg-
files, sequence homology, and co-expression of genes etc., to accu-
ing Sciences, Peshawar, Pakistan. Tel: +92-111-128-128 (Ex: 144); E-mail:
[email protected]
rately infer genome-wide protein-protein interactions 5–8 . How-
b
Department of Electrical Engineering, FAST National University of Computer & Emerg- ever, comparative studies advocate that the development of noise
ing Sciences, Peshawar, Pakistan. free protein interaction repertoires of different genomes, is still in
its early stages 9,10 . The most prominent computational methods
c
Department of Computer Science, FAST National University of Computer & Emerging
Sciences, Lahore, Pakistan.
that produce high confidence interactions utilize protein’s struc-
† Electronic Supplementary Information (ESI) available: [Supplementary Data.pdf
file]. See DOI: 10.1039/b000000x/ tural information e.g., 11,12 . But unfortunately, there is a huge

J
our
nal
Name,
[yea
r][
,vol
.
],1–11 | 1
Molecular BioSystems Page 2 of 11
View Article Online
DOI: 10.1039/C7MB00484B

tion to discrete zones of the cell. These interactions can be solid


Homolog Species
or feeble, and quick or moderate in their intensity of occurrence.
Saccharomyces HDHD1 Protein While in contact with their interactors, transitory interfacing pro-

Proteins with Known


Cerevisiae Species teins are included in an extensive variety of molecular level pro-
cesses, including protein transport, modification, folding, signal-

Structures
Hypothetical Protein Homo Sapiens ing, apoptosis and cell cycling.
YKL033W-A At molecular level proteins interact to each other through a va-
CG15441 Protein riety of forces that include: van der Waals interactions, hydropho-
bicity, and substrates binding at particular amino acid positions

Molecular BioSystems Accepted Manuscript


on every protein. These amino acid positions can be little restrict-
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

Proteins with Drosophila ing clefts or extensive surfaces and can be only a couple of pep-
Unknown Structure Melanogaster
tides long or span many amino acids, and the quality of the inter-
action is affected by the length of amino acids participating in the
Fig. 1 A hypothetical protein of Saccharomyces cerevisiae’s species bonding. A typical surface area that facilitates steady interactions
connected to structurally known proteins of other species through se- among proteins is the leucine zipper. The leucine zipper consists
quence homology.
of alpha-helices that bind to each other in a parallel mold through
the hydrophobic holding of consistently divided leucine deposits
on every alpha-helix. Due to the tight atomic binding, leucine zip-
difference between the number of known protein sequences and pers give stable conformation to formed protein complexes. How-
their relative known structures; even for the well studied organ- ever, not all leucine zippers tie precisely because of the presence
ism such as Saccharomyces Cerevisiae, the known structural infor- of non-leucine friendly amino acids present in the alpha-helices
mation is sparse i.e. less than 10% proteins are with known struc- that can ultimately affect the arrangement of molecules as well
ture 13 . Moreover, the protein complex information of known PPIs as the quality of molecular bonding among proteins.
is even sparser. A well-known example of transient interaction is the two Src
Fortunately, homology models (see Figure 1) as well as known homology (SH) domains (SH2 and SH3 domains) that commonly
protein complexes (across species) in well-known databases e.g. interact with short amino acid chains and are found in signaling
PDB (Protein Data Bank) 14 , present the opportunity to relate un- pathways. The SH2 domain has the ability to recognize sequences
known structure sequences with known structures using geomet- with phosphorylated tyrosine residues, which are always present
rical features of the individual templates. Approaches incorporat- to start protein activation process. SH2 domains play a vital part
ing this type of information have shown great success 11 ; in such in a number of processes, one such example is the receptor sig-
cases protein structure have multiple clues that associate the geo- naling of growth factors, in which downstream proteins with SH2
metric features of individual templates. However, these methods domains recognize tyrosine residues by ligand-controlled recep-
exhibit much less success on proteins with inconsistent homolog tor phosphorylation process. On the other hand many types of
templates (i.e. the homolog templates whose geometrical fea- proteins e.g., phospholipases, kinases and GTPases recognize the
tures are much variant; hence they result in effecting the overall target proteins using their SH3 domain which has excessive pro-
accuracy of the prediction algorithm). line residue sequences that build interactions with the target pro-
teins.
1.1 The Elusive Nature of PPIs Both SH2 and SH3 domains have tendency to bind with
The exact modeling of protein-protein interactions is a daunt- proline-rich type of motifs, however, uniqueness in protein inter-
ing task because of the magnitude of factors contributing to the actions comes from particular arrangement of neighboring amino
materialization of these interactions. Fundamentally, protein- acid residues present in each motif. The complexity of the molec-
protein interactions can be classified as stable or transient interac- ular mechanics of PPIs described so far make the precise modeling
tions and each type of interaction can further be either strong or of interactions an uphill task. For the evaluation of our work we
weak in magnitude. Stable protein-protein interactions are distin- considered both stable and transient interactions (with varying
guished because of their relatedness with proteins that have been degree of bonding), retrieved from both IntAct and DIP databases.
purified as multi-sub unit complexes, and the sub-units of these In this paper, we propose and evaluate a novel approach that
complexes can be either much related or diverse. The examples of combines heterogeneous information of proteins and determine
such multi sub-unit complexes are hemoglobin and central RNA their likelihood for interaction in the form of a probability score.
polymerase, the presence of each in a complex results in stable The fundamental conceptual innovation of our method is to con-
complexes. nect geometrical features of proteins (extracted from structural
On the other hand, transient interactions are relied upon to homolog templates) with protein binding sites and to enhance
control the larger part of biological processes occurring in the cell. the algorithm’s power as well as applicability for heterogeneous
As the name suggests, transient associations are temporary in na- homolog templates. In addition to the structural information
ture and are caused by certain molecular conditions that enable of proteins we also incorporate the non-structural information
the occurrence of interaction. The example of such interactions namely, co-complex similarity, gene ontology similarity and motif
are conformational changes, phosphorylation as well as localiza- similarity. All these type of information are strongly linked with

2| J
our
nal
Name,
[yea
r][
,vol
.
],
1–11
Page 3 of 11 Molecular BioSystems
View Article Online
DOI: 10.1039/C7MB00484B

the functions of proteins and significantly contributes in accurate employ to predict PPIs is the amino acid sequence. A protein’s
identification of potential protein-protein interactions. Our new amino acid sequence is the most basic as well as abundantly avail-
approach relies on scores (features) obtained by combining both able type of information that encodes a number of characteristics
structural and non-structural sources of biological information. about a protein. Many methods devised in the beginning focused
These scores are combined using Bayes classifier and an overall on the use of sequence information to see the mutual evolution of
confidence score is calculated that determines the tendency of proteins 17–20 . One such method, focused on evolutionary infor-
two proteins as interacting pairs. mation related to structure and function was proposed by 17 . This
The remaining part of the paper is organized as follows: In method constructs and utilized evolutionary relationship among
Section 2, we first give an overview of the closely related ap- proteins to infer protein-protein interactions.

Molecular BioSystems Accepted Manuscript


proaches used for the prediction of protein-protein interactions; Another group of methods use sequence information to con-
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

with the explanation of why hybrid approaches stand out from struct a multiple classifier based system by means of rotation
other techniques. We then introduce, in Section 3, a heteroge- forest and autocorrelation descriptors 21,22 . They combined the
neous information based Bayesian network model that combines two classifiers rotation forest and autocorelation descriptor to
different types of information (i.e., sequence homology, protein enhance prediction accuracy of PPIs. They reported improve-
binding sites, geometrical features etc.) along with non struc- ments in accuracy for interaction networks of Saccharomyces cere-
tural information to predict protein-protein interactions. Section visiae and Helicobacter pylori species proteins. Another group of
4 demonstrates the effectiveness of the proposed model when ap- researchers combined sequence information from the sequence
plied on protein interaction network of Saccharomyces cerevisiae alignment perspective to predict PPIs 23 . They used Bayesian clas-
species proteins extracted from two most widely used interaction sifier to predict PPIs by integrating the mutual alignment scores
databases namely: DIP database 15 and IntAct database 16 . Next, of potentially interacting proteins. A similar technique utilized
we discuss in detail the effectiveness of our model by evaluating sequence information only for PPI prediction 24 .
different performance measures. We then compare the perfor- With the availability of complete protein sequences of many
mance of our method with most widely used related state-of-the- modal genomes, several in-silico methods were devised that clus-
art techniques for protein interaction prediction. Lastly, in Section ter protein (using potentially supporting interacting information)
5 we present conclusion of our study with possible future dimen- into networks that provide insights into the functional association
sions. of uncharacterized proteins. One such techniques was proposed
The code (Python/Bash/Java) along with instructions to re- by 25 , in which the author use non-structural features namely,
produce this work are available from https://tinyurl.com/ phylogenetic profiles, and gene ontology based similarity to pre-
ydyq4ez3. dict PPI network of Saccharomyces cerevisiae species.
Apart from sequence information some researcher use quite di-
2 Background verse formulations, typically combining and accordingly modify-
Protein-protein interactions are key to almost all the biological ing well understood concepts from the fields of probability, graph
processes. Our knowledge of PPIs is limited mainly because of our theory, graphical models etc., for the prediction of PPIs. In 26 , the
lack of understanding for the underlying mechanisms that govern authors adopted mathematical probabilistic models for the pre-
these interactions. On the other hand, a profound knowledge diction of protein-protein interactions. The authors accurately
of protein-protein interactions is important for a deeper under- predicted about 40,000 interactions for human interactome 26 .
standing of biological processes; as through these interactions at The probabilistic model was based on properties derived from in-
molecular level, proteins mingle to perform important tasks such teractome data, gene ontology based scores, protein domain data
as signal transduction, transport of small molecules, regulation and genome-wide gene expression data.
and organization of many different type of cellular processes etc. Many researchers showed the use of probabilistic models to
Numerous experimental techniques have been devised in the model protein-protein interactions. One such work was proposed
past to predict protein-protein interactions, such as Yeast two- by 27 , in which the authors use generative probabilistic models
Hybrid (Y2H) 1 screening, Protein-fragment Complementation with graph’s biclique properties to model the interaction network
Assays (PCA) 2 , or “co complex” interaction maps (that are at- of Saccharomyces cerevisiae species. This method concluded that
tained by high-throughput Co-affinity Purification followed by naive unmodified DD (duplication–divergence) model is much
Mass Spectrometry (AP/MS) to identify protein-protein (bait) in- more effective than preferential attachment model at capturing
teractions) 3,4 etc. Unfortunately, experimental techniques have key aspects of PPI prediction. Another group of researchers, in 28 ,
not been able to characterize the protein interactions to a greater developed a system that provides accurate prediction of protein-
extent. In addition to that these methods are prone to different protein interactions as well as rank them using probabilistic ap-
type of noise resulting in many false positive interactions. Due to proach. The study utilizes co-expression information, as well as
the aforementioned limitations, our knowledge of protein func- orthology information of proteins which are known to be inter-
tions as well as their interactions is very limited. This low con- acting. They further use Bayesian classification to combine the
tribution by experimental techniques and lesser knowledge about previous information with localization on sub-cellular level, do-
protein interactions is being supplemented by the advancement main co-occurrence and post transition modification data to infer
of computational methods. PPIs.
An important type of information that computational methods Another important type of information that is more relevant

J
our
nal
Name,
[yea
r][
,vol
.
],1–11 | 3
Molecular BioSystems Page 4 of 11
View Article Online
DOI: 10.1039/C7MB00484B

to PPIs is the protein domain information. In the beginning re- if they interact or not. In addition to that we also check the non-
searchers were focused on single domain-single function associ- structural association by calculating a similarity score based on
ation and were using domains with this preconception to infer interaction specific information i.e., gene ontology terms, shared
interactions. With the advancement of protein domain knowl- motifs and co-complex similarity. This aggregation also increases
edge, this fact became clear that one domain can form multiple the power of our algorithm to include structural templates that
functions and viz a viz., a protein can have multiple domains thus are varied in geometry but contain sites that can bind to other
can have many different types of interactions. Some group of re- proteins. Another important aspect of our proposed approach is
searchers 29 , utilized multiple domains of a protein for inferring that the identified interaction information is easily available for
protein interactions. Due to the success of this pioneer work, new a wide range of even uncharacterized proteins hence broadening

Molecular BioSystems Accepted Manuscript


methods used multiple domains for PPI prediction. its applicability.
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

The structural conformation of proteins are also strongly as- The prediction of protein interactions is more challenging for
sociated with it’s functions and can provide insights into poten- proteins which are not well annotated or whose molecular details
tial interactions. Researchers in 30 , combine protein sequence are limited. Keeping in view the said aspect, we combine very
and structure information to predict PPIs. In this work, the au- powerful and varied associative biological information sources,
thors utilize distantly conserved patterns in protein sequences namely: protein sequence and structural homology, as a baseline
and compute their structural relationships in proteins. Another to capture proteins which are most similar. The diverse sources of
group of researcher employed protein structures for the predic- biological information including sequence similarity, protein ho-
tion of PPIs 31 . This work used statistical information of protein mology, protein binding sites, and geometrical features like, no of
structures and built a hypothesis using SVM (Support Vector Ma- interacting residues, no of surface residues etc., are combined for
chines) to decide if a pair of proteins interact or not. Some re- the prediction of protein interaction networks. Capturing varied
searchers in 32 , used geometrical features of a protein structure aspect of molecular activity is particularly important as each type
e.g., conservation of short sequences in interacting structural in- of data typically associates distinct aspects of molecular activity.
terfaces, to infer putative PPIs. The overall integration process of our technique for PPI prediction
Most recent approaches that integrate structural and non- is divided into following seven steps (as shown in Figure 02):
structural type of information into computational models and
use machine learning algorithms e.g., Bayesian classifier or SVM
(Support Vector Machines) etc., to infer interaction of putative Step 1: Homolog Structures:
proteins have reported extraordinary results. One such work is To predict the interactions for sparsely annotated proteins, the
done in the recent past by 13 , that utilizes structural as well as first useful type of information that can associate them is the pro-
non-structural type of features and combine it using Bayesian tein homology information. Evolutionary relationships between
classifier for the prediction of PPIs on a genome wide scale. The species advocate that the homolog (specifically orthologous) pro-
authors of this study presented their results for Saccharomyces teins of different species, whose functions have been established
cerevisiae species, and reported that an integrated model outper- before speciation event and which share high sequence similarity
forms models based on single type of information (i.e., either are likely to interact for similar functional activities.
structural or non-structural) with great margin, in terms of statis-
tical performance measures i.e., precision, recall, accuracy, false Two proteins are said to be homologous if they share a com-
positive rate etc., The major contribution in this work was the mon ancestor. To detect homology, sequence information is often
use of hybrid features (i.e., both structural or non-structural) and used to deduce if proteins are homologous or not. If two pro-
evaluation of their impact on the prediction accuracy. Thus hy- teins share high sequence similarity i.e., above 25 % sequence
brid information of proteins plays a key role in deciphering the similarity 33–35 , they are very likely to be homologous and have
underlying mechanism of protein interactions; because each type similar structures and in many cases part of the same molecular
of information captures either local or global aspect of a protein’s functional activity.
activity. Therefore, in our work we also mainly integrate, struc- The input of our algorithm is a pair of proteins (also called
tural and non-structural type of information, in a unique way, for query proteins) say P1 and P2 (in our implementation we used
putative PPI prediction. Uniprot IDs 36 ), whose interaction information we want to find
or predict. In the first step, since these proteins can possibly be
3 Methods sparsely annotated so, we need to associate them through homol-
In our work, we build a technique that integrates structural and ogy to other proteins. To capture ortholog based homolog sim-
non-structural types of information by associating it with the two ilarity we run a single iteration BLAST 37 search for each query
query proteins. We utilize this information for query proteins to protein Pi against the protein’s NR database and an E-value cut-
determine their interaction tendency in the form of a likelihood off of 0.0001. We selected protein structures (namely, S1 and
score using Bayesian statistics. The uniqueness of our technique S2 ; also called Model structures) that are highly similar to our
comes from the fact that potential interaction information e.g., query, with the additional constraint that matching PDB struc-
protein binding sites (which are strongly associated to molecular tures should have at least 90% or higher sequence identity. These
interaction), can be combined with geometrical features present structures are then queried to PDB 14 to obtain their structural
in the structural templates of two interacting proteins to decide details (i.e. atomic coordinates, residues information etc.).

4| J
our
nal
Name,
[yea
r][
,vol
.
],
1–11
Page 5 of 11 Molecular BioSystems
View Article Online
DOI: 10.1039/C7MB00484B

1 2 3
Query P1 Templates
NT NT1 NT2 . . . . . . . . NTi MT1
BLAST NT1
MT1 MT2 MTj
Query P 2 ........
MT For i, j Templates
i x j Template Pairs
in Total

Homolog Finds Structural


Input Protein Pairs PDB Neighbor Templates

Molecular BioSystems Accepted Manuscript


Structures of each Homolog
Structure
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

NR Database VAST+ Server


8 6 5 4
Model by Superimposing Multiprot
Program to
Calculate Templates with Detect
Combine using Bayesian

Interacting
Scores Homologs Residue Pairs in
the Structure
on all 100
Template
Yes/No Decision

Model PDBe Binding


Classifier

Pairs sites server

7
Non Structural Features
Gene
Template s Template s
Ontoglogy
Shared Motif Co-Complex
(GO)
Similarity Similarity
Similarity
Score Score
Score

Fig. 2 The general scheme of heterogeneous information integration for our PPI prediction algorithm. The input of the algorithm is a pair of proteins
(say P1 and P2 ) from the yeast interactome whose interaction we want to predict. The algorithm operates in eight major steps to combine both structural
and non-structural information using Bayesian classifier and results in either positive or negative interaction.

Step 2: Structural Neighbors: pairs.


In the second step, structural representatives of each model struc-
ture i.e. NT and MT, were taken directly by querying each
model structure to VAST+ (Vector Alignment Search Tool Plus) Step 4: Interacting Residues and Binding Sites:
Server 38 . VAST+ is a tool designed by NCBI (National Center
for Biotechnology Information) and utilizes Molecular Modeling As a first step, to evaluate the propensity for interaction, of in-
Database (MMDB), for 3-dimensional structures, with the need dividual template pairs, we first identify the # of interacting
of finding those structures that have similar macro molecular residues in the template pairs. For this purpose, we use a tool
complexes. The macro molecular similarities are evaluated us- called Multiprot in a protocol known as PRISM (PRotein Inter-
ing purely molecule’s geometric criteria, without considering se- actions by Structural Matching) 39,40 . The Multiprot rationale is
quence similarity, thus it is able to identify even distant homologs based on the fact that globally different protein structures can in-
structures. We queried VAST+ with default parameters and with teract via chains of architecturally similar residues called motifs.
a threshold of ten templates i.e. we select top ten neighboring Thus Multiprot predicts binding residues by utilizing structural
templates of each model structure. The structural neighbors are similarity as well as evolutionary conservation of putative binding
named as NT i for model template NT and MTj for model template residue also called ‘hot spots’. For each template pair Multiprot
MT, (where, i and j = 1,2.....10) . calculates the # of interacting residues.

To further strengthen the PPI prediction of our technique we


Step 3: Formation of Templates Pairs: also utilize the PDBeMotif 41 . PDBeMotif is an incredibly fast and
At this stage, we have 10 structural neighbors for each query pro- powerful search tool that facilitates the exploration of binding
tein Pi . To check the overall strength for interaction, of individual sites of single proteins or classes of proteins e.g., Pepsin, and lo-
template pairs, we construct pairs of each structural neighbor NT i cates the conserved structural features of individual residues both
with MTj (where i and j = 1,2.....10) i.e., NT1 pairs with MT1 , within the same specie as well as in different species. We employ
MT 2 .... and so on up to MT 10 , likewise we repeat pairing for NT2 , PDBeMotif to locate residues that are binding sites in our tem-
NT3 .... up to NT10 . This step results in a total of 100 template plate pairs.

J
our
nal
Name,
[yea
r][
,vol
.
],1–11 | 5
Molecular BioSystems Page 6 of 11
View Article Online
DOI: 10.1039/C7MB00484B

Step 5: Model of Structural Templates & Homolog Pairs:


(k)
 
∑10 10
i=1 ∑ j=1 ϒModi j
In this step, we build an interaction model Modij by superposing F (k)
=  . . . For, k = {1, 2, 3, 4} (4)
the template pairs SAi and SBj over the model template S1 and 100
S2 . Overall 100 models are built for (10x10) template pairs. Each
model Modij is used to calculate four structure based scores.
vu  2 
(k)
F
u 10 10 (k)
ϒ
t ∑i=1 ∑ j=1 Modi j −
Step 6: Structure based Features of Superimposed Interaction F (l) = 


Models:
 100 

Molecular BioSystems Accepted Manuscript


From the 100 interaction models we prepared in the previous
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

step, we evaluate and combine associated information to calcu- . . . For, k = {1, 2, 3, 4} and l = {5, 6, 7, 8} (5)
late four scores for each interaction model Modij . The scores are
based on the criterion that make use of interacting residues, bind- The Standard deviation of scores captures the fact that,
ing sites as well as sequence information. We name our first score whether the templates that our method finds are different from
(1)
as ϒModi j (read as Upsilon 1), where Modij denotes the interaction each other or not; because when differences among homologs are
(1) spread out the standard deviation will be high.
model for which this score is calculated. ϒModi j is calculated by
taking into account the number of interacting residues in the tem-
Step 7: Non Structural Features:
plate (calculated using Multiprot) that are preserved in the ho-
molog models NT1 and MT1 , i.e. both template and model share To increase the prediction confidence we also integrate non-
those residue pairs. Templates have different variations in their structural clues that are strongly related with the interactions (as
amino acid sequence, this score captures the strength of interac- previously used by 13 ).
tion model in terms of # of interacting residues preserved, when Proteins at the molecular level interact to perform a range of
compared with homolog template pair. functional activities. The functional activities of proteins can be
(2) used as a strong clue to predict if they interact or not. The first
The second score of our model is called ϒModi j and is esti-
non structural clue i.e., F (9) , of our algorithm is based on aver-
mated by taking fraction of total interacting residues preserved
(1) age gene ontology similarity between homolog structures of each
i.e., ϒModi j , divided by the average of total number of residues in
query protein, and is defined as:
both homolog templates i.e., NT1 and MT1 , as shown in equation
1.  |annot ∩annot |

∑10 10 NTi MT j
i=1 ∑ j=1 min(|annotNTi |,|annotMT j |)
F (9)
=  (6)
(1) 100
(2)
ϒModi j
ϒModi j = (1)
Average(NT1 , MT1 ) Another type of non structural information that can be used
to check a protein’s functional linkage with another protein are
(3)
The third score ϒModi j is the same as the first score, with the the number of shared conserved motifs. Proteins usually have
additional check that the interacting residues, both in template many conserved motifs with varied evolutionary histories. The
and model are also shared by the binding sites retrieved using number of common motifs conserved in two connected proteins
PDBeMotif service, and is calculated as shown in the equation 2. represents a good opportunity to identify strong functional associ-
ations between them. We incorporate the protein motif informa-
(3) (1)
ϒModi j = |ϒModi j ∩ Binding− Sites(Modi j )| (2) tion from the most widely used PROSITE database 42 , by estimat-
ing a similarity score namely, F (10) , between homolog structures
(4)
Lastly, the final score ϒModi j , of our technique is calculated by of each query protein, in the following way:
taking shared binding sites in the superimposed template and
|Moti f sNTi ∩Moti f sMT j |
 
(4)
model pairs as shown in equation 3. ϒModi j is the number of bind- ∑10 10
i=1 ∑ j=1 min(|Moti f sNTi |,|Moti f sMT j |)
F (10) =  (7)
ing sites in the template that align to the number of binding sites

100
in the model.
The last type of non structural similarity clue is the co-complex
similarity. The co-complex similarity is calculated by searching
each homolog structure of each query protein for co-complex ex-
istence in PIBASE database 43 . PIBASE provides complex centric
(4)
ϒModi j = |Binding− Sites(NT1 , MT1 )
information of proteins by identifying all protein structural inter-
∩ Binding− Sites(Modi j )| (3) faces, which are extracted from the most widely used Protein Data
Bank and PQS structure databases. In addition to that it provides
Once all scores are calculated for hundred interaction models, both chain-chain (SCOP) and domain-domain (CATH) interfaces.
we then combine their effect into one score by taking the mean We calculate the co-complex similarity score namely, F (11) , be-
and standard deviation of individual scores as shown in equation tween homolog structures of each query protein, in the following
4 and 5. way:

6| J
our
nal
Name,
[yea
r][
,vol
.
],
1–11
Page 7 of 11 Molecular BioSystems
View Article Online
DOI: 10.1039/C7MB00484B

(8 from structural data and 3 from non structural data sources).


 |CO−COMPNTi ∩CO−COMPMT j |
 The mutually independent features were combined using state
∑10 10
i=1 ∑ j=1 min(|CO−COMPNTi |,|CO−COMPMT j |) of the art Bayesian classifier, which results into a probability es-
F (11)
=  (8)
100 timate (using equation 9) of whether two proteins (Pi and Pj)
interact or not.

Step 8: PPI Prediction using Bayesian Networks:


Lastly, we use Bayesian classification to combine the mean as 4.1 Interaction Supporting Evidences
well as the standard deviation values of our structural as well The Saccharomyces cerevisiae species’ interaction network consist

Molecular BioSystems Accepted Manuscript


as non structural scores. The reason for choosing Bayesian clas- 24,570 interactions in IntAct database and 27,240 interactions in
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

sifier lies in its outstanding performance on mutually indepen- DIP database, as of November, 2016. The protein interaction net-
dent feature set. The scores are combined in a total of eleven work databases contain false positive interactions that are a bot-
variables namely: F (k) , where k={1,2,...,11}. Let Pi and Pj be tleneck to predict the overall performance of an algorithm as well
the query proteins whose interaction we want to predict and as to judge the statistical significance of experiments conducted.
F (1) , F (2) , F (3) , ........., F (11) be the random variable that cap- In order to deal with this limitation, we filtered interaction net-
ture different aspects of structural and non structural association. work to include interactions that are of high confidence with the
The conditional probability that Pi and Pj interact given the distri- criteria that each interaction in the network must be supported by
bution of random variables F (1) , F (2) , F (3) , ........., F (11) is given at least eight methods (both experimental and electronic annota-
by: tion). We chose a threshold of 8 because at this value we have
considerable amount of interactions left in the interactome while
achieving optimum accuracy. We call these interactions as high
P(Ci j = 1/F (1) , F (2) , F (3) , ........., F (11) ) = confidence interactions because each interaction is supported and
validated by at least 8 methods. The interaction network con-
P(F (1) , F (2) , F (3) , ........., F (11) /Ci j = 1).P(Ci j = 1) tains proteins from the same (Saccharomyces cerevisiae) as well as
P(F (1) , F (2) , F (3) , ........., F (11) ) other species namely: Arabidopsis thaliana, Homo sapiens, Rattus
norvegicus, Arabidopsis thaliana, and Dictyostelium discoideum.


(k)
 4.2 Evaluation Measures
= Π11
k=1 P(F /Ci j = 1).P(Ci j = 1) 6
For the evaluation of prediction performance we use five fold
cross validation settings. For each yeast interactome retrieved
 
(k)
Π11
k=1 P(F /Ci j = 1).P(Ci j = 1) +
from IntAct and DIP databases, we randomly divide the dataset
  into five partitions (one for test and remaining four for training
Π11 (k)
k=1 P(F /Ci j = 0).P(Ci j = 0) (9)
our model) and the process is repeated five times. For each pro-
Where P(Ci j = 1) is the prior probability that Pi and tein pair Pi and Pj in the interaction dataset, we assumed the
Pj interact, P(F (1) , F (2) , F (3) , ........., F (11) ) is the probability interaction of Pi and Pj to be unknown and then try to predict
that Pi and Pj has F (1) , F (2) , F (3) , ........., F (11) features and the interaction by means of our algorithm. Lastly, we compare
P(F (1) , F (2) , F (3) , ........., F (11) /Ci j = 1) is the probability that Pi the predicted interactions with the true interaction set. For as-
and Pj has F (1) , F (2) , F (3) , ........., F (11) features given that Pi sessment of our methodology, we compute standard performance
and Pj interact. measures, such as: precision, recall, accuracy and F1 score which
are computed using the following formulas:
All feature values F (1) , F (2) , F (3) , ........., F (11) are normalized
and we used binning of feature values so that values of features
TP
lie in known ranges, as many machine learning algorithms spe- precision =
T P + FP
cially Bayes classification produce better results when continuous
TP
attributes are discretized. Finally, using equation 9, we calcu- recall =
T P + FN
late the value of P(Ci j = 1/F (1) , F (2) , F (3) , ........., F (11) ) for each
protein pair Pi and Pj. TP+TN
accuracy =
T P + FP + T N + FN
4 Results
The proposed algorithm is evaluated for the interaction network and
of Saccharomyces cerevisiae species proteins extracted from two
2 ∗ precision ∗ recall
most widely used interaction databases namely: DIP database 15 F1 =
precision + recall
and IntAct database 16 . For each interaction link/edge in the
network, the algorithm gives probabilities derived by combining Where TP is the number of true positives, TN is the number of
structural and non structural data sources. For each arbitrary pair true negatives, FP is the number of false positives and FN is the
of proteins say Pi and Pj our algorithm calculates eleven features number of false negatives.

J
our
nal
Name,
[yea
r][
,vol
.
],1–11 | 7
Molecular BioSystems Page 8 of 11
View Article Online
DOI: 10.1039/C7MB00484B

90 100
IntAct Database IntAct Database
85 DIP Database V DIP Database
95

80
90

Prediction accuracy (%)


Precision values (%)

75
V V
V 85
70
V
80
V
65
75
60 V
V

Molecular BioSystems Accepted Manuscript


55 V 70
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

50 65
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
# of supporting evidences for each type of interaction # of supporting evidences for each type of interaction

Fig. 3 Graph showing comparison of precision values for Sac- Fig. 4 Graph showing comparison of accuracy values for Sac-
charomyces cerevisiae’s interactome extracted from DIP and IntAct charomyces cerevisiae’s interactome extracted from DIP and IntAct
databases. databases.

4.3 Cross Validation Analysis of Prediction Accuracy compared with DIP database protein interactions.
For each interaction networks obtained from IntAct and DIP By applying our algorithm on high confidence interaction net-
databases we predict protein-protein interactions by 5-fold cross work (consisting of proteins with eight number of experimental
validation. For each protein pair Pi and Pj in each interactome, evidences per interaction) retrieved from IntAct database, we ob-
we calculate the probability that protein Pi interacts with Pj using tained an overall accuracy of 95%, recall of 75.0%, precision of
equation 9. Predicted protein interactions having a probability 82.2% and an F1 score of 78.48%. Likewise, when we also bench-
estimate of greater than or equal to 0.5 were considered as posi- mark our algorithm on high confidence interaction network (con-
tive interactions whereas protein pairs that result in lesser prob- taining proteins with eight number of experimental evidences per
ability estimate than this were decided as non interacting pairs. interaction) obtained from DIP database, our technique achieved
In the following sub-sections we present separate comparisons of an overall accuracy of 82%, recall of 72%, precision of 74.05%
prediction accuracy and precision values for both IntAct and DIP and an F1 score of 73.01%
databases.
4.3.1 Precision and Accuracy of IntAct and DIP Proteins 4.4 Comparison with other approaches
To find a remedy for false positives inherently present in inter- We compare our technique with the most recently proposed algo-
action databases, we parse interactions according to their num- rithm for protein-protein interaction prediction, namely Pre-PPI
ber of supporting evidences. For each yeast interactomee from approach devised by Q. C. Zhang et al. 13 . The PrePPI technique
DIP and IntAct databases, we made eight interactomes in the fol- combines structural as well as non-structural type of information
lowing way: Interactome1 with at least 1 experimental evidence, to predict protein-protein interactions. In this technique, the au-
Interactome2 with 2 experimental evidence so on upto interac- thor use Baysian reasoning to calculate a likelihood ratio using
tome8 with 8 experimental evidences. To assess the ability of features extracted from structural and non-structural informa-
our model for positive interaction prediction as well as negative tion. The concepts of Bayesian reasoning are utilized by an over-
interaction prediction, we operate two interesting performance whelming majority of techniques in a variety of ways to combine
measures namely, accuracy and precision. We report the values many different type of biological information into a probabilistic
for each said measure by applying our model to different parsed estimate for PPI prediction. Since PrePPI utilized Baysian reason-
interactomes. ing and benchmarked quite promising results, therefore we chose
The precision and accuracy values for DIP & IntAct database PrePPI algorithm to compare with our method.
proteins can be found in figure 3 and figure 4 respectively. Fig- .
ure 3 contains two graphs depicting precision values for different We calculate four evaluation measures namely, precision, re-
parsed networks. It is evident from the graph for both database call, accuracy and F1 score for comparing our method with PrePPI
proteins, the precision values increase as we increase the num- method. In figure 5 and 6 we show results separately for Saccha-
ber of evidences (supporting each interaction present in the net- romyces cerevisiae’s protein interaction networks extracted from
work). Our model achieves best accuracy for network with eight DIP and IntAct databases respectively. It is pertinent to mention
number of evidences per interaction. Likewise in figure 4, we see here that the networks were parsed to retain interactions with at
the same trend for accuracy values for both database proteins i.e., least eight number of supporting evidences per interaction, thus
accuracy values increase as we increase the number of evidences parsed interactomes contained only high confidence interactions.
supporting each interaction present in the network. Looking at It can clearly be seen from the results of both databases that our
the precision and accuracy values in both graphs we can conclude algorithm outperforms PrePPI in all aspects i.e., precision, recall,
that IntAct protein interactions are much reliable (lesser noise) as accuracy as well as F1 score. The higher precision our algorithm is

8| J
our
nal
Name,
[yea
r][
,vol
.
],
1–11
Page 9 of 11 Molecular BioSystems
View Article Online
DOI: 10.1039/C7MB00484B

100 to perform functions therefore these are very useful in the accu-
Our Method
PrePPI Method rate identification protein’s mutual interactions.
80
5 Discussion
Relative % values

60 5.1 Gene Ontology based Analysis of Predicted PPIs


A protein’s function denotes to molecular, biological or cellular as-
40 pects that a protein is involved in, including how it interacts with
other molecules (such as substrates, pathogens and other small
compounds etc.). Proteins interact to perform multitude of such

Molecular BioSystems Accepted Manuscript


20
function. An important aspect is to see which type of functional
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

interactions can be precisely predicted by our model. For our


0
Precision Recall Accuracy F1 Score analysis, we use the most famous and widely used Gene Ontology
(GO) classification scheme 44 , due to its desirable properties, e.g.,
Fig. 5 Comparison of Acurracy, Recall, Precision, and F1 score of our wide coverage, disjoint categories, standardized format etc. We
method with Pre-PPI algorithm for DIP database proteins.
choose the molecular function ontology from the gene ontology,
which is a hierarchical set of functions called terms, arranged in
100
Our Method a Direct Acyclic Graph (DAG) structure. Gene ontology is suitable
PrePPI for large scale computational studies because of its consistency
80 across species.
Relative % values

molecular function
60

catalytic activity structural molecule activity binding

40 heterocyclic compound binding organic cyclic compound binding


hydrolase activity transferase activity

nucleic acid binding


20
RNA binding DNA acid binding

0
Precision Recall Accuracy F1 Score Fig. 7 The top 5 experimentally verified GO categories of Saccha-
romyces cerevisiae’s proteins.
Fig. 6 Comparison of Acurracy, Recall, Precision, and F1 score of our
method with Pre-PPI algorithm for IntAct database proteins. We present our analysis for top five most frequent and ex-
perimentally verified GO annotations of Saccharomyces cere-
visiae’s species proteins, namely: hydrolase activity (with 820
attributed to increased number of true positive predictions, while proteins), transferase activity (with 784 proteins), RNA bind-
higher accuracy can mainly be attributed to both higher number ing (with 750 proteins), DNA binding (with 379 proteins), and
of true positives and true negatives predictions. Another impor- structural molecule activity (with 344 proteins). The reason for
tant measure is recall, which quantifies how many true interac- choosing only five GO terms lies in their broader coverage. It is
tions present in the dataset were missed by our algorithm i.e., pertinent to mention here that almost 90% proteins in the yeast
it not just incorporates true positive predictions but also utilizes species fall under the selected terms i.e., they are a special case
false negative predictions (missed by our algorithm). Because of of any one of them. Each term’s detailed annotation in gene on-
the ability of our algorithm to result in quite small number of tology is presented in Figure 7. For each selected GO term, we
false negatives our method’s recall is much higher than PrePPI. extract functional sub-modules from the yeast PPI data and ap-
Lastly, we combine recall and precision to give one measure to ply our technique on individual modules. The obtained accu-
quantify the overall statistical strength of our method i.e., the F1 racies were quite interesting i.e., we got 96% for DNA binding,
score. Since our method outperforms PrePPI in both precision 97% for RNA binding, 84% for hydrolase activity, 88% for trans-
and recall thus, it has ultimately a better F1 score for both DIP ferase activity, and 79% for structural molecule activity. Among
and IntAct datasets (as can be seen in figure 5 and 6). all extracted PPI sub-modules the binding sub-module’s accuracy
The improved performance of our algorithm can be credited namely, DNA binding and RNA binding outperformed the other
to a number of factors. Most importantly it is the protein bind- sub-modules. This increased performance of our model can be
ing sites that are very helpful in identifying functional sites of a attributed to a number of factors: First, it is the quality of inter-
proteins for which they interact. In addition to that other type actions in the sub-module that contributes to improved accuracy,
of structural information e.g., interfacing residues, interacting as all interactions are either manually curated or experimentally
residues etc., also add up in effective modeling of interaction verified and secondly, since the binding activity is strongly related
mechanism. Lastly, the nonfunctional clues namely, co-complex with protein binding sites therefore, it is the precise integration
similarity, gene ontology similarity and motif similarity that we of protein binding sites with structural and nonstructural infor-
utilize are more relevant to functions, and since proteins interact mation that resulted in improved accuracy.

J
our
nal
Name,
[yea
r][
,vol
.
],1–11 | 9
Molecular BioSystems Page 10 of 11
View Article Online
DOI: 10.1039/C7MB00484B

5.2 Identification of protein complexes and annotation of 4 B. A. Shoemaker and A. R. Panchenko, PLOS Comput. Biol.,
uncharacterized proteins 2007, 3(3), e42.
Accurately inferred PPIs can be used as a powerful tool to under- 5 B. A. Shoemaker and A. R Panchenko, PLOS Comput. Biol.,
stand biological processes. One important application of PPIs is to 2007, 3(3), e43.
enhance the understanding of functions of many uncharacterized 6 L. Salwinski and D. Eisenberg, Curr. Opin. Struct. Biol., 2003,
proteins. In this study, we opt a strategy (as by 45 ) to determine 13, 377 to 382.
the functions of uncharacterized proteins based on two steps: 7 H. Ur Rehman, U. Zafar, A. Benso and N. Islam, BIOINFOR-
1) use accurately inferred PPIs to identify the protein complexes MATICS, 2016, pp. 237–244.
2) annotate uncharacterized proteins based on their presence in 8 H. Ur Rehman, N. Azam, J. Yao and A. Benso, PLOS ONE,

Molecular BioSystems Accepted Manuscript


identified protein complexes. For Saccharomyces cerevisiae’s pro- 2017.
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

tein interaction network, we used K-mean clustering algorithm to 9 P. Braun and et al., Nature Methods, 2009, 6, 91 to 97.
identify protein complexes. A total of 570 complexes were iden-
10 C. M. Deane, L. Salwinski, I. Xenarios and D. Eisenberg, Mol.
tified (with a cut-off of 5 proteins per complex). After retrieving
Cell. Proteomics, 2002, 1, 349 to 356.
the protein complexes, we calculate entropy for each functional
11 Q. C. Zhang, D. Petrey, R. Norel and B. Honig, Proc. Natl Acad.
category (of known proteins) present in each complex. The func-
Sci. USA, 2010, 107, 10896–10901.
tions which are consistent within a complex are candidates for
12 M. Wass, G. Fuentes, C. Pons, F. Pazos and A. Valencia, Mol.
annotation transfer. With this, our method finally annotated 789
Syst. Biol., 2011, 7, 469.
uncharacterized proteins of Saccharomyces cerevisiae species (for
detailed annotations see Table 1 in Supplementary Data.pdf file). 13 Q. C. Zhang, D. Petrey and et al., Nature, 2012, 490(7421),
556 to 60.
6 Acknowledgment 14 H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat,
We are thankful to Mr. Usman Zafar for contributing in partial H. Weissig, I. N. Shindyalov and P. E. Bourne, Nucleic Acids
implementation of this method as part of his undergraduate thesis Research, 2000, 28, 235–242.
work. 15 L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie
and D. Eisenberg, Nucl. Acids Res., 2004, 32, 449–51.
7 Conclusion 16 S. Orchard and et al., Nucl. Acids Res., 2013, 42, (D1): D358–
In this work, we presented a novel approach that combines struc- D363.
tural and non-structural biological data to precisely predict pro- 17 A. Valencia and F. Pazos, Methods Biochem Anal, 2003, 44,
tein interactions. The conceptual novelty of our approach lies 411–26.
in identifying and precisely associating both structural and non- 18 A. Benso, S. Di Carlo, H. Ur Rehman, G. Politano, A. Savino
structural information that provide substantial interaction clues. and A. Vasciaveo, author, 2013, pp. 397–404.
Using structural information we build a model that utilizes pro- 19 H. Ur Rehman, A. Benso, S. Di Carlo, G. Politano, A. Savino
tein binding sites to link individual residues in structural tem- and P. Suravajhala, author, 2012, pp. 497–502.
plates. In addition to this the non-functional clues that we inte- 20 H. Ur Rehman, A. Benso, S. Di Carlo, G. Politano, A. Savino
grate namely, co-complex similarity, gene ontology similarity and and P. Suravajhala, Bioinformatics and Biomedicine (BIBM),
motif similarity, significantly enhance prediction accuracy of our 2012 IEEE International Conference, 2012, pp. 1–4.
method. Our model combines structural and non-structural in-
21 J. F. Xia, K. Han and D. S. Huang, Protein Pept Lett, 2010,
formation using Bayesian statistics to calculate the likelihood of
17(1), 137–45.
interaction. The proposed model is tested on Saccharomyces cere-
22 L. Wang and et al., OncoTarget, 2016.
visiae’s interactions extracted from DIP and IntAct databases and
23 L. Burger and E. V. Nimwegen, Mol Syst Biol, 2008, 4, 165.
provides substantial improvements in terms of accuracy, preci-
sion, recall and F1 score, as compared with previous state of the 24 J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu and et al., Proceedings
art approaches. The proposed technique can easily be extended of the National Academy of Sciences, 2006, vol. 104, 4337–
to integrate more evidences (both structural and non-structural) 4341.
to improve the interaction prediction process. 25 J. Sun, J. Xu, Z. Liu, Q. Liu, A. Zhao and et al., Oxford Journals
Bioinformatics, 2005, Volume 21, Issue 16, 3409–3415.
Conflict of interest 26 D. R. Rhodes, S. A. Tomlins and S. Varambally, Nature Biotech-
There are no conflicts to declare. nology, 2005, 23, 951 – 959.
27 R. Schweiger, M. Linial and N. Linial, Oxford Journals, 2011,
References Volume 27, year.
1 T. Ito, T. Chiba, R. Ozawa and et al., Proc Natl Acad Sci USA, 28 M. S. Scott and G. J. Barton, BMC Bioinformatics, 2007, Vol-
2001, 98, 4569–74. ume 8, year.
2 J. N. Pelletier, K. Arndt, A. Pluckthun and et al., Nat Biotech- 29 X. W. Chen and M. Liu., Oxford Journals Bioinformatics, 2005,
nol, 1999, 17, 683–90. Volume 21, 4394–4400.
3 G. Rigaut, A. Shevchenko, B. Rutz and et al., Nat Biotechnol,
1999, 17, 1030–32.

10 | J
our
nal
Name,
[yea
r][
,vol
.
],1–11
Page 11 of 11 Molecular BioSystems
View Article Online
DOI: 10.1039/C7MB00484B

30 J. Espadaler, O. Romero, R. M. Jackson and et al., Oxford Jour- Molecular Biology, 1990, 215, 403–410.
nals, 2005, Volume 21, Issue 16, 3360 –3368. 38 T. Madej, C. J. Lanczycki, D. Zhang, P. A. Thiessen, R. C. Geer,
31 M. Hue, M. Riffle, J. P. Vert and W. S. Noble, BMC Bioinfor- A. M. Bauer and S. H. Bryant, Nucleic Acids Res., 2013, 42,
matics, 2010, 11, year. (D1): D297–D303.
32 A. S. Aytuna, A. Gursoy and O. Keskin, Oxford Journals Bioin- 39 N. Tuncbag, A. Gursoy, R. Nussinov and O. Keskin, Nature
formatics, 2005, Volume 21 Issue 12, 2850–2855. Protocols, 2011, 06 NO.09, 1341–1354.
33 A. Benso, S. Di Carlo, H. Ur Rehman, G. Politano, A. Savino 40 M. Shatsky, R. Nussinov and H. J. Wolfson, PROTEINS: Struc-
and P. Suravajhala, PROTEOME SCIENCE, 2013, 11, 1–12. ture, Function, and Bioinformatics, 2004, 56, 143–156.
34 A. Mitrofanova, V. Pavlovic and B. Mishra, IEEE/ACM Trans- 41 A. Golovin and K. Henrick, BMC Bioinformatics, 2008, 9, 1–

Molecular BioSystems Accepted Manuscript


actions on Computational Biology and Bioinformatics, 2011, 8 11.
Published on 05 October 2017. Downloaded by Freie Universitaet Berlin on 12/10/2017 08:52:32.

no. 3, 775–784. 42 N. Hulo, A. Bairoch and et al., Nucl. Acids Res., 2006, 34,
35 Y. Jiang, T. Ronnen Oron, H. Ur Rehman and et al., Genome D227–230.
Biology, 2016, 17, year. 43 F. P. Davis and A. Sali, Bioinformatics, 2005, 21, 1901–1907.
36 The UniProt Consortium, UniProt: a hub for protein informa- 44 GO, Nucleic Acids Research, 2015, 43, D1049–D1056.
tion, Nucleic Acids Res. 43: D204-D212., 2015. 45 Y. Han, J. Song and et al., Nature Scientific Reports, 2016.
37 S. F. Altschu, W. Gish, W. Miller, E. W. Myers and D. J. Lipman,

J
our
nal
Name,
[yea
r][
,vol
.
],
1–11 | 11

You might also like