Knowledge Graphs and Their Applications in Drug
Knowledge Graphs and Their Applications in Drug
Finlay MacLean
To cite this article: Finlay MacLean (2021): Knowledge graphs and their applications in drug
discovery, Expert Opinion on Drug Discovery, DOI: 10.1080/17460441.2021.1910673
Article views: 56
REVIEW
CONTACT Finlay MacLean [email protected] BenevolentAI, 4-8 Maple St, Bloomsbury, London W1T 5HD, United Kingdom of Great Britain and
Northern Ireland.
© 2021 Informa UK Limited, trading as Taylor & Francis Group
2 F. MACLEAN
recommendation, and information retrieval systems. Arguably graph. In a disease-gene association network, comorbidities
the most famous of commercial question-answering systems is will share a higher number of associated genes, inferring their
Watson, developed by IBM to beat human experts at the quiz functional similarity. In contrast, two diseases that have no
show Jeopardy [25]. In terms of recommendation systems, intersection of associated genes are unlikely to exhibit func
Pinterest have famously used a KG of user-likes-pin to recom tional homogeneity. Traditional approaches such as Common
mend new pins to their user-base [26]. Neighbor, Jaccard’s Index, Adamic/Adar Index and Katz com
In the field of drug discovery, one of the earliest notable pute similarities between nodes based on local neighbor
attempts to integrate multiple structured biomedical data hoods [37], however failed to utilize the global
bases was the work of Himmelstein et al., developing neighborhood and topology of the graph. In recent years,
Hetionet to prioritize drugs for repurposing [28**] and genes embedding-based GML has become the norm. Embedding
associated with disease [29]. Other KGs include OpenBioLink, strategies encode the continuous neighborhood information
principally used to benchmark link prediction models [30], and of a node, graph substructure, or entire graph into a discrete
the work of Womack et al. [31]. Whilst the integration of low-dimensional latent vector [38]. To refer back to the pre
structured databases has proven its utility, others have derived vious exemplary disease-gene association network,
biological relationships from literature. The Global Network of a comorbidity of two diseases will encode both diseases with
Biomedical Relationships [32*] screened 24 million research embeddings (vectors) that are mathematically similar in latent
articles to create a disease-gene-chemical KG consisting of vector space, since they are proximal within the network. For
2 million thematically-labeled edges. Biomedical KGs can con a comprehensive survey of graph embeddings, we defer the
tain a multitude of multimodal data spanning transcriptomics, reader to these comprehensive reviews [34*] [38]
proteomics, genomics, phenomics, drug pharmacology, chem Embedding strategies have now developed to encode het
istry, and ontological information. The schema of the Drug erogeneous graphs, often referred to as knowledge graph
Repurposing Knowledge Graph [33] exemplifies the heteroge embeddings (KGEs). Aside from representing nodes as latent
neity of data common in KGs for drug discovery. The majority vectors, a low-dimensional representation of each relationship
of large-scale biomedical KGs are based on semantic web type is also learned. A wonderful myriad of methods have now
technologies, the largest of which is Bio2RDF [34]. One of the been employed to generate KG embeddings. A review of
defining features of semantic KGs is their extensibility, as these methodologies is outside the remit of this manuscript.
demonstrated by projects, such as Chem2Bio2RDF, which com We refer the reader to the comprehensive review of Rossi et al.
bined Bio2RDF with a chemogenomic semantic graph [35]. [24] and repository of KGE models [39].
There seems to be no agreed-upon definition of a KG.
Some constrict its name to only literature-derived graphs.
Others even go further, using KG to refer only to graphs that 2. Applications of knowledge graphs
use semantic technologies to represent the data. In this article,
KGs have emerged as an effective method of information
we define a KG as any heterogeneous information network,
representation in drug discovery. Modeling biological systems
regardless of the technology used and the provenance of the
as graphs has facilitated the use of powerful network-based
data it represents.
algorithms; encoding the continuous global or local neighbor
hood of nodes into discrete latent vectors. The vectors are
1.4. Graph machine learning on knowledge graphs then used in a downstream machine learning task. Supervised
downstream tasks include link prediction (pairwise prediction
Marshall Nirenberg famously stated that science progresses
between two nodes), and node classification (classification of
best using simple assays to rapidly generate large data sets
one node). Embeddings may be also used in unsupervised
[19]. Whilst large-scale and genome-wide screens are certainly
tasks such as community detection (the detection of neigh
the gold standard of systematic drug discovery, their high
borhoods of nodes via clustering). These methods make the
costs often prohibit their use only for all but the most com
guilt-by-association assumption; that functionally or structu
mon (and thus profitable) of diseases. Machine learning has
rally similar biological entities are likely to have similar proper
demonstrated its potential as a complementary approach:
ties, have high network proximity, and have a small distance
rapidly and inexpensively generating data in unexplored
between node embeddings in vector space.
areas of the biological and chemical space. In particular,
graph-based machine learning (GML; also referred to as geo
metric machine learning) methods have shown promise in this
2.1. Drug repurposing: a link prediction task
field. By representing biological systems as KGs, it has allowed
for the exploitation of graph theory and powerful network The overwhelming majority of applications using KGs are
science algorithms; drawing new insights into this otherwise framed as link prediction tasks. Our knowledge of biology is
silo-ed data. GML has been applied to systemically screen incomplete, and the resulting information networks are spar
compounds for new interactions, and shed light upon areas sely populated. Link prediction on incomplete networks aims
unknown of the human interactome. to systematically complete these networks, in which predicted
GML uses the topological structure of the network to clas edges represent biological interactions or associations that are
sify properties of nodes, predict the existence of edges, and currently unknown, unexplored or yet to be validated. Figure 2
detect communities [36*]. It assumes that functionally or struc illustrates the training process of a link prediction KG embed
turally similar nodes will be more highly connected within the ding model.
4 F. MACLEAN
nor disease associated genes are implicitly provided to the 2.1.3. Off-target repurposing and drug-target interaction
models and thus obfuscate the mechanism of action of the The one drug, one gene, one disease paradigm of drug discov
drug. ery has passed. Many diseases are now understood to be
multifactorial; caused by the combination of the effects of
multiple genes. Most drugs are now estimated to bind to
2.1.2. On-target repurposing and target identification between 10 and 100 targets [48]. Polypharmacology is
In contrast to target-agnostic approaches, target-based a promising paradigm in drug discovery, assuming drugs act
drug repurposing approaches are attractive alternatives. through multiple genes associated with one or more patho
Diseases can be described as phenotypic manifestations mechanism. Our knowledge of the pharmacogenomic space is
caused by genomic perturbations. These genomic pertur sparsely populated [53], mainly limited to the disease-
bations cause further dysregulation of genes and path associated genes of interest, and genes common in safety
ways. The aim of target-based drug discovery is to panels (essential genes and those associated with undesired
develop a compound to either directly or indirectly target phenotypes). Genome-wide screens would be of great utility
one of these disease-causing or disease-associated genes. to understand the polypharmacology of existing drugs (off-
Similarly, the aim of on-target drug repurposing is to iden target drug repurposing), and novel compounds (drug discov
tify a preexisting drug to target one of these disease- ery). Due to the cost of experimental screens, many research
causing or disease-associated genes. Both methods require ers have developed in silico methods to quickly and
one or more targets through which to act. GML methods inexpensively screen for drug-target interactions. Link predic
have been widely applied to prioritize pathogenic genes. tions methods have been used to predict drug-binds-target
Himmelstein et al. applied the previously cited hetionet KG edges in pharmacological KGs. A random walk-based
to predict disease-causing genes for multiple sclerosis [29]. approach, DTINet, was used to identify the novel inhibitory
Even embedding strategies trained on relatively small het action of three approved drugs
erogeneous networks have demonstrated their superiority on cyclooxygenase proteins [54]. The pharmaceutical
over baseline approaches, such as Xu et al. who trained industry is often interested in determining chemicals with little
a multipath random walk model on a network of gene- to no available interaction data: the so-called cold-start pro
phenotype, protein–protein interactions and phenotypic blem. Lim et al. developed a collaborative filtering approach
similarities [47]. tailored specifically to chemicals with few interactions [55]. In
KGE approaches have recently been applied to target iden addition to the necessary pharmacogenomic data, network-
tification. Pitalla et al. used a relation-weighted RotatE model based approaches have used genomic, chemical, pharmacolo
to predict drug targets for Parkinson’s disease [48]. Notably, gical [54,56], side effects [57], diseases, pharmacokinetics, and
the model outperformed OpenTargets, the leading initiative for proteomics [58]. KG embeddings have also been applied,
target identification which includes genetic, pharmacologic, screening approved drugs for off-target interactions with
pathway, multi-omics data. Similarly, Paliwal et al. developed COVID-19 associated genes [33].
Rosalind, a tensor factorization-based KG embedding trained Many computational methods have been developed to
on BenevolentAI’s proprietary biomedical knowledge to predict perform in silico screens. The main advantage of network-
therapeutic targets for rheumatoid arthritis [49]. Rosalind out based approaches is they do not require the 3D structure of
performed OpenTargets alongside other GML approaches. Top the protein. Deep learning approaches such as DeepPurpose,
predicted genes were experimentally validated in an in vitro using only the primary amino acid sequence of the protein
assay using patient-derived cells. Five genes were determined and SMILES string of the molecule, have shown to be compe
to be promising for further preclinical research. titive methods of DTI prediction [59]. If the 3D structure of the
The utility of network-based approaches to target identifi protein is known, molecular docking studies provide an unpar
cation is well demonstrated by the above (both academia and alleled level of information of how a drug interacts with the
industry-driven) projects. The above methods are focused on binding pocket of a protein. Deep learning has also been
identifying pathogenic protein-coding genes. Whilst protein successfully applied to molecular docking, as exemplified by
drug targets remain the central focus of drug discovery, the the commercialization of Atomnet by Atomwise [60]. Until
role of non-coding RNA (ncRNA) in disease is becoming more recently, these approaches were limited to proteins with
apparent [50]. Researchers have started to exploit KGs and a known 3D structure. A deep learning method AlphaFold,
GML to predict ncRNA-disease associations. Ji et al. con that uses amino acid sequence as input, has recently demon
structed a KG consisting of micro-RNA, circular-RNA, long non- strated accuracy comparable to experimental techniques such
coding RNA (lncRNA), proteins and diseases, and used a matrix as X-ray crystallography [61]. This may widen the screening
factorization embedding model to predict miRNA-disease possibilities of structure-based approaches, overshadowing
associations for three common cancers [51]. GML has also network-based approaches to DTI prediction.
been applied to predict lncRNA-disease associations. In
a similar project, Zhou et al. built a KG similar to that used
by Ji et al., and trained a higher-order preserving matrix 2.1.4. COVID-19: A case study in network-based
factorization model [52]. They validated their model by pre repurposing
dicting disease-related lncRNAs for three excess death rate Unlike it’s serendipitous counterpart, network-based drug repur
cancers. posing is still waiting to see its first compound reach the market.
6 F. MACLEAN
True validation that this method is an effective tool in drug many other applications remain as academic research projects.
discovery will come only once drugs identified are approved KGs and GML techniques have been widely used in academia
for their new indication, and a systematic review of the metho and applied to further pharmacological and multi-omic pre
dology has been conducted. Arguably, the most mature and diction tasks, including prediction of protein–protein interac
substantial efforts to identify repurposable drugs have been tion [70–72], polypharmacy side effect [73], disease side effects
focused on finding a therapeutics to target the SARS-CoV-2 [74], and drug–drug interactions [75]. An exhaustive summary
coronavirus or treat the associated COVID-19 disease. of all of these is out of the remit of this review. We refer the
Network-based methods need a network on which to train, in reader to the review of Su et al. [76].
this case capturing data pertaining to SARS-CoV-2. Reese et al.
[62] integrated multiple structured databases into their biome
dical KG. Next, they integrated datasets pertaining to COVID-19
(Zhou et al. [63], CORD-19 [64]). Ioannidis et al. [33] combined the
2.3. Node classification applications
preexisting KGs [26**], [30*] with additional databases. To predict Whilst most applications of KGs in drug discovery are framed
the likelihood that an approved drug would treat COVID-19, the as link prediction tasks, node classification has also demon
researchers trained a TransE embedding model on the KG, and strated its utility. Node classification describes the process in
then computed the distance scores between approved drugs which a model is trained on features derived from a
and COVID-19 and similar coronaviruses, and drugs and COVID- subset of nodes with a labeled property, and subsequently
19-associated genes. Hseih et al. extended the KG of Ioannidis predicts the likelihood that unlabeled nodes possess this prop
et al. [33], integrating a SARS-CoV-2-specific graph into the ori erty. To the best of our knowledge, there are few examples of
ginal graph via transfer learning [65]. To discover drugs that can the adoption of these methods by industry or public biological
functionally target SARS-CoV-2-associated host genes, protein databases, however demonstrate the diverse range in which
and drug embeddings were used to predict therapeutic KGs can be applied to drug discovery.
effectiveness.
A multitude of network medicine approaches have been
applied to identify existing drugs to palliate or treat COVID-19. 2.3.1. Protein function
The majority of these studies focus on predicting drugs that Understanding protein function is one of the earliest prerequi
prevent viral entry (targeting viral genes), viral replicative sites in the drug discovery process. Proteins with similar
mechanisms, or suppression of the host inflammatory response sequences tend to exhibit similar functions [77]. Also, proteins
(both targeting host genes). One of the most noteworthy efforts with similar sequences tend to interact with similar proteins
was produced by Gysi et al. [66], who integrated host-host, host- within protein–protein interaction (PPI) networks. Thus, pro
viral, and host-drug protein interaction networks, using an tein nodes with high network proximity tend to share protein
ensemble of predictive models to predict 81 potential candidates function. The seminal paper for the node2vec model [78],
to treat SARS-CoV-2. Their method successfully predicted that a semi-supervised random walk model, showed how node
the SARS-CoV-2 could manifest in brain tissue and have neuro embeddings and a downstream multi-label classifier could
logical comorbidities, which have since been validated [67]. be effectively used to predict protein function, using the
Other researchers have similarly used network proximity of BioGRID PPI network [79]. Whilst node2vec used only
drug targets to viral proteins in protein–protein interactomes a homogeneous network of PPIs, others have extended their
[63]. BenevolentAI used a proprietary literature-derived KG to work, including other forms of both graphical and comple
identify baricitinib, a drug used to treat rheumatoid arthritis, to mentary information. DeepGo uses both sequence and PPI
treat patients with bilateral COVID-19 pneumonia [68]. Since networks to generate features for each protein [80]. Nariai
then, more than 12 clinical trials have been conducted, including et al. demonstrated that the integration of PPI networks,
by the drug’s proprietor, Eli Lilly. The janus kinase inhibitor has gene expression, protein motif information, gene knockout
now been granted emergency use authorization (EUA) from the phenotype data and protein localization information yielded
FDA for the treatment of hospitalized patients with COVID-19 greater performance than homogeneous PPI networks [81].
[69]. The methodology employed by BenevolentAI is yet to be Proteins do not execute all of their functions at all times,
published, and drug authorization of baricitinib is only tempor and in all tissues in which they are expressed [82].
ary. True validation for the use of KGs in drug repurposing will Researchers developed OhmNet, using node2vec to generate
come only after (i) the FDA approval of drugs surfaced through embeddings based on tissue-specific PPIs, demonstrating
KG and GML methods, and (ii) such methods have been sub improved performance over methods employing tissue-
jected to the scientific method. Nevertheless, the numerous agnostic networks.
aforementioned studies certainly suggest that the inclusion of Node classification is far from the only computational
KG-based methods into drug discovery workflows could be of method for predicting protein function. Protein function is
great benefit. directly related to its 3D structure. A deep learning method
AlphaFold, that uses amino acid sequence as input, has
recently demonstrated accuracy comparable to experimental
2.2. Additional link prediction applications techniques such as X-ray crystallography [61]. Such advances
The utility of the method is reflected in its adoption in indus may facilitate the overshadowing of network methods by
try. Whilst the above applications of both drug repurposing deep learning methods which learn functions from 3D
and early drug discovery have been employed in industry, topology.
EXPERT OPINION ON DRUG DISCOVERY 7
other words, nodes which are highly connected are statisti distribution (but captures the same information). Researchers
cally more likely to be connected to other nodes, and thus [26**] demonstrated how a degree-biased model achieved
have a higher prior probability of connection. To clarify, in a near perfect score (AUROC of 0.979) when tested on the
a drug-treats-disease prediction task, imatinib would be highly same graph distribution that it was trained on (a drug-treats-
predicted to treat diabetes type 1, mainly due to their con disease graph based on DrugBank [92]). However, when tested
nectivity, whilst the probability that marizomib treats diffuse on a separate graph (a drug-treats-disease graph based on
intrinsic pontine glioma would be low, due to the rarity of DrugCentral [93]), the model achieved a score close to random
both disease and drug. This issue is especially problematic, (AUROC of 0.541). In contrast, their model that utilized net
when the prediction task is focused on predicting edge exis work proximity achieved similar scores across both graphs
tence for low-degree nodes, for example, drug repurposing for (AUROC of 0.974 and 0.855 on DrugBank and DrugCentral,
rare diseases. respectively).
Training a model that depends on degree bias is not inher Literature-derived networks often have vastly different
ently problematic. If one imagines a PPI graph that accurately degree distributions to that of the underlying network. They
represents the biological system, hub genes have a higher rely on the extraction of biological relationships stated in
connections, and thus are more likely to be connected to academic manuscripts. The connectivity of nodes in these
any other protein at random. A simplistic model using degree- networks is therefore largely governed by the quantity of
derived features would accurately predict the connectivity of research that is conducted in that area, and not by the under
this network. The problem lies, however, when the degree lying biology. For example, more than three quarters of pro
distribution diverges from that of the underlying biological tein research focuses on the 10% of proteins that were known
network. In this example, nodes that are over-represented in before the genome was mapped, even though many proteins
the graph will receive high probabilities, despite a lack of have now been linked to disease [94]. There is no discernible
biological evidence that the node is central in the network. biological difference between well-researched and under-
Simplistic models that depend on node degree fail to over researched genes, save their level of research interest and
come this disparity between network and graph. In contrast, tool availability. Studies have shown that therapeutic oppor
models that use network proximity are considerably more tunities [95] and essentiality [96] of well-researched genes are
robust, and predictions are less likely to vary when the training similar to those in the unstudied ’dark’ genome. Systematic
graph distribution differs from the biological network. As we genome-wide screens effectively eliminate research bias, and
often do not know the ground truth, one can simply test provide a much more accurate proxy for true distribution of
a model on an external graph with a different biological biological networks. Comparisons between networks based on
EXPERT OPINION ON DRUG DISCOVERY 9
systematic screens and literature highlight the disparity principally used for link prediction by employing KG embed
between literature-derived KGs and the biology they aim to ding methods. Whilst these methods are continually evolving,
represent [91**]. This example should heed as a warning to they remain relatively immature. Hereinafter, we present the
researchers or machine learning practitioners performing link author’s opinion on the current shortcomings of KGs, the areas
prediction on graphs. One must stay vigilant to this bias, as it in which they need to be improved, and evaluate their utility
is often possible to achieve seemingly satisfactory results in a drug discovery project.
based solely on the degree of the network; without consider KGs are fundamentally question answering tools. Questions
ing the topology of the network nor the similarity of nodes. such as does drug X treat disease Y? and does gene X regulate
[paragraph removed – see comments] disease Y? have demonstrably been answered. However this
doesn’t reflect the granularity and variety of questions asked
by researchers and scientists in the drug discovery process. We
3.2. Mitigation strategies need to work toward a universal and extensible system that
can answer questions such as given pathway X, which com
These results highlight the need for effective mitigation stra
pounds agonize targets assayed in only functional assays with
tegies to remove the reliance on degree-based features;
potency <1 mm? And given diseases with the shared pathogenic
instead encouraging models to learn network topology to
mechanism Y, which targets have failed clinical trials at Phase
infer the existence of edges. Whilst far from commonplace in
I or II and why? and for disease Z, which targets have ligands in
literature, researchers have now addressed the problem of
different stages of the development process with publications
degree bias, mitigating the bias at distinct points in the
and/or patents describing these compounds? When KGs can
model training process.
answer these questions, their value will increase
Multiple embedding strategies exist to encode node neigh
immeasurably.
borhoods. Researchers have applied a degree penalty to pre
KGs are fundamentally question answering tools. Questions
vent over-representation of high degree nodes using
such as does drug X treat disease Y? and does gene X regulate
a random walk embedding and skip-gram model [97]. Others
disease Y? Have demonstrably been answered. However, this
have used a Bayesian method, explicitly providing the prior
does not reflect the granularity and variety of questions asked
probability alongside the adjacency matrix to the model; pre
by researchers and scientists in the drug discovery process. We
venting the encoding of node degree in the embeddings
need to work toward a universal and extensible system that
[98,99]. Many graphs have only positive edges, and negative
can answer questions such as given pathway X, which com
edges are created by either randomly sampling an edge from
pounds agonize targets assayed in only functional assays with
all possible non-edges (node pairs without a connection), or
potency <1 mm? and given diseases with the shared pathogenic
by corrupting a node in a positive edge, replacing a node at
mechanism Y, which targets have failed clinical trials at Phase
random. Importantly, it has been shown by uniformly sam
I or II and why? and for disease Z, which targets have ligands in
pling, positive and negative samples do not have the same
different stages of the development process with publications
degree distribution. By sampling nodes from the global
and/or patents describing these compounds? Work on this topic
degree distribution, models are forced to differentiate edges
is already underway [102]. When KGs can answer these ques
by their network proximity, and not their network connectiv
tions, their value will increase immeasurably.
ity, ultimately leading to improved and less biased perfor
Pathology is fascinatingly complex. This complexity is often
mance [100,101]. Whilst the above methods try to prevent
not well-represented in KGs. Many research projects use pub
information of node degree from being provided to the
licly accessible KGs which provide a reductive model of dis
model, some researchers have removed reliance of degree
ease (with edges such as drug-binds-gene, gene-associates-
by doing precisely the opposite. By explicitly providing the
disease and drug-treats-disease). These graphs fail to represent
prior probability alongside network-based features to the
neither the genetic heterogeneity, nor transient nature of
model during training, the model relies on the degree-based
disease. Whilst a drug repurposing link prediction model
features and does not learn them. During testing, a uniformly
may successfully predict CFTR-associates-chronic_pancreatitis,
connected network is assumed, and a uniform prior is pro
ivacaftor-binds-CFTR, and ivacaftor-treats-chronic_pancreatitis,
vided in place of the biased prior, yielding bias-free predic
these generalizations do not reflect the complexity of the
tions [26**] [98]
disease, or the prerequisites of ivacaftor to be an effective
treatment. In reality, we want to be able to use a KG to
approximate the causal reasoning of a team of researchers:
4. Expert opinion
”chronic pancreatitis is caused by loss of function of the CFTR
KGs have shown great promise in drug discovery providing an gene. Mutations in CFTR cause an imbalance of calcium home
answer to the pharmaceutical industry’s ’big data’ problem. ostasis, leading to early protease activation, fibrosis, inflamma
KGs have opened the doors to the application of graph theory tion and abdominal pain. Ivacaftor is used to treat a subset of
to drug discovery; harnessing powerful network algorithms to cystic fibrosis patients via potentiation and correction of
systematically ’fill in’ the unknown areas of the genome and mutant CFTR, which restores the calcium homeostasis in
draw novel insights into the genes and mechanisms that endothelial cells. Patients with similar loss-of-function muta
underpin disease. There has been significant research interest tions in the CFTR gene could be treated with Ivacaftor. Whilst
in KGs in both academia and industry, using them principally CFTR remains the main pathomechanism of chronic pancrea
for target identification and drug repurposing. KGs are titis, other possible treatments include immunosuppressants,
10 F. MACLEAN
antifibrotics, protease inhibitors, and analgesics”. A KG that them mathematically. Whilst we are increasingly creating
can deliver this level of granularity would be a fundamental more and more data pertaining to these systems, we currently
asset in any drug discovery company. cannot sufficiently model them. KGs are undoubtedly a useful
KGs are mainly used in conjunction with KG embedding framework on which to build such approaches. To be able to
models. These models are based on reasoning-by-association develop informative computational models, we must strive
(also called guilt-by-association). This is distinctly different from toward building KGs which describe the complex dynamic
a causal model of the underlying biological mechanism. biological systems of the human body, how they are dysregu
The most informative paths of many network embedding- lated in the disease state, and how therapeutics act upon the
based models are not describing biological paths (e.g. drug systems. Whilst the dog days of phenotypic-based drug dis
inhibits-gene-causes-disease), but instead are describing simila covery have not yet passed, the dawn of target-based discov
rities between source and target nodes (e.g. drug-resembles- ery is certainly upon us. Biologically-representative KGs will be
drug treats-disease and disease-resembles-disease-treats-drug) instrumental in the era of systems biology.
[26**] [45], Moreover, most relation inference models do not
capture directionality nor trend of the edge [103]. Whilst
Acknowledgments
researchers have developed models that produce inference
paths between source and target nodes to approximate the The author would like to express their gratitude to Delphine Rolando,
biology path [39,104], such paths are often not well-correlated Rachel Hodos and Dane Corneil. Their expertise in drug discovery, graph
with the underlying causal biological path. Perhaps we should machine learning, and knowledge graphs was instrumental in writing this
review. Lastly, we thank Daniel Miskell for his insight over the years.
strive to move away from models that simply associate biolo
gical components, and more toward models that accurately
describe the underlying biological system. This problem seems Reviewer disclosures
endemic in the wider field of artificial intelligence. Gary
Peer reviewers on this manuscript have no relevant financial or other
Marcus and Ernest Davis echoed the problem, stating ”we relationships to disclose.
need to stop building computer systems that merely get
better and better at detecting statistical patterns . . . and start
building computer systems that from the moment of their Funding
assembly innately grasp three basic concepts: time, space This manuscript was supported by BenevolentAI.
and causality” [105]. Whilst KG embeddings remains an over
populated area of research, with researchers competing to eek
out the smallest increase in model performance, causal net Declaration of interest
work reasoning remains a largely unexplored field. There have F MacLean is a full-time employee of BenevolentAI. The author has no
been a handful of notable network-based causal reasoning other relevant affiliations or financial involvement with any organization
approaches that have been successfully applied to drug dis or entity with a financial interest in or financial conflict with the subject
matter or materials discussed in the manuscript apart from those
covery [106–114]. We hope to see more causal models, built
disclosed.
upon biologically-representative KGs.
In areas such as target identification, link prediction meth
ods have demonstrated their utility in academia and industry ORCID
led projects. Applications such as drug–target interaction, Finlay MacLean http://orcid.org/0000-0003-2779-179X
drug–drug interaction, and protein–ncRNA interaction remain
academic exercises in graph theory, often surpassed by
powerful deep learning approaches with features based solely References
on the physicochemical structures of the interacting entities Papers of special note have been highlighted as either of interest (•) or of
(the power of which is exemplified by AlphaFold). We believe considerable interest (••) to readers.
GML is best suited to the prediction of abstract entities such as 1. “Total global pharmaceutical RD spending 2012–2026,”. [cited 2021
diseases. Modeling physicochemical interactions should be left Jul 03]. Available from: https://www.statista.com/statistics/309466/
to structure-based approaches. Whilst it was assumed that global-r-and-d-expenditure-forpharmaceuticals
2. “2020 FDA drug approvals,”. [cited 2021 Jul 03]. Available from:
graph embedding methods inferred edge existence via net https://www.nature.com/articles/d41573-021-00002-0
work proximity, it has become evident the overwhelming 3. “Ten years on: measuring the return from pharmaceutical innova
majority of their predictive power comes simply from the tion 2019,”. [cited 2021 Jul 03]. Available from: https://www2.
connectivity of the nodes, and not their local neighborhood. deloitte.com/us/en/pages/life-sciences-andhealth-care/articles/mea
This issue becomes especially problematic when using litera suring-return-from-pharmaceutical-innovation.html
4. Collins FS, Morgan M, Patrinos A. The human genome project:
ture-derived KGs, where link prediction models strive to lessons from large-scale biology. Science. 2003;300(5617):286–290.
approximate the biologically incorrect degree distribution of 5. 1000 G. P. Consortiumet al.. A map of human genome variation
a literature-derived network and not that of the underlying from population-scale sequencing. Nature. 2010;467(7319):1061.
biological system. Mitigation strategies, more appropriate eva 6. Canela-Xandri O, Rawlik K, Tenesa A. An atlas of genetic associa
luation metrics and less biased graphs are desperately needed tions in UK biobank. Nat Genet. 2018;50(11):1593–1599.
7. Leinonen R, Sugawara H, Shumway M, et al. The sequence read
to correct this problem. Network medicine is based on the archive. Nucleic Acids Res. 2010;39(suppl 1):D19–D21.
assumption that we can accurately model the biological sys 8. Leinonen R, Akhtar R, Birney E, et al. The european nucleotide
tems that govern disease; applying graph theory to describe archive. Nucleic Acids Res. 2010;39(suppl 1):D28–D31.
EXPERT OPINION ON DRUG DISCOVERY 11
9. Ponten F, Jirstrom K, Uhlen M. The human protein atlas—a tool for 29. Himmelstein DS, Baranzini SE. Heterogeneous network edge pre
pathology. J Pathol. 2008;216(4):387–393. diction: a data integration approach to prioritize disease-associated
10. GTEx Consortium. The genotype-tissue expression (gtex) pilot ana genes. PLoS Comput Biol. 2015;11(7):e1004259.
lysis: multitissue gene regulation in humans. Science. 2015;348 30. Breit A, Ott S, Agibetov A, et al. OpenBioLink: a benchmarking
(6235):648–660. framework for large-scale biomedical link prediction. arXiv
11. Stathias V, Turner J, Koleti A, et al. Lincs data portal 2.0: next Preprint arXiv:1912 04616. 2019.
generation access point for perturbation-response signatures. 31. Womack F, McClelland J, Koslicki D. Leveraging distributed biome
Nucleic Acids Res. 2020;48(D1):D431–D439. dical knowledge sources to discover novel uses for known drugs.
12. Tomczak K, Czerwinska P, Wiznerowicz M. The cancer genome atlas bioRxiv. 2019;765305.
(tcga): an immeasurable source of knowledge. Contemp Oncol. 32. Percha B, Altman, RB. A global network of biomedical relationships
2015;19(1A):A68. derived from text. Bioinformatics. 2018;34(15):2614–2624.
13. Ghandi M, Huang FW, Jane-Valbuena J, et al. Next-generation • Article of interest - Despite being not without their shortcom
characterization of the cancer cell line encyclopedia. Nature. ings, literature-derived knowledge graphs are popular meth
2019;569(7757):503–508. ods of rapidly generating biological knowledge graphs. This
14. Tsherniak A, Vazquez F, Montgomery PG, et al. Defining a cancer paper provides an effective method of knowledge graph gen
dependency map. Cell. 2017;170(3):564–576. eration from publicly available data sources. Their derived
15. Chadwick LH. The NIH roadmap epigenomics program data biological relationships are pleasingly complex, compared to
resource. Epigenomics. 2012;4(3):317–324. other efforts. Their evaluation of literature-derived relation
16. Kozomara A, Griffiths-Jones S. MiRBase: integrating microrna anno ships against structured databases highlights the disparity
tation and deep-sequencing data. Nucleic Acids Res. 2010;39(suppl between structured and unstructured data sources, and the
1):D152–D157. need for effective edge harmonization methods.
17. Volders P-J, Helsens K, Wang X, et al. Lncipedia: a database for 33. Ioannidis VN, Song X, Manchanda S, et al. Drkg-drug repurposing
annotated human lncrna transcript sequences and structures. knowledge graph for COVID-19. arXiv. 2020.
Nucleic Acids Res. 2013;41(D1):D246–D251. 34. Belleau F, Nolin M-A, Tourigny N, et al. Bio2rdf: towards a mashup
18. Cui T, Zhang L, Huang Y, et al. Mndr v2. 0: an updated resource of to build bioinformatics knowledge systems. J Biomed Inform.
ncrna–disease associations in mammals. Nucleic Acids Res. 2018;46 2008;41(5):706–716.
(D1):D371–D374. 35. Chen B, Dong X, Jiao D, et al. Chem2bio2rdf: a semantic framework
19. Earm K, Earm YE. Integrative approach in the era of failing drug for linking and data mining chemogenomic and systems chemical
discovery and development. Integr Med Res. 2014;3(4):211–216. biology data. BMC Bioinformatics. 2010;11(1):255.
20. Rago L, Santoso B. “Drug regulation: history, present and future,” ¨. 36. Yue, X, Wang, Z, Huang, J, et al. Graph embedding on biomedical
Drug Benefit Risks. 2008;2:65–77. networks: methods, applications and evaluations. Bioinformatics.
21. “Novartis CEO who wanted to bring tech into pharma now explains 2020;36(4):1241–1251.
why it’s so hard,”. [cited 2020 Sep 30]. Available from: https://www. • Article of interest - For any researcher wishing to learn the
forbes.com/sites/davidshaywitz/2019/01/16/novartis-ceo-who- fundamentals of graph embeddings and their applications,
wanted-to-bring-tech-into-pharma-now-explains-why-its-so-hard, this review is a must-read.
accessed: 2020-september-30. 37. Gao F, Musial K, Cooper C, et al. Link prediction methods and their
22. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The fair guiding accuracy for different social networks and network metrics. Sci
principles for scientific data management and stewardship. Sci Programm. 2015;2015:1–13.
Data. 2016;3(1):1–9. 38. Cai H, Zheng VW, Chang K-C-C. A comprehensive survey of graph
23. Iams WT, Lovly CM. Molecular pathways: clinical applications embedding: problems, techniques, and applications. IEEE Trans
and future direction of insulin-like growth factor-1 receptor Knowledge Data Eng. 2018;30(9):1616–1637.
pathway blockade. Clin Cancer Res. 2015;21(19): 39. Xia X, “Knowledge Graph Embedding Methodologies,”. [cited 2020
4270–4277. Jul 03]. Available from: https://github.com/xinguoxia/
24. Rossi A, Firmani D, Matinata A, et al. Knowledge graph embedding KGE#methodologies
for link prediction: a comparative analysis. arXiv Preprint arXiv:2002 40. Hodos, RA, Kidd, BA, Khader, S, et al. Computational approaches to
00819. 2020. drug repurposing and pharmacology. Wiley Interdiscip Rev Syst
25. Zou X. A survey on application of knowledge graph. JPhCS. Biol Med. 2016;8(3):186.
2020;1487(1):012016. • Article of interest - This review highlights drug repurposing as
26. Gao Y, Li Y-F, Lin Y, et al. Deep learning on knowledge graph for a promising application of knowledge graphs. Knowledge
recommender system: a survey. arXiv Preprint arXiv:2004 00387. graphs and associated graph machine learning approaches
2020. only constitute a few of the many computational approaches
27. “Neo4j graph database. [cited 2021 Sep 12]. Available from: https:// that have been used for drug repurposing. This manuscript
neo4j.com provides a comprehensive summary of most other approaches.
28. Himmelstein, DS, Lizee, A, Hessler, C, et al. Systematic integration of 41. Talevi A, Bellera CL. Challenges and opportunities with drug repur
biomedical knowledge prioritizes drugs for repurposing. Elife. posing: finding strategies to find alternative uses of therapeutics.
2017;6:e26726. Expert Opin Drug Discov. 2020;15(4):397–401.
•• Article of high interest - This seminal paper represents one of 42. Wang L, Lei Y, Gao Y, et al. Association of finasteride with prostate
the earliest attempts to train a link prediction model on cancer: a systematic review and meta-analysis. Medicine
a biomedical knowledge graph, to answer biological ques (Baltimore). 2020;99(15):e19486.
tions (in this case drug repurposing). This research area has 43. Jain P, Jain SK, Jain M. Harnessing drug repurposing for exploration
matured significantly since this manuscript. The knowledge of new diseases: an insight to strategies and case studies. Curr Mol
graph they developed is rather small using today’s stan Med. 2020;20. DOI:10.2174/1566524020666200619125404
dards, and research interest has moved away from pathway- 44. Ganzer CA, Jacobs AR, Iqbal F. Persistent sexual, emotional, and
based models to embedding-based models, in part due to cognitive impairment post-finasteride: a survey of men reporting
their scalability. However, all researchers and practitioners symptoms. Am J Men’s Health. 2015;9(3):222–228.
working in this space would benefit from understanding the 45. Poleksic A. Overcoming sparseness of biomedical networks to
provenance of their work. The authors also highlight the identify drug repositioning candidates. bioRxiv. 2020.
prior probability of connection problem in their manuscript. 46. Sosa DN, Derry A, Guo M, et al. A literature-based knowledge graph
This, however, is covered in more detail in their more recent embedding method for identifying drug repurposing opportunities
work. in rare diseases. bioRxiv. 2019;727925.
12 F. MACLEAN
47. Xu B, Liu Y, Yu S, et al. A network embedding model for pathogenic 70. Kuchaiev O, Rasajski M, Higham DJ, et al. Geometric de-noising of
genes prediction by multi-path random walking on heterogeneous protein-protein interaction networks. PLoS Comput Biol. 2009;5(8):
network. BMC Med Genomics. 2019;12(10):188. e1000454.
48. Gaudelet T, Day B, Jamasb AR, et al. Utilising graph machine 71. Xiao Z, Deng Y. Graph embedding-based novel protein interaction
learning within drug discovery and development. arXiv Preprint prediction via higher-order graph convolutional network. PloS One.
arXiv:2012 05716. 2020. 2020;15(9):e0238915.
49. Paliwal S, De Giorgio A, Neil D, et al. Preclinical validation of 72. Yang F, Fan K, Song D, et al. Graph-based prediction of
therapeutic targets predicted by tensor factorization on heteroge protein-protein interactions with attributed signed graph
neous graphs. Sci Rep. 2020;10(1):1–19. embedding. BMC Bioinformatics. 2020;21(1):1–16.
50. Amaral PP, Dinger ME, Mattick JS. Non-coding rnas in homeostasis, 73. Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side
disease and stress responses: an evolutionary perspective. Brief effects with graph convolutional networks. Bioinformatics.
Funct Genomics. 2013;12(3):254–278. 2018;34(13):i457–i466.
51. Ji B-Y, You Z-H, Cheng L, et al. Predicting mirna-disease association 74. Lim H, Poleksic A, Xie L. Exploring landscape of drug-target-
from heterogeneous information network with grarep embedding pathway-side effect associations. AMIA Summits Translat Sci
model. Sci Rep. 2020;10(1):1–12. Proceed. 2018:132–141.
52. Zhou J-R, You Z-H, Cheng L, et al. Prediction of lncrna–disease 75. Zhang W, Chen Y, Liu F, et al. Predicting potential drug-drug
associations via an embedding learning hope in heterogeneous interactions by integrating chemical, biological, phenotypic and
information networks. Mol Ther Nucleic Acids. 2020;23:277-285. network data. BMC Bioinformatics. 2017;18(1):18.
53. Zheng Y, Peng H, Zhang X, et al. Old drug repositioning and new 76. Su C, Tong J, Zhu Y, et al. Network embedding in biomedical data
drug discovery through similarity learning from drug-target joint science. Brief Bioinform. 2020;21(1):182–197.
feature spaces. BMC Bioinformatics. 2019;20(23):605. 77. Sangar V, Blankenberg DJ, Altman N, et al. Quantitative
54. Luo Y, Zhao X, Zhou J, et al. A network integration approach for sequence-function relationships in proteins based on gene
drug-target interaction prediction and computational drug reposi ontology. BMC Bioinformatics. 2007;8(1):294.
tioning from heterogeneous information. Nat Commun. 2017;8 78. Grover A, Leskovec J, “node2vec: scalable feature learning for
(1):1–13. networks,” in Proceedings of the 22nd ACM SIGKDD international
55. Lim H, Gray P, Xie L, et al. Improved genome-scale multi-target conference on Knowledge discovery and data mining, 2016, USA. pp.
virtual screening via a novel collaborative filtering approach to 855–864.
cold-start problem. Sci Rep. 2016;6(1):1–11. 79. Stark C, Breitkreutz B-J, Reguly T, et al. Biogrid: a general repository
56. Ba-Alawi W, Soufan O, Essack M, et al. Daspfind: new efficient for interaction datasets. Nucleic Acids Res. 2006;34(suppl 1):D535–
method to predict drug–target interactions. J Cheminform. 2016;8 D539.
(1):15. 80. Kulmanov M, Khan MA, Hoehndorf R. Deepgo: predicting protein
57. Mizutani S, Pauwels E, Stoven V, et al. Relating drug–protein inter functions from sequence and interactions using a deep ontology
action network with drug side effects. Bioinformatics. 2012;28(18): aware classifier. Bioinformatics. 2018;34(4):660–668.
i522–i528. 81. Nariai N, Kolaczyk ED, Kasif S. Probabilistic protein function predic
58. Wan F, Hong L, Xiao A, et al. Neodti: neural integration of neighbor tion from heterogeneous genome-wide data. Plos One. 2007;2(3):
information from a heterogeneous network for discovering new e337.
drug–target interactions. Bioinformatics. 2019;35(1):104–111. 82. Makrodimitris S, Van Ham RC, Reinders MJ. Automatic gene func
59. Huang K, Fu T, Xiao C, et al. Deeppurpose: a deep learning tion prediction in the 2020’s. Genes (Basel). 2020;11(11):1264.
based drug repurposing toolkit. arXiv Preprint arXiv:2004 83. Goymer P. Why do we need hubs? Nat Rev Genet. 2008;9(9):651.
08919. 2020. 84. Chen S-J, Liao D-L, Chen C-H, et al. Construction and analysis of
60. Wallach I, Dzamba M, Heifets A. Atomnet: a deep convolutional protein-protein interaction network of heroin use disorder. Sci Rep.
neural network for bioactivity prediction in structure-based drug 2019;9(1):1–9.
discovery. arXiv Preprint arXiv:1510 02855. 2015. 85. Dai W, Chang Q, Peng W, et al. Network embedding the protein–
61. Senior AW, Evans R, Jumper J, et al. Improved protein structure protein interaction network for human essential genes identifica
prediction using potentials from deep learning. Nature. 2020;577 tion. Genes (Basel). 2020;11(2):153.
(7792):706–710. 86. Lefranc F, Tabanca N, Kiss R. Assessing the anticancer effects
62. Reese JT, Unni DR, Callahan TJ, et al. KG-COVID-19: a framework to associated with food products and/or nutraceuticals using in vitro
produce customized knowledge graphs for covid-19 response. and in vivo preclinical development-related pharmacological tests.
Patterns. 2020;2(1):100155. In: Seminars in cancer biology. Vol. 46. Elsevier; 2017. p. 14–32.
63. Zhou Y, Hou Y, Shen J, et al. Network-based drug repurposing for 87. Veselkov K, Gonzalez G, Aljifri S, et al. Hyperfoods: machine intel
novel coronavirus 2019-ncov/sars-cov 2. Cell Discov. 2020;6 ligent mapping of cancer-beating molecules in foods. Sci Rep.
(1):1–18. 2019;9(1):1–12.
64. Wang LL, Lo K, Chandrasekhar Y, et al. Cord-19: the covid-19 open 88. Du J, Jia P, Dai Y, et al. Gene2vec: distributed representation of
research dataset. ArXiv. 2020. genes based on co-expression. BMC Genomics. 2019;20(1):7–15.
65. Hsieh K, Wang Y, Chen L, et al. Drug repurposing for covid-19 using 89. Goh K-I, Cusick ME, Valle D, et al., “The human disease network,”
graph neural network with genetic, mechanistic, and epidemiolo Proceedings of the National Academy of Sciences, vol. 104, no. 21,
gical validation. arXiv Preprint arXiv:2009 10931. 2020. pp. 8685–8690, 2007, USA.
66. Gysi DM, Valle ID, Zitnik M, et al. Network medicine framework for 90. Cantini L, Medico E, Fortunato S, et al. Detection of gene commu
identifying drug repurposing opportunities for covid-19. arXiv nities in multi-networks reveals cancer drivers. Sci Rep. 2015;5
Preprint arXiv:2004 07229. 2020. (1):17386.
67. Gasmi A, Tippairote T, Mujawdiya PK, et al. Neurological involve 91. Zietz M, Himmelstein DS, Kloster K, et al. The probability of edge
ments of sars-cov2 infection. Mol Neurobiol. 202 existence due to node degree: a baseline for network-based
68. Stebbing J, Phelan A, Griffin I, et al. Covid-19: combining antiviral predictions. Manubot, Tech Rep. 2020.
and anti-inflammatory treatments. Lancet Infect Dis. 2020;20 •• Article of high interest - This paper provides the most compre
(4):400–402. hensive analysis of the problems arising from i) the degree
69. “Baricitinib receives emergency use authorization from the FDA for imbalance in graphs with long-tailed distributions, and ii) the
the treatment of hospitalized patients with COVID-19,”. [cited 2021 disparity between literature-derived biological networks and
Jan 02]. Available from: https://investor.lilly.com/news-releases those derived from systematic screens. Of particular interest is
/news-release-details/baricitinib-receives-emergency-use- the authors’ closed form approximation of the prior probabil
authorization-fda-treatment ity of connection. This allows researchers and industry
EXPERT OPINION ON DRUG DISCOVERY 13
professionals to differentiate between predictions based on 103. Lee B, Zhang S, Poleksic A, et al. Heterogeneous multi-layered
network connectivity and proximity at almost no computa network model for omics data integration and analysis. Front
tional cost. Genet. 2020;10:1381.
92. Wishart DS, Knox C, Guo AC, et al. DrugBank: a knowledgebase for 104. Lin XV, Socher R, Xiong C. Multi-hop knowledge graph reasoning
drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36 with reward shaping. arXiv Preprint arXiv:1808 10568. 2018.
(suppl1):D901–D906. 105. Bishop JM. Artificial intelligence is stupid and causal reasoning
93. Avram S, Bologa CG, Holmes J, et al. DrugCentral 2021 supports won’t fix it. arXiv Preprint arXiv:2008 07371. 2020.
drug discovery and repositioning. Nucleic Acids Res. 2021;49(D1): 106. Liu A, Trairatphisan P, Gjerga E, et al. From expression footprints to
D1160–D1169. causal pathways: contextualizing large signaling networks with
94. Edwards AM, Isserlin R, Bader GD, et al. Too many roads not taken. carnival. NPJ Syst Biol Appl. 2019;5(1):1–10.
Nature. 2011;470(7333):163–165. 107. Rivas-Barragan D, Mubeen S, Guim-Bernat F, et al. Drug2ways:
95. Oprea TI, Bologa CG, Brunak S, et al. Unexplored therapeutic reasoning over causal paths in biological networks for drug dis
opportunities in the human genome. Nat Rev Drug Discov. covery. bioRxiv. 2020.
2018;17(5):317. 108. Vidal M, Cusick ME, Barabasi A-L. Interactome networks and human
96. Hutchison CA, Chuang R-Y, Noskov VN, et al. Design and synthesis disease. Cell. 2011;144(6):986–998.
of a minimal bacterial genome. Science. 2016;351(6280):6280. 109. Broido AD, Clauset A. Scale-free networks are rare. Nat Commun.
97. Feng R, Yang Y, Hu W, et al. Representation learning for scale-free 2019;10(1):1–10.
networks. arXiv Preprint arXiv:1711 10755. 2017. 110. Dorogovtsev S, Mendes J, Samukhin A. Generic scale of the” scale-
98. Kang B, Lijffijt J, Bie TD. Conditional network embeddings. arXiv free” growing networks. arXiv Preprint Cond-mat/0011115. 2000.
Preprint arXiv:1805 07544. 2018. 111. Rohani N, Eslahchi C. Drug-drug interaction predicting by neural
99. Buyl M, De Bie T. Debayes: a bayesian method for debiasing net network using integrated similarity. Sci Rep. 2019;9(1):1–11.
work embeddings. arXiv Preprint arXiv:2002 11442. 2020. 112. Wouters OJ, McKee M, Luyten J. Estimated research and develop
100. Lerer A, Wu L, Shen J, et al. Pytorch-biggraph: a large-scale graph ment investment needed to bring a new medicine to market,
embedding system. arXiv Preprint arXiv:1903 12287. 2019. 2009–2018. Jama. 2020;323(9):844–853.
101. Zheng D, Song X, Ma C, et al. Dgl-ke: training knowledge graph 113. Mohs RC, Greig NH. Drug discovery and development: role of basic
embeddings at scale. arXiv Preprint arXiv:2004 08532. 2020. biological research. Alzheimers Dementia. 2017;3(4):651–657.
102. Hamilton WL, Bajaj P, Zitnik M, et al. Embedding logical queries on 114. Xue S, Lu J, Zhang G. Cross-domain network representations.
knowledge graphs. arXiv Preprint arXiv:1806 01445. 2018. Pattern Recogn. 2019;94:135–148.